ABOUT CARSON Worked in Security Operations for 15 years SOC Engineering Team Lead @ Microsoft Previously SOC engineer, analyst &consultant @ MITRE Checkout my book if you haven’t s-center

ABOUT CHRIS Independent Consultant ( SANS Institute Senior Instructor & Course Author SOC Survey Author (2017, 2018, 2019) Security Operations Summit Chair – Security Operations Class on building & running a SOC Engagements with Defense, Education, Energy, Financial, IT,Manufacturing, Science, Software Development,

PICK SOMETHING YOU LOVE Jessie hugs Woody.jpg

AND MEASURE IT measure#/media/File:Measuring-tape.jpg

MEASURING THINGS USUALLY DRIVES CHANGEOptimizingEven if you’re not atCMM level 3, youcan still get started!MeasuredDefinedManagedInitial

METRICS ARE LIKE ghtsaber-Green-Science-Fiction-Space-1675211

THEY CAN BE USED FOR GOOD enes-we-want-to-see-on-blu-ray/

AND FOR edClan-RotS.jpg

SOME DEFINITIONS Metrics: things you can objectively measure Input: behaviors and internal mechanisms Output: results, typically customer-facing Service level agreements (SLAs):agreement/ commitment between providerand customer Service level objectives (SLOs): performancemetric or benchmark associated with an hedifference-between-SLO-and-SLA

TOP TIPS Metric data should be free and easy to calculate ½ of all SOCs collect metrics according to SANS SOC survey 2017 & 2018 There should be a quality measure that compensates for perversionanytime there’s a time based metric Metrics aren’t (necessarily) SLOs The metric is there to help screen, diagnose, and assess performance Don’t fall into a trap of working to some perceived metric objective Any metric should have an intended effect, and realize the measurement andcalculation isn’t always entirely valid Expectations, messaging, objectives- all distinct!

DATA SOURCES SOC Ticketing/case management system SIEM / analytic platform / EDR- anywhereanalysts create detections, investigate alerts SOC code repository SOC budget CAPEX including hardware & software OPEX including people & cloud Enterprise asset management systems Vulnerability ges-159825349.jpeg

EXISTING teral/en/mtrends-2018.pdf SOC CMM: measure yourSOC top to bottom VERIS Framework: trackyour incidents well SANS SOC Survey: recentpolls from your eports/rp DBIR 2018 Report execsummary en xg.pdf


METRIC FOCUS 1: DATA FEED HEALTH Is it “green” What is green anyway? Just because it’s updoesn’t mean all is well Delays in receipt Drops Temporary Permanent ia/File:Watermelon cross BNC.jpg

HOW MANYEVENTS ARE WERECEIVING?Select count(*) group ime, day)


ADVANCED: AUTO DETECTION OF OUTAGESOldCounts Select OldCount count(*)/7, OldDevices distinct(deviceHostName) where ReceiptTime ago(1 day) and ReceiptTime ago(8 days) group by DataCollectorName, SourceEnvironment;NewCounts Select NewCount count(*), NewDevices distinct(deviceHostName) where ReceiptTime ago(1 day) group by DataCollectorName, SourceEnvironment;Join NewCounts on OldCounts by DataCollectorName, SourceEnvironment project CountRatio NewCount/OldCount,DeviceRatio NewDevices/OldDevices IsBroken OR( CountRatio 25%, DeviceRatio 50%)

RESULTCollector ACollector BCollector CCollector DCollector EOldCountNewCount OldDevices NewDevices 342325No1120305569234Yes342102502496Yes Detection of dead, slow or lagging collectors or sensors is fully automated Consider human eyes on: weekly or monthly

ADVANCED: MEASURE TIME EVERYWHERERandomsyslog dataETLIoT &cloud logsHosttelemetryFirewall &Proxy LogsLayer 3-7NetFlow &SuperFlowMessageBusDatareplicationETLNode failoverETLUserconfigurabledata topicsExactly oncedata deliveryShort-termcachingQuery &Data Viz.NRT AlertTriageDataSciencePlatformLatency as a factor of:1. Clock skew2. Systems rejoining thenetwork & networkoutages3. Lack of capacity:a. Ingest & parsingb. Decoration / enrichmentNRTAnalyticEnginec.NRT analytics & correlationd. Batched query

METRIC FOCUS 2: COVERAGEDimensions:Tips:1. Absolute number and percentageof coverage per computeenvironment/enclave/domain1. Never drive coverage to 100%2. Kill chain or ATT&CK cell2. There is always anotherenvironment to cover, customer toserve3. There will always be more stonesto turn over; don’t ignore any ofthese dimensions3. Layer of the compute stack(network, OS, application, etc.)4. Device covered (Linux, Windows,IoT, network device)a. You don’t know what you don’t knowb. Always a moving target

MANAGED VS WILDERNESS Percentage of systems “managed”: Inventoried?Tied to an asset/business owner?Tied to a known business/mission function?Subject to configuration management?Assigned to a responsible security team/POC?Risk assessed? If all are yes: it’s managed If not: it’s “wilderness” SOC observed device counts help identify“unknown unknowns” in the wilderness

VALIDATING DATA FEED & DETECTION COVERAGE1. Expected heartbeat & true activity from every sensor and data feed2. Detection triggersa. Injected late into pipeline as synthetic events: consider “unit” tests for each ofyour detectionsb. Injected early into pipeline as fake “bad” activity on hosts or networks3. Blue/purple/red teaming: strong way to test your SOC!

MONITORING SLAS/SLOS SLA: Agreement monetary (or otherpenalty) for failing to meetBasic Service Host EDR SLO: Objective no specific penaltyagreed to for failing to meet Network logs Institution & missions specific wherethese need to be set in place Yearly engagement Don’t monitor everything the same way! Instrumentation, custom detections,response times, retention Standard mix of detectionsAdvanced Service Basic, plus: 3 application logs 1 focused detection/quarter Quarterly engagement

METRIC FOCUS 3: SCANNING AND SWEEPINGBasicAdvanced # % of known on prem & cloudassets scanned for vulns Time to sweep and compile resultsfor a given vuln or IOC: Amount of time it took to compilevulnerability/risk status oncovered assets during last highCVSS score “fire drill” Number of people needed tomassage & compile thesenumbers monthly A given domain/forest identity plane Everything Internet-facing All user desktop/laptops Everything # % of assets you can’t/don’t cover(IoT, network devices, etc.)

METRIC FOCUS 4: YOUR ANALYTICSBasics:Advanced: Runs in what framework(Streaming, batched query, etc.)NameDescriptionKill chain mappingATT&CK cell mappingDepends on which data type(s)(OS logs, Netflow, etc.)6. Covers whichenvironments/enclave7. Created- who, when9. Last modified- who, when10. Last reviewed- who, when11. Status- dev, preprod, prod, decom12. Output routes to (analyst triage,automated notification, etc.)

MEASURE ANALYST PRODUCTIVITYAnalytics Status for Last Month Is this good orevil? Can this TrudyMalloryDecom

HOW FRUITFUL ARE EACH AUTHOR’S DETECTIONS?Alert Final Disposition by Detection Author # of times adetection oranalytic fired,attributed to thedetection author60 Is this evil?20 How can this begamed?504030100AliceBobCharlieTrudyMalloryQuick F by Tier 1Quick F by Tier 2True Garnered Further work


MAP YOUR ANALYTICS TO ATT&CK Props to MITREfor the greatexample Many places todo this consider anystructured coderepo or wiki

METRIC FOCUS 5: ANALYST PERFORMANCE1. dateCurrent role & time in roleNumber of alerts triaged in last30 days% true positive rate forescalations% response rate for customerescalationsNumber of escalated caseshandled in last 30 daysMean time to close a case9. Number of analytics/detections createdthat are currently in production10. Number of detections modified that arecurrently in production11. Total lines committed to SOC code repo inlast 90 days12. Success/fail rate of queries executed in last30 days13. Median run time per query14. Mean lexical/structural similarity inqueries run

Analyst Baseball CardChristopher CrowleyChrisTwoGuns2015-11-17NSM Analyst - Senior1 year, 1 referred first nameCallsignJoin DateCurrent RoleTime in RoleAlerts Triaged in last 30 daysPercent True Positive RateResponse rate percent for customer escalationEscalated cases handled in last 30 daysMean time to close caseNumber analytics created currently in productionNumber detection modified currently in productionTotal lines committed to SOC code repository in last 90 daysSuccess rate of queries against SIEM in last 30 daysMedian run time per queryMean lexical structure similarity in queries run in last 30 days

DAILY REVIEW DASHBOARDTier 1 Inputs2015Top firing detections1050Phone Web site Emailcalls10s ofalertsTipsfromhuntAlert DispositionTop time spent per case80706050403020100Quick F by T1Quick F by T2True Garnered Further WorkAuto RemediatedAuto notifiedTipsfromIntel

METRIC FOCUS 6: INCIDENT HANDLING Mean/median adversarydwell time Mean and median time to Triage & Escalate Identify Contain Eradicate & recover Divergence from SLA/SLO? Insufficient eradication? Threat attributed?Top sources of confirmed incidents Proactive? Reactive? User reports? SOC monitoring?Data & ”anecdata”: unforced errors andimpediments Time waiting on other teams to do things No data/bad data/ data lost Incorrect/ambiguous conclusions Time spent arguing with other parties

TYPICAL INCIDENT METRICSIncidents: Last 6 Months250200More ideas:150 Mean/median time to respond100 Cases left open time threshold50012698237 Cases left open by initialreporting/detection type Stacked bar chart by case typeOpen CasesEscalated to 3rd partyClosed Cases

INCIDENT IMPACTLowModerateHigh Few systems (or only a specific type) Unimportant systems Unimportant data More systems (or many common types) Important or high value person’s, account, or system Important data at risk Most systems (or almost all types) Highest level accounts, users, and systems Business critical data

INCIDENT IMPACT CATEGORYFunctional Low – minimal function disruption Moderate – substantial disruption High – complete disruptionInformational Intellectual Property (L/M/H) Integrity Manipulation (L/M/H) Privacy violated (such as PII / PHI)Recoverable Regular – predictable using resources on hand Supplemented – predictable with augmented resources Unrecoverable – data breach which cannot be undoneSee more here: elines#impact-category-descriptions

INCIDENT AVOIDABILITY The vast majority of incidents are avoidable everyone realizes this Collect metrics on how avoidable, what could have been done to prevent Crowley’s Incident Avoidability metric1. A measure, already available in the environment, is applied to othersystems/networks, but wasn’t applied - resulting in the incident2. A measure is available (generally) and something (economic, political) preventsimplementing it within the organization3. Nothing is available to prevent that method of attack Attribution for measure/mechanism in 1 & 2 is critical

METRIC FOCUS 7: INCIDENT FINANCIALS: COST Routine handling All alerts & reports fielded Per escalated event to tier 2 True positives Consider: Cost of people Technology Proportion of time spentCost to handle each incident for handling, for actual loss The more incidents you handle,the more efficient - cheaperthey will be to handle Only rare, awful incidents shouldbe very costly to handle# of incidents

INCIDENT FINANCIALS: VALUE Start with standard impact valueassigned to each incident saved/loss prevented Routine incidents: standard calculation Escalated & customized handling:often speculate What to do? Past incidents Reporting from other orgs, news Iterate with execsExample implied value: lossprevention Incidents that were escalated tolegal counsel, law enforcement Incidents handled that clobberedcompetitors Direct value of IP caught in exfil Value of systems not beingbricked from EFI bootkit

METRIC FOCUS 8: TOP RISK AREAS & HYGIENE Make vulnerability managementdata available to customers Self service model Scan results down to asset & itemscanned But don’t beat them over the headwith every measure! Pick classic ones they will always bemeasured on Scanning, monitoring, patching Pick top risk items from ownincident avoidability metrics andpublic intel reporting to focus oneach year, semester, or quarter Internet-exposed devices Code signing enforcement EDR deployment Single factor auth Non-managed devices & cloudresources


SUMMARY: INTERNAL METRICS Analyst baseball card Weekly intel & IOC processing volume Raw output / productivity Weekly forensics/malware volume Technical & operational quality Analytic coverage Pedigree, training, growth Kudos, ”saves” Data feed health Kill chain & ATT&CK cell Dependencies: source, detectionframework Up/down Written by whom Latency Volume & success rates Daily alert volume & FP rate Customer coverage

SUMMARY: EXTERNAL METRICSKey themes: Cost – Value – RiskAlways be ready to answer: “whathave you done for me lately?” Managed vs unmanaged assets Monitoring & scanning coverage Top risk areas & hygiene Top issues that are leading to incidents Custom detections & value add Incidents handled Cost incurred & avoided Causes & impediments Mean/median dwell time Mean/median time to identify,contain, eradicate, recover Mean/median time to respond toa data call, such as an IOC sweep

SUMMARY: SLAS / SLOSKey themes:For written agreements, select onlythe SLAs necessary to suit missionobjectivesExamples: Response initiation within 4 hours Reporting / Notification frequencyat minimum daily regarding anyactive incident rated at moderateseverity If less that 90%, 5% “ManagedSystems” percentage increasequarterly (improvement in assettracking and identification as wellas business coordination), above90%, 1% increase quarterly Increased performance onrepeated incidents of the samenature on the same systems(demonstrated improvement inproficiency)

CLOSING Whatever you do, measuresomething You can do it, regardless of howmature, old, or big your SOC is Pick your investments carefully Iterate an-do-it-18134


Worked in Security Operations for 15 years SOC Engineering Team Lead @ Microsoft Previously SOC engineer, analyst & . ½ of all SOCs collect metrics according to SANS SOC survey 2017 & 2018 . DAILY REVIEW DASHBOARD 0 5 10 15 20 Phone calls Web site Email 10s of alerts Tips