Splunk: Baselining
Introduction
When working with Splunk, understanding statistical measures is crucial for baselining and anomaly detection. Below are some common statistical functions and their usage:
Mean
- Command:
stats avg(field)
- Purpose: Calculate the average value of a field to identify the central tendency of your data.
- Importance: Provides a baseline for normal behavior, making it easier to spot deviations.
- Example: For values
10, 20, 30, 60, 80
, the mean is(10 + 20 + 30 + 60 + 80) / 5 = 40
.
Median
- Command:
stats median(field)
- Purpose: Find the middle value of a field, especially useful for datasets with outliers.
- Importance: Less affected by extreme values, offering a robust baseline.
- Example: For values
10, 20, 30, 60, 80
, the median is30
.
Mode
- Command:
stats mode(field)
- Purpose: Determine the most frequently occurring value in a field.
- Importance: Highlights common patterns or repeated values in your dataset.
- Example: For values
10, 20, 20, 60, 80
, the mode is20
.
Standard Deviation
- Command:
stats stdev(field)
- Purpose: Measure the amount of variation or dispersion in a dataset.
- Importance: Quantifies data spread, helping to define thresholds for normal versus abnormal behavior.
- Example: For values
10, 20, 30, 60, 80
, the standard deviation is approximately31.62
.
Quartiles and Percentiles
- Command:
stats perc75(field) perc25(field)
- Purpose: Calculate the 75th and 25th percentiles to segment and identify data spread.
- Importance: Provides a detailed view of data distribution for precise baselining and anomaly detection.
- Example: For values
10, 20, 30, 60, 80
, the 25th percentile is20
and the 75th percentile is60
.
Splunk Commands
Stats
- Command:
stats
- Purpose: Perform aggregations like sum, average, count, etc., on fields in your dataset.
- Importance: Essential for summarizing data and deriving insights.
- Example:
| stats count by status
counts the occurrences of each status value.
Streamstats
- Command:
streamstats
- Purpose: Calculate running totals, averages, or other statistics over a stream of events.
- Importance: Useful for tracking trends or cumulative metrics in real-time.
- Example:
| streamstats sum(bytes) as cumulative_bytes
calculates a running total of bytes.
Eventstats
- Command:
eventstats
- Purpose: Add aggregated statistics to each event without reducing the number of events.
- Importance: Allows for comparisons between individual events and overall statistics.
- Example:
| eventstats avg(duration) as avg_duration
adds the average duration to each event.
Bin
- Command:
bin
- Purpose: Group numeric or time values into buckets for easier analysis.
- Importance: Simplifies data by categorizing it into ranges or intervals.
- Example:
| bin span=5m _time
groups events into 5-minute intervals.
Timechart
- Command:
timechart
- Purpose: Create time-based visualizations of aggregated data.
- Importance: Ideal for monitoring trends and patterns over time.
- Example:
| timechart span=1h count by status
shows hourly counts of each status value.
Baselining with Standard Deviation
Baselining with standard deviation is a statistical method used to identify normal behavior and detect anomalies in data. Here’s how you can use it in Splunk:
Steps to Baseline with Standard Deviation
-
Calculate the Mean: Use the
stats avg(field)
command to calculate the average value of the field you want to baseline.| stats avg(response_time) as mean_response_time
-
Calculate the Standard Deviation: Use the
stats stdev(field)
command to calculate the standard deviation of the field.| stats stdev(response_time) as stddev_response_time
- Define Thresholds:
Use the mean and standard deviation to define thresholds for normal behavior. For example:
- Lower Threshold:
mean - (2 * stddev)
- Upper Threshold:
mean + (2 * stddev)
- Lower Threshold:
-
Filter Anomalies: Use a
where
clause to filter events that fall outside the thresholds.| eval lower_threshold = mean_response_time - (2 * stddev_response_time) | eval upper_threshold = mean_response_time + (2 * stddev_response_time) | where response_time < lower_threshold OR response_time > upper_threshold
Example Use Case
Suppose you are monitoring the response time of a web application. You can use the following search to identify anomalies:
index=web_logs sourcetype=access_combined
| stats avg(response_time) as mean_response_time, stdev(response_time) as stddev_response_time
| eval lower_threshold = mean_response_time - (2 * stddev_response_time)
| eval upper_threshold = mean_response_time + (2 * stddev_response_time)
| where response_time < lower_threshold OR response_time > upper_threshold
This search calculates the mean and standard deviation of response times, defines thresholds, and filters out events that are outside the normal range.
Baselining with Z-Score
Baselining with Z-Score is a statistical method used to identify anomalies by measuring how far a data point is from the mean in terms of standard deviations.
- 68% of data falls within ±1 standard deviation
- 95% falls within ±2 standard deviations
- 99.7% falls within ±3 standard deviations
Steps to Baseline with Z-Score
-
Calculate the Mean and Standard Deviation: Use the
stats
command to calculate the average and standard deviation of the field you want to baseline.| stats avg(response_time) as mean_response_time, stdev(response_time) as stddev_response_time
-
Calculate the Z-Score: Use the
eval
command to calculate the Z-Score for each event. The formula for Z-Score is:Z-Score = (value - mean) / standard deviation
In Splunk:
| eval z_score = (response_time - mean_response_time) / stddev_response_time
- Define Thresholds:
Decide on a threshold for the Z-Score to identify anomalies. Common thresholds are:
- Z-Score > 3 or Z-Score < -3 (indicating the data point is more than 3 standard deviations away from the mean).
-
Filter Anomalies: Use a
where
clause to filter events with Z-Scores outside the threshold.| where abs(z_score) > 3
Example Use Case
Suppose you are monitoring the response time of a web application. You can use the following search to identify anomalies:
index=web_logs sourcetype=access_combined action=failed
| bin _time span=1h
| stats count as failed_logins by _time
| stats avg(failed_logins) as mean_failed, stdev(failed_logins) as stddev_failed
| eval z_score = (response_time - failed_logins) / stddev_failed
| where abs(z_score) > 3
This search calculates the mean and standard deviation of response times, computes the Z-Score for each event, and filters out events with Z-Scores greater than 3 or less than -3.
Sources
For further reading and practical examples, refer to the following resources:
- Splunk: Finding and Removing Outliers
- Z-Scoring Your Way to Better Threat Detection
- Stop Chasing Ghosts: How Five-Number Summaries Reveal Real Anomalies
- Unravel the Mysteries of Variance and Standard Deviation
- Using Stats in Splunk Part 1: Basic Anomaly Detection
- Statistics How To: Empirical Rule ( 68-95-99.7)