Splunk: Baselining

Introduction

When working with Splunk, understanding statistical measures is crucial for baselining and anomaly detection. Below are some common statistical functions and their usage:

Mean

Command: stats avg(field)
Purpose: Calculate the average value of a field to identify the central tendency of your data.
Importance: Provides a baseline for normal behavior, making it easier to spot deviations.
Example: For values 10, 20, 30, 60, 80, the mean is (10 + 20 + 30 + 60 + 80) / 5 = 40.

Median

Command: stats median(field)
Purpose: Find the middle value of a field, especially useful for datasets with outliers.
Importance: Less affected by extreme values, offering a robust baseline.
Example: For values 10, 20, 30, 60, 80, the median is 30.

Mode

Command: stats mode(field)
Purpose: Determine the most frequently occurring value in a field.
Importance: Highlights common patterns or repeated values in your dataset.
Example: For values 10, 20, 20, 60, 80, the mode is 20.

Standard Deviation

Command: stats stdev(field)
Purpose: Measure the amount of variation or dispersion in a dataset.
Importance: Quantifies data spread, helping to define thresholds for normal versus abnormal behavior.
Example: For values 10, 20, 30, 60, 80, the standard deviation is approximately 31.62.

Quartiles and Percentiles

Command: stats perc75(field) perc25(field)
Purpose: Calculate the 75th and 25th percentiles to segment and identify data spread.
Importance: Provides a detailed view of data distribution for precise baselining and anomaly detection.
Example: For values 10, 20, 30, 60, 80, the 25th percentile is 20 and the 75th percentile is 60.

Splunk Commands

Stats

Command: stats
Purpose: Perform aggregations like sum, average, count, etc., on fields in your dataset.
Importance: Essential for summarizing data and deriving insights.
Example: | stats count by status counts the occurrences of each status value.

Streamstats

Command: streamstats
Purpose: Calculate running totals, averages, or other statistics over a stream of events.
Importance: Useful for tracking trends or cumulative metrics in real-time.
Example: | streamstats sum(bytes) as cumulative_bytes calculates a running total of bytes.

Eventstats

Command: eventstats
Purpose: Add aggregated statistics to each event without reducing the number of events.
Importance: Allows for comparisons between individual events and overall statistics.
Example: | eventstats avg(duration) as avg_duration adds the average duration to each event.

Bin

Command: bin
Purpose: Group numeric or time values into buckets for easier analysis.
Importance: Simplifies data by categorizing it into ranges or intervals.
Example: | bin span=5m _time groups events into 5-minute intervals.

Timechart

Command: timechart
Purpose: Create time-based visualizations of aggregated data.
Importance: Ideal for monitoring trends and patterns over time.
Example: | timechart span=1h count by status shows hourly counts of each status value.

Baselining with Standard Deviation

Baselining with standard deviation is a statistical method used to identify normal behavior and detect anomalies in data. Here’s how you can use it in Splunk:

Steps to Baseline with Standard Deviation

Calculate the Mean: Use the stats avg(field) command to calculate the average value of the field you want to baseline.
```
| stats avg(response_time) as mean_response_time
```
Calculate the Standard Deviation: Use the stats stdev(field) command to calculate the standard deviation of the field.
```
| stats stdev(response_time) as stddev_response_time
```
Define Thresholds: Use the mean and standard deviation to define thresholds for normal behavior. For example:
- Lower Threshold: mean - (2 * stddev)
- Upper Threshold: mean + (2 * stddev)

Filter Anomalies: Use a where clause to filter events that fall outside the thresholds.

| eval lower_threshold = mean_response_time - (2 * stddev_response_time)
| eval upper_threshold = mean_response_time + (2 * stddev_response_time)
| where response_time < lower_threshold OR response_time > upper_threshold

Example Use Case

Suppose you are monitoring the response time of a web application. You can use the following search to identify anomalies:

index=web_logs sourcetype=access_combined
| stats avg(response_time) as mean_response_time, stdev(response_time) as stddev_response_time
| eval lower_threshold = mean_response_time - (2 * stddev_response_time)
| eval upper_threshold = mean_response_time + (2 * stddev_response_time)
| where response_time < lower_threshold OR response_time > upper_threshold

This search calculates the mean and standard deviation of response times, defines thresholds, and filters out events that are outside the normal range.

Baselining with Z-Score

Baselining with Z-Score is a statistical method used to identify anomalies by measuring how far a data point is from the mean in terms of standard deviations.

The Empirical Rule 1, 2

68% of data falls within ±1 standard deviation

95% falls within ±2 standard deviations

99.7% falls within ±3 standard deviations

Steps to Baseline with Z-Score

Calculate the Mean and Standard Deviation: Use the stats command to calculate the average and standard deviation of the field you want to baseline.
```
| stats avg(response_time) as mean_response_time, stdev(response_time) as stddev_response_time
```

Calculate the Z-Score: Use the eval command to calculate the Z-Score for each event. The formula for Z-Score is:

Z-Score = (value - mean) / standard deviation

In Splunk:

| eval z_score = (response_time - mean_response_time) / stddev_response_time

Define Thresholds: Decide on a threshold for the Z-Score to identify anomalies. Common thresholds are:
- Z-Score > 3 or Z-Score < -3 (indicating the data point is more than 3 standard deviations away from the mean).
Filter Anomalies: Use a where clause to filter events with Z-Scores outside the threshold.
```
| where abs(z_score) > 3
```

Example Use Case

Suppose you are monitoring the response time of a web application. You can use the following search to identify anomalies:

index=web_logs sourcetype=access_combined action=failed
| bin _time span=1h
| stats count as failed_logins by _time
| stats avg(failed_logins) as mean_failed, stdev(failed_logins) as stddev_failed
| eval z_score = (response_time - failed_logins) / stddev_failed
| where abs(z_score) > 3

This search calculates the mean and standard deviation of response times, computes the Z-Score for each event, and filters out events with Z-Scores greater than 3 or less than -3.

Sources

For further reading and practical examples, refer to the following resources:

Splunk: Baselining

Introduction

Mean

Median

Mode

Standard Deviation

Quartiles and Percentiles

Splunk Commands

Stats

Streamstats

Eventstats

Bin

Timechart

Baselining with Standard Deviation

Steps to Baseline with Standard Deviation

Example Use Case

Baselining with Z-Score

Steps to Baseline with Z-Score

Example Use Case

Sources

You may also enjoy

Splunk: Asset and Identity Framework

Splunk4Admins

Access ESP32 with upysh