Taming the AWR Tsunami – An Innovative Approach

When I first encountered Oracle’s Automated Workload Repository (AWR), I was blown away with the tsunami of data and metrics available to the performance tuner [Oracle’s AWR provides tens of thousands of metrics exposed in hundreds of DBA_HIST% views and embedded in numerous columns in each view].  A lot of the AWR data is exposed in an AWR report, but, as you may have experienced, wading through an AWR report can be exhausting.  Some of my first presentations on AWR were titled something like “Taming the AWR Tsunami”, to try to put some of this performance information in a useful (but still traditional) format; organizing queries against the AWR according to the kinds of problems these views are useful for (e.g. issues related to CPU, i/o, memory, network, TEMP, UNDO, REDO, …).  Following this traditional analysis perspective, you have to know a lot about the kind of performance problem the DB application is having and you have to know where to look in the AWR/AWR report for useful information about the problem;  this fact of spread out performance metrics complicates processing at scale because you have to individually go into each of these tables to extract the information, plus you need to know which ones to look at [I found it nearly impossible to scale traditional analysis methods to over a few dozen metrics].  In contrast, the novel Dynamic Oracle Performance Analytics (DOPA) method/approach I developed (which uses data science analytical techniques) takes many AWR metrics and “normalizes” them on the fly into a single logical key-value-pair (KVP) structure (I’ve instrumented over 11 thousand metrics and can easily instrument many more).   The DOPA process is unique because with the KVP normalized metrics I can now perform anomaly detection at a massive scale across tens of thousands of performance metrics easily. The DOPA process is a strikingly simple, SQL-based approach that will flag up anomalies in metric values from a problem interval compared to values for those metrics in a “normal” processing interval.  So, how do I use a simple SQL-based approach to implement a complicated analysis and come up with meaningful results? There are several steps I implement that use with-clause sub-queries to work these and incrementally build the analysis and comparison.  The first concept is the key-value-pair normalization, followed by establishing normal ranges for the metrics (pre-condition/assumption:  you must have a date/time range where the system behaved normally), then you compare the metrics values from the normal period to the metrics values from the problem period and flag/report the anomalous metrics (pre-condition/assumption:  you must have a date/time range where the system behaved abnormally normally).

Key-Value-Pair Normalization

Some performance metrics found in the AWR are already in key-value-pair format, such as DBA_HIST_SYSMETRIC_SUMMARY; this view has about 161 metrics that are computed for each snapshot interval.  Similarly, DBA_HIST_SYSSTAT is in KVP format and has over 2000 metrics in 19c, however, the values are given as cumulative, so you need to compute the “delta” for a snap_id in order to make easy use of the data for a snapshot.  Views like DBA_HIST_SYSTEM_EVENT and DBA_HIST_LATCH require additional processing to unpivot the columns into rows.  Once you have all the metrics in a comparable KVP format, you union them all together for subsequent analysis.

Establish Normal Ranges

Once you have your KVP normalized metrics in the unioned set of metrics, you need to calculate the normal ranges.  The method I use is called the statistical rule of thumb** which is plus or minus two standard deviations from the mean.  Basically, this uses simple common aggregate functions AVG and STDDEV.   The high bound of the normal range is plus 2 standard deviations from the mean and the lower bound of the normal range is minus 2 standard deviations from the mean. [N.B. before calculating the normal ranges, I remove outliers in the data using the interquartile range method.]  Effectively, with a single calculation, I’m establishing what normal values look like for over 11 thousand metrics.  The simple but powerful construct of normal ranges is a key distinction compared to traditional methods which tend not to tell you what the values for a metric normally are.

**Figure illustrating the Rule of Thumb:


 

Flag and Report Metric Anomalies

This is where the rubber hits the road for the DOPA process.  Now that you have the normal ranges defined for all your metrics, I simply compare the metrics from the problem interval to the normal ranges and flag everything that is outside of the normal ranges.  So rather than looking at, say, the top wait events, the SQL statement is showing you the wait events that are abnormal (i.e. the anomalies).  Viola, I’ve now just tamed the AWR tsunami for over 11 thousand metrics because the anomaly detection SQL statement is showing me a value-based assessment of the metrics that are the key influencers of the performance problem at hand.  I didn’t have to know what was normal or expected values for Average Active Sessions, Physical read/write IO, read latencies, REDO, or any other metric for that matter.

References

For a comprehensive background in the Dynamic Oracle Performance Analytics (DOPA) process, I would refer to the short medium article Analytics Using Feature Selection for Anomaly Detection or my book: Dynamic Oracle Performance Analytics: Using Normalized Metrics to Improve Database Speed

V1 Code for the DOPA process is in GitHub:  https://github.com/Apress/dynamic-oracle-perf-analytics?msclkid=bb3dc9edca8e11ecbc916783ffc4c624

The Back Story – Dynamic Oracle Performance Analytics

Tuning SQL Using Feature Engineering and Feature Selection - Part IV