About 12 years ago I became aware of the plethora of metrics persisted in the AWR (Oracle’s Automated Workload Repository [license required]). It could be argued that Oracle is the most instrumented software system on the planet. If you compare the AWR to prior incarnations of Oracle’s performance metric (e.g. StatsPack, bstat/estat, v$ views), there really is no comparison; AWR is a historic repository of the requisite metrics for performance analysis. As you may know, the AWR warehouse of data contains tens of thousands of metrics are exposed in dozens of views and usually multiple columns in those views. Many of the metrics are visible in various AWR reports (including ASH and ADDM reports), but I found these tools somewhat lacking because I was always concerned that I’d miss some observations essential to the root cause analysis. Moreover, these tools don’t tell you what the values of these zillions of metrics should be, or, let say, what the values usually are for the DB being examined.
The Small Model Approach
Thus, like many of us “tuners”, I’d rather query the AWR directly and follow my “instincts” as to where to look in the AWR for various problems, such as CPU, Memory (incl. PGA), Network, REDO, UNDO, TEMP, … . Ah, the “instincts”: I chose the metrics I wanted to look at based on the type of metric and my past experience of how useful that metric was for problem solving. I call the approach of selecting specific metrics for analysis the “small model approach” because it only uses a subset of all the data available. Herein lies the problem, I found the small model approach self-limiting and not always effective and efficient for solving any arbitrary type of performance problem. The small model approach requires a lot of knowledge as to where to look in the AWR for solving specific problems and a good amount of trial and error. Although I was able to gain proficiency over time, it was a difficult and sometimes tedious learning process [if you give it enough time, water will wear a rock down to a grain of sand]. The downfall of the small model approach is that by selecting a few metrics you wind up with a decreased visibility into the tens of thousands of metrics available in the AWR, and thus a less complete and less accurate understanding of the performance problem at hand (i.e. the small model approach doesn’t scale).
Rescue Me, Analytics Guru’s
Interestingly, my eldest son was in a PhD program in advanced analytics, and I got to rub elbows on occasion with his cohort analytics gurus and other genius mathematicians. In conversation, I would bring up my quandary of lack of visibility into these tens of thousands of metrics, but I must admit, I made little progress on solving this quandary until I “cornered” my son on this topic. He says, “Well, we do this thing called feature engineering and feature selection” [at that time these terms were totally new to me]. With my interest piqued, I said, “Tell me more.” He went on to explain one of many feature selection methods where it is sometimes easier/convenient to count intervals where a metric value exceeds some threshold, and the metrics that most often exceeded the threshold were likely the key influencers to the performance problem. I’m still a bit skeptical because now with tens of thousands of metrics, it still going to be difficult to make sense of them all. Wherein, he introduced me to the concept of using taxonomies to organize/bundle the metrics into some broader categories.
The Breakthrough
This started to make sense to me; so now the difficulty is how do you identify the threshold for tens of thousands of metrics. I knew some basic thresholds for a few metrics, but not for the many metrics available, so my son explained the technique of getting the normal ranges from the metric values in a “normal” period [reference Taming the AWR Tsunami – An Innovative Approach for an explanation of the rule-of-thumb method for establishing normal ranges and/or a short Medium article Analytics Using Feature Selection for Anomaly Detection]. Then it dawned on me that if I logically convert all the metrics into a key-value-pair structure, I could compute normal ranges for all the metrics in one shot and probably solve the whole problem with one SQL statement. Three months later I emerged with what I started to call the Dynamic Oracle Performance Analytics method (DOPA for short - [I guess the PhD types with their long thesis names got to me too]); dynamic, because the metrics are assembled and analyzed on the fly without having to pre-know very much about the metrics and their values ahead of time. The DOPA process substantially increases the number of metrics that can be included in an analysis (I’ve instrumented over 11,000 metrics), while simultaneously reducing the complexity of the analysis by identifying the metrics most relevant to solving a particular performance problem. The DOPA process is a descriptive modeling technique that enables the root cause analysis to be efficient, targeted, and specific. Using the DOPA process, the tuner will be able to quickly, accurately, and consistently identify root cause with confidence. Further, I built two taxonomies to overlay and bring structure to all the metrics; one taxonomy to classify the metrics by the infrastructure area they stem from (e.g. CPU, Network, Memory, IO) and another taxonomy to classify the metrics by the Oracle subcomponent/subsystem that they stem from (e.g. DBWR, LGWR, UNDO, REDO, TEMP, …). By bundling the metrics by these domain specific taxonomies, additional clarity as to the kind of problem the DB is having is brought to the surface and made more apparent. Of course, there is no substitute for the performance analytics domain expert to interpret the findings of the model, but that is true in all analytics use cases.
The Book
I had been presenting at professional conferences over the years on various performance tuning topics, so I decided to throw the DOPA process into the mix and began presenting on this topic as well. One of the editors from Apress saw my abstract that I would be presenting at the 2018 IOUG conference and asked me if I would like to write a book on the topic. I really had no intention of writing a book on the subject – I simply wanted a way to analyze the thousands of Oracle performance metrics available. My original answer to the idea of writing a book on this was, “No. Writing a book takes too much time.” Thankfully, I was encouraged to go for it by my lovely wife. Several months later, I [we - she spent countless hours working with me, editing] emerged with the book [Dynamic Oracle Performance Analytics: Using Normalized Metrics to Improve Database Speed] which was published in late December 2018 [for a period of time it was the #1 best seller in Oracle Database category]. The main challenge in writing the book was how to carefully unpack and plumb the depths of the work I had already done in writing the SQL code for the DOPA process along with use cases of this process in action.
Thanks for your time; I hope you found the back story at least somewhat interesting.