This post is Part II of a series that details my innovative approach to SQL-level performance diagnostics. Part I of this series can be found here. As previously stated, I termed the approach, the Dynamic Oracle Performance Analytics process, or DOPA process for short. The DOPA process makes use of some interesting data science techniques of feature engineering and feature selection using plain SQL (i.e. no other specialized coding needed) to analyze many metrics, perform anomaly detection, and point you in the right direction to improve SQL performance in an Oracle Database. The book I wrote on this topic covers database-level performance metrics, however, I have found that many tuners are intensely interested in SQL-level diagnostics. In this series, I detail out how I applied the DOPA process to the SQL metrics. My novel approach to database performance analytics represents a step-change paradigm shift away from traditional methods of analysis which tend to rely on a few hand-picked (favorite) metrics, or worse yet, wading through multiple specialized tables and many columns of information such as that found in an AWR report. With the DOPA process applied to SQL-level metrics, you will draw on all the available numeric metrics and flag up anomalies in the metric values, which will help you make impactful, focused performance improvement conclusions. Thus, using feature engineering and feature selection, the problem of metric overload is solved and the tuning process is simplified.
Errata: In Part I on this topic and in the title, I reference SQL tuning using feature engineering and feature selection. Admittedly, I was a bit looser in this representation than I typically like to be. My main assertion is that the techniques I’m describing will help you interpret the anomalies in the performance metrics (and prioritize which metrics to focus on) and thus lead you to an accurate root cause analysis. SQL tuning on the other hand is the actions one would take after having an accurate root cause analysis. In other words, the Dynamic Oracle Performance Analytics (DOPA) method will lead you to a correct root cause analysis, but the model won’t tell you what actions to take to fix the problem. This of course is where one’s SQL tuning domain expertise comes into play (i.e. as with most analytical techniques, a domain expert is still needed to interpret the resulting analytical models).
With the errata out of the way, there were a few loose ends from Part I. First, I’ll look at feature engineering as applied SQL and session-level metrics exposed by the active session history view, DBA_HIST_ACTIVE_SESS_HISTORY. I’ll also cover how to apply feature engineering techniques to transform non-numeric date/time metrics to numeric values to make computing on them easier and more impactful.
Feature Engineering in Action – Active Session History Example
As stated in Part I, to engineer features into your model, one would use their knowledge of the problem domain to pick the metrics. In this post I’ll cover feature engineering as applied to SQL and session-level metrics the active session history (ASH) view, DBA_HIST_ACTIVE_SESS_HISTORY.
As many DBA’s familiar with Oracle tuning will know, the view DBA_HIST_ACTIVE_SESS_HISTORY exposes the SQL/Session-level metrics sampled by the AWR every second in the v$active_session_history; of this, one tenth of the samples are rolled up into DBA_HIST_ACTIVE_SESS_HISTORY every snapshot period. Thus, a sample in DBA_HIST_ACTIVE_SESS_HISTORY will represent roughly 10 seconds worth of samples from v$active_session_history. I may not be expressing this “by the book” but that’s more or less how I think of the DBA_HIST version of ASH compared to the v$ version.
MCT View of ASH Metrics
Below, I have, as an example, a simple select on the multi-column table (MCT) version dba_hist_active_sess_history for a sql_id. For the purposes of my dynamic metric analysis method, I do not want to leave the metrics in this column-by-column/metric-by-metric format because it is much too cumbersome and because any comparison I do, whether calculated (as I do in the DOPA process) or visual, will also have to be column-by-column/metric-by-metric. Thus, with data in the MCT format, it is difficult to scale column-by-column analysis to many metrics.
I’ve taken screenshots (below) of the data values produced after running the above query on select dba_hist_active_sess_history columns. Again, trying to compare metric-by-metric is possible, but far too cumbersome to scale to many metrics.
Sidebar Note: You may have noticed that I’ve left off a number of valuable ASH columns, most of them being non-numeric text values which are a special case which I plan to deal with in a future post.
KVP View of ASH Metrics
Rather than including a massive SQL statement and letting the reader sort through it, I will review the following SQL piece-by-piece as per each with clause sub-query. Taken together the SQL can be used to unpivot the multi-column table version of dba_hist_active_sess_history to the Key-Value-Pair (KVP) “view”. As with the query on dba_hist_sqlstat, this may seem complicated, but it really follows a simple pattern where I unpivot various column sets based on their data characteristics, then I logically UNION all the unpivoted data sets into one big, tall-skinny KVP structure.
Query the ASH Data
This is the with clause sub query I subset out the ASH data of interest. In this example, I’m subsetting on a particular SQL_ID. In the full version I would also subset on the problem and normal intervals as well as [optionally] a set of sql_id's and/or a set of sessions. Code snippet follows:
with ash as /* get all the ASH data of interest ; unpivot later */
(
select
SNAP_ID
,DBID,INSTANCE_NUMBER
,SAMPLE_ID
,SAMPLE_TIME
,SESSION_ID,SESSION_SERIAL#
,SESSION_ID||':'||SESSION_SERIAL# SESSION_AND_SERIAL#
,decode(SESSION_TYPE, 'FOREGROUND', 'Y', 'N') as IS_FOREGROUND_SESSION
,decode(SESSION_TYPE, 'FOREGROUND', 'N', 'Y') as IS_BACKGROUND_SESSION
,FLAGS,USER_ID
,SQL_ID
,IS_SQLID_CURRENT
,SQL_CHILD_NUMBER || '' as SQL_CHILD_NUMBER
,SQL_OPCODE,SQL_OPNAME
,FORCE_MATCHING_SIGNATURE
,TOP_LEVEL_SQL_ID,TOP_LEVEL_SQL_OPCODE
,SQL_PLAN_HASH_VALUE||'' as SQL_PLAN_HASH_VALUE
,SQL_PLAN_LINE_ID,SQL_PLAN_OPERATION,SQL_PLAN_OPTIONS
,SQL_EXEC_ID
,SQL_EXEC_ID||'' as SQL_EXECUTION_ID
,SQL_EXEC_START,PLSQL_ENTRY_OBJECT_ID,PLSQL_ENTRY_SUBPROGRAM_ID,PLSQL_OBJECT_ID,PLSQL_SUBPROGRAM_ID
,QC_INSTANCE_ID,QC_SESSION_ID,QC_SESSION_SERIAL#
,decode(QC_INSTANCE_ID || QC_SESSION_ID || QC_SESSION_SERIAL#, null, 'N', 'Y') as IS_SQL_EXECUTING_IN_PARALLEL
,PX_FLAGS
, EVENT || decode(BLOCKING_SESSION||BLOCKING_SESSION_SERIAL#,null,null, ' [blocked event]') as EVENT
,EVENT_ID
,SEQ#
,seq#||'' as WAIT_EVENT_SEQUENCE_NUMBER
,P1TEXT,P1,P2TEXT,P2,P3TEXT,P3
,WAIT_CLASS,WAIT_CLASS_ID,WAIT_TIME
,SESSION_STATE
,decode(SESSION_STATE, 'WAITING', 'Y', 'N') as IS_SESSION_WAITING
,decode(SESSION_STATE, 'ON CPU', 'Y', 'N') as IS_SESSION_ON_CPU
,TIME_WAITED,BLOCKING_SESSION_STATUS
,decode(BLOCKING_SESSION||BLOCKING_SESSION_SERIAL#,null,'N', 'Y') as IS_BLOCKED_SESSION
,BLOCKING_INST_ID,BLOCKING_HANGCHAIN_INFO,CURRENT_OBJ#,CURRENT_FILE#,CURRENT_BLOCK#,CURRENT_ROW#,TOP_LEVEL_CALL#,TOP_LEVEL_CALL_NAME
,CONSUMER_GROUP_ID
,XID||'' as TRANSACTION_ID
,REMOTE_INSTANCE#,TIME_MODEL
,IN_CONNECTION_MGMT,IN_PARSE,IN_HARD_PARSE,IN_SQL_EXECUTION,IN_PLSQL_EXECUTION,IN_PLSQL_RPC,IN_PLSQL_COMPILATION,IN_JAVA_EXECUTION,IN_BIND
,IN_CURSOR_CLOSE,IN_SEQUENCE_LOAD
,CAPTURE_OVERHEAD,REPLAY_OVERHEAD
,IS_CAPTURED,IS_REPLAYED
,SERVICE_HASH,PROGRAM,MODULE,ACTION,CLIENT_ID,MACHINE,PORT,ECID,DBREPLAY_FILE_ID,DBREPLAY_CALL_COUNTER
,TM_DELTA_TIME,TM_DELTA_CPU_TIME
,TM_DELTA_DB_TIME
, TM_DELTA_DB_TIME - TM_DELTA_CPU_TIME as TM_DELTA_IDLE_TIME
,DELTA_TIME,DELTA_READ_IO_REQUESTS,DELTA_WRITE_IO_REQUESTS,DELTA_READ_IO_BYTES,DELTA_WRITE_IO_BYTES,DELTA_INTERCONNECT_IO_BYTES
,PGA_ALLOCATED
,TEMP_SPACE_ALLOCATED
from dba_hist_active_sess_history
where 1=1 /* in practice, subset on intervals, sql_id's, and sessions */
and sql_id = '988a1hx9zc5rr'
)
Unpivot ASH Counts
For example, for the with clause sub-query “unpivot_ash_counts”, I’m taking all the “state” columns who’s values that are given as Y or N and decoding them to 1 or 0, then summing them up and multiplying by 10 to get the estimated number of seconds in that “state”. Thus, for example, the metric name (i.e. the key) for the column IN_PLSQL_EXECUTION becomes “total in_plsql_execution (sec)” and the value is the sum of all the Y’s decoded to 1 and N’s decoded to 0 (times 10), to give the total amount of time in seconds that the SQL statement was in the PL/SQL execution state for that snapshot period.
, unpivot_ash_counts as /* ASH: "total <metric_name> (sec)" */
(
select sql_id, snap_id, 'total ' || lower(metric_name) || ' (sec)' metric_name, sum(Decode(delta_value, 'Y', 1, 0))*10 delta_value
from (
select sql_id, snap_id, metric_name, delta_value
from ash
unpivot /* exclude nulls */
( delta_value for metric_name in(
IN_CONNECTION_MGMT
,IN_PARSE
,IN_HARD_PARSE
,IN_SQL_EXECUTION
,IN_PLSQL_EXECUTION
,IN_PLSQL_RPC
,IN_PLSQL_COMPILATION
,IN_JAVA_EXECUTION
,IN_BIND
,IN_CURSOR_CLOSE
,IN_SEQUENCE_LOAD
,IS_SQLID_CURRENT
,IS_FOREGROUND_SESSION
,IS_BACKGROUND_SESSION
,CAPTURE_OVERHEAD
,REPLAY_OVERHEAD
,IS_CAPTURED
,IS_REPLAYED
,IS_BLOCKED_SESSION
,IS_SESSION_WAITING
,IS_SESSION_ON_CPU
,IS_SQLID_CURRENT
,IS_SQL_EXECUTING_IN_PARALLEL
)
)
)
group by sql_id, snap_id, metric_name
)
Unpivot ASH Times
Here I’m unpivoting “time” columns and also converting values to seconds. I’m planning to thoroughly test and perhaps rethink the feature engineering here regarding what to express, perhaps the ratio of DB time and CPU time using:
, round(100*TM_DELTA_DB_TIME/TM_DELTA_TIME,0) as DB_TIME_PCT
, round(100*TM_DELTA_CPU_TIME/TM_DELTA_TIME,0) as CPU_TIME_PCT
This is because, by definition of these columns (as per below), both TM_DELTA_DB_TIME, and TM_DELTA_CPU_TIME are to be interpreted in relation to TM_DELTA_TIME.
TM_DELTA_TIME is the time interval (in microseconds) over which TM_DELTA_CPU_TIME and TM_DELTA_DB_TIME are accumulated
TM_DELTA_CPU_TIME is the amount of time this session spent on CPU over the last TM_DELTA_TIME (in microseconds)
TM_DELTA_DB_TIME is the amount of time spent by this session in database calls over the last TM_DELTA_TIME (in microseconds).
My current code snippet follows:
, unpivot_ash_times as /* ASH: "total <metric_name> (sec)" */
(
select sql_id, snap_id, 'total ' || lower(metric_name) || ' (sec)' metric_name , sum(delta_value)/1,000,000 delta_value
from (
select sql_id, snap_id, metric_name, delta_value
from ash
unpivot /* exclude nulls */
( delta_value for metric_name in(
TM_DELTA_TIME,TM_DELTA_CPU_TIME,TM_DELTA_DB_TIME
, TM_DELTA_IDLE_TIME
,DELTA_TIME
)
)
)
group by sql_id, snap_id, metric_name
)
Unpivot ASH IO
For the with clause sub-query “unpivot_ash_io”, I’m taking all the two IO requests columns (DELTA_READ_IO_REQUESTS, DELTA_WRITE_IO_REQUESTS) and summing them up to get the number of read or write IO requests recorded for the sessions in that snapshot period. Thus, for example, the metric name (i.e. the key) for the column DELTA_READ_IO_REQUESTS becomes “total delta_read_io_requests (sum)” and the value is the sum of all the read io requests for that snapshot period. Code snippet follows:
, unpivot_ash_io as /* ASH: "total <metric_name> (sum)" */
(
select sql_id, snap_id, 'total ' || lower(metric_name) || ' (sum)' metric_name , sum(delta_value) delta_value
from (
select sql_id, snap_id, metric_name, delta_value
from ash
unpivot /* exclude nulls */
( delta_value for metric_name in(
DELTA_READ_IO_REQUESTS,DELTA_WRITE_IO_REQUESTS
)
)
)
group by sql_id, snap_id, metric_name
)
Unpivot ASH Bytes
Here I’m unpivoting “size” columns and also converting values to megabytes. Code snippet follows:
, unpivot_ash_bytes as /* ASH: "total max <metric_name> (meg)" */
(
select sql_id, snap_id, 'total max ' || lower(metric_name) || ' (meg)' metric_name , max(nvl(delta_value,0))/(1024*1024) delta_value
from (
select sql_id, snap_id, metric_name, delta_value
from ash
unpivot include nulls
( delta_value for metric_name in(
PGA_ALLOCATED
,TEMP_SPACE_ALLOCATED
,DELTA_READ_IO_BYTES,DELTA_WRITE_IO_BYTES,DELTA_INTERCONNECT_IO_BYTES
)
)
)
group by sql_id, snap_id, metric_name
)
Unpivot ASH Distinct
This sub-query unpivots columns that I’m counting distinct occurrences of. Code snippet follows:
, unpivot_ash_distinct as /* "count <metric_name> (count)" */
(
select sql_id, snap_id, 'count distinct '|| lower(metric_name) || ' (count)' metric_name , count(distinct delta_value) delta_value
from (
select sql_id, snap_id, metric_name, delta_value
from ash
unpivot include nulls
( delta_value for metric_name in(
SESSION_AND_SERIAL#
, SQL_PLAN_HASH_VALUE
, TRANSACTION_ID
, WAIT_EVENT_SEQUENCE_NUMBER
, SQL_EXECUTION_ID
, SQL_CHILD_NUMBER
)
)
)
group by sql_id, snap_id, metric_name
)
Unpivot ASH Events
This sub-query unpivots columns that contain event level details. Code snippet follows:
, unpivot_ash_events as /* "total time in wait event: <metric_name> " (sec)" */
(
select sql_id, snap_id, 'total time in wait event: '|| metric_name || ' (sec)' metric_name , sum(delta_value) delta_value
from (select sql_id, snap_id, nvl(event,'ON CPU') || decode(wait_class, null, null, ' ; wait class: ' || wait_class) metric_name, 10 delta_value from ash)
group by sql_id, snap_id, metric_name
union /* seq# */
select sql_id, snap_id, 'max ' || metric_name metric_name , max(delta_value) delta_value
from (select sql_id, snap_id, 'wait event sequence number' metric_name, seq# delta_value from ash)
group by sql_id, snap_id, metric_name
)
Unpivot ASH Event Class
Simply put, this sub-query calculates the amount of time spent in a particular wait class. Code snippet follows:
, unpivot_ash_event_class as /* "total time in wait class: <metric_name> " (sec)" */
(
select sql_id, snap_id, 'total time in wait class: '|| metric_name || ' (sec)' metric_name , sum(delta_value) delta_value
from (select sql_id, snap_id, wait_class metric_name, 10 delta_value from ash where wait_class is not null)
group by sql_id, snap_id, metric_name
)
Unpivot ASH Objects
This sub-query is a bit tricky in practice because I’m embedding data into the metric_name (in prior sub-queries, the column name was the metric name). While this data is useful to the analysis, it poses a missing values problem that will require a specialized solution. Code snippet follows:
, unpivot_ash_objects as /* "totals" */
(
select sql_id, snap_id, 'total time in object wait: '|| metric_name || ' (sec)' metric_name , sum(delta_value) delta_value
from (select sql_id, snap_id,
object_type || ': ' || owner || '.' || object_name || ' (' || subobject_name || ')' metric_name
, 10 delta_value
FROM ash, DBA_OBJECTS WHERE data_object_id = CURRENT_OBJ#
)
group by sql_id, snap_id, metric_name
union
select sql_id, snap_id, 'total time in sql plan step: '|| metric_name || ' (sec)' metric_name , sum (delta_value) delta_value
from (select sql_id, snap_id,
'hash: ' || SQL_PLAN_HASH_VALUE || ' | line: ' || SQL_PLAN_LINE_ID || ' | operation: ' || SQL_PLAN_OPERATION || ' | options: ' || SQL_PLAN_OPTIONS metric_name
, 10 delta_value
FROM ash
)
group by sql_id, snap_id, metric_name
union /* USER_ID */
select sql_id, snap_id, 'total time executed by user: '|| metric_name || ' (sec)' metric_name, sum (delta_value) delta_value
from (select sql_id, snap_id,
nvl(username, 'SYSTEM') metric_name
, 10 delta_value
FROM ash, DBA_USERS du WHERE ash.user_id = du.user_id
)
group by sql_id, snap_id, metric_name)
Union together all ASH Unpivots
Once all the ASH data of interest is unpivoted, the next thing to do is to union them all together, so I can deal with them more easily as a set. Code snippet follows:
, union_ash_data as /* UNIONing ASH Data */
(
select sql_id, snap_id, metric_name, delta_value from
(
select sql_id, snap_id, metric_name, delta_value from unpivot_ash_counts
union
select sql_id, snap_id, metric_name, delta_value from unpivot_ash_times
union
select sql_id, snap_id, metric_name, delta_value from unpivot_ash_bytes
union
select sql_id, snap_id, metric_name, delta_value from unpivot_ash_io
union
select sql_id, snap_id, metric_name, delta_value from unpivot_ash_distinct
union
select sql_id, snap_id, metric_name, delta_value from unpivot_ash_events
union
select sql_id, snap_id, metric_name, delta_value from unpivot_ash_event_class
union
select sql_id, snap_id, metric_name, delta_value from unpivot_ash_objects
)
)
Main Query to review the ASK Key-Value-Pairs
The only thing left to do is run the main query that produces the key-value-pairs for the ASH data.
select * from union_ash_data
order by sql_id, snap_id, metric_name
fetch first 50 rows only; -- only use this construct for testing
I’ve taken a screenshot of the data values produced after running the above query which unpivots select dba_hist_active_sess_history columns. This is what I call the KVP format where for each snapshot interval and sql_id I have generated ~50 metrics where metric_name is the key and delta_value is the value for that interval. With the data logically (i.e. no physical data transformation was done) in the KVP format it is much easier compute on and compare the KVP metric values from the interval where the problem occurred to the normal interval; basically I can do the comparison using one expression rather than n expressions (where n is this number of metrics instrumented - in this example ~30 metrics). Hopefully you are beginning to get a sense for how this technique allows me to easily scale the comparative analysis to many metrics.
As the code stands now, I will have to come up with a way of comparing metrics that may not exist in both the problem interval and the normal interval. For example, you may have noticed the metric “total time executed by user RDCORNEJO (sec)”, if someone else executed the query in one of the intervals, the referenced metric won’t exist in both intervals. A similar problem exists for other metrics that have data embedded in the name. I’ll have to come up with a better solution for these; for now, I’m thinking a straight comparison query that will show all mismatches in metrics. Alternatively, I can come up with a more sophisticated feature engineering technique to transform these non-numeric ASH data values in to useful numeric metrics.
What’s Next?
As hinted at above, I expect some adjustments to the “unpivot_ash_<>” with clause sub-queries to better deal with the missing data problems encountered. Depending the time I can put into this, I may have to defer these fixes [e.g. handling non-numeric column values such as the sql plan hash value; sql plan step/operation/options; P1,P2,P3 and their text; object#/file#/block#]. Also, I’d like to cover my plan for converting date/time variables to new discreet metrics that can be computed on more easily. For example, it should be interesting if I can compute at the session level, the number of seconds since the last event; my theory is, this should help by providing a surrogate metric for idle time, since idle events are not captured in ASH.
More importantly, I’ve developed enough of the DOPA process on SQL-level metrics (let’s call it the beta-release) to actually run the anomaly detection on a problem interval compared to a normal interval and see what gets flagged up.
Again, thank you for your time and willingness to have me “drag you through the broken glass” so to speak; I hope you found this information useful. Stay tuned for more details.