Friday, March 30, 2012

Just What Are Blocking Reads and How Did They Get in My Database?

If you have been around Oracle for very long I am sure you have heard that in Oracle readers don’t block other readers and readers don’t block writers. Of course, this is referring to the processes that happen within Oracle locking structures inside Oracle memory caches. What happens within the physical storage is a completely different story.

Even if you hand Oracle raw devices (which means the Oracle kernel handles the IO instead of passing it off to the operating system) you will still get blocking reads (and blocking writes) if you are using standard hard disk drives. Imagine it this way: a person initiates a full table scan with a multi-block read count of 64 and a block size of 8 kilobytes. The disk subsystem is in a RAID10 with a stripe depth of 64 kilobytes going across 8 disks for a stripe width of 512 kilobytes. What happens to disk access while his full table scan of a multi-megabyte table is happening? Look at Figure 1.



Figure 1: A Blocking Read

So, even if the second user in a blocking read situation is going after totally unrelated data, they have to wait until the first user finishes his read before they get access to the disks. Now multiple this by the number of full table scans in your system and the number of users trying to access data and you can see the issue. How can blocking reads be mitigated? Well in hard drive based system you need to align stripe depth to the size read in a full a table scan to minimize the number of disks involved. Look at Figure 2.



Figure 2: Non-Blocking Read

In the situation in Figure 2 the disk stripe depth is aligned to the product of the db file multi-block read count setting and the block size. This allows each of the first users reads to only tie up one disk at a time. Now when user two tries to get access, as long as they are going for data on a different disk, they can get to their data with no problem.

ASM uses a 1 megabyte stripe depth for data in a normal setup and 4 megabytes for an Exadata cell for just his purpose, to prevent reads form blocking other reads or writes. But, why does this collision happen? The underlying cause is shown in Figure 3.




Figure 3: The Cause of Read Blocking

The ultimate cause of blocking reads and writes is the rotating disk and moving armature inside hard disks. The need to reposition the disks and armatures to read or write the individual data sectors results in blocking access to the disks during each discrete read or write by user for all other users. Of course the only way to get rid of blocking reads and writes, when not accessing the same data block, is by using solid-state memory that has no need to rotate a platter or move an armature. If you could create a read “surface” that could read any point on the disk surface without repositioning the disk, you could get near memory speeds from disk based systems. Unfortunately you can’t do this with current technologies, and, quite frankly, why would you want to? Modern flash technology provides read latency of 250 microseconds or less and with a little help from DDR buffering and advanced write algorithms, 85 microsecond or less write times.

But what if you don’t want to replace your entire SAN system with $20/GB flash technology? Well, in my next blog entry next week I’ll tell you how to eliminate read blocking and still benefit from flash while utilizing you existing SAN or disk technology.

Wednesday, March 28, 2012

Using AWR or Statspack for Tier Zero Object Placement Analysis

The SQL sections of either the Statspack or AWR reports are valuable sources of information for the determination of what objects should be placed on Tier Zero. When combined with execution plan analysis from the V$SQL_PLAN view, use of the Statspack or AWR SQL areas can quickly isolate not only the most critical SQL statements, but their underlying objects. The first indication that moving to a Tier Zero storage layer would be beneficial for your system is found in the “Load Profile” section of the Statspack or AWR. Let’s look at an example client report, Figure 1



Figure 1 Load Profile from AWRRPT

The report excerpt in Figure 1 was taken from client OLTP type database. Oracle is reporting 20.2K block IOPS against this setup, which is actually quite impressive. Of course most of these IOPS out of Oracle will be bundled at the controller, usually by a factor of 16, as can be seen in Figure 2 in the listing of total read and write IOs from the detailed statistics section of the report.



Figure 2 Total Read and Write Statistics from Instance Statistics

This concentrating of IOs leads to reducing the IOPS seen at the actual disks to around 750. Given that each disk is capable of between 90-150 random IOPS, that means we should be able to support about 2800 IOPS before we stress the disk system, if all of the IOPS are non-colliding. Of course with reads stomping on writes and writes stomping on reads even this relatively low number of physical IOs can result in IO subsystem stress. If we want to know what kind of stress we are actually seeing, we need to compare CPU time and idle time next. Figure 3 shows the Operating System Statistics section of the same report as in Figure 2.



Figure 3: Operating System Statistics

In Figure 3 we see that the idle time is large in comparison to the busy time and that the IO wait time is more than half the busy time. But is the AWR report giving us the complete information for the IOwait time it is reporting in this section? Let’s look at the Instance Activity Statistics section again, look at Figure 4.



Figure 4: User IOWait Statistic

This new statistic seems to indicate IOwait is actually greater than busy time, this indicates that CPUs are excessively waiting for IOs to complete, signaling that the system is stressed at the IO subsystem. But how stressed is the system?

A key indicator of the amount of IO stress is the IO latency. Disk latency is based on disk rotation speed for the most part. A 15K RPM disk should have a max latency of around 5 milliseconds, less if the disk is being “short stroked;” a short stroked disk is one that is not allowed to get to more than 30% capacity, specifically to reduce IO latency. The Tablespace IO statistics section will give us an idea of what the Oracle system is seeing as far as latency in the IO subsystem. Figure 5 shows the Tablespace IO section of the report (I have excluded the tablespace names to protect the clients ID).



Figure 5: Tablespace IO Report

From Figure 5 we can see that the IO latency for reads for our tablespaces are extremely high, 10-25 milliseconds respectively (other than the first tablespace at 2 ms). These extremely high latencies definitely indicate the IO subsystem where those tablespaces are located is being heavily stressed. High numbers of buffer waits also indicate long wait times for writes, as we can see the second tablespace is experiencing large numbers of buffer waits and each buffer wait is taking about 4 milliseconds to resolve. Other than moving all of the tablespaces, data tables, and indexes to Tier Zero, how can we determine the minimum number of objects that should be moved? Let’s go back to our original report.

Going back to the beginning of the report we can first examine the top wait events to see under what categories the IO stress occurs. Figure 6 shows the “Top Five Wait Events” for our Data warehouse database report.



Figure 6: Top Five Wait Events

I’ll bet you were expecting to see db file scattered reads or db file sequential reads as the top wait events, and you would have been correct in a non-partitioned, non-parallel query environment. However, this database is highly partitioned and uses table, index, and instance level (it is a 4-node RAC setup) parallel query. The direct path read wait event indicates asynchronous reads direct into the PGA, in this case because of the parallel queries being utilized by the system. The direct path read temp and direct path write temp indicate that our PGA_AGGREGATE_TARGET may be insufficient for the size of hashes and sorts being performed. So what does all this mean?

The direct path read wait events are being driven by full table/partition scans. If we examine the “Segment Statistics” section of the AWR report we can see what objects are undergoing the most physical IOs and, generally speaking, those will be the ones causing our direct path read waits. If we saw db file scattered reads or db file sequential reads we could look at the same report section to see what objects were most likely causing issues. Figure 7 shows the pertinent “Segment Statistics” sections of the report.



Figure 7: Segment Statistics Sections

Obviously the H_LINEITEM table is dominating our Segment Statistics. With only 8.1% of total IO being shown, it indicates that the physical IO is being spread over a large number of objects, in this case, probably more H_LINEITEM and H_ORDER partitions. From looking at all three sections we can see that the H_CUSTOMER, H_PART, and H_SUPPLIER tables are seeing a great deal of full scans at the partition level as well. However, with less than 50% of any of the Segment Statistics actually being shown, we are probably missing a great deal of important statistics for determining what should be placed on Tier Zero. However, as a start it looks like the following tables should be placed there:

• H_LINEITEM
• H_CUSTOMER
• H_PART
• H_PARTSUPP

In addition, the temporary tablespace stress indicates that the temporary tablespace datafiles should be moved there as well, assuming we cannot increase PGA_AGGREGATE_TARGET to accomplish relief of the temporary IO stress by moving the sorts and hashes into memory.

Since we are seeing less than half of the actual database activity being recorded in the Segment Statistics area of the report, we need to look at the SQL area of the report to determine what SQL is causing the most stress. By analyzing the top 5-10 problem SQL statements, defined as the SQL statements that are worst performing in their area of report, we can determine any additional objects that may need to move to Tier Zero or other high speed, low latency storage. Figure 8 shows the applicable top five SQL statements in the SQL physical reads section of the report.



Figure 8: Top 5 SQL Statements by Physical IOs

The full SQL statement for the top SQL statement from Figure 8 for Physical IOs, SQL ID 7zcfxggv196w2, with over 20 million reads is shown in Figure 9:



Figure 9: Full Text for Top SQL Query

Just by examining the SQL in Figure 9 we can’t really see much more than we knew before. However, if we generate an execution plan using the explain plan command, we get a much better look at what objects are actually being utilized by the query. Figure 10 shows the abbreviated explain plan for the query in Figure 9.



Figure 10: Abbreviated Execution Plan for Top Query

In the execution plan just the objects accessed are being shown; in this case we see the same tables from our full table scan list in addition to the LINEITEM_IDX2 index. The H_NATION table is very small and is usually fully cached early in any processing cycle so we can disregard it. In fact, if we do a select on the V$SQL_PLAN table we find that the following indexes are also being heavily used by the other queries:




So after looking at IO statistics and physical waits and doing some SQL analysis, we have determined that the following objects should be placed on Tier Zero if space allows:

• Temporary tablespace
• H_LINEITEM table
• H_CUSTOMER table
• H_PART table
• H_PART_SUPP table
• H_ORDER table
• CUSTOMER_IDX1 index
• PARTSUPP_IDX2 index
• LINEITEM_IDX1 index
• LINEITEM_IDX2 index
The Results
What happens if we move the selected items from hard disk to Tier Zero storage (in this case solid state devices ranging from 0.2 ms for data and index areas to .015 ms latency for temporary areas)? The graph in Figure 11 shows the performance of 22 queries with the objects on the SSD arrays and on the original hard drives.



Figure 11 Results from Moving to Tier Zero

Figure 11 clearly shows the advantages of Tier Zero storage. On the average the queries ran a factor of 5 times faster on the Tier Zero storage, with some queries being almost 20 times faster. Also shown in Figure 11 are the results of moving the temporary segments to Tier Zero; note the 4 queries (9, 13, 16 and 18) whose performance was improved just by moving the temporary tablespace. Of course moving all the suggested items to SSD gave the greatest improvement.

Summary

Through the analysis of IO profiles, wait events, and SQL object usage we were able to determine what objects would benefit from being moved to Tier Zero type storage. Through the use of AWR reports and some custom queries the determination of objects was rendered fairly easy. The final test of our selections by actually moving the objects to Tier Zero provided a 5 fold increase in average performance, with over a 20 fold improvement in some queries.

Monday, March 26, 2012

What Do I Put on Tier Zero?

In the blog entry previous to this one we looked at tier zero usage in general. But how do we determine what should be placed on tier zero? The determination of the tables, indexes or other structures that should be placed on the often costly and low capacity (volume wise) Tier Zero storage can be a complex issue. There are generally three sources for analysis data:

1. Explain plans showing access patterns or SQL showing poor performance
2. Wait event analysis
3. Analysis of IO patterns of tablespaces

Of course if an object is small enough that it will be read once from storage and then reside in the SGA buffer space, placing it on Tier Zero is a waste of Tier Zero space. So small tables that are easily cached (<5% of SGA cache size) we can exclude them from the list of tier zero candidates. In the next series of blogs let’s look at each of these three sources and see how they can be utilized to determine optimal placement of assets on Tier Zero storage. We will start with use of explain plans to determine tier zero eligibility.
Use of Explain Plans
Depending on the release of Oracle you may have a very good source of explain plans already built into your metadata. On Oracle9i and releases since Oracle9i the dynamic performance views include v$sql_plan. The contents of the v$sql_plan view are shown in Figure 1.



Figure 1: Contents of the V$SQL_PLAN Dynamic Performance View

If you have worked with explain plan and the plan_table then the contents of the V$sql_plan view should be familiar; other than the ADDRESS, SQL_ID, and HASH_VALUE columns the plan_table and the v$sql_plan view are virtually identical.

The major difference between the plan_table table and the v$sql_plan view is that the plan_table must be populated by request of the user with the explain plan command and usually contains the plan for only a few specific SQL statements, while the v$sql_plan table is automatically populated and contains the plans for all active SQL statements in the SGA SQL area.

Through the use of queries against the v$sql_plan view after your application has been running and established, its working set of SQL can yield detailed information about what indexes are frequently used and what tables are accessed by inefficient (from the storage point of view) full or partial scans. An example SQL script to pull information about table and index access paths is shown in Figure 2.

rem based on V$SQL_PLAN table
col operation format a13
col object_name format a32
col options format a30
col fts_meg format 999,999,999.99
column dt new_value today noprint
select to_char(sysdate,'ddmonyyyyhh24miss') dt from dual;
set pages 55 lines 132 trims on
ttitle 'Table/Index Access '||&&owner
spool access&&today
select
a.object_name, rtrim(a.operation) operation, a.options,
sum((b.executions+1)*c.bytes)/(1024*1024) fts_meg
from
v$sql_Plan a, v$sqlarea b, dba_segments c
where
(a.object_owner=c.owner
and a.object_name=c.segment_name)
and a.address=b.address
and a.operation IN ('TABLE ACCESS','INDEX')
and a.object_owner=upper('&&owner')
group by a.object_name, rtrim(a.operation), a.options
order by a.object_name, rtrim(a.operation), a.options
/
spool off
set pages 20
ttitle off

Figure 2: SQL Script to Extract Full Table and Index Accesses

An example report generated from the script in Figure 2 is shown in Listing 1.



Listing 1: Example Output of V$SQL_PLAN Report

To determine which objects are candidates for Tier Zero storage, look at the number of paths as show by the OPTIONS column which use the object and amount of access as shown by the FTS_MEG column. The objects with the greatest number of executions and sized such that it is unlikely they will fit in the SGA buffer area are excellent candidates for Tier Zero storage.

In the report in Listing 1 we see that for overall volume of access the H_LINEITEM table and H_LINEITEM index LINEITEM_IDX2 is the first candidate for moving to Tier Zero. In fact, about the only objects which aren’t candidates are the tables H_NATION, H_REGION and H_SUPPLIER and the index SUPPLIER_IDX1.

Of course if the table hasn’t been accessed in the period since startup, or, its plan has been bumped out of the SQL cache, then it won’t show up in the report. It is suggested that the report be generated several times during the day, or over several days to make sure all valid tables and indexes are accounted for. Once you have the list of candidates, review them for size, small tables or indexes that will be cached in the normal memory cache shouldn’t be placed on tier zero. In fact, small tables and indexes that you know should be cached in memory should be placed into a designated KEEP cache using the DB_KEEP_CACHE_SIZE to size the cache and the ALTER TABLE or ALTER INDEX command to specify the objects should use the keep cache.

In a RAC environment you may need to run the report on multiple instances in the RAC cluster if access patterns are different based on instance usage.

In my next blog we will cover using statspack or AWR reports to determine tier zero eligibility.

Thursday, March 22, 2012

Using Tier Zero

It would be wonderful if we could put our entire database on fast low latency storage. Imagine getting sub-one two hundred microsecond response on the entire database? Well, we may soon be able to, in fact some are doing it now. Unfortunately not everyone can afford to replace all of their SAN assets with SSD just yet so for many tiering of data storage provides the answer. The lower the tier number, the higher the performance. usually the lowest tier is tier zero although some have proposed a tier -1...

Just what is “Tier Zero Storage.” No doubt you have heard it applied to fast disks, Flash disks, and of course DDR based SSD storage systems. Tier Zero is the fastest, lowest latency level of storage reserved for the parts of your database or system architectures that need high speed access (low latency) to ensure the performance of the entire system.

Many systems suffer from IO contention which results in high run queues, low CPU usage, and high IO waits. Increasing bandwidth usually is not a solution if the minimum latency for the existing storage media has been reached.

Usually less than 10% of your database tables and indexes need the performance offered by Tier Zero type storage, but the trick is picking which 10%. Many times we are sure we know that since table X or index Y is the most important it must have to go on Tier Zero; however, in many cases this is just not true.

Typically the speed increases and latency decreases as you move from higher numbered tiers to lower numbered tiers up the latency pyramid as is shown in Figure 1.



Figure 1: Storage Latency Pyramid


What should go on Tier Zero?

The determination of the tables, indexes or other structures that should be placed on the often costly and low capacity (volume wise) Tier Zero storage can be a complex issue. There are three sources for analysis data:

1. Explain plans showing access patterns or SQL showing poor performance
2. Wait event analysis
3. Analysis of IO patterns of tablespaces

Of course if an object is small enough that it will be read once from storage and then reside in the SGA buffer space, placing it on Tier Zero is a waste of Tier Zero space.

Wednesday, March 21, 2012

Using StatspackAnalyzer.com

Part of my duties (Mike here) at TMS is managing the Statspack Analyzer website. The website management mostly consists of reviewing comments in the forum there and analyzing the Statspacks and AWRs that it can't handle. What things can cause a report not to able to be handled by the website? Let's look at them.

1. The report contains Oracle generated errors
2. The report is in HTML format
3. The parser for one reason or another can't handle the report format

The first situation generates an error 4 code back to the user that states an Oracle error has occurred. This means that it found an Oracle error in the report, not that the site has an error. Usually this happens with Oracle10g or newer Statspack reports and it is from this bit of error stack somewhere in the report:

truncate table STATS$TEMP_SQLSTATS
*
ERROR at line 1:
ORA-00942: table or view does not exist

If you remove that bit of error, your report will process in the parser just fine.

The next problem is that the parser for now can only handle text reports generated as text by the spreport.sql or awrrpt.sql scripts. It cannot handle HTML converted to text by a HTML converter. These SQL scripts for generating the reports are located in $ORACLE_HOME/rdbms/admin (or the equivalent on Windows) and can be run by any user from SQLPLUS that has DBA privileges. Do not spool the output, it must be the output text-based file generated by the script. The scripts, unless you tell them to do otherwise, will dump the reports into the directory from which you run the particular SQL script. Then you simply cut and past the text from the report into the window on the StatspackAnalyzer site. Afraid we can't as yet take the new RAC based, multi-instance reports.

The final thing which can prevent use of the site is if your report for one reason or another deviates from the standard Statspack or AWR reports, for example, if you have set some of the optional reporting features in the Statspack or AWR setup tables or have customized the report output in anyway this can result in a report the parser doesn't recognize. Another issue can be that you are sending in a report that is just too new. Unfortunately we can't always keep the site up to date with all the new reports Oracle comes out with, but please give us time!

So, if for some reason, after looking this over and making any corrections on your side of the fence that are needed to your AWR or Statspack reports they still won't process, send them on in using the link on the StatspackAnalyzer website and I will take a look and give you my recommendations.

Monday, March 19, 2012

Accelerating Your Existing Architecture

Very often we hear today how you need to completely throw away your existing disk based architecture and move in the latest, greatest set of servers, disks, flash and who knows what all in order to get better performance, scalability and so forth. The new configurations pile on the storage, flash and CPUs and of course, license fees. In some cases it has reached the point where you are paying 2-3 times the hardware cost in license fees alone!

What if you could double or triple your performance, not pay any additional license fees and still be able to use your existing servers and disk storage? Would that make you a hero to your CFO or what? First let’s discuss what would be required to do this miracle.

In a perfect world we would use exactly the amount of storage, have instant access and then be able to use an exact amount of CPU and hand the CPU to other processes when we are done, getting our work done as quickly as possible. Unfortunately it usually ends up that a process makes a request, the CPU sends an Io request, is told it has to wait so it spins registering time as idle time. It is quite possible to have nearly idle CPUs and yet not be able to get any work done. This is usually due to IO wait conditions.

One major contributor to IO wait conditions is what I call read-poisoning. If all a disk had to do was writing, then it would function very effectively since writes can be optimized by controllers to be very efficient. Likewise if all we did were reads then disks would be happy and we could optimize for reads. Unfortunately we usually have a mixture of both reads and writes going to the same disks in the array with reads usually outnumbering writes 4 to 1 ( a 80/20 ratio reads to writes.) With Oracle anytime you slow down reads you will cause performance issues.

Oracle waits on reads to complete, it has to unless the data is already stored in the DB or Flash caches. For most things Oracle is write agnostic. What do I mean by write agnostic? Oracle uses a process called delayed block cleanout whereby data is kept in memory until it absolutely has to be written. When data is written it is usually done in batches. This is why Oracle doesn’t reort in AWR and Statspack reports the milliseconds it takes to write data, with a few exceptions, it really doesn’t care!

When does Oracle need writes to be fast? When it is waiting on those writes! When does Oracle wait on writes? There are only a few instances when Oracle will be waiting on writes:

1. Redo log writes
2. Temporary tablespace writes
3. Undo tablespace writes (although these have been greatly reduced by in-memory undo)

Unless you are in a high transaction environment like a stock exchange, redo writes rarely cause issues, and, since the implementation of in memory undo with Oracle10g, undo write issues have also faded to obscurity. Usually if they occur your temporary tablespace writes are what will cause the most issues. Temporary tablespaces are used for:

1. Sorts
2. Hashes
3. Temporary table activity
4. Bitmap operations

Since more than sorts are done there it is quite possible to have zero sort operations and yet have the temporary tablespace a major source for IO, I have seen this with hash joins and temporary table operations.

So, to optimize an existing system we would need to split off reads from writes and isolate the effects of large writes such as temporary tablespace activity from the general table and index storage. Luckily in Oracle11g R2 we have been given a means to do this isolation of reads from writes. In Oracle11g R2 ASM there is the capability for preferred-read failover group designation from the instances using ASM. Each instance can specify its own ASM preferred read failure group within a specific disk group. This feature was intended to enable remote (relatively speaking) RAC instances to specify local storage for read activity to preserve performance. However, we can make use of this preferred-read failure group to optimize a single instances performance. If you aren’t using ASM, your disk management tool may also have this capability.

If we add in a suitable amount of RamSan flash based storage to a server’s storage setup, we can specify that the flash half of a disk group in ASm or other storage managers be the preferred read failure group. For example let’s put a RamSan720 or RamSan820 into an existing storage subsystem. The 720 or 820 provide no single point of failure within the devices themselves so unless we just want the added security; there isn’t a need to mirror them. The 720 comes in 6 or 12 terabyte SLC Flash configurations with 5 or 10 terabytes available after the high availability configuration is set. The 820 comes in 12 or 24 terabyte eMLC flash configurations with 10 to 20 terabytes available after HA configuration. Did I mention both of these are 1-U rack mounts? For reads both of these units give sub 200 microsecond (.2 millisecond) read times and sub 50 microsecond (.05 millisecond) write times.

So, now we have, for arguments sake, a RamSan-820 with 20 terabytes and our existing SAN of which we are using 10 terabytes for the database with a potential to grow to 15 terabytes over the next 3 years. We create a diskgroup (with the database still active mind you if we are currently using ASM) with the existing 15 terabytes of disk in one failure group and the 15 terabyte LUN we created on the RamSan in another. Once the ASM finishes rebalancing the diskgroup, from the instance that is using the diskgroup, we assign the diskgroup’s RamSan failure group as the preferred-read mirror failure group.

SQL> alter system set ASM_PREFERRED_READ_FAILURE_GROUPS = 'HYBRID.SSD’;

System altered.

Now we should see immediate performance improvements. But what about the 5 terabytes we have left on the RamSan? We use them for redo logs, undo and temporary tablespaces. All of these structures can be re-assigned or rebuilt with the database up and running in most cases. This will provide high speed writes (and reads) for write sensitive files, highspeed reads for data and indexes and remove the read-poisoning from the existing disk based SAN. Notice we did it without adding Oracle license fees! And, with any luck, we did it with zero or minimal down time and no Oracle consulting fees!

Now, the IO requests complete 10-20 or more times faster which means the CPU spends less time waiting on IO and more time working. In tests using a RamSan 620 for 2,000,000 queries doing 14,000,000 IOs the configuration using the RamSan as the preferred read mirror completed nearly 10 ten times faster than the standard disk configuration.

When the test is run against the architecture with PRG set to HYBRID.DISK we see the following results:

• ~4000 IOPS per RAC node
o 16,000 IOPS total
• 12.25 minutes to complete with 4 nodes running (2m queries).
[oracle@opera1 ~]$ time ./spawn_50.sh

real: 12m15.434s
user: 0m5.464s
sys: 0m4.031s

When the test is run against the architecture with PRG set to HYBRID.SSD we see the following results:
• 40,000 IOPS per RAC node
o 160,000 total IOPS in this test
• 1.3 minutes to complete with 4 nodes running (2m queries).
[oracle@opera1 ~]$ time ./spawn_50.sh

real: 1m19.838s
user: 0m4.439s
sys: 0m3.215s

So, as you can see you can optimize your existing architecture (assuming you have Oracle11g R2 or are using a disk manager that can do preferred read mirror) to get 10-20 times the performance by just adding in a RamSan solid state storage appliance.

Friday, March 16, 2012

Cluster Headaches

In my SSD testing I use the standard benchmarks, TPC-C and TPC-H to simulate the OLTP and DSS/DWH environments. Instead of re-inventing the wheel, I use schema examples gleaned from test on similar hardware that have been published at the http://ww.tpc.org/ website.

In creating the TPC-C schema I used a schema model based on a successful TPC-C run on an HP platform. In this schema, several of the tables were created as single-table or multi-table clusters. During the initial load I found that the multi-table cluster used wouldn’t load correctly, at least loading using external tables, so I broke it into two indexed and referentially related tables. However, I left the single table clusters alone believing them to be more efficient.

In a single table cluster the primary key structures are hashed into a single set of blocks making them easier to look up by simply using a hash function and scanning those blocks. This is supposed to be faster than an index lookup followed by a table lookup due to the use of the hash lookup and co-location of the primary key blocks.

I noticed during reloads that the single-table clusters took longer to finish loading than did the non-clustered tables of a similar size. I decided for a test to check to see if the clustering was having a positive or negative effect on the performance of the database. In this test I replaced all clustered tables with table-primary key index table combinations and used the configuration that gave the best previous performance (no flash cache, no keep or recycle and maximized db cache, with FIRST_ROWS_(n) set to 1). The results are shown in Figure 1.



Figure 1: The Effect of Removing Table Clusters

Surprisingly removing the clusters increased performance from a peak of 6435 up to a peak value of 7065 tps, nearly a 10% increase in performance. This corresponds to a non-audited tpmC value of 197,378.310, this would be equivalent to a result from around a 200 disk drive based system. From my research I find generally a 1K of tpmC per physical disk drive depending on the amount of cache and the speed and type of disk used.

It appears that the SSD reduces latency to the point where disk access time saving features such as table clustering may actually incur more overhead in processing than is saved from the supposed reduction in IO from their use.

Wednesday, March 14, 2012

Flash Over Disk

Since the test with a flash utility against an internal PCIe Flash card proved inconclusive I decided to have the lab hook up some disks and re-run the tests using a disk array containing 24-10k 300gb disks for the tables and indexes. The DB_CACHE_SIZE was increased to 50gb and the DB_FLASH_CACHE_SIZE was set to 300gb. Figure 1 shows the results for a disk array with and without a 300gb flash cache.



Figure 1: Disk verse Disk plus Flash Cache Performance

As you can see from reviewing the graph, the Flash cache definitely helped performance at the all levels of our user range. It also showed that with the same hardware the sustained performance increase could be extrapolated to a larger number of users so in the case of using flash cache with disks, yes, performance is gained.

While running this test I had indication that over 160 gigabytes of data blocks were cached in the flash cache. Figure 2 shows the SQL script used to determine flash usage for a single user and Figure 3 shows an example of its output during test runs.

set lines 132 pages 55
col object format a45
select owner||'.'||object_name object,
sum(case when b.status like 'flash%' then 1 end) flash_blocks,
sum(case when b.status like 'flash%' then 0 else 1 end) cache_blocks,
count(*) total_cached_blocks
from v$bh b join dba_objects o
on (objd=object_id)
where owner = upper('&owner')
group by owner, object_name
order by owner,4 asc;

Figure 2: SQL Script to see cached objects for an owner



Figure 3: Example use of Flash Cache
Just to put things in perspective, let’s put the top pure-Flash database results against these disk and Flash cache results. Look at Figure 4.



Figure 4: Flash only, Disk Only and Disk plus Flash Cache Results

In reviewing Figure 4 you should first note it is a logarithmic plot, which means that for each change on the left axis there is a factor of 10 change. This figure shows that using pure flash far outperforms even the best we can expect from a combination of flash and disk: in this case by nearly a factor of 7. The peak performance we obtained from our disk combined with a Flash cache was 1024 TPS, while the peak we obtained in our flash tests was over 7000 TPS. Even in previous testing with larger disk arrays, the peak performance I obtained from disk arrays was only in the 2000 TPS range, again showing that SSD technology is superior to any equivalent disk array.

Monday, March 12, 2012

Flash on Flash

Why exactly did Oracle create the Flash Cache concept? Well, generally speaking flash is going to be faster than disk for access of data, so, putting a bit of flash into your server and then using it as an L2 cache for the database cache makes sense. In this test we aren’t dealing with the Exadata Cell based flash cache, but the DB, server based flash cache.

But what happens when disk isn’t the storage media? Let’s look at a flash-on-flash test case using a RamSan630 flash solid state storage appliance and a RamSan 70 PCIe server mounted flash card.



Figure 1: Test Configuration

The flash cache was sized at the lower end of the suggested 2X to 10X the database cache size (90 gb) and then a run with the flash cache set to zero was run. Note that for the first run appropriate tables and indexes were assigned to be kept in the flash cache, other tables where set to default. Figure 2 shows the results from use of the Smart Flash Cache with Flash as storage.



Figure 2: TPS verses GB in the Flash cache

At least for our testing with the database on a RamSan630 SSD and the flash cache being placed on a RamSan70 PCIe card, the results are not encouraging towards the use of the flash cache with a flash based SAN. Review of the AWR results showed that the flash cache was indeed being used but, due to the small difference in overall latency between the RS630 with IB interfaces and the RS70 in the PCIe slot, the overall effect of the flash cache was negligible. According to AWR results when the flash cache was set to zero the predominate wait event was the db file sequential read, when the flash cache was set to 90 gb, the db flash cache single block physical read event dominated the report, thus showing that the cache was in fact being used.

These results demonstrate that for a database system that is based on high-speed flash storage, the DB flash cache will not be needed.