Monday, March 19, 2012

Accelerating Your Existing Architecture

Very often we hear today how you need to completely throw away your existing disk based architecture and move in the latest, greatest set of servers, disks, flash and who knows what all in order to get better performance, scalability and so forth. The new configurations pile on the storage, flash and CPUs and of course, license fees. In some cases it has reached the point where you are paying 2-3 times the hardware cost in license fees alone!

What if you could double or triple your performance, not pay any additional license fees and still be able to use your existing servers and disk storage? Would that make you a hero to your CFO or what? First let’s discuss what would be required to do this miracle.

In a perfect world we would use exactly the amount of storage, have instant access and then be able to use an exact amount of CPU and hand the CPU to other processes when we are done, getting our work done as quickly as possible. Unfortunately it usually ends up that a process makes a request, the CPU sends an Io request, is told it has to wait so it spins registering time as idle time. It is quite possible to have nearly idle CPUs and yet not be able to get any work done. This is usually due to IO wait conditions.

One major contributor to IO wait conditions is what I call read-poisoning. If all a disk had to do was writing, then it would function very effectively since writes can be optimized by controllers to be very efficient. Likewise if all we did were reads then disks would be happy and we could optimize for reads. Unfortunately we usually have a mixture of both reads and writes going to the same disks in the array with reads usually outnumbering writes 4 to 1 ( a 80/20 ratio reads to writes.) With Oracle anytime you slow down reads you will cause performance issues.

Oracle waits on reads to complete, it has to unless the data is already stored in the DB or Flash caches. For most things Oracle is write agnostic. What do I mean by write agnostic? Oracle uses a process called delayed block cleanout whereby data is kept in memory until it absolutely has to be written. When data is written it is usually done in batches. This is why Oracle doesn’t reort in AWR and Statspack reports the milliseconds it takes to write data, with a few exceptions, it really doesn’t care!

When does Oracle need writes to be fast? When it is waiting on those writes! When does Oracle wait on writes? There are only a few instances when Oracle will be waiting on writes:

1. Redo log writes
2. Temporary tablespace writes
3. Undo tablespace writes (although these have been greatly reduced by in-memory undo)

Unless you are in a high transaction environment like a stock exchange, redo writes rarely cause issues, and, since the implementation of in memory undo with Oracle10g, undo write issues have also faded to obscurity. Usually if they occur your temporary tablespace writes are what will cause the most issues. Temporary tablespaces are used for:

1. Sorts
2. Hashes
3. Temporary table activity
4. Bitmap operations

Since more than sorts are done there it is quite possible to have zero sort operations and yet have the temporary tablespace a major source for IO, I have seen this with hash joins and temporary table operations.

So, to optimize an existing system we would need to split off reads from writes and isolate the effects of large writes such as temporary tablespace activity from the general table and index storage. Luckily in Oracle11g R2 we have been given a means to do this isolation of reads from writes. In Oracle11g R2 ASM there is the capability for preferred-read failover group designation from the instances using ASM. Each instance can specify its own ASM preferred read failure group within a specific disk group. This feature was intended to enable remote (relatively speaking) RAC instances to specify local storage for read activity to preserve performance. However, we can make use of this preferred-read failure group to optimize a single instances performance. If you aren’t using ASM, your disk management tool may also have this capability.

If we add in a suitable amount of RamSan flash based storage to a server’s storage setup, we can specify that the flash half of a disk group in ASm or other storage managers be the preferred read failure group. For example let’s put a RamSan720 or RamSan820 into an existing storage subsystem. The 720 or 820 provide no single point of failure within the devices themselves so unless we just want the added security; there isn’t a need to mirror them. The 720 comes in 6 or 12 terabyte SLC Flash configurations with 5 or 10 terabytes available after the high availability configuration is set. The 820 comes in 12 or 24 terabyte eMLC flash configurations with 10 to 20 terabytes available after HA configuration. Did I mention both of these are 1-U rack mounts? For reads both of these units give sub 200 microsecond (.2 millisecond) read times and sub 50 microsecond (.05 millisecond) write times.

So, now we have, for arguments sake, a RamSan-820 with 20 terabytes and our existing SAN of which we are using 10 terabytes for the database with a potential to grow to 15 terabytes over the next 3 years. We create a diskgroup (with the database still active mind you if we are currently using ASM) with the existing 15 terabytes of disk in one failure group and the 15 terabyte LUN we created on the RamSan in another. Once the ASM finishes rebalancing the diskgroup, from the instance that is using the diskgroup, we assign the diskgroup’s RamSan failure group as the preferred-read mirror failure group.

SQL> alter system set ASM_PREFERRED_READ_FAILURE_GROUPS = 'HYBRID.SSD’;

System altered.

Now we should see immediate performance improvements. But what about the 5 terabytes we have left on the RamSan? We use them for redo logs, undo and temporary tablespaces. All of these structures can be re-assigned or rebuilt with the database up and running in most cases. This will provide high speed writes (and reads) for write sensitive files, highspeed reads for data and indexes and remove the read-poisoning from the existing disk based SAN. Notice we did it without adding Oracle license fees! And, with any luck, we did it with zero or minimal down time and no Oracle consulting fees!

Now, the IO requests complete 10-20 or more times faster which means the CPU spends less time waiting on IO and more time working. In tests using a RamSan 620 for 2,000,000 queries doing 14,000,000 IOs the configuration using the RamSan as the preferred read mirror completed nearly 10 ten times faster than the standard disk configuration.

When the test is run against the architecture with PRG set to HYBRID.DISK we see the following results:

• ~4000 IOPS per RAC node
o 16,000 IOPS total
• 12.25 minutes to complete with 4 nodes running (2m queries).
[oracle@opera1 ~]$ time ./spawn_50.sh

real: 12m15.434s
user: 0m5.464s
sys: 0m4.031s

When the test is run against the architecture with PRG set to HYBRID.SSD we see the following results:
• 40,000 IOPS per RAC node
o 160,000 total IOPS in this test
• 1.3 minutes to complete with 4 nodes running (2m queries).
[oracle@opera1 ~]$ time ./spawn_50.sh

real: 1m19.838s
user: 0m4.439s
sys: 0m3.215s

So, as you can see you can optimize your existing architecture (assuming you have Oracle11g R2 or are using a disk manager that can do preferred read mirror) to get 10-20 times the performance by just adding in a RamSan solid state storage appliance.

2 comments:

  1. How are you doing this?

    "We create a diskgroup (with the database still active mind you if we are currently using ASM) with the existing 15 terabytes of disk in one failure group and the 15 terabyte LUN we created on the RamSan in another. "


    Are you adding the RAMSAN disks to the existing diskgroup on which currently the data resides?

    ReplyDelete
  2. If you have a 15 TB diskgroup already setup, even with existing failure groups you can add in the 15 tb SSD lun as another failure group. Then, one disk at a time, drop and re-add the existing disks to move them into the other failure group. When you are finished you will have two failure groups one pure hard disk based, the other pure SSD based in the same diskgroup. Oracle ASM will rebalance as needed to make the adjustments.

    ReplyDelete