Texas Memory Systems' Blog: April 2012

Thursday, April 26, 2012

Final Day of Collaborate 2012

Well, today was a half day. After seeing the wife and daughter off at 6am I did some email work, then ate breakfast and came down to the final sessions. At 12:15 I participated in the Rise of the ExaMachine debate on the con side. It was an interesting debate since the other side didn't really come prepared for a debate. They kept essentially carping on the same tired arguments and doing veiled personal attacks (when you can't attack the argument, attack the speaker). Anyway some good ideas where put forth by both sides and in general a good time was had by the audience which was what it was all about. So ends the 2012 Collaborate conference. Next year it will be in Denver. Now, what to do in Las Vegas until my 10pm flight home?

Wednesday, April 25, 2012

Day Three of Collaborate 2012

Well, at the half-way point of day three here at the booth. I just finished at the Virtualization SIG Expert panel. Day two was interesting in that the keynote by Oracle CIO Mark Sunday talking about how Oracle handles IT transformation internally. Since Oracle absorbs several companies a year (sometimes it seems per day!) the ability to rapidly absorb the best parts of their products, structures and capabilities is a vital component of how Oracle moves ahead. Interestingly enough he used the comparison I used for the Exadata with Oracle as the Borg (for those non-trekies the Borg are a race of half-machine-have sentient beings that absorb other civilizations by injecting them with nano-bots that convert them into Borgs, very appropo eh?) here is the slide I use in my Exadata: A Critical Review presentation:

So I had to laugh a little when he said the very same thing that Oracle was like the Borg.

Of the entire keynote I found the closing ten points list the most interesting. I wish I would have jotted them down, but the one that stuck with me most was that you can recover from a poor decision but not from no-decision. Actually I have always found the people who won't make a decision are actually using passive aggression to force you to make a decision for them so they will have no culpability should the decision go bad.

Tomorrow I participate in the closing session debate about the pros and cons of Exadata, should be an interesting time!

Monday, April 23, 2012

Keynote Collaborate 2012

Well, here I am at the Keynote given by Captain Mark Kelly Commander of the Space Shuttle Endeavour’s Final Mission. I can't help but think of my desires to be an astronaut when I was a pre-teen watching all the Apollo launches and the moon landings. But what if I had?

If I had become an astronaut 25 Oracle books probably wouldn't have been written, thousands of DBAs wouldn't have (hopefully) benefited from my mistakes as most of the books were the product of learning from mistakes. If I had become an astronaut I don't know if I would have had the same children I do now, or stayed married to the same wife. One decision and your whole life changes.

Of course, if you have attended one of my presentations you have probably heard the story of how I got started in Oracle. You see, I owe my Oracle career to the space program. In 1989 TVA hired Marvin Runyon (responsible for gutting the Post Office in the name of efficiency in his previous job) to streamline TVA. Every six months they laid off about 2-3000 people once he took the reins. In the third round of laying off all the non-union people with low seniority (i.e. my description) my boss called me into her office, she told me that I shouldn't turn down any job offers (her way of telling my my number was up.)

At that time I had experience with Informix (minor) and Ingre (several classes, 3 years of experience) and zero Oracle experience. I could spell Oracle (barely). Meanwhile NASA had experienced the Challenger disaster because of the effects of sub-zero temperatures on the e-ring in the boosters. NASA instituted the ASRM project (Advanced Solid Rocket Motor) in Iuka, Mississippi. The database was Oracle, the language Cobol. I used Fortan and Ingres, obviously a match made in heaven! Anyway, I sent my resume not expecting to be hired. No, I didn't lie, nowhere did I say I knew Oracle, just relational concepts and Ingres and Informix. I was very surprised when they called me and arranged for me to come to Iuka for an interview.

Iuka, for those who have no clue where it is, resides in the corners region of Mississippi (where Tennessee, Alabama and Mississippi all come together). They moved a bunch of Sacramento, California to Iuka, Mississippi. Talk about culture shock! At least I was coming from Hixson, Tennessee. I bought and read the only available Oracle book on the way to Iuka. Anyway I got the job, thank God for Oracle and DEC training credits! I think I got the job because I was the only one that didn't run screaming when they saw Iuka.

Well, enough rambling. So even though I didn't get to be an astronaut, the space program was directly responsible for my Oracle career.

Sunday, April 22, 2012

One Down, Three to Go! Collaborate 2012

finished up my Deep Dive on Oracle IO. I had about 35 attendees and everyone seemed to enjoy the presentation. The audience put forward many good questions. I started setting up the booth afterward but realized you needed at least 2 people, so I will wait until the other booth folks arrive.

Attendance looks good this year if the registration lines were any indication. The line for the A-B group of names went across the main hall at the conference center and along the wall and this was for "early" registration, I shudder at how long it will be tomorrow! I missed the Virtualization SIG meet and greet but hopefully I will make the rest of our events. To tell you the truth my feet were just too sore to stand up for another hour.

Look for another entry tomorrow after the official day 1of the conference.

Thursday, April 19, 2012

ASM Features best used with SSD

ASM in Oracle 11g includes advances that can exploit the performance of SSDs. The features of Preferred Mirror Read and Fast Mirror Resync are the two most prominent features that fit in this category.

Preferred Read Failure Groups

The Preferred Read is not a new idea, but is implemented in Oracle’s volume management in Oracle 11g. The concept is to read from the storage that can present the needed data at a lower latency. Initially, this was designed for WAN or site-specific storage in order to avoid higher-latency site connections. By restricting data reads to the local storage, the application would be able to service requests at nominal read speeds while writes were the only communication needed to traverse the long haul site link. This is a feature that is available to most Operating Systems with their included volume manager and as a feature to Symantec/Veritas through the title Preferred Plex.

As each node of a cluster runs its own instance of ASM, each node can also define its own preferred read failgroup. This allows for each node to operate at top performance with its local storage.

The Preferred Read Failure Groups are assigned in the initialization parameter ASM_PREFERRED_READ_FAILURE_GROUPS. The syntax for the failgroups is designated by DISKGROUP.FAILGROUP and comma-separated if working with multiple diskgroups.

The advantage of Preferred Reads for SSD is not in avoiding the intersite links. Since the concept of preferred reads is to read from a lower latency mirror, the strategy of preferring the reads from a SSD mirrored to HDD can also be employed. This allows SSDs to be deployed with full redundancy at a much lower cost by mirroring to HDD.

The preferred read allows for overall increases in performance. The read performance of course will be serviced at the speed of the SSD where SSDs excel. The writes on the other hand, will be as fast as the slowest disk member. This not a problem for tables and indexes as only reads result in foreground waits, while writes are performed in the background by the lazy writer. Also, since there is no longer a read+write contention on the disk, the disks can now focus performance on solely doing writes. This also emphasizes the gains from write-cached arrays as it allows the disk to service the writes in bursts from Oracle’s lazy writer. This architecture is not recommended for redo, undo, or temp where writes result in foreground waits.

The Preferred Read does not change the application’s read/write ratio. In order for the application to scale with the SSD, the writes must not become the bottleneck prematurely. An excellent option is to deploy a small diskgroup composed of RAM SSDs for the database logs, and deploy FLASH SSDs alongside disks with the preferred read mirror setting for the tables and indexes. This way all of the blocking I/O requests are served from SSD and the HDD merely provides inexpensive redundancy.

Fast Mirror Resync

Oracle 11g includes the Fast Mirror Resync which tracks the changed extents for a given window of repair, a default value of 3.6 hours. If a disk goes offline, then ASM will track the extents associated with the disk. When the disk is recovered and brought online (knowing that all previous data is still intact) the list of changed extents are applied to the recovered disk. This is extremely useful when working with very large LUNs.

For alls systems (SSDs and spinning disks) that undergo offline maintenance or code updates, the Fast Mirror Resync reduces maintenance impact dramatically. The system can be taken offline and quickly serviced and then added back to ASM and only the extents that changed while the service was performed are synchronized. Without the Fast Mirror Resync, ASM would have to completely rebuild the mirror which could take a long time.

The window for disk recovery can be set at a DiskGroup scope with the DISK_REPAIR_TIME attribute or it can be manually set when offlining a disk for scheduled maintenance. If a system is scheduled for 1 hour of maintenance, be sure to allocate enough time for unforeseen issues with a larger maintenance window and longer “DROP AFTER” clause.

ALTER DISKGROUP TIER0 OFFLINE DISK SSD01 DROP AFTER 4h;

This would allow for a recovery window of 4 hours for the expected 30 minute maintenance. When the LUN is reinstated with the ONLINE DISK command, all changed extents are rebalanced to the recovered disk.

Conclusion

The high availability, performance gain, and ease of management lead ASM to a win for storage management. Preventing hot spots on disks intuitively leads to lower mechanical costs. Ease of management incurs lower operations costs. With higher performance, the fundamental cost per transaction is also reduced, and, by using an industry-leading database engine to serve storage, security and availability are not compromised.

The fundamental compliment of ASM with tiered storage gets more value to the investment. SSD technology with ASM management can also ensure uptime without interruption to day-to-day operations, even in times of maintenance. Higher numbers of transactions through lower latency and higher throughput from unburdened disk storage illustrates the raw performance gain of using ASM with tiered storage.

Friday, April 13, 2012

Oracle Guidelines for ASM

There are many suggestions that Oracle has provided in order to achieve the greatest benefits from ASM. Overall, ASM is a very favorable and proven application. The design of ASM is still progressing, it was a new feature implemented in Oracle 10g and has developed more of its promise in 11g. Remember that a guideline is just that, a guide, not necessarily applicable to your situation, always take any best practice with a grain of salt. The following are considered good practices when using ASM:

1.) Separate the ASM Home from that of any normal database

Since ASM does run as a lightweight and independent database it should have a separate $ORACLE_HOME or $HOME. When using the Oracle Universal Installer (OUI) it places ASM in an independent home automatically. For best practice, the ASM home directory should be kept separate from that of any other database, despite that database’s dependency. By default, when using the DBCA, creating an initial database for ASM will result in a separate ASM directory. To initialize ASM with its own $ORACLE_HOME or $HOME directory, use the Universal Installer and follow the steps from the Oracle Database Installation Guide regarding Installing ASM.

By using a separate $ORACLE_HOME or $HOME for ASM, DBAs can manage the database(s) dependent upon ASM with upgrades, backups and recovery without having to interfere with ASM operations. Also, if multiple databases are reliant upon ASM while one database shares the $ORACLE_HOME or $HOME with ASM, management of the databases becomes exceptionally complicated.

For ASM management, the ASM database can be patched or upgraded independently of the dependent databases as long as home directories are separate. The backwards compatibility of ASM ensures that these steps can proceed without having to migrate storage between upgrades.

2.) Use homogenous diskgroups

Every disk in a diskgroup is treated the same (with the exception of the “Preferred Mirror Read” discussed in the next section). Therefore, high-speed drives will only exhibit their performance if grouped with other high-speed drives. Stripe performance of a hybrid diskgroup will be virtually equivalent to that of a diskgroup of the slowest type disk. This is known as the convoy affect.

3.) Assign task-specific diskgroups for tiered storage

Although ASM makes exceptionally efficient use of available disks, performance characteristics are only chartered through the ASM templates with extent size. Oracle suggests using a limited number, 2 specifically, of diskgroups. In smaller deployments, limiting the number of diskgroups but increasing the number of disk members will highlight the performance of ASM. In large-scale deployments, however, tiered storage with task-specific diskgroups will provide the greatest performance. With the RAM-based SSD as the premier solution for the hot-files atop the tiered storage pyramid, it is necessary to isolate the system into its own diskgroup. Frequently accessed data files should be placed on Flash-based storage, in its own diskgroup, with high performance disk based storage used for the remainder. Again, the performance of these drives is not exhibited if grouped with lower speed storage.

Archive and recovery storage should be kept on a large deployment of slower disks, potentially non-SCSI disks. ASM can provide significant bandwidth of data written to these drives, given the sequential data patterns of the archive storage in Oracle. With a sufficiently large diskgroup, the capacity, availability, and performance requirements can be met for the readily available data files of large-scaled databases.

This strategy conflicts with the concept of a minimal number of diskgroups, but it can be quite favorable to the performance of the database, especially when multiple classes of storage are available. To host multiple databases on a single ASM instance, this strategy can be scaled further to provide an even larger diskgroup for the data files and isolated log file diskgroups for each database.

4.) Take advantage of multipathing

Multipathing provides performance gains in throughput and data protection against path failure. Almost all multipathing solutions available to each Operating System are thoroughly tested and provide full compatibility with ASM.

Taking advantage of cache-enabled SANs can offer better response times than disks can deliver. In such cases, it is important to note the precautions of such a practice; cache front-ends can often fill quickly and as applications grow performance can suffer. In such an event, many disk systems will suffer poorer than average performance because the clients, as well as the cache contents, are continually reading/writing to the spindle. In write-intensive or low latency applications, be wary of using cache-enabled SANs because the de-staging process can be detrimental to the clients and customers.

5.) Allocate enough reserve capacity for groups where disk failures are probable

In the event that a disk failure occurs in a diskgroup with mirroring enabled, a second failure could result it lost data. In order to minimize this risk ASM will attempt to get back to full redundancy as quickly as possible, but space must be available on all other members of the failgroup to accommodate the capacity of the failed disk. This type of protection can be ensured as long as the free space in a diskgroup (taking into account the redundancy level) is greater than the size of a single disk.

For example: Say six 73GB disks are used in a single diskgroup (438 GB total) with normal redundancy (mirroring), and each disk is be holding 50GB of ASM data, leaving approximately 23GB of unused capacity. If a disk member fails, then ASM will immediately rebalance and create new backup extents in order to maintain normal redundancy. The 50GB of ASM data that is no longer available is repopulated evenly amongst the remaining drives in the failgroup. This results in each of the remaining drives using an additional 10GB of data (50GB divided by the remaining 5 disks). The free space available (23*6 = 138 GB) was greater than the size of a single disk so redundancy can quickly be ensured. However, the redundancy will be lost if a second disk fails before a replacement is added. After the first rebalancing each disk member containing now contains 60GB of ASM data and 13 GB of free space (13 * 5 = 65 GB total – less than the size of a single disk). If a 2nd disk were to fail, there would not be sufficient space available to allocate 15GB (60GB divided by remaining 4 disks) to each disk.

6.) Use External Redundancy for diskgroups from Arrays that Provide Mirroring and Replication

Duplicating data onto an already-duplicated storage medium can be overly redundant. This protects against a catastrophic failure, but array level replication may be providing protection against this already. The external redundancy is in effect disabling using ASM to provide redundancy.

7.) The default initialization parameters are fine, except for processes

Set processes according to this formula:

processes = 40 + (10 + [max number of concurrent database file creations, and file extend operations possible])*n

Where n is the number of databases connecting to ASM (ASM clients).

By following these best practices, the highest performance can be achieved using ASM.

Wednesday, April 11, 2012

An Oracle ASM Primer

Oracle Automatic Storage Manager (ASM) is a mature product capable of replacing high-cost disk management software for most Oracle applications. In this first blog of a series on ASM we will review ASM basics.

Automatic Storage Management (ASM) is an all-inclusive approach to storage management, performance, and availability. ASM is an excellent tool for managing mixed storage environments where both SSD and HDD technologies are being used. It is an intrinsic block manager database that balances file access across disks and disk subsystems to eliminate hotspots and increase storage efficiency. By striping extents across diskgroup members and mirroring across failgroups, ASM can generate RAID performance and protection at the file-level of databases. ASM can also be used in conjunction with high-end disk arrays for mixed-storage file management.

ASM operates as a lightweight Oracle database and offers the stability and reliability of a standard Oracle database. Therefore, multiple databases can be clients to a single ASM instance. This will allow a single pool of disks to be efficiently used for various databases. In addition, ASM can be used in a RAC environment where ASM failover can occur nearly seamlessly between active nodes, permitting non-stop uptime for storage management.

ASM does not have to replace, but can complement, disk subsystems and arrangements for efficient use in specific files. ASM will not automatically separate storage areas in diskgroups for log files, archived backups, or database files. ASM templates assist in the striping performance of each type of file, but sequential I/O is not differentiated from random I/O. It is still the duty of the architect to specify storage for specific roles. Sequential access log files should traditionally be held separate from the random-I/O generated by database files. This is not a flaw in ASM but a continuing practice of tiered storage and disk speeds. ASM is most effective when diskgroups are allocated with single homogenous disks as their components.

The use of high-speed SSD highlights the benefits of ASM. When multiple devices are available, the striping performance of ASM can linearly increase the throughput of the storage, reducing the response time to the client. In situations where extremely low-latency is needed, such as log writes, the impact of mirroring with ASM is still negligible and the performance promises of a single SSD are still met.

Figure 1: Tiered Storage Pyramid

ASM Internals

ASM provides its striping through a default of 1MB file extents with normal disk arrays and 4 MB extents if used within an Exadata cell. By striping across files instead of blocks, multiple files can be used on a single ASM instance but have independent levels of redundancy. Log files and database files are often held at a normal redundancy while control files are often in triplicate for extra protection. In addition, the default template for ASM specifies the extent size for each file. Control files and log files are assigned a smaller extent size of 128K while larger, more bandwidth-intensive operations default to 1MB extents. Testing in Oracle’s development labs showed the 1MB size as the most balanced size in bandwidth and latency and concurrency of access.

ASM will utilize disks in what may appear as a very sporadic layout. Extents are not sequential in disks or failgroups. The first extent may be on Disk 1, but the second extent may be on Disk 3 of the third failgroup. In addition, ASM will place extents throughout lengths of disk space. This results in the even use of disk platters and prevents hotspots within the disks themselves. The only guarantee in the placement of extents is the mirrored extent located in a different failgroup. All disks will be used as evenly as possible to prevent any disk having higher utilization than others. This is demonstrated in Figure 2.

Figure 2. ASM Internal Layout

From the layout in Figure 2, each disk contains an original extent and backup extents (B) are assigned to a non-correlating disk in a separate failgroup. It is evident that ASM can take full advantage of the disks included. Reading from the disk provides efficient use of the disks as a fully striped volume. Write performance is equivalent to that of a RAID10 as half the disks are written while the other half serves as a mirror to the file extent. Through this behavior, the performance gain is much greater than a standard mirrored array but still provide the redundancy necessary for file protection. If failgroups are assigned appropriately, then ASM will also continue to serve even amidst a controller failure.

Because ASM is a lightweight database, managing ASM is done with a limited command set nearly equivalent to that of a standard Oracle database. Management such as the assigning of disks to diskgroups and failure groups can also be done through GUI interfaces such as DBCA. The v$ Dynamic Views within ASM call the system-based functions directly to eliminate a layer of administration. When querying the v$asm_disk view with a standard SELECT statement, Oracle will automatically initiate a rescan of the SCSI bus to detect any new devices attached to the server. The v$asm_disk_stat and v$asm_diskgroup_stat return the current rows for the ASM instance without the inherent system calls of scanning for disks.

Diskgroups and Failgroups

A diskgroup is a set of disks defined to work in tandem to provide a single volume. Such as drive letter access in Windows (X:\), a diskgroup is given an ASM pseudo-filesystem to be accessed through the same reference by a file. All files located on ASM are referenced by the following:

“+\\filename”

where the + sign indicates ASM storage.

To navigate the ASM-based storage on disks through a filesystem interface, Oracle provides the ASMCMD executable. ASMCMD operates as an interactive shell for ASM in which you can view, delete, and alias ASM-based files and directories. The DBA may perform the same functions through an SQL terminal, but ASMCMD provides a more direct approach.

Diskgroups can be created with 3 different levels of redundancy: Normal, High, and External. Many expensive disk arrays rely on internal mechanisms for data protection. The External level of protection does not feature any duplication of data in the ASM diskgroup, but simply includes the volume into the ASM volume manager. This feature is commonly used for expensive SANs. These arrays typically hold cache-substitutes for dropped disks or can allocate a replacement disk from a hot-spare almost immediately. However, catastrophic failures are only sustained through mirroring systems.

Normal redundancy offers a 2x to 3x mirror level while High redundancy specifies a 3x mirror. This allows ASM to provide redundancy rather than requiring the purchase of a SAN that guarantees no single point of failure. These mirror levels specify the number of entries each extent will have on a diskgroup. ASM will provide the mirroring across specified failgroups. A failgroup is a subset of a diskgroup that defines a point of failure. This provides consistency across controllers and paths, in addition to disks. With Normal redundancy, each mirrored extent is guaranteed to be located on a failgroup separate from the original extent. With this configuration, the primary copy may fail, along with an entire failgroup, but the corresponding failgroup will contain data fully consistent with that of the original data. The 3x mirror will attempt to mirror across three failgroups. If there are only 2 failgroups in the setup, then the third copy will be duplicated within a failgroup but held on a separate disk if possible.

Enterprise SSDs offer protection against data corruption and power loss. However, uptime during maintenance can only be upheld through mirroring. ASM copies each extent to a non-corresponding location in a different failgroup. For system mirroring, failgroups would be defined as all drives included within a single system. Oracle recommends setting failgroups across controllers. The primary exception to this rule is to apply multipathing. Multipathing will increase throughput, provide failover, and will not require additional space allocated for mirroring the disk.

In my next blog I will discuss good practices for ASM when used with SSDs.

Monday, April 9, 2012

A Holistic Look at the Oracle CPU Layer

The CPU layer is where all of the processing occurs within a database. Of course now with the introduction of extremely distributed processing as in the Exadata systems, this line is blurring because some of the processing is actually being done at the storage level. In the future this may even blur more as some companies are offering CPUs built into memory chips (8 CPUs on a single gigabyte memory chip).

Insufficient CPU Resources

I have heard it said that tuning usually pushes the log jam from one part of the information stream to another; I have found this to be true. Once we tune the code, fix the IO subsystem, and fix the memory issues we may have with our system, we usually end up with higher CPU utilization. Higher CPU utilization in itself is not a bad thing, unless it leads to over utilization or overloading of the system CPUs.

When examining an AWR or Statspack report, I look at the balance between Busy, Idle, and IO Wait on the CPUs in the system before making a recommendation as to how much an SSD will improve performance. For example, if the percent busy is 30, the percent idle 20, and the IO wait percent 50, then even if we reduce the IO wait to less than 5% we can still only reclaim 20 percent of performance losses. In order to get the maximum bang for the performance buck for any improvement to the IO subsystem above 20 percent we would need to either tune to reduce CPU usage, add more CPUs, or replace the CPUs with ones that can sustain a higher level of operations. Figure 1 shows an Operating Statistics section of an AWR report.

Figure 1: Operating System Statistics

However, even looking at Figure 1 it may not tell the whole truth about CPU utilization or the amount of CPU cycles that could be regained from reclamation of IO waits. You should always verify the above number for IO wait time by using the statistic from the Server Statistics area on User IO Wait time. In many cases the value will be much larger than the one reported form the Operating System statistics. In figure 1 if a RamSan SSD would reduce the value for the IO statistic by a factor of 9, which would return 98,000 ticks to the CPU for processing, since the CPU has 20,819,000 idle ticks available, this would at least double the performance.

However, if in returning IO wait time to the CPU you exceed the amount of idle time available, then you won’t get the full benefit from adding in a low latency IO solution such as a RamSan SSD.

Other sections of the report also show how utilized the CPU is in your system, Figure 2 shows some of the additional statistics available.

Figure 2: CPU related areas of AWR

From Figure 2 we can also see that there is plenty of CPU available with a 99.8% idle figure and only 0.2% being utilized by this instance. Of that 0.2% the instance did spend 78.6% of that 0.2% doing actual work.

Review of the SQL area of the reports will also show what SQL is consuming the most CPU cycles. When tuning to decrease CPU utilization, the most attention should be paid to the SQL which consumes the most of the CPU cycles. Figure 3 shows an excerpt from an AWR report SQL using high CPU section.

Figure 3: SQL using most CPU AWR Section

When tuning SQL look at removing excessive sorts, use bind variables to eliminate recursive SQL. In PL/SQL avoid excessive looping and don’t do processing inside the loop that can be done once outside the loop. In a best case situation post-10g if the PL/SQL optimizer realizes a bit of code is deterministic such as using a function against a constant, will only be done once, but, if you are doing anything more complex it will add more cycles each time through the loop. In Figure 3 we can see that the first statement is actually a PL/SQL routine, you could start by seeing if that PL/SQL has excessive looping.

Memory Trashing

With automatic memory management (AMM) if you have improper settings for your SGA_MAX_SIZE and SGA_TARGET (pre-11) and MEMORY_TARGET and MEMORY_MAX_TARGET (post-11), then a condition called thrashing can occur. Thrashing is when the AMM processes allocate and deallocate memory continuously. In an example, if the DB cache size is too small, then the AMM will take memory form the other sections, usually the shared pool, to supplement the DB cache. This involves a lot of Mutex and latching operations. It also removes possibly reusable SQL and PL/SQL from the SQL and PL/SQL caches.

Now, what happens when the reverse happens and the shared pool needs more space? It takes it back from the cache possibly causing IO activity, lots of latch operations and other CPU stealing actions. Then once the shared pool gets the memory it probably reloads SQL that it had to offload before and has to redo all of the recursive SQL needed to parse it, again adding CPU cycles from latches, mutexs and parses of recursive SQL. Remember, it has been a true statement since the shared pool was introduced that a shared pool miss is many times more expensive CPU wise than a cache miss. Figure 4 shows an example of thrashing in the memory area of Oracle.

Figure 4: Example of Memory Trashing

Notice how all of the actions in Figure 4 are DEFFERED, this is because they involve waits for latches, mutexs, recursive SQL and a number of CPU cycle stealing operations to occur before they can be completed.

The way to mitigate thrashing is first to have properly tuned code, next, properly size the db cache, shared pool, large pool, java pool and streams pool using the base parameters, then provide proper wiggle room by setting the MAX parameters (SGA_MAX_SIZE, MEMORY_MAX_TARGET) 10-20% higher than the resulting TARGET settings (SGA_TARGET, MEMORY_TARGET). I see the most thrashing when the SGA or MEMORY MAX parameters are too small and the TARGET parameters are set equal to them with no base parameters set.

By setting the base parameters (DB_CACHE_SIZE, SHARED_POOL_SIZE, etc) you establish a floor value below which the AMM cannot go for that parameter. You set them at a point where you know you get reasonable performance.

Proper SQL and PL/SQL tuning will do wonders to reduce CPU usage. Optimizing memory settings to eliminate thrashing will also help.

Friday, April 6, 2012

A Holistic Look at the Oracle Memory Layer

The memory layer is the actual area where the database caches, pools, and other memory structures reside. As such, the memory layer is holistic in nature since any changes there are immediately felt on all other areas. Figure 1 is a diagram of the various memory areas in Oracle11g.

Figure 1: Oracle11g memory Areas

Insufficient Memory Resources

Generally speaking, insufficient memory resource shows up as excessive physical reads. Usually the first indication in a properly designed system will be excessive db file sequential reads. If a system is using automatic memory management it may also show as numerous deferred actions or excessive swapping of memory between the shared pool components and the database cache areas. Figure 2 shows an example of memory starvation.

Figure 2: Memory Starvation Indication

Another indication is when the cache analysis section of the AWR or Statspack indicates a large amount of change in physical reads from increasing the cache size as shown in Figure 3.

Figure 3: Buffer Pool Advisory

As you can see in the example shown, doubling the cache size (if possible) would result in at least a 50% reduction in physical reads.

Correcting insufficient memory resources may be as easy as increasing the settings for AMM or the floor values for the various components in the Oracle memory. However, if the system memory has been exhausted then buying more memory and inserting it in the system is indicated. Many times this may not be enough.

Access to memory and the back path to the IO subsystem can also be stressed if there is too much memory. This can be helped by switching to a RAC based system, thus spreading the memory and IO pathways amongst several servers. In a RAC environment excessive interconnect traffic can be a sign that the individual servers have insufficient memory.

Too large a buffer area can result in excessive latching and CPU overhead. It is usually easily detected by the same areas of the AWR or Statspack reports showing the estimated physical reads don’t change starting at a range below our current settings.

Improper Memory Configuration

Within the Oracle memory area there are multiple caches and pools. The wary DBA should be utilizing proper settings for the shared pool database cache, keep, recycle, and multiple block size caches. Additional settings for the large pool, Java pool, and streams pool should be used as needed in a particular environment. Even when AMM is used, floor values should be set for these parameters to ensure minimum levels of performance are met.

When using SGA_TARGET and SGA_MAX_SIZE or MEMORY_TARGET and MEMORY_MAX_SIZE, be sure they aren’t set equal to each other, because this leads to memory thrashing as the internal memory pools and caches are forced to release and reclaim memory, leading to deferred actions. I usually recommend that at least 1-2 gigabytes be between either set of settings. Of course, as you gain operational experience for your system’s needs this gap can be reduced, or increased. The dynamic performance view, v$sga_resize_ops and the AWR resize operations section can be utilized to see if you need to rethink the floor settings for specific parameters.

A typical v$sga_resize_ops query:

select
COMPONENT, OPER_TYPE, OPER_MODE,
INITIAL_SIZE, TARGET_SIZE, FINAL_SIZE,
STATUS,
to_char(START_TIME,'mmdd hh24:mi') start_time,
to_char(END_TIME,'mmdd hh24:mi') end_time
from V$SGA_RESIZE_OPS order by start_time
/

Figure 4 shows an example output from the memory resize query.

Figure 4: Resize Operations Query Results

In some databases, notably data warehouse (DWH) and decision support (DSS), it may be impossible to provide enough cache to support all operations. However, in many cases the developers or the DBAs or both may not be aware of all the options Oracle provides. Many times when reviewing the AWR or Statspack for DWH systems I see they are not utilizing star joins. Star joins will reduce 30 minute queries to 30 seconds when utilized correctly.

In the case where you must have db file scattered reads (full table or index scans), you have to make sure the IO subsystem is properly sized to handle the load. As was said in the section on physical IO, to get adequate performance from disk based systems it may be necessary to over buy the number of disks needed to the tune of 30-40 or more times the number of disks needed for storage capacity alone to get the needed number of IOPS. The major cause of the need for all of these drives is the contention and queuing that result from too-few spindles. By utilizing low latency IO subsystems that don’t block, such as SSDs, you don’t need to over buy storage capacity to meet IOPS needs.

In the case where is it not economical or feasible to add memory to a server or servers (RAC) then replacing some of the high latency disks with low latency IO subsystems such as large cache or SSDs is a method to get higher performance from your system.

Sort Areas

Another area of memory that people tend to set and forget is the setting for PGA_AGGREGATE_TARGET (PGAT). This PGAT setting controls the SORT_AREA_SIZE, HASH_AREA_SIZE and various other sort and hash parameters. One temporary activity that isn’t controlled are the sizes of the BITMAP_MERGE_AREA and BITMAP_SORT_AREA sizes which can be of critical importance if you are using bitmap indexes. I usually look at the instance activity and PGA histogram sections of the AWR or Statspack reports. Some example report sections are shown in Figure 5.

Figure 5: Example Sort Statistics

Notice that the majority of sorts in Figure 5 are in the range of 2-16 megabyte, these should easily be covered by the PGAT setting, 1.0 gb in this database, so why are we getting these small sorts? Processes that do direct IO to and from the PGA may avoid the PGAT and use the traditional SORT_AREA_SIZE setting. In this case you may to go ahead and set the SORT_AREA_SIZE to 16 megabytes to handle these sort needs. A majority of the rest of the sorts are equal to or less than 512 megabytes which should be handled if we boost the setting to 4 gigabytes. Also, in the case of the above report, the temporary tablespace was the number one source of IO in the system as shown in Figure 6.

Figure 6: Temporary Tablespace #1 Source of IO

The PGA Memory Advisor section from the same report is shown in Figure 7.

Figure 7: Example PGA Advisory

When looking at this PGA Memory Advisory section you need to pay attention to the Estd PGA Overalloc Count column. In this report the value for the Overallocation count column doesn’t go to zero until 2400 megabytes. I have found that this section, when followed, tends to under allocate what you need. In this case we have determined that 4 gb is actually a better number. There have been times when there have been sorts (other than those accountable to non-PGAT controlled processes) in the range where the PGAT should have handled and yet the Overallocation count said everything was fine.

DB Flash Cache

If you are running on Solaris or Oracle Linux you can utilize the new 11gR2 DB Flash cache. The DB flash cache can be thought of as an L2 cache for the normal database default memory cache. If you have some spare memory and db sequential reads are your majority wait event, use of a DB flash cache may be the solution to holistically improve the performance of the system. Usually it should be set to between 2 to 10 times the size of your default DB cache size.

The DB Flash Cache is maintained by the DB writer processes. If you configure the DB Flash cache I suggest doubling the number of the DB writer processes to facilitate the management of the DB Flash cache. Usually the DB Flash cache will be on a PCIe based Flash card such as the RamSan70. However, there is no reason why you can’t utilize an external Flash appliance such as the RamSan 630/700/800 series for the DB Flash cache. The only caveat is that the DB Flash cache assignment device must have a latency that is very much less than the main source of the cached blocks or you will not get good performance improvements.

Wednesday, April 4, 2012

Oracle Structure Layer Holistic Implications

In this next Holistic tuning blog, I will examine database structures. The structure layer consists of the internal physical structures that make up the database. In the beginning it was just tables and indexes; now of course we have several types of tables, several types of indexes, materialized views, LOBs, objects, and many other structures to be concerned about.
Improperly Designed Tables
Even though processors have gotten more powerful and disks and storage arrays more capable, you can still defeat them with improper table design. Generally, improper table designs fall into two broad categories: insufficiently normalized and over normalized designs.

Insufficiently normalized tables result in repeating values within a table. Insufficient table design usually produces excessive IO, excessive sorts, and excessive use of DISTINCT.

Over-normalized tables usually result in too many joins, excessive sorts caused by Cartesian products from insufficient WHERE clauses, and overly complex SQL.

In most cases Oracle will not produce an efficient execution plan for over 6 tables unless you have a data warehouse and utilize a star join path. The undocumented parameter “_optimizer_max_permutations” sets the number of path determinations to 2000. Since the number of permutations is based on the factorial of the number of tables in the join, you reach 2000 between 6 tables (6 n!=720) and 7 tables (7 n!=5040). Place the most important (i.e. the ones that reduce the result set the most) tables first in the FROM clause if you have over 6 tables as (at least last time I reviewed this) the tables are evaluated from left to right when it is going through possible paths.

Use of flex fields can also be a show stopper for performance. Flex fields are fields whose content type is determined by the context of previous rows. In one case I was called on to help with creating a data warehouse from an application that was based on Oracle Applications. In this case they had used the “flex” fields in Oracle applications to specify that a single column was one of: a number, a date, or a character field, depending on other values in the row, in fact there where several different fields in the same row used the same way. In order to resolve one entry into the companion data warehouse we had to recursively join this table to itself seven times.

Use of improperly typed fields can also cause performance and storage problems, for example using a large CHAR based field for small length character values. CHAR reserves whatever space is specified regardless of the length of the input data. This requires extra programming for comparison semantics and wastes storage space. Another example is using CHAR or VARCHAR2 for numeric data, thereby forcing conversion (either implicit or explicit).

Non-use of partitioning is another possible design flaw. Partitioning is an extra license but can be worth it in time and resources saved. The biggest benefit from partitioning is partition elimination. Partition elimination is when only the partitions needed to resolve the query are scanned. Partitions, when used properly (as in with a proper partition key), reduce IO requirements for the entire database by eliminating full table scans. On the other hand, if a partitioned table has an improper partition key it may make performance worse.
Use of Materialized Views
Materialized views, also known as summaries, allow you to pre-build reports, summaries and other “end results” so that when someone issues a query that can use the materialized view, the results appear instantly. Use of proper materialized views is important to reduce IO and CPU usage as well as reducing sort space usage. You need to determine if you need “fresh” data or can stomach using “stale” data.

With fresh data you would want to use an “On-commit” materialized view that ties its refresh cycle to the main table that is in the materialized view. You can also use an “on request” materialized view if data doesn’t have to be fresh. The on request materialized view can be used with a schedule to refresh it at whatever periodicity is needed (based on how stale the results can be.)
Insufficient Indexes
Insufficient indexing results in full table scans, excessive sorting, and excessive CPU usage. Indexes, when used properly are probably one of the most performance enhancing features of Oracle.

Another area that can really affect holistic performance is when you move referential integrity (RI) from the database to the application. There was a large movement to do this type of design to allow the application to be database layer agnostic a few years ago. Unfortunately this design was sometimes only partially utilized leaving some database defined RI mixed with application based RI. In this mixed RI case the foreign keys would often not be indexed resulting in excessive row-locking in the application. Of course this can also happen if you use database based RI and don’t index foreign keys.
Over-Indexing
Over indexing results in excessive times for DML operations such as insert, update, and delete because the amount of IOPS required to perform each action is multiplied several fold with the addition of each index.

Several years ago Kevin Loney did a study to see the effects of multiple indexes on Insert, Update and Delete operations. Essentially the time to do the operations increased by large amounts as the number of indexes increased due to the increased number of IO operations required.
Use of an Incorrect Index
Using the wrong type of index can result in either the index not being used at all, or being used inefficiently. For example, using a single bitmap index will usually result in poorer performance than using the equivalent B-tree index. Using a regular index when a function-based index is required will result in a full table scan. Another problem with indexes is improper use such as using a single concatenated index rather than single indexes. Single indexes can sometimes provide a greater advantage than a single composite by allowing a more diverse group of SQL statements to utilize the single indexes, either as single indexes or by using two or more of the indexes. There are many other issues of this type; be sure to research the proper index to use in each situation.

Use of proper indexing and making sure you aren’t over indexing reduces IO and CPU cycles as well as speeding up queries; it will also make sure you don’t excessively penalize DML (Insert, Update and Delete) operations.

Monday, April 2, 2012

Holistic Tuning Part 1

What does the term “Holistic” really mean? From the online Webster Dictionary:

Holistic:

”… relating to or concerned with complete systems rather than with the analysis of, treatment of, or dissection into parts… ”

Given the above definition of “Holistic,” this next series of blogs will deal with looking at the entire database when performing a tuning exercise, not just the immediate statement or other item that seems to be the problem.

This Holistic view of tuning means that we must consider the total effects of design, structure, applications, and physical systems on tuning and treat each as a part of the whole rather than piecing them out into individual areas.

Holistic View of Oracle

Oracle is like an onion; each layer rests on the one underneath. If any layer is “rotten” then the entire structure is in danger. Likewise, a change to any layer may propagate effects up and down through other layers. Let’s look at an example.

Given: Statement X is generating excessive physical IOPS when it executes

Local Fix: Add indexes

Immediate local effect: Statement X runs faster with fewer IOPS

Holistic effects: Statements Y and Z switch from proper indexes for them to indexes added for statement X

Holistic Result: X runs faster, but Y and Z run slower, overall result: Slower system even though the tuning effort on X was a success!

Now, this is not to say that all local actions are going to have a negative effect. Usually a switch to a new index is because the optimizer has looked at statistics and determined the index would be better before using it. However, we all know this may not always work as designed!

Of course everyone generally agrees that caching is a good thing. The more we can cache data for Oracle to use, the faster the application will run. However, this comes at a cost. The cost of caching is that now instead of handing off IO to an external IO processor, IO has to be handled by the CPU using logical IO. Now, logical IO is several orders of magnitude faster than physical IO, except if it introduces so many additional CPU cycles that the system becomes CPU bound.

The converse is also true; without enough caching or disk IO bandwidth, adding memory or CPU to a system that is IO bound may not help at all. So again we have to truly understand the holistic view to take the proper course of action.

Local Actions that Have a Negative Holistic Effect

In my job with TMS I am tasked with doing system evaluations using statspack and AWR reports to determine if a client really needs a new IO subsystem or if other issues are causing their system performance issues. Believe me, it does us no good as a company to have someone say they bought our solution and it didn’t help their problem!

It is amazing how many people still suffer from basic, local problems that affect the holistic performance of the entire database. We will look at a few of these “local” issues that have a global effect over the next few blog entries.

Texas Memory Systems' Blog