Friday, June 15, 2012

Best Practices for Utilizing SSDs in HA Configurations


Introduction

SSD technology (the use of Flash or DDR chips verses hard disks to store data) is the disruptive technology of this decade. By the end of 2020 virtually all data centers will place their most critical data on SSDs. Having no moving parts (other than switches and cooling fans) SSD technology promises to be more reliable and resilient than the disk drive technology it replaces. Disks with their motors, armatures, head assemblies and mechanical components are prone to breakdown and are inherently less reliable than solid state devices. However, unless proper practices are followed in the SSD technology, single point failures can occur which will require shutdown and possible loss of data on SSD based systems. This paper will provide best practices to mitigate the possibility of data loss and system downtime due to a single point failure in the SSD based SAN architecture.

The Problem

Given the ultra-reliable nature of SSD devices many users are not paying attention to best practices involving data center redundancy when the architecture is utilizing SSDs. This paper is not met to address the single use, form factor SSD type installation, but rather the large scale (greater than 1-2 terabyte ) installation of SSD systems in a rack mount configuration such as is seen with the TMS RamSan-500, 620 and 630 systems and a lesser extent with the RamSan-300/400 series. Note that these best practices also apply to other manufacturers that provide rack mount SSD based systems or enclosures.

SSD Reliability

Most properly designed SSD based rack mount systems are very reliable and resilient. They employ many technologies to make sure that a single point of failure at the card or device level is not fatal. For example, RamSan systems utilize RAID5 across the chips on a single component card, as well as ECC and ChipKill technologies that automatically correct or remap chip or cell based errors to good sectors, similar to how hard disk technology has been doing it for years. In addition, many SSDs also offer the capability for hot spare cards so that should a single flash card fail, another is remapped with its data. However, in any single component device there are usually single points of failure which must be addressed.

Best Practices

  1. The entire storage system substructure must be redundant
    1. Dual, multi-ported HBAs, FC or IB
    2. Use of auto-failover MPIO software in the host
    3. Use of multiple cross-connected FC or IB switches
    4. Multiple cross linked FC or IB cards in the SSD
  2. The SSD must provide dual power supplies with auto-failover and have hot swap capability
  3. The cooling fans in the SSD units must be redundant and hot swappable
  4. If full reliability and redundancy is required, and a dual backplane is not provided internally, then at least 2-SSD units must be utilized in an external-mirrored (RAID1) configuration
  5. The rack power must be from at least two independent sources, which at least one  must be an uninterruptable power supply (UPS).
  6. Any disk based systems used in concert with the SSD based systems must share the same level of redundancy and reliability.
  7. For disaster recovery, a complete copy of the data must be sent to an offsite facility that is not on the same power grid or net branch as the main data center. The offsite data center must provide the at least the minimum accepted performance level to meet service level agreements.

Reasoning

If any part of storage system is single threaded, then that part becomes a single point of failure eliminating the protections of the other sections. For example, while all SSDs are redundant and self-protecting at the card or component level, usually these cards or components plug into a single backplane. The single backplane provides for a single point of failure. Designing with dual backplanes would drive system cost higher and involve complete re-engineering of existing systems. Therefore it is easier to purchase dual components and provide for true redundancy at the component level.

If the system is fully redundant but the power supplies are fed from a single line source or not fed using a UPS then the power supply to the systems becomes a single point of failure.

When additional storage is provided by hard disk based systems, if they are not held to the same level of redundancy as the SSD portion of the system, then they become a single point of failure.

Disaster recoverability in the case of main data center loss requires that the systems be minimally redundant at the offsite data center, but the data must be fully redundant. To this end technologies such as block level copy or standby system such as Oracle’s Dataguard should be used. In order to provide performance at the needed levels, an additional set of SSD assets should be provided at the offsite data center. Another possibility with Oracle11g is the use of a geo-remote RAC installation using preferred-read technology available in Oracle11g ASM.

Are TMS RamSans Fully Redundant?

All TMS RamSans (except the internal PCi 70) can be purchased compliant with best practices 1-3. Due to the single backplane design of the RamSans, 2 units must be used in a mirror configuration for assurance of complete redundancy and reliability (best practice 4). Best practices 5-7 depend on things external to the RamSans. The newly release RamSan720 and 820 designs are fully internally redundant and can be used for HA as a standalone appliance.

Summary

A system utilizing SSD rack mount technology can be made fully redundant by the use of 2 or more SSD components using external mirroring. If 2 or more SSD components are not utilized a single point failure in the backplane of the SSD could result in downtime and/or loss of data if the new RamSan-720 or RamSan-820 are not utilized.

No comments:

Post a Comment