Introduction
SSD technology (the use of Flash or DDR chips verses hard
disks to store data) is the disruptive technology of this decade. By the end of
2020 virtually all data centers will place their most critical data on SSDs.
Having no moving parts (other than switches and cooling fans) SSD technology
promises to be more reliable and resilient than the disk drive technology it
replaces. Disks with their motors, armatures, head assemblies and mechanical
components are prone to breakdown and are inherently less reliable than solid
state devices. However, unless proper practices are followed in the SSD
technology, single point failures can occur which will require shutdown and
possible loss of data on SSD based systems. This paper will provide best
practices to mitigate the possibility of data loss and system downtime due to a
single point failure in the SSD based SAN architecture.
The Problem
Given the ultra-reliable nature of SSD devices many users
are not paying attention to best practices involving data center redundancy
when the architecture is utilizing SSDs. This paper is not met to address the
single use, form factor SSD type installation, but rather the large scale
(greater than 1-2 terabyte ) installation of SSD systems in a rack mount
configuration such as is seen with the TMS RamSan-500, 620 and 630 systems and
a lesser extent with the RamSan-300/400 series. Note that these best practices
also apply to other manufacturers that provide rack mount SSD
based systems or enclosures.
SSD Reliability
Most properly designed SSD based rack mount systems are very
reliable and resilient. They employ many technologies to make sure that a
single point of failure at the card or device level is not fatal. For example,
RamSan systems utilize RAID5 across the chips on a single component card, as
well as ECC and ChipKill technologies that automatically correct or remap chip
or cell based errors to good sectors, similar to how hard disk technology has
been doing it for years. In addition, many SSDs also offer the capability for
hot spare cards so that should a single flash card fail, another is remapped
with its data. However, in any single component device there are usually single
points of failure which must be addressed.
Best Practices
- The entire storage system substructure must be redundant
- Dual, multi-ported HBAs, FC or IB
- Use of auto-failover MPIO software in the host
- Use of multiple cross-connected FC or IB switches
- Multiple cross linked FC or IB cards in the SSD
- The SSD must provide dual power supplies with auto-failover and have hot swap capability
- The cooling fans in the SSD units must be redundant and hot swappable
- If full reliability and redundancy is required, and a dual backplane is not provided internally, then at least 2-SSD units must be utilized in an external-mirrored (RAID1) configuration
- The rack power must be from at least two independent sources, which at least one must be an uninterruptable power supply (UPS).
- Any disk based systems used in concert with the SSD based systems must share the same level of redundancy and reliability.
- For disaster recovery, a complete copy of the data must be sent to an offsite facility that is not on the same power grid or net branch as the main data center. The offsite data center must provide the at least the minimum accepted performance level to meet service level agreements.
Reasoning
If any part of storage system is single threaded, then that
part becomes a single point of failure eliminating the protections of the other
sections. For example, while all SSDs are redundant and self-protecting at the
card or component level, usually these cards or components plug into a single
backplane. The single backplane provides for a single point of failure.
Designing with dual backplanes would drive system cost higher and involve
complete re-engineering of existing systems. Therefore it is easier to purchase
dual components and provide for true redundancy at the component level.
If the system is fully redundant but the power supplies are
fed from a single line source or not fed using a UPS then the power supply to
the systems becomes a single point of failure.
When additional storage is provided by hard disk based
systems, if they are not held to the same level of redundancy as the SSD
portion of the system, then they become a single point of failure.
Disaster recoverability in the case of main data center loss
requires that the systems be minimally redundant at the offsite data center,
but the data must be fully redundant. To this end technologies such as block
level copy or standby system such as Oracle’s Dataguard should be used. In
order to provide performance at the needed levels, an additional set of SSD
assets should be provided at the offsite data center. Another possibility with
Oracle11g is the use of a geo-remote RAC installation using preferred-read
technology available in Oracle11g ASM.
Are TMS RamSans Fully Redundant?
All TMS RamSans (except the internal PCi 70) can be
purchased compliant with best practices 1-3. Due to the single backplane design
of the RamSans, 2 units must be used in a mirror configuration for assurance of
complete redundancy and reliability (best practice 4). Best practices 5-7
depend on things external to the RamSans. The newly release RamSan720 and 820
designs are fully internally redundant and can be used for HA as a standalone
appliance.
Summary
A system utilizing SSD rack mount technology can be made
fully redundant by the use of 2 or more SSD components using external
mirroring. If 2 or more SSD components are not utilized a single point failure
in the backplane of the SSD could result in downtime and/or loss of data if the
new RamSan-720 or RamSan-820 are not utilized.
No comments:
Post a Comment