High Availability for SAP Systems

Feature Article | October 29, 2010 by Timo Brüggemann

Der ftServer 6300 ist das Top-Modell unter den fehlertoleranten Servern von Stratus (Foto: Stratus)

Stratus ftServer 6300 (photo: Stratus)

In today’s SAP world, too, high availability is a buzzword. Almost every single system description, specification and case study in existence mentions the fact that the servers used are, naturally, high availability. For IT departments, high availability seems to be as ubiquitous as luxury sedans in the management parking lot.

The demand for failsafe systems that perform around the clock in accordance with the 24/7 principle is, of course, perfectly justifiable. IT applications are, after all, essential to the running of every company, and every company models processes in their Business Suite which are “mission critical”, i.e., upon which the organization’s weal and woe directly depends. If the central servers should go down, sooner or later the company’s entire operations will grind to a halt.

Although these requirements are, in some ways, extremely clear-cut, in others they are very vague. Anyone researching the terms “high availability” and “SAP” will realize that almost no company, be they user or provider, is keen to define exactly what is meant by high availability. Instead, one is much more likely to get the impression that high availability means exactly what each system is currently capable of performing. One company, for example, operates an SAP system with a “high availability of 99.5%” and a “maximum continuous downtime of 2 hours.” In another case, “SAP high availability” allows for downtime of up to 20 minutes.

A practical way to envisage this is to imagine all the things that can (or cannot) happen in a company during a 20-minute period. What, for example, would happen if the POS system was down for 20 minutes, or an ordering system was unavailable for that time? What if an emergency call system was unreachable for 20 minutes, or if trucks had to wait 20 minutes at the loading ramp before being processed? There are, of course, other examples. While it’s less dramatic if things in HR or FI come to a standstill, in today’s world, 20 minutes can sometimes be verging on the critical, even for these areas. While HR and Accounting are, at the end of the day, essential to the running of the company, the main aim here is to ensure availability of data. First and foremost, it’s important to make sure that nothing is lost. While this requires a high level of availability, true high availability is not required in every single case.

Read on: Availability

Bei fehlertoleranten Servern sind alle Systemkomponenten doppelt vorhanden, so dass keine einzige Störung auf Kosten der Systemverfügbarkeit geht.

Process Availability Requires High Availability

When process availability, rather than data availability, is called for, things look different. In such cases, systems which, for example, monitor and control a production process in a MII (Manufacturing Integration and Intelligence) application must not go down at all, not even for a minute. Doing so would cause production to be interrupted—with all the consequences that entails for Just-in-Time and Just-in-Series processes. In certain cases, any production batches already started would need to be rejected, as seamless product tracking information would no longer be available for them. This level of process availability is regularly demanded in the manufacturing and pharmaceutical industries—and the requirements of IT are very clear: True high availability.

What this level of high availability actually means for IT seems to be unclear to many companies. The first surprises arise when quantifying the availability level in figures. An availability of 99.5% sounds extremely high to many people—before they begin to do the math, that is. In 24/7 operations, 99.5% availability allows for an average downtime of over 43 hours per year. This is, of course, far removed from process availability, and is potentially unacceptable for many other SAP systems. Users also need to understand that system downtime is not scheduled and that, in accordance with Murphy’s Law, it will occur at the worst possible time, with POS systems being unavailable on Saturdays before Christmas, or online stores going down in the early evening. The results are familiar to every IT director. As, by the way, 99.5% is the level of uptime providers are usually able to ensure for externally hosted applications, it cannot be used as a basis for high availability.

An availability of 99.9% percent, a level which is often described as high availability, is also insufficient for process-critical applications. In shop floor control with SAP MII, an average downtime of more than 8.8 hours per year is clearly too long, because it cannot adequately eliminate process interruptions. Only once availability reaches 99.99%, which reduces the average downtime to around 52 minutes a year, can a system justifiably be referred to as “high availability”. Even 99.99% is, of course, insufficient for some applications. Examples include systems that control power stations or emergency systems in hospitals. Here, organizations must go the extra mile to ensure 99.999% or even 99.9999% (“six nines”) uptime, with an average downtime of around 5 minutes (or even half a minute) per year. This is true “continuous availability”.

We’re not simply playing with numbers here. The number of “nines” in the description of the availability level is just the statistical representation of a very real risk for a company. Even though developments in hardware technology have resulted in server systems becoming considerably more stable than the problem children IT departments formerly had to wrangle, the risk still remains. While, on one hand, downtime is less likely, the potential losses have dramatically increased. 10 minutes of downtime for an SAP system today has completely different consequences for a business than would have been the case ten—or even fifteen—years ago. Most companies are aware of the potential expenses incurred during IT downtime. UPS, for example, has estimated the cost of downtime for its aircraft management system at around 25,000 dollars per minute.

Companies in this situation haven’t, of course, merely been twiddling their thumbs. Instead, they’ve made efforts to increase the availability of their SAP systems. There are various different technologies available for doing this.

Read on: Cluster: High Availability with Restrictions

Cluster: High Availability with Restrictions

In today’s SAP world, high availability is achieved mainly through cluster systems. This technology connects (at least) two servers by means of control software. These cluster nodes are constantly monitored by a cluster service. If one node fails, the other takes over its job. These configurations can also comprise dozens of servers.

Many people are, however, unaware of the fact that even clusters do not operate completely uninterrupted in cases of failure. When one computer takes over the duties of another, there is a certain failover period during which the applications and data are unavailable. This may be due, for example, to system services and programs having to be restarted, or to database transactions being reset. Even though failovers are largely automated in modern cluster systems, there will always be a period of several minutes until all systems are fully operational once more. During this period, application statuses—or entire transactions—can, of course, be lost. Cluster servers, therefore, achieve availability of only around 99.99%. While this is a considerably higher value, it is insufficient for process-critical applications.

From a practical standpoint, cluster systems are also difficult to administrate. As a rule, two completely independent systems are always more complicated to maintain than a single one, and a cluster can only function if everything takes place in parallel. This includes updates, the implementation of new security guidelines, etc. While the operation of two cluster nodes isn’t exactly straightforward, the amount of effort and time required to monitor and control the cluster increases incrementally with the number of servers involved. This type of configuration cannot be operated without knowledgeable staff. This makes cluster solutions relatively expensive in terms of total cost, even when comparatively inexpensive server hardware is used.

Read on: Fault-Tolerant Servers

Fault-Tolerant Servers: High Availability?

Bearing in mind the design flaws inherent in cluster technology, companies must find other solutions for performing process-critical tasks. Fault-tolerant servers are particularly suitable for this. Like clusters, these technologies are based on redundancy, but begin one level earlier: Rather than the servers themselves being redundant, it’s the individual computer components that are duplicated. Fault-tolerant servers contain two of every component essential to the running of the equipment, including processors, memory chips and I/O units—not just the power supplies and hard drives, as is usually the case in high-end systems. If one component should fail, the parallel component automatically continues operations—automatically and unnoticed by the user. This enables the application to continue running during disruptions of any kind—without loss of data or application status.

In contrast to cluster systems, a fault-tolerant server works—as far as the user is concerned—like a black box. In such cases, high availability is a purely internal issue—it does not need to be implemented or ensured. This does, of course, have an effect on the system costs. Although fault-tolerant servers are more expensive upfront than conventional systems, their overall cost is well below that of comparable cluster systems due to their lower operating expenses.

Fault-tolerant servers also surpass cluster systems in terms of reliability. The ftServers offered by Stratus can, for example, achieve availability of up to 99.9999%. This exceeds the availability offered by mainframes—and is currently the highest level of availability provided by commercially available IT solutions. In addition to the process-critical systems mentioned, high availability is preferred for central systems in SAP landscapes, such as message servers or database management systems, as the availability of all related systems depends on them.

High Availability through Software

Recently, software-based high-availability solutions have been being positioned between clusters and fault-tolerant servers. These solutions are cost-effective and can be implemented and operated without much expense or effort. Stratus Avance, for example, can be used to link two standard x86 servers—including servers from Dell, HP, Tarox and Wortmann—to create a single high-availability unit. This software automatically installs a common logical server on both servers. On this, in turn, as many virtual servers as desired can be set up. The two computers are linked via a normal network connection and constantly monitored and synchronized by Avance. If one of the servers should fail, the second one automatically takes over.

This technology enables an availability level of 99.99 percent to be achieved, with no need for the user to deal with the complexities of a cluster system. Due to its ease of administration, Avance is suitable for SAP applications in distributed locations which have no experts on site.

Determining which technology will best suit a particular company does, of course, depend on its specific requirements. If the company’s priority is to ensure data availability, the organization must decide whether server availability of 99.99 percent is sufficient. If process availability is their main objective, this level of uptime may not be enough. It is important for all SAP users to know that not everything labeled “high availability” actually provides this feature.

Tags: ,

2 comments

  1. Derek Prior

    This article is terrible ! It is highly misleading. You just talk about unplanned downtime of hardware that SAP runs on. You dont even mention the real problem, which is the planned downtime needed for maintenance of SAP applications, Near Zero Downtime stuff, etc…

  2. JDS

    This article just lays out the problem, and the downsides of certain solutions. Then it ends with a plug for a commercial product.

    I was expecting something on SAP’s offerings in this area, like redundant message and enqueue servers.

Leave a Reply