What is fault tolerance in a distribution system?

Fault-tolerance is the process of working of a system in a proper way in spite of the occurrence of the failures in the system. Even after performing the so many testing processes there is possibility of failure in system. Practically a system can’t be made entirely error free. hence, systems are designed in such a way that in case of error availability and failure, system does the work properly and given correct result.

Inhaltsverzeichnis Show

Fault tolerance vs. high availability
What is graceful degradation?
Matching data protection and fault tolerance
Which industries depend on system fault tolerance?
Fault Tolerance Definition
What is Fault Tolerance?
High Availability vs Fault Tolerance
What are Fault Tolerance Requirements?
What is Fault Tolerance Architecture?
What is the Relationship Between Security and Fault Tolerance?
What is Fault Tolerance in Cloud Computing?
What Are the Characteristics of a Fault Tolerant Data Center?
Load Balancing Fault Tolerance Issues
Does Avi Networks Offer a Fault Tolerance Solution?

Any system has two major components – Hardware and Software. Fault may occur in either of it. So there are separate techniques for fault-tolerance in both hardware and software.

Hardware Fault-tolerance Techniques:
Making a hardware fault-tolerance is simple as compared to software. Fault-tolerance techniques make the hardware work proper and give correct result even some fault occurs in the hardware part of the system. There are basically two techniques used for hardware fault-tolerance:

BIST –
BIST stands for Build in Self Test. System carries out the test of itself after a certain period of time again and again, that is BIST technique for hardware fault-tolerance. When system detects a fault, it switches out the faulty component and switches in the redundant of it. System basically reconfigure itself in case of fault occurrence.
TMR –
TMR is Triple Modular Redundancy. Three redundant copies of critical components are generated and all these three copies are run concurrently. Voting of result of all redundant copies are done and majority result is selected. It can tolerate the occurrence of a single fault at a time.

Software Fault-tolerance Techniques:
Software fault-tolerance techniques are used to make the software reliable in the condition of fault occurrence and failure. There are three techniques used in software fault-tolerance. First two techniques are common and are basically an adaptation of hardware fault-tolerance techniques.

N-version Programming –
In N-version programming, N versions of software are developed by N individuals or groups of developers. N-version programming is just like TMR in hardware fault-tolerance technique. In N-version programming, all the redundant copies are run concurrently and result obtained is different from each processing. The idea of n-version programming is basically to get the all errors during development only.
Recovery Blocks –
Recovery blocks technique is also kike the n-version programming but in recovery blocks technique, redundant copies are generated using different algorithms only. In recovery block, all the redundant copies are not run concurrently and these copies are run one by one. Recovery block technique can only be used where the task deadlines are more than task computation time.
Check-pointing and Rollback Recovery –
This technique is different from above two techniques of software fault-tolerance. In this technique, system is tested each time when we perform some computation. This techniques is basically useful when there is processor failure or data corruption.

Article Tags :

Fault-tolerant technology is a capability of a computer system, electronic system or network to deliver uninterrupted service, despite one or more of its components failing. Fault tolerance also resolves potential service interruptions related to software or logic errors. The purpose is to prevent catastrophic failure that could result from a single point of failure.

VMware vSphere 6 Fault Tolerance is a branded, continuous data availability architecture that exactly replicates a VMware virtual machine on an alternate physical host if the main host server fails.

Fault-tolerant systems are designed to compensate for multiple failures. Such systems automatically detect a failure of the computer processor unit, I/O subsystem, memory cards, motherboard, power supply or network components. The failure point is identified, and a backup component or procedure immediately takes its place with no loss of service.

To ensure fault tolerance, enterprises need to purchase an inventory of formatted computer equipment and a secondary uninterruptible power supply device. The goal is to prevent the crash of key systems and networks, focusing on issues related to uptime and downtime.

Fault tolerance can be provided with software embedded in hardware, or by some combination of the two.

In a software implementation, the operating system (OS) provides an interface that allows a programmer to checkpoint critical data at predetermined points within a transaction. In a hardware implementation (for example, with Stratus and its Virtual Operating System), the programmer does not need to be aware of the fault-tolerant capabilities of the machine.

At a hardware level, fault tolerance is achieved by duplexing each hardware component. Disks are mirrored. Multiple processors are lockstepped together and their outputs are compared for correctness. When an anomaly occurs, the faulty component is determined and taken out of service, but the machine continues to function as usual.

Fault tolerance vs. high availability

Fault tolerance is closely associated with maintaining business continuity via highly available computer systems and networks. Fault-tolerant environments are defined as those that restore service instantaneously following a service outage, whereas a high-availability environment strives for five nines of operational service.

In a high-availability cluster, sets of independent servers are loosely coupled together to guarantee system-wide sharing of critical data and resources. The clusters monitor each other's health and provide fault recovery to ensure applications remain available. Conversely, a fault-tolerant cluster consists of multiple physical systems that share a single copy of a computer's OS. Software commands issued by one system are also executed on the other system.

The trade-off between fault tolerance and high availability is cost. Systems with integrated fault tolerance incur a higher cost due to the inclusion of additional hardware.

What is graceful degradation?

Fault tolerance is often used synonymously with graceful degradation, although the latter is more aligned with the more holistic discipline of fault management, which aims to detect, isolate and resolve problems pre-emptively. A fault-tolerant system swaps in backup componentry to maintain high levels of system availability and performance. Graceful degradation allows a system to continue operations, albeit in a reduced state of performance.

Matching data protection and fault tolerance

Fault tolerance hinges on redundancy. Namely, information is redundantly protected via data Replication or synchronous mirroring of volumes to an off-site data center. For physical redundancy, extra hardware equipment remains on standby for failover of operational systems.

Data backup is frequently combined with redundancy. Both strategies are intended as a safeguard against data loss, although backup tends to focus on point-in-time recovery, including granular recovery of a discrete data object. Redundant systems are engineered specifically for application workloads that tolerate very little downtime.

When implementing fault tolerance, enterprises should match data availability requirements to the appropriate level of data protection with redundant array of independent disks (RAID). The RAID technique ensures data is written to multiple hard disks, both to balance I/O operations and boost overall system performance.

Organizations that prioritize fault tolerance above speed and performance would be best served by RAID 1 disk mirroring or RAID 10, which combines disk mirroring and disk striping. If fault tolerance and system performance are equally important, an enterprise may find it worthwhile to spend a little extra money combining RAID 10 with RAID 10 with RAID 6, or double-parity RAID, which tolerates the loss of two disk failures before data is lost. Aside from higher cost, the other drawback is data writes occur more slowly to the RAID set.

Aside from hardware, a fault-tolerant architecture should be coordinated with regularly scheduled backups of critical data, perhaps including a mirrored copy at a secondary or alternate location. Security needs to be part of the planning to prevent unauthorized access, and to apply antivirus tools and the most recent version of the computing system OS.

Which industries depend on system fault tolerance?

Fault tolerance refers not only to the consequence of having redundant equipment, but also to the ground-up methodology computer makers use to engineer and design their systems for reliability. Fault tolerance is a required design specification for computer equipment used in online transaction processing systems, such as airline flight control and reservations systems. Fault-tolerant systems are also widely used in sectors such as distribution and logistics, electric power plants, heavy manufacturing, industrial control systems and retailing.

<< Back to Technical Glossary

Fault Tolerance Definition

Fault Tolerance simply means a system’s ability to continue operating uninterrupted despite the failure of one or more of its components. This is true whether it is a computer system, a cloud cluster, a network, or something else. In other words, fault tolerance refers to how an operating system (OS) responds to and allows for software or hardware malfunctions and failures.

An OS’s ability to recover and tolerate faults without failing can be handled by hardware, software, or a combined solution leveraging load balancers(see more below). Some computer systems use multiple duplicate fault tolerant systems to handle faults gracefully. This is called a fault tolerant network.

FAQs

What is Fault Tolerance?

The goal of fault tolerant computer systems is to ensure business continuity and high availability by preventing disruptions arising from a single point of failure. Fault tolerance solutions therefore tend to focus most on mission-critical applications or systems.

Fault tolerant computing may include several levels of tolerance:

At the lowest level, the ability to respond to a power failure, for example.
A step up: during a system failure, the ability to use a backup system immediately.
Enhanced fault tolerance: a disk fails, and mirrored disks take over for it immediately. This provides functionality despite partial system failure, or graceful degradation, rather than an immediate breakdown and loss of function.
High level fault tolerant computing: multiple processors collaborate to scan data and output to detect errors, and then immediately correct them.

Fault tolerance software may be part of the OS interface, allowing the programmer to check critical data at specific points during a transaction.

Fault-tolerant systems ensure no break in service by using backup components that take the place of failed components automatically. These may include:

Hardware systems with identical or equivalent backup operating systems. For example, a server with an identical fault tolerant server mirroring all operations in backup, running in parallel, is fault tolerant. By eliminating single points of failure, hardware fault tolerance in the form of redundancy can make any component or system far safer and more reliable.
Software systems backed up by other instances of software. For example, if you replicate your customer database continuously, operations in the primary database can be automatically redirected to the second database if the first goes down.
Redundant power sources can help avoid a system fault if alternative sources can take over automatically during power failures, ensuring no loss of service.

High Availability vs Fault Tolerance

Highly available systems are designed to minimize downtime to avoid loss of service. Expressed as a percentage of total running time in terms of a system’s uptime, 99.999 percent uptime is the ultimate goal of high availability.

Although both high availability and fault tolerance reference a system’s total uptime and functionality over time, there are important differences and both strategies are often necessary. For example, a totally mirrored system is fault-tolerant; if one mirror fails, the other kicks in and the system keeps working with no downtime at all. However, that’s an expensive and sometimes unwieldy solution.

On the other hand, a highly available system such as one served by a load balancer allows minimal downtime and related interruption in service without total redundancy when a failure occurs. A system with some critical parts mirrored and other, smaller components duplicated has a hybrid strategy.

In an organizational setting, there are several important concerns when creating high availability and fault tolerant systems:

Cost. Fault tolerant strategies can be expensive, because they demand the continuous maintenance and operation of redundant components. High availability is usually part of a larger system, one of the benefits of a load balancing solution, for example.

Downtime. The greatest difference between a fault-tolerant system and a highly available system is downtime, in that a highly available system has some minimal permitted level of service interruption. In contrast, a fault-tolerant system should work continuously with no downtime even when a component fails. Even a system with the five nines standard for high availability will experience approximately 5 minutes of downtime annually.

Scope. High availability systems tend to share resources designed to minimize downtime and co-manage failures. Fault tolerant systems require more, including software or hardware that can detect failures and change to redundant components instantly, and reliable power supply backups.

Certain systems may require a fault-tolerant design, which is why fault tolerance is important as a basic matter. On the other hand, high availability is enough for others. The right business continuity strategy may include both fault tolerance and high availability, intended to maintain critical functions throughout both minor failures and major disasters.

What are Fault Tolerance Requirements?

Depending on the fault tolerance issues that your organization copes with, there may be different fault tolerance requirements for your system. That is because fault-tolerant software and fault-tolerant hardware solutions both offer very high levels of availability, but in different ways.

Fault-tolerant servers use a minimal amount of system overhead to achieve high availability with an optimal level of performance. Fault-tolerant software may be able to run on servers you already have in place that meet industry standards.

What is Fault Tolerance Architecture?

There is more than one way to create a fault-tolerant server platform and thus prevent data loss and eliminate unplanned downtime. Fault tolerance in computer architecture simply reflects the decisions administrators and engineers use to ensure a system persists even after a failure. This is why there are various types of fault tolerance tools to consider.

At the drive controller level, a redundant array of inexpensive disks (RAID) is a common fault tolerance strategy that can be implemented. Other facility level forms of fault tolerance exist, including cold, hot, warm, and mirror sites.

Fault tolerance computing also deals with outages and disasters. For this reason a fault tolerance strategy may include some uninterruptible power supply (UPS) such as a generator—some way to run independently from the grid should it fail.

Byzantine fault tolerance (BFT) is another issue for modern fault tolerant architecture. BFT systems are important to the aviation, blockchain, nuclear power, and space industries because these systems prevent downtime even if certain nodes in a system fail or are driven by malicious actors.

What is the Relationship Between Security and Fault Tolerance?

Fault tolerant design prevents security breaches by keeping your systems online and by ensuring they are well-designed. A naively-designed system can be taken offline easily by an attack, causing your organization to lose data, business, and trust. Each firewall, for example, that is not fault tolerant is a security risk for your site and organization.

What is Fault Tolerance in Cloud Computing?

Conceptually, fault tolerance in cloud computing is mostly the same as it is in hosted environments. Cloud fault tolerance simply means your infrastructure is capable of supporting uninterrupted functionality of your applications despite failures of components.

In a cloud computing setting that may be due to autoscaling across geographic zones or in the same data centers. There is likely more than one way to achieve fault tolerant applications in the cloud in most cases. The overall system will still demand monitoring of available resources and potential failures, as with any fault tolerance in distributed systems.

What Are the Characteristics of a Fault Tolerant Data Center?

To be called a fault tolerant data center, a facility must avoid any single point of failure. Therefore, it should have two parallel systems for power and cooling. However, total duplication is costly, gains are not always worth that cost, and infrastructure is not the only answer. Therefore, many data centers practice fault avoidance strategies as a mid-level measure.

Load Balancing Fault Tolerance Issues

Load balancing and failover solutions can work together in the application delivery context. These strategies provide quicker recovery from disasters through redundancy, ensuring availability, which is why load balancing is part of many fault tolerant systems.

Load balancing solutions remove single points of failure, enabling applications to run on multiple network nodes. Most load balancers also make various computing resources more resilient to slowdowns and other disruptions by optimizing distribution of workloads across the system components. Load balancing also helps deal with partial network failures, shifting workloads when individual components experience problems.

Does Avi Networks Offer a Fault Tolerance Solution?

Avi offers load balancing capabilities that can keep your systems online reliably. Avi aids fault tolerance by automatically instantiating virtual services when one fails, redistributing traffic, and handling workload moves or additions, reducing the chance of a single point of failure strangling your system.

Avi Networks Software Load Balancer