Writing from scratch

Money back Guarantee

Why Us

  • We offer the most reliable services in the market
  • Provision of custom written material
  • UK, US, Canadian, Australian writers
  • Our services are available 24/7
  • 100% satisfaction

Paper Title Goes Here (Bold, 16 font)

Author 1, Author 2, Author 3

Email-1, Email-2, Email-3

CS 3200 (Computer Ethics) Semester Paper [Fall 2017]


            This paper discusses fault-tolerance and its importance as a concept to not just operating systems but technology at large. We discuss the various classifications of common faults, and how one can prepare for and implement measures to safely deal with them without compromising the entire system. We then look at practical applications of fault-tolerance in the real world, and a brief history on its usage in the past compared to today. The paper closes by examining what it means for an OS to be reliable, and how such a state could even be compared amongst competitors. All of this comes together to create a good case for why fault-tolerance is such a necessary idea, and one that will likely exist long into the future due to its practicality.

  1. Introduction

            Faults are inevitable. You can test and perfect a system all you want but at some point in time, due to the environment or components aging or just spontaneously breaking, a fault will occur. This is why it is futile to try and design something incapable of faults, instead it is much more practical to plan ahead, accept that faults will always unexpectedly occur to try and damage the system, and implement measures to detect and deal with these faults in an orderly fashion. The real danger faults pose is unexpected behavior, but fault-tolerance and redundancies help eliminate the unexpected and have the system behave exactly as one wants. This concept of fault-tolerance is what is explored further in the paper, its inner workings, history, and applications.
We explore how exactly one classifies and categorizes faults, so that one can understand the parameters of the problem they need to anticipate and solve. A look at all the ways a system can be compromised, from the smallest bug in code to a burnt out processor. Then we delve deeper into how fault-tolerant measures can aid one in their fight against faults. Hardware and software solutions are discussed, amongst other more specific examples. Next we take a more practical look at the concept, with how such solutions are commonly achieved in the real world. Followed shortly after by a brief look at the history of fault-tolerant implementation, and how some strategies still persist from the past to this very day simply due to their ingenuity. The paper closes out by looking at how fault-tolerance ties into the concept of ‘reliability’, as most would like to use something that is unlikely to be down for long periods of time. This calls into question whether it is even possible to truly compare operating systems in terms of reliability, simply due to the breadth of different issues and implementations that occur. Then we conclude by talking about the overall importance of fault-tolerance, and how it will likely persist long into the future as people develop new strategies and methods for combating faults.

  1. Fault Information

            Firstly, we need to know exactly what classifies a fault in the first place. At its core, a fault is simply a malfunction in hardware, or mistake in software that will end up causing a system to experience errors. This error is the manifestation of the fault’s damage. For example, say a fault occurs in an adder circuit that renders it incapable of outputting anything but 1. Whenever a system makes use of that adder for further computation, it would output erroneous data to other sections, and exponentially affect a wider and wider range of computations until the entire system is compromised. This is simply one example, and can be used to showcase exactly how dangerous a seemingly innocuous fault can become [5].Several categories exist to generalize faults and make them easier to identify and recommend the best solution. Two overarching categories exist, either by describing a fault based on its temporal behavior, i.e. how long it affects a system, and its output behavior, i.e. how it manipulates or changes the output of the affected area to something unexpected.

2.1 Temporal Classifications

As far as temporal behavior is concerned, there are further groupings to take into account: Transient, Intermittent, and Permanent. Transient faults are the easiest to deal with and the best type of faults one can wish for, as they simply appear for a short period of time and then vanish completely. A good example of this type of fault could be when a network message doesn’t reach its destination, and the system will have to wait and retransmit for it to be successful [1]. Intermittent faults, however, no one would want to wish for as they are the most frustrating to deal with. These faults will appear and disappear like a transient fault would, but instead of going away forever they will reappear time and time again to cause trouble and disrupt system operations. A good real-world example of this type of fault could be a loose wire that needs to be plugged in fully for the system to stop showing that there is a problem or error [4]. Permanent faults are the last of the grouping, and possibly the worst in terms of cost to fix. This is because once a permanent fault has compromised part of a system, it is impossible to get rid of short of replacing the components involved or trying to repair whatever damage has occurred [4].

2.2 Output Classifications

Now it’s time to take a look at how we can group faults based on the way they affect a system’s output. Technically there are a wide range of specific categories in this group, but it is easier to understand by pooling them under two general headings: Fail-stop faults, and Byzantine faults. A Fail-stop fault does what its namesake implies, halting the affected component’s output completely. The unit in question will no longer be able to transmit any output, and is essentially shut down by the fault. This is simple to understand, and relatively easy to diagnose and fix when such a fault occurs, but Byzantine faults are an entirely different story. These faults act like a virus, which affects a component and makes it behave in different and unpredictable ways. Under the sway of a Byzantine fault, a component may pretend not to hear any input coming from the rest of the system, or take in input then maliciously change the expected output to cause further damage and problems throughout the system, as we illustrated at the start of this section. These faults are typically much more challenging to prepare for and usually require significant additional hardware support. Again, there are more specific classifications such as omission faults and commission faults, but the way they alter output can be generally applied to the umbrella provided by Fail-stop and Byzantine classifications [2].



  1. Fault-Tolerant Strategies

Now that we know all about the various types of faults and how they seek to damage and compromise systems, it’s time to look at the strategies employed to anticipate, control, and even correct these faults so core functionality is preserved. For almost all methods involved in achieving fault-tolerance, redundancy is the chief idea involved. One needs to understand that faults are by and large inevitable, and systems need to be prepared in advance to expect these faults and deal with them in an orderly and predictable manner, typically through Backward Recovery or Forward Recovery. Four key implementations of redundancy are employed to achieve this goal: Hardware redundancy, Software redundancy, Information redundancy and Time redundancy [5].

3.1 Hardware and Software Redundancy

            Let us first examine exactly how hardware and software redundancy are achieved. For both of these methods there are two different styles of implementation, static and dynamic, and it depends on the designer to choose which of the two to focus on; or to try and implement a hybrid solution where different sections of a system will rely on one style or the other. But regardless of which is chosen, in hardware redundancy several additional copies of the component in question are installed on a system, far more than what it would normally need to function properly. In the static method of hardware redundancy a tactic commonly referred to as “fault-masking” is employed. The system will take the component in question and run whatever computation or task upon it as well as the spare components of the same type in parallel. These will then send their output to a voting unit which will tally the responses and determine the correct output through majority vote. It’s expected in this strategy that should a fault be affecting one or more of the components, there are so many other perfectly good ones running the same computation that the correct output will overwhelm any possible deviation, thus ensuring the system does not become compromised. The other method is dynamic, where the system will have a built-in capability to detect when the currently utilized component has been compromised by a fault, and then automatically switch to one of the spare inactive components and use that for the current task instead. Of course in this case additional requirements will be needed to be put in place so the system can perform such a switch in the first place, but it is usually more efficient than the static method [3].
Now is also a good time to go more in-depth with the previously mentioned Forward Recovery. As we have just explained in the static method of hardware redundancy, the system uses parallel computations with voters in order to silence any faults that may occur and keep the system running just fine. This is the same principle involved with the Forward Recovery method, and is best understood through a visual guide, which you can see below:
Figure 1. TMR Diagram example (A) No Redundancy (B) TMR [2].

            Figure 1 illustrates a common implementation of Forward Recovery, known as Triple-Modular Redundancy. And even this is but a specific example of the more general N-Modular Redundancy technique. The basic premise is that you want to have all copies running the same computation and exporting their output to voters, which will then use a method, typically majority rule, to determine the correct output that should be transmitted elsewhere in the system. This method is very useful, but can become very costly as well, because the more faults you anticipate the system may encounter, the more spare components and voters must be installed to make up for that. The above TMR example, for instance can suffer up to 2 Fail-stop faults and remain functioning, but can only deal with a single Byzantine fault. This is because if two components fail, one output will still get through to the system and all is well, but a Byzantine fault may have corrupted data so 2 additional ‘good’ components are needed to outvote it. In a general sense, if k is the number of faults anticipated, then you need to have k+1 components to deal with fail-stop faults and 2k+1 components to deal with Byzantine faults [1].
Software redundancy, as said before, also utilizes similar methods for achieving redundancy. However, there is one key difference in that one cannot simply have multiple copies of software and expect them to work differently for the same input. It would be pointless because if the fault caused the first code to give an incorrect output, then the same problem would happen with each and every successive copy. Instead, to have true software redundancy, there needs to be several different designs of the same software implemented on the system. This can be achieved by creating separate teams of people and give them the same general design goal to write towards, or by utilizing different tools and programming software, or even by writing in different languages. The key is that they need to be different enough so as not to fail on the same input, while similar enough so that one implementation doesn’t take exponentially longer or shorter than any other and cause additional problems to the system in a different way. But beyond this it’s largely the same methods as hardware redundancy. Either one program is currently being run, and when a fault is detected it is swapped out for a spare implementation of code to complete the same task; or all implemented versions of the code in question are run in parallel and the output is voted on in a method similar to TMR to determine the correct final output of the entire computation [4].

3.2 Information and Time Redundancy

Now that we know about hardware and software redundancy methods, let’s take a look at the more specific information and time redundancy implementations. Information redundancy has historically been used to safeguard sensitive data needed for transmission, or the kind typically found in memory. It achieves this by encoding data with additional ‘information’ such as check bits or error correction codes that can be used by other parts of the system to monitor for any changes, and even try to correct bit errors should they occur. Since this is quite a low-level oriented form of redundancy, it also requires that additional hardware be installed on the system to allow all these additional checks and error correction computations to be performed [1]. Time redundancy, on the other hand, is implemented within a system by allowing components and software more time to perform a re-do should a fault be detected or assumed to have occurred. This may seem trivial, but a good portion of faults are Transient, and as such just because something failed once doesn’t necessarily mean that a more serious fault is to blame, and simply waiting it out can solve the issue entirely [1].
Now is a good time to go more in-depth with the previously mentioned Backward Recovery. This type of recovery is most closely linked with Time redundancy, due to the creation and maintenance of ‘states’. Basically this method of recovery requires that the system periodically save it’s current ‘state’ to serve as a checkpoint for the system to revert to should a fault be encountered. A common example of this would be if one’s computer has been seriously compromised in some way and needs to be completely restored to its ‘factory settings’. This is basically having the computer go back all the way and restore itself to its original default state before it saw any real use. But in our case this would be more periodic state saving so that the recovery method could be applied more frequently and be more useful in the long run. Now whenever the system encounters a fault during the course of a computation, it will be able to: stop, revert to its previously recorded state, and choose its next step. Either it assumes that the fault was transient and tries again with the same hardware and software as last time, assumes the hardware may have been the one to blame and switch to a new component, or assume that the software was at fault and switch to a different implementation of the program in question. Overall the key difference between Backwards and Forwards Recovery is that Backwards requires the system to constantly save ‘states’ or ‘checkpoints’ to revert to and then try again, whereas Forwards notices there is a problem but is able to deal with it at the current system state and move on just fine [4].

  1. Different Faults and Respective Solutions

Different types of computer systems may require different sets of fault tolerance measures. For example, a large company may require a file server in order to serve as a hardware redundancy system for many employee workstations. If a data is lost them they may revert to a backup saved on another piece of hardware. Below, Figure 2a portrays a simple server workstation relationship and how they are networked together.


Figure 2a illustrates a server / client relationship [8]


4.1 Permanent Faults and Intermittent Faults

A server also requires a few specific fault tolerant method such as a constant power source. If a server loses power than the business as a whole may lose a lot of money. According to an article by Oracle, multiple utility feeds or a generator are usually the main solution for this type of permanent fault [6]. Servers are also setup with dynamic hardware redundancy to protect against intermittent faults, an example of this would be network issues. If a network card goes out, a server is usually equipped with more than one as a fail-safe. Also those cards would be connected to different switches so that in case a switch goes out then the server would still have network access by falling back on the other switch. Below, figure 2b shows server hardware with multiple built in network cards that have several ports that go to different switches to combat connectivity loss.

Figure 2b Network redundancy [7]

  1. Fault Tolerant Systems From Past to Future

In the 1950’s, a Czech computer scientist by the name of Antonin Svoboda invented the first known fault tolerant computer, SAPO. Its basic design was magnetic drums, with a voting method of memory error detection (triple modular redundancy). Other systems were also based off SAPO but mainly for military use [9]. Sapo, was a fault detector and then the user would come and replace whatever was the cause of the fault. Over time, uses for fault tolerant systems became widely used, in hospitals, utility companies, and even NASA. According to Dhiraj K. Pradhan, it was not enough just to detect a problem, it was required that the system not only be self-diagnosing but also self-repairing [9]. An example of this type of system would be a RAID array which basically is an array of hard drives that are linked together as a single unit. According to Webopedia, data can be mirrored on one or more other disks in the same array, so that if one disk fails, the data is preserved. Thanks to a technique known as “striping,” RAID also offers the option of reading or writing to more than one disk at the same time in order to improve performance [10].

  1. Fault Tolerance Reliability

System reliability in perspective of operating systems can be regarded as a measure of trustworthiness of the results of the system produced by the system and maintain specific standards set by the user. OS can be grouped according to functionally whereas the common user may know of or interaction with the most basic types; servers, desktops, workstations, handheld. Other more complex OS can be classified as supercomputing, mainframes, real time, and embedded systems. Comparing OS reliably can be adequate if their functional capabilities are similar and generally try to serve the same purpose. However when the functionally capabilities are different it’s very difficult to find the means to compare the two systems which would produce any meaningful results.  Likewise, two different OS can different opinions on what makes the system more reliable which also adds to the chaos. Thus there is no general over all standards for measuring reliability but as you look in detail at what the OS is trying to accomplish various categories elevate above others but the last say on what is important is user determined. [10][12][13]

Hardware faults are handled differently by OS systems but reliability of catching them are equally the same since hardware faults are typically understood, and efficient hardware fault models exist, are simple to implement, and inexpensive.  As mentioned earlier in the paper OS can simply ignore problems dealing with certain hardware issues and continue to function normally or even sense the problem can activate a replacement part if one exists. Software faults are where most system outages occur, due to their complexity, and ever evolving software updates. For example a study conducted by Ganapath showed that 65% of all OS crashes were due to device drivers. Unfortunately due to different OS architectures, and operational environments measuring the same software system fault is almost impossible unless the user specifics which aspects are the most important. [10] [12][14]

For example, an independent study was done by the  ITIC 2013 global server hardware and Server OS reliability which polled  IT managers and C level executives at over 500 companies worldwide from august 2012 through 2013 and the results were are displayed in Figure 3.

    Figure 3 Highest Marks in ITIC Reliability Survey [11]

The data shows [Figure 3] that IBM AIX OS for a server was the most reliable correlating to downtime which seemed to be what the polled users thought was most important. This is understandable able for reasons such as servers usually need to be online 24/7 and if the system has a fault which leads to lots of downtime it’s not very reliable for the user.  This is one example of many specific comparisons that can be used to measure reliability across different OS. Comparing the four most recognizable Operating systems MAC OS, Windows, UNIX, LINX in general is nearly impossible as stated before because of their specified purposes which give them certain advantages  in areas and hinder others but only the user can tell if these categories are actually beneficial in their purpose. For instance, Windows is known for the infamous blue screen of death whereas LINX or UNIX hardly ever have these problems but the  range of different applications supported by these is vastly underwhelming compared to Windows. As software faults being a majority cause of system faults the user could argue that if LINX or UNIX had the same amount of applications that could be handled in their library as Windows they would have the same amount of trouble dealing with the vast ranges of different software. Of course many comparisons or arguments like this exist between the four OS but just this one small example shows how hardly to generalize the term “reliable”.[9][11][13]

  1. Conclusion

As one can see, the subject of fault-tolerance has had a considerable impact on how computers and technology has developed over the years. Making something capable of anticipating malfunctions and errors while still being able to function is an extremely important goal when designing almost anything, but it is especially critical when dealing with something as complicated as an operating system. As devices get more advanced, and interconnectivity amongst those devices increases the population at large is going to expect, even demand, that precautions are put in place to ensure they work properly. Advances have been made since the early days with SAPO, but there is always going to be room for improvement. Developing new, more efficient means of implementing extra hardware components to take off the slack, or entirely new architectures to frontload computations in the event of a serious fault. There is likely never going to be a time when fault-tolerance is not considered in the design of future technologies, because it is extremely practical to plan ahead for issues especially when dealing with the population at large that doesn’t grasp or need to know the inner workings of a system to make use of it. Without fault-tolerance the slightest issue could result in full-blown failure, and much more time would be wasted on fixing things that could have fixed themselves if a little preparation had occurred. Overall this paper should show the value and importance that fault-tolerance as a concept has had,  not just in the vein of computer science and operating systems, but in the development of almost any practical work.


[1]. Krzyzanowski, Paul. (2009, April). Fault Tolerance: Dealing with an imperfect world [Online]. Available: https://www.cs.rutgers.edu/~pxk/rutgers/notes/content/ft.html

[2]. Kawash, Jalal et al. Fault Tolerance [Online]. Available: cse.unl.edu/~ylu/csce855/notes/Fault.ppt


[3]. Rennels, David. (1998). Fault-Tolerant Computing [Online]. Available: http://www.cs.ucla.edu/~rennels/article98.pdf


[4]. Eles, Petru. Fault Tolerance [Online]. Available: http://www.ida.liu.se/~TDDB37/lecture-notes/lect9-10.frm


[5]. Koren, Israel and Krishna, C., Mani (2007). Fault Tolerant Systems [Online]. Available: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=9&ved=0CFAQFjAI&url=http%3A%2F%2Fwww.ecs.umass.edu%2Fece%2Fkoren%2FFaultTolerantSystems%2Fslides%2FPart1(Ch.1)-intro.ppt&ei=108xVMLLLtiBygSpjYEI&usg=AFQjCNEv3b6nkwF18hrsrcAsBShcSfMpzg&sig2=lyoVnosfTPINJgMg32dcXA&bvm=bv.76802529,d.aWw&cad=rja​​


[6] Sun Microsystems, Inc. (2006) Server Power and Cooling Requirements [Online]. Available:



[7] Dell’Oca, Luca (2013) Howto configure a small redundant iSCSI infrastructure for VMware [Online]. Available: http://www.virtualtothecore.com/en/howto-configure-a-small-redundant-iscsi-infrastructure-for-vmware/

[8] Campbell, Steve (2010) How to Set Up a Small Business Computer Network [Online] Available:



[9] Wikipedia (2014) Fault-tolerant computer system [Online] Available:http://en.wikipedia.org/wiki/Fault-tolerant_computer_system

[10] Webopedia Staff (2014) RAID – redundant array of independent disks [Online] Available:    http://www.webopedia.com/TERM/R/RAID.html

[9]Is There Life Beyond Windows? Pros, Cons and Costs of the Major Operating Systems [Online]

Available : http://www.nashnetworks.ca/pros-cons-and-costs-of-operating-systems.htm

[10]  BUILDING A DEPENDABLE OPERATING SYSTEM: FAULT TOLERANCE IN MINIX 3[Online] Available: http://www.cs.vu.nl/~ast/Theses/herder-thesis.pdf

[11] IBM, Dell, Fujitsu & Stratus Get Highest Marks in ITIC Reliability Survey [Online]                        Available: http://itic-corp.com/blog/2013/02/ibm-dell-fujitsu-stratus-get-highest-marks-in-itic-reliability-survey/

[12] Operating Systems: The Problems Of Performance and Reliability [Online]                                                     Available: http://www.cs.ncl.ac.uk/publications/inproceedings/papers/339.pdf

[13] http://www.osdata.com/

[14] MICHAEL M. SWIFT, BRIAN N. BERSHAD, and HENRY M. LEVY: Improving the Reliability of Commodity Operating Systems [Online].                                                                                                                                     Available: http://nooks.cs.washington.edu/nooks-tocs.pdf


Your Brief Professional Biography

Author1: Few lines about the author including major, planned graduating semester, computing interest, and an alternative email id (in case the instructor needs to contact you after you graduate)

Author2: Few lines about the author including major, planned graduating semester, computing interest, and an alternative email id (in case the instructor needs to contact you after you graduate)

Author3: Few lines about the author including major, planned graduating semester, computing interest, and an alternative email id (in case the instructor needs to contact you after you graduate)

Get a 10 % discount on an order above $ 200
Use the following coupon code :

Order Now

We Guarantee


Our Benefits

  • 100% plagiarism FREE
  • Guaranteed Privacy
  • FREE bibliography page
  • Fully referenced
  • Any citation style
  • 275 words per page
  • FREE amendments