Design of Fault-Tolerant Systems (ID2218, PhD F2B5472)
Fault tolerance is the ability of a system to continue performing its
intended function despite of faults. In a broad sense, fault tolerance
is associated with reliability, successful operation, and
the absence of breakdowns.
The goal of fault tolerance is the development of a dependable
system. As computer systems become relied upon by society more and
more, dependability of these systems becomes a critical issue. In
airplanes, chemical plants or heart pace-makers a system failure can
cost people's lives or environmental disaster.
There are various approaches to achieve fault-tolerance. Common to all
of them is a certain amount of redundancy. This can a replicated
hardware component, an additional check bit attached to a string of
digital data, or a few lines of program code verifying the correctness
of the program's results.
The aims of this course are:
- to create understanding of the fundamental concepts of fault-tolerance
- to learn basic techniques for achieving fault-tolerance in electronic,
communication and software systems
- to develop skills in modeling and evaluating
fault-tolerant architectures in terms of reliability, availability and
- to gain knowledge in sources of faults and means for their prevention
- to understand merits and limitations of fault-tolerant design
The following is a tentative list of topics to be covered:
- Definition of fault tolerance
- Applications of fault-tolerance
- Fundamentals of dependability
- Attributes: reliability, availability, safety
- Impairments: faults, errors and failures
- Means: fault prevention, removal and forecasting
- Dependability evaluation
- Common measures: failures rate, mean time to failure, mean time to repair, etc.
- Reliability block diagrams
- Markov processes
- Hardware redundancy
- Redundancy schemes
- Evaluation and comparison
- Information redundancy
- Codes: linear, Hamming, cyclic, unordered, arithmetic, etc.
- Encoding and decoding techniques
- Time redundancy
- Software fault tolerance
- Specific features
- Software fault tolerance techniques: N-version programming, recovery blocks, self-checking software, etc.
The evaluation will be based on seven homework assignments (20%, grade A-F), a
midterm exam (20%, grade A-F) and a final exam (60%, grade A-F). For PhD students, an
additional task will be to read and present a paper approved by the
instructor (20 min talk).
The following lecture handouts
contain the material covered in the course.
Five assignments for the
course (become available as deadline approaches). Numbers refer to probelms in the textbook.
Midterm exam will take place on Monday, April 24th, 13:15-14:00 in room F304 (same as lecture).
You don't need to register for it.
An example of last year midterm exam.
Final exam will take place on Wednesday, June 1th, 8-12 in room 303.
Don't forget to register!
An example of exam with answers.
More examples without answers: exam 1
Basic understanding of circuits and digital logic.
- Course notes
- E. Dubrova, "Fault-Tolerant Design", Springer, 2013, ISBN 978-1-4614-2112-2
School of Information and Communication Technology
Royal Institute of Technology (KTH)