System and Software Safety (Week #07)

I had the privilege of attending our first international guest speaker’s panel during the mid-semester break. The panellist, David Pumfrey from the University of York, presented the topic of safety within the domain of systems and software projects. David talked extensively on systems and software safety and presented many interesting examples.

I learnt that safety is an integral and important part of all systems and that all systems must minimize risks, incorporate acceptable levels of safety, and also provide evidence of achieved safety. Also, that systems can only ever be asymptotically completely safe.

Safety is an interesting topic and one we all think about on a daily basis. It is a major human issue and there is a lot of media coverage relating to safety, especially when human lives are affected. Safety is a big concern to everyone as no-one wants to buy or use an unsafe product. Let’s be honest a business wouldn’t survive if its products were unsafe. Unsafe systems and products would result in a loss of business or potential liability. Further, we all want some reassurance when it comes to safety, and we all automatically way up the risks of partaking in certain activities, whether it be bungee jumping, paragliding or simply driving a car. Systems and software safety engineering is no different. No-one wants to use a system that is unsafe or hazardous, so therefore, systems must be designed with awareness of safety issues and must incorporate safety devices into the system designs. Following on from last week’s topic on Requirements Engineering, it is during the early conceptual design process of the requirements definition that safety issues should be identified and incorporated into the initial design in order to achieve acceptable levels of safety.

David was a polished speaker who was very engaging or perhaps it was just his English accent that kept our interest and held our attention. He presented various examples of unsafe systems. I was especially intrigued by his example on the development of the Therac-25 medical machine that was designed to provide radiation treatment for people with tumours. Since it is one of the most serious software related accidents to date I did some further reading only to discover that the manufacturers Atomic Energy Commission Limited (AECL) did not duplicate the hardware safety mechanisms and interlocks of the previous model because they relied on the software to maintain safety. This is a common occurrence as companies save on expenses and rely more heavily on their software. The problem with the Therac-25 stemmed from a software functionality problem that affected the turntable alignment of the machine, and hence people received excessively high levels of radiation which resulted in severe radiation burns and even death. It was interesting to note that AECL’s safety analysis only considered hardware failures and not software failures. It’s frightening to think that companies do not perform comprehensive quality and safety testing due to cost benefits. even at the expense of human lives. However, this raises an important issue on the boundaries of safety benefits versus the cost benefits in Systems Engineering. An analysis of the cost benefits and time involved in developing safety controls and procedures must be individuality applied to individual projects that determine and justify how much is spent on ensuring safety requirements within a system because every system has a set budget and a set timeframe. Relating back to the Therac-25 example, it took the death of 6 patients before the machine’s software was investigated and improvements were made to its design, and even then not all the changes were made, probably again due to cost benefits. The Therac-25 accidents prove that software with safety-critical functionality must be thoroughly verified and tested, and continuously managed and engineered throughout the development and operational life cycle of the project.

One thing that I have definitely learnt throughout my degree is the importance of validation and verification. Since computers and software are used to monitor and control safety functions and safety-critical functionality within more complex systems, software must be thoroughly verified and tested. Procedures, such as formal proof, as taught in COMP 2600 Formal Methods in Software Engineering, can be used to ensure the mathematical validation of programs. Also, from personal experience, peer review can be used as a method for testing software. It is extremely helpful in locating mishaps and unseen errors within software.

A very recent example of the exploitation of poor software safety is the Heartbleed Bug. Found on the 1st April 2014, it is a weakness in the OpenSSL software library which is used for securely transmitting information over the internet. However, through an implementation problem of one line of code, hackers are able to extract many lots of 64k bytes of information from any system that uses the OpenSSL software. This results in a huge security breach of passwords, security keys and private information from the general public. Since the 7th April 2014 safety measures and checks in the code have been implemented to fix the problem. This example highlights inadequate software safety. It’s amazing how one bug can affect the entire population of internet users, and it is scary to think that it has been around for over two years before it has even been detected. It worries me to think of what other security threats are out there in the networking world, and how many malicious people are out there trying to exploit unsuspecting internet users. This example illustrates the importance of continual improvements to safety tools, which was how the bug was discovered, and highlights the importance of safe software systems.

System safety and software safety is a fascinating area of systems engineering, and judging from the examples specified above it is a very difficult and challenging part of system designs. I look forward to attempting to analyse and minimise potential hazards and risks in my future endeavours in software design.

References:

Therac-25

Heartbleed Bug SITE /VIDEO

Ariane 5 Failure

Reducing Risk, Protecting People

The Goal Structuring Notation – A Safety Argument Notation