Ep 72: Mayday! Mayday!

Today, in a first of what will probably be a semi-regular series: ‘How is [career] like testing’, we cover Air Crash Investigations.

Investigating what has caused an accident/bug

  • First action?
    • Find the black boxes! (logs/monitoring)
    • Collect all of the evidence (effects of the bug, clues in data that led to it, eye witness accounts)
    • Come up with hypotheses with the data and attempt to disprove
    • Simulate the problem/Steps to reproduce
    • Sometimes its not possible to identify the exact problem, only give a summary of findings and suggest any possible improvements

Implement new procedures

  • Providing evidence and explanation for new procedures
  • A tester/investigator only provides recommendations, they do not implement them

Post-incident investigation and action

  • Mentality
    • Focus on facts
    • Thinking outside of the box, outside of the immediately obvious
    • Gathering as much information as possible – even it appears irrelevant – then assessing with all the information
    • Coming up with hypotheses and attempting to disprove it
    • Being unafraid of uncovering the truth, despite politics

 

  • Needing to work fast
    • Evidence may disappear over time
    • A need to quickly discover if there are any fatal design flaws that could affect other similar planes
  • Always considering human factors
    • Considering the mental state of those involved
      • What were they thinking?
      • Were they stressed or under pressure? Did they have other motivations which drove their decisions?
      • What was their experience?
    • Psychology, sociology
      • Seniority affecting juniors?
    • Popularity (“no one questions this person”)
      • Cultural or language barriers
        • Tenerife accident – “Ok” meaning both confirmation of receiving and giving clearance.
        • Korean Air Cargo Flight 8509 – communication breakdown because junior officers wouldn’t take control from captain.

Not like testing?

  • Its only investigating incidents, not really being involved in the process of building directly
  • We have to explore what hasn’t happened yet and try to create the problems, as well as triage and investigate them.

Some anecdotes…..

Test environment not the same as production

Atlantic Southeast Airlines Flight 2311

  • Propellers got stuck in a position that caused drag
  • Was tested for this scenario, but only on the ground in a lab
  • Accident happened because it behaved differently at altitude, in the air
  • Investigator (tester) felt sure this was the case, but had to prove it with live test that shocked the engine suppliers

Checklists – how difficult it is to write them

1999 South Dakota Learjet crash

  • Checklist for depressurisation warning started with diagnosing the problem
  • By the time the crew got to the check about donning their oxygen masks, they had already lost their ability to think
  • Checklist should start with donning masks no matter what

The interaction of humans with automation – people tend to assume software is smarter than it really is

Asiana Airlines Flight 214

  • Plane crashed due to landing short of the runway
  • Caused by the pilots inputting the wrong setting into the autopilot
  • Pilots were trained to let the automation assist them
  • This broke down when they gave the automation the wrong command and they didn’t understand its behaviour
  • Lesson here on abstracting important procedures and relying on automation
  • Automation cannot think or second guess, but people can believe its more intelligent than it really is

Leave a Reply

Your email address will not be published. Required fields are marked *