The price of the single point of failure

Mark Lyndersay -
Mark Lyndersay -

After TSTT's network experienced widespread failures on August 9, the company issued general apologies about the issue, promising prompt restorative action.

It wasn't until Communications Workers Union secretary general Clyde Elder accused the company of neglect and poor maintenance that CEO Lisa Agard responded with details on the reasons for the hours-long outage.

"A GE breaker at Nelson Exchange designed to function uninterruptedly for 30 years malfunctioned when there was an electrical surge," Agard told the TT Guardian.

"This equipment is only eight years old. This led to a series of events which ultimately led to important telecommunications equipment not being supported by power."

Reading directly from that statement, questions arise about the reference to a breaker, which isolates equipment from an electrical surge, and its implied role in impacting support systems which should have supplied redundant stored power, normally from an industrial UPS.

A request to discuss the technical details of the incident was ignored by TSTT, so what's left is an admission of reliance on a single, probably expensive piece of equipment, which represented a point of failure that appears to have had no clear redundancies.

The problems that arise from a single point of failure are not unique to technology.

The concept was most dramatically illustrated by the story of Achilles, whose mother, according to mythology, dipped him in the River Styx, granting him immortality save for his heel, where she'd held him.

If there is any moral to be taken from this fanciful sidebar on the Trojan War, it is that the part of any system that is weakest and least replaceable will eventually suffer a catastrophic failure.

In February 2022, while the country was firmly under pandemic lockdown, a 21-metre-tall Palmiste tree, its pole rotted through with fungus infestation, came crashing down in Grant Trace, Rousillac. The falling tree severed a single-phase TTEC 12 KV distribution line then hit the 220KV line circuit which transfers most of the power from Trinidad Generation Unlimited (TGU) to TTEC.

TGU generates one third of the country's electricity and the sudden loss of that much capacity caused a cascading collapse of the grid shutting off electricity for more than twelve hours for most of Trinidad.

Six months later– again in Rousillac – a landslip caused the partial collapse of a 220 KV transmission tower, leading to load shedding and an outage for 30 per cent of the country, most in the south of Trinidad.

These are not mechanical failures, they are design problems, architectures with limited redundancy that made critical systems more fragile than they needed to be.

Ultimate redundancy is two of everything, but that's both needlessly and prohibitively expensive.

Design thinking in networks, whether they carry electricity or data, is a measured consideration of the eventual failure of components and how their function can be replaced or rerouted with minimal impact on the end user.

It's an idea that has wider relevance to society. Any company or government hierarchy that doesn't have a proper succession plan is also courting the problems that result from a single point of failure.

The contemplation of redundancy and efficiency in networks is a challenge even in small systems.

During a complete revamp of my desktop workstation, its peripherals and connections, a topological map of the system, freed from the tangle of wires and boxes, revealed a systemic problem.

I'd replaced both my workstation and laptop systems, both upgraded to 100base-T ports along the way, but the systems were connected using a 10base-T ethernet rated switch and cabling.

That simple, inexpensive change doubled throughput speeds, particularly useful since I move large files back and forth across that hardwired connection.

It would be a mistake to think about a single point of failure as being only technology related.

It's a weakness of systems designed by humans, not a cruel whim of fate. A fragility in the heel of presumptions of robustness or invulnerability that are inadequately tested and believed to be more secure and protected than they actually are.

Finding and assessing these weaknesses is even more critical when continuous availability and access are the most characteristic expected of a system.

It's why skydivers have a reserve parachute, but with every jump they hear the wind roaring past their ears and see the ground growing closer.

Their enthusiasm to design redundancy is commensurately more urgent in their planning before stepping through the door of an aircraft.

Mark Lyndersay is the editor of technewstt.com. An expanded version of this column can be found there.

Comments

"The price of the single point of failure"

More in this section