Profile cover photo
Profile photo
Courtleigh Cannick
24 followers -
"They call me, MISTER President!"
"They call me, MISTER President!"

24 followers
About
Posts

Post has attachment
Check out Troy Hudson's von Karman lecture on the seismic instruments for the InSight mission.
InSight Lecture
InSight Lecture
astro.computespace.net

Post has attachment
Reporting from AP, CNBC, and Reuters confirm multiple statements from BA CEO Alex Cruz that the source of outage was a “power supply issue”, more specifically a power surge that hit BA’s “flight, baggage, and communications systems. It also rendered the backup system ineffective.”

Does this indicate that the flight, baggage, communications, and backup infrastructure were all fundamentally tied to the same power source? Not necessarily, but the broad failure is unarguably an indicator of a failure in infrastructure architecture or this incident wouldn’t exist.
The caveats are, of course, that (1) Alex Cruz might have a poor understanding of the technical aspects of the event or be misinformed. I am going to assume he isn’t poorly educated on this topic since he has an engineering degree from Central Michigan University.

(2) We don’t know the actual infrastructure of BA’s IT architecture and are not likely to get that information. (Any BA whistleblowers out there, feel free to drop us a line.).

Even if all the above-mentioned functions were housed in a single data center (which is doubtful), I am curious about the nature of the power surge and how it was able to do so much damage. In most modern data center architectures power is channelled through the racks via large-scale PDUs (Power Distribution Units) with integrated Transient Voltage Surge Suppression (TVSS) to prevent power spikes from getting to the racks where the servers are. Upstream from the PDU, a flywheel is implemented which takes incoming AC and converts it to DC, and back to AC again. This rectification process, levels out any spikes in the power signal before it gets to the PDU or the rack. Finally, upstream from the Flywheel is a switchgear unit. Think of it as a sophisticated circuit breaker which causes an interruption or redirection of the current on detection of the surge. Statistically, with all 3 components in place we get our infamous 5 9s of availability assured.

Generally, for business continuity reasons in a large-scale firm, the various functions would be both virtualized and distributed across multiple data centers or service providers with unrelated power chains for additional insurance.

So I’m curious how a power surge was able to do so much damage.

One thing I can comment on is that every large-scale company has a DR (Disaster Recovery) plan and, if indeed BA has one, it failed pretty spectacularly. If not, we wouldn’t be watching the result on CNN.

The DR plan is supposed to be fully documented, regularly updated, and signed off by the chief IT officers. I’m not saying there is no DR, but whatever occurred seemed to undercut the purpose of the DR in the first place.

The definition of DR is as follows: “Resuming the mission critical business functions from almost the exact point in time that the disaster struck*. Can’t say that happened.
*Disaster Recovery and Business Continuity by Thejandra BS

Let’s take a look at the potential impact to BA from the point of view of costs DR is meant to prevent.

Loss of Business. This seems certain to follow the event. We will keep track to determine the magnitude of the loss.
Loss of Reputation. Dudes, you’re on CNN!
Loss of Customers. Inevitable, but we don’t know what percentage of customers will have other comparable options in air travel.
Stock Price Volatility. The event occurred going into a holiday weekend so we only have market numbers for the first day, Friday. However it is notable that the share price fell from a weekly high of $617.50 to $613.45 at close. $4.05 a share at a volume of 2.10B shares. That’s a loss in value of roughly $8.5B.
Reduced staff productivity. Your IT staff, your contractors, and even your CEO re running around putting out fires instead of running an airline.
Fines and Penalties. The regulatory enforced compensation BA will be forced to pay is estimated at 61M euros ($68M) by Reuters. This excludes stranded customers’ lodging and meal reimbursement.
Lawsuits. You know and I know they’re coming.

I suspect that Cruz laid the blame for the outage at the feet of IT and an errant power supply to explicitly deny that any cyber-terrorism was involved. Which is a kind of bad PR that, right now, it’s very easy for customers to get their heads around. Whether that is the whole truth or not.

To the uninitiated, I say you should note that even if this proves to be the case, this is not like your household surge protector failing and your PC getting turned off. These systems cost hundreds of thousands to millions of dollars to implement and the people hired to implement and maintain them make an average of $150K annually.

If you’re a BA customer you should expect more. If you’re an IAG shareholder you should expect more.

Back in 2013, ComputerWeekly.com’s Angelica Mari did a story on BA’s IT team. At that time they were embarking on a 4-year program to migrate off 20-year old legacy systems and appeared to be basing the new architecture on x86 systems running RHEL (Red Hat Enterprise Linux). They wanted to implement a Service-Oriented Architecture (SOA).

Well, the 4 years are up. I’d like to know the status of the project. Particularly since the Independent suggested that the fact that BA having a cutting edge technology alongside legacy elements and processes had some relation to this week’s crash.

Due to over-marketing, SOA has lost a lot of its luster. I wouldn’t be surprised if it’s being called something else but the principle is still valid. Self-contained systems with well-defined interfaces. The question that comes to mind is that, if it was in place, would the magnitude of the crash have been lessened?

Mari also noted head of IT Mike Croucher had stated that BA was using AWS for the backend of their social presence. I’m a bit surprised it wasn’t being used in a high-availability context.

Some of these architectural implementations are expensive and few execs ever got fired for keeping expenses down. On the other hand, how much is this costing you now, BA?

We will continue to follow this story for a while. Aviation analysts have suggested that it may take a couple of weeks for the downstream effects of the crash to completely get through the system. An indicator of architecture inflexibility.

--cdc 5/29/17
Photo

Post has attachment
Pretty good introductory article.

Post has attachment

Post has attachment
A simple but nicely done comparison
ComputeSpace
ComputeSpace
blog.computespace.net

Post has attachment

Tomorrow is the 325th anniversary of Newton's Mathematical Principals of Natural Philosophy and I'm STILL waiting for the movie version.  Ron Howard and Daniel Day Lewis are you listening??

Starting work on v7.0 of the project site today.  Unveiling the last week of July.

Post has attachment
Wait while more posts are being loaded