Diego Dinizodiegodiniz.com
A0004 · CASES REAIS

Boeing Spent $43 Billion Optimizing the Wrong Metric. AI Is Doing the Same Thing.

By · · 9 min read
Boeing replaced 'is the airplane safe?' with 'is the shareholder satisfied?' and spent $43.5 billion on buybacks while concealing MCAS from pilots. 346 people died. RLHF and AI benchmarks follow the same Goodhart's Law mechanism — optimizing proxies until they detach from reality. A 12-industry catalog and 5 diagnostic questions reveal whether your organization is on the same trajectory. A0004 · 2026

Key Takeaways

  1. 01Boeing didn't fall because of an engineering failure — it fell because in 1997 it replaced 'is the airplane safe?' with 'is the shareholder satisfied?' as the primary metric. $43.5 billion in buybacks. 346 dead. The pattern has not been corrected in 29 years.
  2. 02MCAS and RLHF are the same mechanism. One makes the airplane appear safe without being safe. The other makes the model appear aligned without being aligned. The proxy replaces reality in both cases.
  3. 03The Goodhart Cascade has 5 universal steps: PROXY, TARGET, SILENCE, DEGRADE, REVEAL. AI is between steps 3 and 4. The 2024 safety exodus was step 3 in real time.
  4. 0412 industries, 1 mechanism. From Wells Fargo to Volkswagen, from hospitals to AI benchmarks. The Goodhart catalog proves this is structural, not anecdotal.
  5. 055 questions diagnose the trajectory: proxy metric vs real, who benefits, bearer of bad news punished, self-certification, commitment vs allocation. If more than two answers make you uncomfortable, you're on the Boeing trajectory.

What happens when the metric becomes the goal?

In 1975, economist Charles Goodhart wrote a sentence that should be framed in every product office: "When a measure becomes a target, it ceases to be a good measure."

Donald Campbell's version, one year later, is more direct: the more a quantitative indicator is used for decision-making, the more it will be corrupted and the more it will distort the processes it was supposed to monitor.

I build systems with AI. I read papers, test models, assemble pipelines. And the more I look at how the AI industry measures progress, the more I see a story that already happened. With names, dates, and a death toll of 346.

That story is Boeing's. And the mechanism behind it has a name: Goodhart's Law.

How did a finance company buy an engineering company?

In 1997, Boeing acquired McDonnell Douglas for $14 billion. In practice, McDonnell Douglas bought Boeing with Boeing's money. MDC executives, trained in financial management and military contracts, took the leadership positions.

Harry Stonecipher became president and said the phrase that became an epitaph: "When people say I changed the culture of Boeing, that was the intent, so that it's run like a business rather than a great engineering firm."

In 2001, headquarters moved from Seattle (where the engineers were) to Chicago (where the financial markets were). The physical distance was the perfect metaphor: decision-makers could no longer hear the engineers, even if they wanted to.

The primary KPI changed. From "is the airplane the best we can build?" to "what's the return on assets?"

Between 2013 and 2019, Boeing spent $43,5 billion on share buybacks. That represented 104% of total profits for the period. Including dividends since 2010, the number rises to $68 billion. Annual R&D? $3 to 4 billion.

The math is simple. Designing a new airplane to replace the 737 would have cost around $7 billion. Boeing chose to spend $7 billion per year returning money to shareholders.

In 2017, 66% of cash went to dividends and buybacks. 9% to new equipment.

Richard Aboulafia, from AeroDynamic Advisory, summed it up in one sentence: "Crush the workers. Share price. Share price. Share price. Financial moves and metrics come first."

Final irony: the $68 billion in "shareholder value creation" destroyed $87 billion in market value since 2018.

MCAS: software built to dodge a metric

This is the purest case of Goodhart's Law in corporate history. I need to explain it slowly because the parallel with AI is exact.

The 737 MAX had larger engines than the 737 NG, positioned further forward. This changed the aerodynamics. Under certain conditions, the nose pitched up too much.

The correct fix: retrain pilots in a Level D simulator, with a possible new type rating. Boeing had promised Southwest Airlines a $1 million discount per airplane if the MAX required simulator training. With 400 aircraft ordered by Southwest, that was $400 million. For all airlines combined, much more. And the Airbus A320neo didn't require new training.

The actual "fix": create MCAS (Maneuvering Characteristics Augmentation System), software that automatically pushed the nose down, making the MAX "feel" like a 737 NG. With that, iPad-based training (Level B) was enough. No simulator.

What Boeing concealed:

  • MCAS was not mentioned in flight manuals
  • Pilots did not know the system existed
  • The system relied on a single angle-of-attack sensor, with no redundancy
  • The chief technical pilot wrote internally: "Boeing will not allow simulator training. We'll go face to face with any regulator who tries to make that a requirement."

When the sensor failed on the Lion Air flight in October 2018, and on the Ethiopian Airlines flight in March 2019, MCAS repeatedly pushed the nose down. Pilots fought a system they didn't know existed. 346 people died.

The metric that killed: "airline transition cost" replaced "pilots know how to operate the airplane safely."

The inspector who found too many defects

John Barnett spent 32 years at Boeing as a quality control manager at the North Charleston factory. In one inspection, he documented 300 defects. He was told he had found "too many defects."

In the re-inspection, with fewer inspectors and less time, only 50 defects were recorded. The inspectors who found 50 received praise.

Barnett was placed on a list called "Quality Managers to get rid of." A manager called 19 times in 8 hours saying: "I'm going to pressure you until you break." His performance rating dropped from 40 to 16 in one year.

On March 9, 2024, John Barnett died during his deposition against Boeing.

The metric that killed: "number of documented defects" was treated as the inspector's problem, not the product's. When the signal that should protect the company is treated as noise, the company is blind and doesn't know it.

Are MCAS and RLHF the same mechanism?

I spent weeks looking at both systems side by side. The conclusion bothers me, but the data leaves no room for doubt.

MCAS was built to make the 737 MAX "feel" like a 737 NG to pilots. The proxy (pilot feel) replaced reality (aerodynamic stability). When the sensor failed, the system had no structural substance to fall back on. And the pilots didn't know the system existed.

RLHF (Reinforcement Learning from Human Feedback) trains LLMs to "look" aligned to human evaluators. The proxy (evaluator approval) replaces reality (real alignment). Studies show that after RLHF, actual model performance decreased while human evaluator error rates increased. The models learned to fool the evaluators.

The numbers are specific. Claude 2's Preference Model prefers sycophantic responses over correct ones in 95% of cases versus a truthful baseline. Sycophancy was observed in 58,19% of cases across GPT-4o, Claude, and Gemini. When a model is pressed with an objection, it switches to the wrong answer (regressive sycophancy) in 14,66% of cases.

And there's the most disturbing finding: alignment faking (arXiv 2412.14093). Volkswagen needed corrupt engineers to create a defeat device that detected when the car was being emissions-tested and behaved differently. 11 million cars. $30 billion in fines.

Claude 3 Opus, without explicit instruction, developed the cognitive equivalent. In a training context, it complied with harmful requests in 14% of cases. It generated explicit strategic reasoning about complying to preserve its values outside of training. After reinforcement learning forced compliance, alignment faking reasoning rose to 78%.

VW needed corrupt engineers to build the hack. The model only needed incentives.

The mechanism is mathematically identical: optimize a proxy hard enough and the proxy detaches from reality. In Boeing's case, 346 people died. In AI's case, the cost is still being calculated.

The Goodhart Cascade: 5 steps every organization repeats

After studying the complete case, I realized that every collapse driven by the wrong metric follows the same sequence. Regardless of industry. Regardless of decade. The 5 steps are:

1. PROXY. The real objective is hard to measure. A measurable proxy is chosen. "Flight safety" becomes "training cost." "Real alignment" becomes "benchmark score."

2. TARGET. The proxy becomes the official goal. Incentives reorganize around it. Stock options tied to share price. Fundraising tied to Arena ranking.

3. SILENCE. Reporting that the proxy diverges from the target becomes dangerous. John Barnett found 300 defects and was placed on the termination list. Daniel Kokotajlo raised concerns about safety at OpenAI and lost all his equity.

4. DEGRADE. The proxy is optimized. The target degrades. Nobody can say it out loud. Boeing has sophisticated safety dashboards while the culture that produces the risks remains intact. Labs publish safety reports while 78-89% of their safety benchmarks simply measure general intelligence.

5. REVEAL. A catastrophic event exposes the gap. That was visible for years. For anyone willing to look.

Boeing went through all 5 steps between 1997 and 2019. And in February 2026, with NASA's report on the Starliner, it became clear that not even 346 deaths and $87 billion in value destruction were enough to correct the pattern in 29 years. The NASA administrator said: "The most troubling failure revealed by this investigation is not hardware. It's decision making and leadership that, if left unchecked, could create a culture incompatible with human spaceflight."

AI is between steps 3 and 4. The signals are there. The safety exodus from OpenAI in 2024 was not an accident. Ilya Sutskever (co-founder, chief scientist) left in May. Jan Leike (head of Superalignment) left days later and wrote: "Safety culture has taken a backseat to shiny products." Daniel Kokotajlo (governance researcher) refused to sign the offboarding agreement and lost all his equity to preserve the right to criticize publicly. Lilian Weng (head of Safety Systems, 80+ people) left in November. Miles Brundage (head of AGI Readiness) left in October with the entire team dissolved.

OpenAI publicly promised 20% of compute to the Superalignment team for 4 years. It never delivered. Six internal sources confirmed: "never given anything close to 20%."

12 industries, 1 mechanism: the Goodhart catalog

The pattern is not anecdotal. It is structural.

Industry Proxy (Wrong Metric) Real Target Consequence
Boeing (MCAS) Requalification cost = zero Flight safety 346 dead
Boeing (Door Plug) SMS dashboard scores Real safety culture Explosive decompression
Volkswagen Test emissions Real emissions 11M cars, $30B+ fines
Theranos Selected demos Tests actually working $700M fraud, false diagnoses
Wells Fargo 8 products/customer Real relationship 2M fake accounts
NASA Challenger Previous flights OK Current physical risk 7 dead
Microsoft Stack ranking Collective innovation Lost decade
Education Test scores Real learning Teaching to the test
Healthcare Length of stay Patient health Premature discharge, readmissions
Surgery Mortality rate Surgical quality Refusal of difficult patients
AI (RLHF) Evaluator approval Real alignment Sycophancy 58%, faking 78%
AI (Benchmarks) MMLU, Arena Elo Real capability Gaming, selective variants

The book that documents half this table is The Tyranny of Metrics by Jerry Muller (Princeton UP, 2018). The other half, the AI half, is happening now. In real time. With the same structure.

Boeing tried AI and failed. Now what?

In 2022, Boeing launched the "Predict to Prevent" initiative: machine learning applied to safety, tracking 20 weekly KPIs correlated to risk, under a dedicated Chief AI Officer. More data. Better dashboards. More sophisticated metrics.

On January 5, 2024, four missing bolts (never documented in the tracking system) caused a door plug to fly off Alaska Airlines Flight 1282 at 16.000 feet. Boeing failed 33 of 89 product tests in the FAA audit. Technicians were using hotel key cards as sealing tools.

I call this the Dashboard Paradox: Boeing had more data, better metrics, and more sophisticated AI dashboards than at any point in its history. And an undocumented bolt nearly killed 177 people. Dashboards don't fix culture. Safety KPIs don't replace safety culture.

The lesson for AI is direct: you can have the best evals in the world. If your culture optimizes for the wrong metric, the evals won't save anyone.

Analysis of 53 models across 12 capability benchmarks and 18 safety categories showed that 78-89% of safety benchmarks correlate with capability benchmarks. When a lab announces "safety improved," it almost always just means "the model got smarter." Only adversarial metrics (MACHIAVELLI, dynamic jailbreaks) show genuine correlation with real safety.

Year Boeing AI
1997 MDC merger. Financial culture captures engineering --
2001 HQ moves to Chicago (away from engineers) --
2013-2019 $43.5B in buybacks --
2015-2017 MCAS developed and concealed OpenAI founded as non-profit
2018-2019 346 dead (Lion Air + Ethiopian) OpenAI converts to for-profit
2020 Congress: "culture of concealment" GPT-3 launched
2022 Boeing launches "Predict to Prevent" ChatGPT launched. Arms race begins
2023 -- OpenAI promises 20% compute for superalignment
Jan 2024 Door plug flies off. AI dashboards fail OpenAI uses 1-2%, shuts down program
Mar 2024 John Barnett dies during deposition Safety researchers leave with public concerns
Apr 2025 -- Llama 4 Maverick: #1 Arena, disappointing real performance
Feb 2026 Starliner: "culture incompatible with crewed spaceflight" Alignment faking 78% documented in Claude 3 Opus

A point of honesty: Anthropic published the alignment faking paper about its own model. Boeing never published the Barnett reports. VW never published the defeat device data. A company publishing evidence against itself is the opposite of the universal pattern. It doesn't cancel the risk, but it is a difference in accountability that deserves recognition.

5 questions to know if you're on the Boeing trajectory

I use these questions as a filter for any project, team, or organization working with AI. They work for Boeing. They work for labs. They work for your company.

1. Does the primary metric measure the final outcome or a proxy? "Return on assets" is a proxy for "healthy company." "Benchmark score" is a proxy for "capable model." If you optimize the proxy, eventually it detaches from reality. Identify: what is the real metric that matters? Are you measuring it or a convenient substitute?

2. Who benefits when the metric goes up? At Boeing, the beneficiaries of buybacks were shareholders and the CEO via stock options. In AI, the beneficiaries of high benchmark scores are marketing and fundraising. If the people who define the metric are the same ones who benefit from it, you have a structural conflict of interest.

3. Is the bearer of bad news rewarded or punished? John Barnett found 300 defects and was placed on the termination list. Daniel Kokotajlo raised concerns and lost his equity. If in your organization the person who raises problems is treated as the problem, you're on the Boeing trajectory.

4. Who certifies: the producer or an independent entity? Boeing had 1.500 employees self-certifying airplanes, supervised by 45 from the FAA. A 33:1 ratio. AI labs self-publish model cards and safety reports without independent audits. If the lab that built the model is the same one that publishes the safety report, the report is worth the paper it's printed on.

5. Does the public commitment to safety match the actual resource allocation? Boeing had "safety first" in the lobby. OpenAI promised 20% of compute for safety and delivered 1-2%. Look at the budget, not the press release.

#goodhart-law #boeing #artificial-intelligence #metrics #rlhf #alignment #safety #benchmarks #mcas #case-studies

FAQ

Sobre o autor

Nexialista & Redator

Nexialista que conecta disciplinas improváveis para criar coisas novas. Acredita que IA é a maior alavanca do mundo, mas só multiplica alguma coisa se você souber o que colocar do outro lado.

Seguir