Essay Management & AI

What will most likely happen when companies replace managers with AI

May 29, 2026
Cover image

I went looking for management training data inside LLMs. What I found - or didn't find - changed how I think about the "replace middle management with AI" conversation.

The training data problem

Every major LLM with a published training mix tells the same story: some domains, especially code, biomedical text, legal text, math, and academic literature, appear as explicit sources or categories. Management usually does not.

Code gets dedicated, curated sourcing. Medicine gets dedicated sourcing. Law gets dedicated sourcing. Management does not appear as a dedicated category in any of them.

Here are the numbers from published model papers and dataset manifests:

Training data allocation chart across domains
(ribbons are not in proportion to the percentages - for illustrative purposes only)

Code has GitHub scrapes, The Stack (3.1TB of permissively licensed source code in 358 languages), and StarCoder's additional 35B Python tokens. Llama 3 allocates 17% of its 15 trillion training tokens to code. Code also has purpose-built benchmarks: HumanEval, MBPP, and others.

Medicine gets PubMed Central at 14.4% of The Pile (EleutherAI's widely-used open training dataset), plus PubMed Abstracts at another 3.1%. Combined biomedical content: roughly 17.5% of that corpus.

Law gets FreeLaw (US case law) at 6.1% of The Pile, plus USPTO patent backgrounds at 3.7%.

Management - no major disclosed pretraining mix contains a dedicated management corpus. No explicit allocation. No canonical, broadly adopted benchmark comparable to HumanEval or MBPP for code. Whatever management knowledge these models have comes from whatever happened to survive Common Crawl web filtering - mixed in with marketing blogs, e-commerce copy, and generic business content. No published study has attempted to quantify the share.

LLMs almost certainly have seen management writing. Management advice, HR policies, leadership books, business blogs, and career guidance all appear somewhere inside broad web and book corpora.

But "management" is not visible as a first-class training category in the major disclosed mixes. Code has GitHub, The Stack, HumanEval, MBPP, and dedicated code models. Medicine has PubMed and PubMed Central. Law has FreeLaw and patent data. Math and reasoning are now explicitly balanced in models like Llama 3. Management, by contrast, is buried inside undifferentiated web, books, and forum content.

That means the evidence supports a narrower claim: current LLMs may contain management knowledge, but there is no public evidence that frontier models were deliberately trained, balanced, and benchmarked for the judgment, contextual awareness, and interpersonal functions that define what managers actually do.

What management actually is

Google ran the cleanest natural experiment on this question. Around 2001, founders Larry Page and Sergey Brin eliminated all engineering manager roles, believing engineers were best left to their own devices. The experiment lasted a few months. Page and Brin were immediately flooded with requests about expense reports, interpersonal conflicts, and project prioritization. Employees complained about the lack of support and guidance.

Google reinstated managers. Then, in 2008, they launched Project Oxygen to understand why managers mattered. They collected over 10,000 observations across 100+ variables. The finding that surprised even Google's VP of People Operations: technical expertise ranked below behaviors such as coaching, communication, empowerment, concern for team members, and career development. What ranked first were relational behaviors of concern and support for employee well-being.

Those are capabilities least visible in disclosed pretraining-source taxonomies and hardest to evaluate with current AI benchmarks.

Project Oxygen findings

The "flat org" graveyard

Google is not the only company that tried to remove the intermediary management layer. The pattern is remarkably consistent:

Zappos (2013) adopted holacracy - a system with no managers or job titles. Within weeks, 14% of the workforce (roughly 210 people) accepted a buyout rather than work under the new system. Employees reported confusion about who was in charge, what they were supposed to do, and how compensation worked. By 2023, the company had significantly modified the system, quietly reinstating management functions under different names.

Valve, the gaming company, is famous for its "no managers" culture. Former employees tell a different story. Jeri Ellsworth, a former engineer, described "popular kids that have acquired power in the company." Glassdoor reviews describe needing to belong to powerful in-groups to survive. Academic research (Foss and Klein, 2022, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4289193) concluded that when formal hierarchy was abandoned at Valve, it was replaced by informal hierarchy - invisible, unaccountable, and operating through social dominance rather than institutional role.

Medium adopted holacracy in 2013 and abandoned it in 2016, stating it was "getting in the way of the work." The specific problem: coordinating across functions became time-consuming and divisive at scale.

GitHub operated with a flat structure until 2016. During that period, a prominent developer reported harassment by a co-founder and his wife - a case that raised questions about flat structures and accountability, particularly regarding the absence of management layers between executives and individual contributors. GitHub reinstated departmental leadership and credited it with enabling better coordination and faster product releases.

Buffer gave up its flat structure, with co-founder Leo Widrich writing a post titled "What we got wrong." (https://buffer.com/resources/self-management-hierarchy/) Treehouse returned to managers, calling the no-boss attempt "naive."

In every case, the stated rationale for removing managers was the same: efficiency, empowerment, liberation from bureaucracy. In every case, the actual outcome was some combination of coordination failure, invisible power structures, loss of employee protection, and eventual reinstatement of the management layer under different vocabulary.

A pattern from 1945

In 1945, the French political philosopher Bertrand de Jouvenel published "On Power" (Du Pouvoir, https://dn790003.ca.archive.org/0/items/onpoweritsnature00injouv/onpoweritsnature00injouv.pdf), analyzing how centralized power grows across history. His core observation: the sovereign expands its authority by destroying intermediary bodies - feudal lords, guilds, churches, professional orders, local assemblies - that stand between the central authority and the individual.

The sovereign's rhetoric is always liberation. We are freeing you from these petty tyrants, these inefficient middlemen, these bureaucratic gatekeepers.

The actual result is the opposite. Once the intermediary layer is removed, the individual stands exposed to centralized power with no buffer, no advocate, no local protector who understands the specific conditions on the ground. The intermediary bodies were messy and sometimes corrupt - but they were the only structures with both the proximity to understand local reality and the standing to push back against the center.

Illustration of centralized power and intermediary bodies

The structural parallel to "replace middle management with AI" is hard to ignore.

The executive layer (the sovereign) proposes to eliminate middle management (intermediary powers) using AI systems (the mechanism of removal), promising efficiency and empowerment for individual contributors (the people). But the AI system reports to whoever configured it - which is the executive layer. It optimizes for metrics the executive layer defines. It has no loyalty to the team, no career stake in the department's survival, no standing to push back on unreasonable demands from above.

Middle managers translate between layers - converting executive strategy into actionable local decisions, and converting ground-level reality back into language the leadership can act on. They absorb ambiguity, making judgment calls in the gap between policy and reality. They protect - shielding teams from above, negotiating deadlines, advocating for resources. They hold contextual knowledge - knowing that the Q3 deadline is flexible because the client's timeline slipped, or that the new policy will trigger quiet departures.

None of these functions are merely information-processing problems solvable by pattern-matching over web-scraped text. They are relational, contextual, and accountable.

What the "flat org" experiments actually prove

Every "successful" flat organization turns out, on inspection, to have intermediary powers under different names. W.L. Gore caps plant size at roughly 200 people and has "leaders" who emerge organically. Morning Star has a formal bilateral commitment system between workers. Buurtzorg caps teams at 10-12 nurses with a coach. None of them eliminated the intermediary layer. They redesigned it.

The failed experiments are more instructive. In every case, removing formal hierarchy did not eliminate hierarchy. It made hierarchy invisible and unaccountable. Valve's "popular kids." Zappos's circles that were "still arranged hierarchically." GitHub's harassment controversy that raised questions about accountability in the absence of management layers.

De Jouvenel predicted this in 1945: power does not disappear when you abolish its formal structures. It becomes informal, opaque, and harder to challenge.

What this means for the "AI-replaces-managers" thesis

To be clear about what the evidence actually supports, this article rests on three distinct claims with different levels of support.

The first is a data claim: management is not isolated as a first-class category in any major disclosed pretraining mix. This is the strongest claim. It is empirically verifiable against published model papers and dataset manifests. Nobody has published a counterexample. It does not mean LLMs know nothing about management. It means no one has demonstrated that frontier models were deliberately trained, balanced, and benchmarked for it the way they were for code, medicine, or law.

The second is a capability claim: AI today remains weak at the full scope of what managers do. This is plausible but less well-evidenced. It follows from the training data gap and from the nature of management work (relational, contextual, accountable), but direct studies measuring AI performance on realistic management tasks - navigating a reorg, retaining a flight-risk employee, mediating a team conflict - are scarce. Google's Project Oxygen tells us what management is. It does not tell us, on its own, that AI cannot do it. This claim should be held as a reasonable inference, not a proven finding.

The third is a political-organizational claim: replacing the intermediary layer concentrates power at the center. This is an analytical framework drawn from de Jouvenel and applied to the flat org evidence, not a causal law discovered in it. The historical cases are real. The pattern is consistent. But the interpretation - that the same dynamic will repeat with AI-mediated management - is a projection, not a certainty. It should be weighed as the most plausible reading of the available evidence, not as a guarantee.

Beyond those three claims, there is an emerging empirical signal worth watching.

Empirical signal illustration

Bad code crashes visibly. Bad management erodes culture, increases turnover, and degrades decision quality over months or years - exactly the kind of slow damage that is hard to attribute causally. By the time the consequences are obvious, the institutional knowledge held by the displaced managers is gone and cannot be rebuilt quickly.

A 2025 study suggests this concern is not just theoretical. Dong et al. (Max Planck Institute for Human Development / Utrecht / Toulouse School of Economics, arxiv.org/abs/2505.21752) placed 382 workers under human, AI, or hybrid management in a customized Minecraft workplace - a platform chosen because it allows real-time behavioral tracking, repeated evaluation cycles with actual contingent pay, and enough task autonomy to approximate real working conditions. The authors note that prior experimental studies of AI management produced contradictory findings, partly because hypothetical scenarios and one-shot tasks fail to capture the psychological dynamics of ongoing AI supervision.

Their results: the AI manager, trained on human-defined evaluation principles, systematically assigned lower performance ratings and reduced wages by 40% - without any adverse effect on worker motivation or perceived fairness. Workers didn't protest because the emotional response that normally constrains exploitative practices was muted under AI evaluation. The system looked fine by every observable metric while extracting significantly more from workers. The authors' term for this is "silent exploitation": the very features that make AI appear impartial may suppress the social reactions that normally constrain extractive management practices.

Minecraft is not a corporate workplace and the tasks were production-oriented, not the complex interpersonal and strategic work that defines middle management. N=382 is respectable but not large. And the AI manager was applying human-defined criteria - it was the emotional layer that was missing, not the evaluation framework. Still, as a controlled demonstration that removing the human intermediary can enable extraction that workers don't resist, the finding is directly relevant.

Separately, a randomized trial (Cui & Yasseri, 2025, Trinity College Dublin, arxiv.org/abs/2502.17730) found that workplace gender biases carry over intact to AI managers - AI-presented-as-female faced the same penalties as human female managers. The assumption that AI management would at least neutralize bias does not appear to hold.

Will AI change how management works? Yes, it will. The question is whether replacing the intermediary layer entirely - removing the humans who translate, absorb ambiguity, protect, and hold contextual knowledge - is wise when the tools being proposed for the replacement were never designed for that purpose, and when many prior attempts to remove that layer have produced recurring problems: coordination failure, shadow hierarchy, and eventual reintroduction of intermediary functions.

The historical evidence, the training data, and the organizational experiments all point the same direction. The intermediary layer exists because there is a problem that does not go away when you remove the people solving it.