The Rise of AI Arbitrage in the Academic Commons

13 Jun, 2026

If you look at the recent headlines coming out of major scientific hubs, you would be forgiven for thinking that academic publishing has devolved into a high-stakes espionage thriller.

In March 2026, the Program Chairs of the International Conference on Machine Learning (ICML) announced that they had run a massive sting operation on their own peer reviewers. They embedded invisible, randomized "prompt-injection" watermarks inside submitted PDFs. If a reviewer fed a paper to a commercial Large Language Model (LLM) to write their peer review in violation of the conference policy, the LLM unwittingly followed the hidden instructions, regurgitating specific, otherwise unlikely phrases into the text of the review. The trap snapped shut on 506 unique reviewers, resulting in the desk rejection of 497 papers.

Two months later, in May 2026, the popular preprint server arXiv clarified penalties under its Code of Conduct. If a submission contains "incontrovertible evidence" of unchecked LLM generation, such as hallucinated references or leftover conversational meta-comments (like "Here is a 200-word summary; would you like me to make any changes?"), the authors face a literal red card: a one-year ban from the platform, and a requirement that any future submissions first be accepted by a reputable, peer-reviewed journal before they can be posted.

And in June 2026, a group of 16 mathematicians issued the Leiden Declaration on Artificial Intelligence and Mathematics, warning that automated mathematics could erode the values the field depends on: proof, attribution, transparency, autonomy, and human responsibility.

When I was thinking about how to write about these developments, there were three natural angles that emerged. The tempting move would be to choose one and make it the whole story. But that would miss the point. Each angle highlights a different failure mode: one in the metric system of academia, one in the technical structure of mathematics, and one in the shared institutions that make science possible.

The issue is not that every use of AI is illegitimate. Translation help, grammar correction, code assistance, literature search, and proof exploration can all be useful and entirely defensible when they are disclosed and checked. The issue is AI arbitrage: using an undisclosed tool to exploit a system that still prices papers, reviews, and polished prose as if their production costs had not collapsed.

Before getting to the three angles, it helps to notice that this pattern is not new. Every time a machine makes a previously expensive cultural artifact cheap, the same drama returns.

This has happened before

Photography is the cleanest historical analogy. In the 19th century, many artists and critics refused to treat photography as art because it looked too mechanical. JSTOR Daily summarizes the early reception bluntly: through much of that century, photography was not merely second-class art; it was an outcast. Critics described it as a thoughtless mechanism for replication, lacking the refined feeling of genius. The camera made likeness cheap, and that threatened portrait painters whose profession had priced faithful representation as skilled labor.

The first response was resistance. The deeper response was relocation. Painting did not die. Instead, once photography took over cheap resemblance, painting moved toward what the camera did not cheaply provide: mood, color, subjective perception, abstraction, impression, expression. Cornell's exhibit on photography and Impressionism makes exactly this point: as accurate imitation became the domain of photography, painters increasingly explored how painting could go beyond holding a mirror up to nature.

That is the pattern: when a machine captures one signal, the human field moves toward a harder signal.

A second example is music. After synchronized sound entered theaters in the late 1920s, live theater musicians faced a direct labor shock. In 1930, the American Federation of Musicians formed the Music Defense League and ran ads against the "Robot of Canned Music." The Smithsonian account is almost too perfect for the present moment: the ads pleaded for "Living Music" against a cold, unseen machine. The union's later recording bans in the 1940s were not just aesthetic complaints. They were labor politics. Musicians were recording performances that could then be replayed without rehiring them.

This maps closely to academia. Researchers write the papers, reviews, code, and formal libraries that become training material. The commons is built by human labor, then reused by systems that may cheapen parts of that same labor.

A third example is the Arts and Crafts movement. The V&A describes it as a response to machine-dominated production and the dehumanizing division of labor. Its advocates elevated handmade process, material honesty, and the connection between maker and object. That sounds noble, and often it was. But it also shows how authenticity becomes performative once machines make outputs cheap. Suddenly, the process matters publicly. The handmade mark becomes a signal.

Photography developed its own version of this. Pictorialist photographers used soft focus, special printing processes, and manual intervention to make the artist's hand visible. The Met's history of Pictorialism describes the late-19th-century controversy between those who wanted photography to remain unaltered and those who believed manual intervention was necessary to make the artist's role clear. Some of these practices created beautiful work. Some were also legitimacy rituals: proof that a person, not just a machine, had been there.

The lesson is not that machine assistance makes art fake. Duchamp's readymades complicate that easy story. A manufactured object could become art through selection, framing, title, context, and responsibility. Auto-Tune complicates it too. Cher's Believe turned pitch correction into an instrument; T-Pain later made it a signature sound. The issue was not whether the tool touched the output. The issue was whether there was still a meaningful artistic act behind it.

AI brings the same historical pattern into academia, but with a sharper edge. The cheapened artifact is not a portrait, a recording, or a corrected vocal take. It is the visible surface of intellectual effort itself. That is why the three angles matter.

Angle 1: Trust Arbitrage and the "Alpha Decay" of Prose

The first angle is financial game theory. In quantitative finance, arbitrage means exploiting a price discrepancy between two markets. The academic version is almost embarrassingly simple: buy cheap fluency, sell expensive credibility.

Before LLMs, polished academic writing was not a perfect signal of insight, but it was at least expensive. A clear introduction, a competent literature review, a coherent rebuttal, and a balanced peer review all required time, domain familiarity, and some level of care. The academy therefore treated fluency as a rough proxy for effort. Not as proof of quality, but as a useful clue.

Generative AI broke that proxy. It collapsed the cost of producing the visible surface of scholarship while leaving the reward system mostly unchanged. That is the arbitrage.

The trade has three parts:

A proxy metric: polished prose, long related-work sections, confident reviewer language, neat summaries, smooth rebuttals.
A hidden production function: an LLM produces the proxy while the author or reviewer appears to have done the work unaided.
Delayed settlement: the institution only discovers much later, if ever, whether the polished artifact was backed by genuine understanding.

This creates two obvious strategies.

Author arbitrage: use LLMs to produce professional-sounding introductions, literature reviews, abstracts, and response letters. Thin research can be wrapped in fluent packaging. The author captures the upside: more submissions, more polished papers, more apparent productivity.
Reviewer arbitrage: feed a confidential PDF into a model and obtain a structured review in seconds. This is more corrosive, because the reviewer is not merely producing text. They are entrusted with confidential ideas, expert judgment, and the authority to shape another researcher's career.

The ICML watermarking sting is reviewer arbitrage caught in the act. The reviewers wanted the status of expert referees, but offloaded the act of expert reading to a machine.

The important point is that this is not stable. In finance, private edges decay once enough traders discover them. The same happens here. If one author uses AI to polish weak prose, they gain an advantage. If everyone uses AI to polish everything, polished prose becomes worthless as a signal. We get paper inflation: more text, more submissions, more confident summaries, and not necessarily more insight.

That is where the recent statistics matter. When computer science abstracts and introductions show AI modification rates above 20%, fluent academic style is no longer scarce. It is abundant. Once abundance arrives, committees, reviewers, and readers need a new pricing mechanism. ICML's hidden watermarks and arXiv's one-year slop bans are attempts to restore price discovery in a market where the old signal has been flooded.

This is the strongest version of the first angle: AI arbitrage is not mainly about cheating at writing. It is about the collapse of a trust-based pricing system for intellectual labor.

Angle 2: The Verifier Arbitrage: Mathematics as Safe Haven and Training Ground

The second angle is structural and disciplinary. AI does not hit all fields equally, because not all fields use text in the same way.

In August 2025, a landmark study published in Nature Human Behaviour by Stanford researchers (Liang et al.) used a statistical word-frequency model to estimate the prevalence of LLM-modified sentences across more than 1.1 million papers. The estimates should not be overread: they measure detectable text modification, not the total amount of AI-assisted thinking. Still, the disciplinary breakdown is revealing:

Estimated LLM-Modified Content by Field (September 2024)
──────────────────────────────────────────────────────────────────
Computer Science (arXiv)                [██████████████████] 22.5%
Electrical Engineering & Systems        [████████████████]   18.0%
Statistics (arXiv)                      [██████████]         12.9%
Biomedical Science (bioRxiv)            [████████]           10.3%
Physics (arXiv)                         [████████]            9.8%
Nature Portfolio Journals               [███████]             8.9%
Mathematics (arXiv)                     [██████]              7.8%
──────────────────────────────────────────────────────────────────
Source: Liang et al., Nat Hum Behav (2025)

Mathematics is at the bottom, at 7.8%. Statistics, its nearby cousin, is at 12.9%. That gap is interesting.

The reason is not that mathematicians are morally purer or more technologically conservative. The reason is that mathematics has a different failure mode. In most empirical fields, prose helps build a plausible narrative around methods, experiments, tables, and interpretation. In mathematics, prose matters, but the proof is the object. A proof is not "approximately persuasive." A proof is either correct or it is not.

That makes mathematics relatively resistant to prose arbitrage. An LLM can make a bad introduction sound better. It can make a literature review smoother. It can even suggest proof sketches. But a single hallucinated implication can destroy the entire argument. In mathematics, the cost of a false step is not a small reputational discount. It is failure.

This is why mathematics looks like a safe haven. But it is only a partial safe haven. The same property that protects mathematics from cheap prose arbitrage makes it uniquely valuable to AI companies.

Here is where Lean 4 enters the story.

Lean 4 is an open-source programming language and interactive theorem prover. A proof written in Lean is not merely a paragraph of explanation. It is an executable mathematical object checked by a small trusted kernel. The system returns a brutally simple verdict: the proof checks, or it does not.

For AI training, that verdict is gold. Most domains lack clean feedback. If a model writes an economic forecast, a legal memo, or a philosophical argument, there is no instant objective oracle that says "correct" or "incorrect." Human judgment is slow, expensive, and contestable. But with Lean 4, mathematical reasoning becomes a game with a verifier:

The model proposes a formal statement or proof step.
Lean checks it.
The model receives an error message or a verified proof.
Successful proof paths become training data.
The next model becomes better at searching the proof space.

This is the verifier arbitrage. The public mathematical commons provides the language, libraries, examples, and correctness oracle. Private AI systems convert that commons into proprietary reasoning capability.

Google DeepMind's AlphaProof illustrates the loop. DeepMind describes AlphaProof as training itself to prove mathematical statements in the formal language Lean. It combined a language model with AlphaZero-style reinforcement learning, generated proof candidates, checked them in Lean, and used verified proofs to reinforce future behavior. At the 2024 International Mathematical Olympiad, the combined AlphaProof and AlphaGeometry system solved four of six problems, scored 28 out of 42 points, and reached the top end of the silver-medal category.

Harmonic AI's Aristotle pushes the same logic further. Its paper describes Lean 4 as providing a "reliable and fast reward signal" for mathematical reasoning. Aristotle's system combines informal reasoning, lemma generation, formalization, Lean proof search, and machine verification. It reports gold-medal-equivalent performance on the 2025 IMO by producing correct formal solutions to five of six problems.

The irony is sharp: mathematics is being harvested because it is rigorous. The field's insistence on exactness, formal proof, attribution, and public verification makes it harder for individual researchers to fake their way through. But it also makes mathematics an ideal training environment for commercial AI systems. The output of centuries of public mathematical labor, now formalized in libraries like Mathlib, becomes a ladder for private model capabilities.

This is why the Leiden Declaration matters. It is not simply a complaint that mathematicians dislike new tools. It is a warning about asymmetry. Mathematical culture treats proof as public, inspectable, attributable, and humanly meaningful. Commercial AI can treat proof as a training signal, a benchmark, and a route toward general-purpose reasoning products. Mathematical culture also treats proof as explanation, attribution, judgment, and community understanding. Those are not the same value system.

Angle 3: The Commons Failure and the Drift Toward Counter-Espionage

The third angle is institutional and human. It treats the crisis not primarily as a story of individual misconduct, but as a prisoner's dilemma inside an already overstretched system.

The traditional academic ecosystem was built as a high-trust, mutual honor system. Reviewers read papers in good faith. Authors wrote papers in good faith. Editors trusted that both sides were overworked but basically sincere. That honor system was under strain long before ChatGPT arrived:

The publish-or-perish treadmill: researchers face immense pressure to produce papers, citations, rebuttals, grants, and visibility.
The unpaid review burden: peer review is a massive, mostly uncompensated service layered on top of teaching, advising, administration, and research.
The language bottleneck: English fluency remains a gatekeeper even when the underlying science is not about English at all.

In that world, generative AI becomes a rational private response. If everyone else uses AI to polish papers and accelerate reviews while you refuse, you may be slower, less polished, and less competitive. If you use it quietly, you gain time. If everyone uses it quietly, the commons deteriorates.

That is the tragedy. The individually rational action is not collectively stable.

There are really two commons being depleted:

Reviewer attention: every low-effort or AI-inflated submission consumes scarce expert time.
Epistemic trust: every undisclosed AI-generated review or unchecked citation weakens confidence that the system is being operated by accountable humans.

This is why the ICML incident matters even if the detected reviews were only about 1% of submitted reviews. The numerical share is not the whole story. The scandal is that the conference had to run a covert watermarking operation to distinguish expert judgment from delegated machine output. Once a system reaches that point, the trust layer has already cracked.

arXiv's one-year ban policy comes from the same pressure. Moderators cannot manually audit the intellectual integrity of every submission. So they look for bright-line evidence: hallucinated references, leftover LLM instructions, meta-comments that should never have survived a human reading. The policy is less about detecting all AI use and more about saying: if you leave incontrovertible evidence that nobody checked the machine's output, the platform can no longer trust anything in the paper.

The danger is that the defensive response becomes its own arms race. Authors learn to scrub AI fingerprints. Reviewers learn to paraphrase model output. Conferences add hidden prompts. Platforms add penalties. Researchers add more automation. Every round makes the system more adversarial and less collegial.

This is the institutional version of AI arbitrage: if the workload problem is not fixed, enforcement has to become increasingly elaborate. But elaborate enforcement is expensive, brittle, and culturally corrosive. It turns a scholarly community into a border checkpoint.

That does not mean there should be no rules. Quite the opposite. But the rules have to make the arbitrage unprofitable rather than merely making the detection game more theatrical. Disclosure norms, reviewer workload reductions, reproducibility requirements, citation verification, and stronger standards for computational artifacts all attack the incentive problem more directly than a permanent escalation of traps.

One second take-away

The wrong response is to pretend that AI can be kept out of scholarship by moral appeal or detection theater. The better response is to make undisclosed arbitrage less valuable. That means shifting academic evaluation away from the cheap artifacts of thought (fluency, formatting, volume, and speed) and toward the expensive parts that still matter: validation, verification, attribution, reproducibility, and accountable human judgment.

If we keep rewarding polished surfaces, AI will flood the market with polished surfaces. If we reward checkable work, disclosed methods, and verified claims, the arbitrage becomes much harder to exploit.

Links: On Violations of LLM Review Policies (ICML Blog) | ArXiv Slop Ban (The Verge) | Leiden Declaration on AI and Mathematics (Leiden Declaration) | Quantifying LLM Usage in Papers (Nature Human Behaviour) | AlphaProof and AlphaGeometry (Google DeepMind) | Aristotle: IMO-level Automated Theorem Proving (arXiv) | When Photography Wasn't Art (JSTOR Daily) | Musicians Wage War Against Evil Robots (Smithsonian) | International Pictorialism (Met Museum) | Arts and Crafts (V&A)

#academia #ai #game-theory #mathematics #peer-review