AI-Generated Throwaway Tests: Understanding Meta's Just-in-Time Catching Test Generation
Introduction
Recently I had the chance to speak with an ML engineer from Meta Platforms about how they implemented ideas from a newly published research paper inside their engineering workflow. That conversation sparked my curiosity.
One thing that became clear to me is that reading white papers can be an incredibly efficient way to understand emerging trends in software engineering.
With that in mind, I spent part of my weekend reading the paper "Just-in-Time Catching Test Generation."
The core idea is simple but powerful: instead of writing tests only to prevent regressions, what if we could automatically generate tests that actively try to break new code changes before they reach production?
The Shift-Left Context
Many engineering teams today follow the shift-left testing philosophy, where testing happens earlier in the development lifecycle.
This often means:
- Writing strong unit tests
- Running CI checks on pull requests
- Catching bugs before deployment
While this has improved software reliability, traditional testing still relies heavily on developers anticipating possible failures.
But developers cannot predict every edge case.
This is where the concept of catching tests comes in.
Hardening Tests vs Catching Tests
Most automated test generation research focuses on hardening tests.
Hardening tests:
- Pass when created
- Remain in the codebase
- Prevent future regressions
Example: A developer writes a unit test to ensure a function behaves correctly. These tests help maintain system stability.
But the paper proposes something fundamentally different: catching tests.
Catching tests are:
- Generated automatically
- Designed to fail if a bug exists
- Temporary
- Used specifically during code review
Their goal is bug discovery, not regression protection.
The Setup Inside Meta
At Meta, proposed code changes are called diffs.
Engineers submit these diffs to the internal CI system. A risk-scoring model evaluates each diff to determine whether it might introduce problems.
High-risk diffs are then analyzed overnight by the catching test generation system.
The key research question becomes: Can we automatically generate tests that find severe bugs before they reach production, without slowing engineers down?
Weak Catches vs Strong Catches
The paper introduces two important concepts.
A weak catch is a generated test that:
- Fails on the new code change
- Passes on the original version
This suggests the change might have introduced a bug.
A strong catch is a weak catch that turns out to be a real bug after investigation.
The challenge is distinguishing real bugs from false positives.
Approaches to Generating Catching Tests
The researchers evaluated several methods.
Baseline Approaches (Not Diff-Aware)
These methods generate tests without considering the code change itself.
Coincidental catches produced extremely low results, with only 0.2% of tests identifying issues. An LLM-based generator improved this slightly to around 2%. Mutation-guided testing achieved roughly 0.8%.
Diff-Aware Approaches
The more interesting results came from techniques that analyze the code diff directly.
Dodgy Diff
This approach treats the new diff as if it were a mutated version of the original code and generates tests to differentiate them.
This method achieved:
- 2.5% weak catch rate
- Bug signals in 4% of diffs
Intent-Aware Generation
This was the most sophisticated technique.
The workflow looks like this:
- An LLM analyzes the diff
- It infers the intent of the change
- It predicts ways the implementation could fail
- It generates mutants representing those risks
- Tests are generated to catch those mutants
This achieved:
- 6.4% weak catch rate
- Bug signals in 7.9% of diffs
Overall, diff-aware approaches produced:
- 4x more catches than traditional approaches
- 20x more than coincidental catches
Handling False Positives
One of the biggest challenges with automated bug detection is noise.
To address this, the system uses three assessors.
Rule-Based Filtering
Patterns that strongly indicate false positives are detected automatically.
Examples include:
- Broken mocks
- Reflection-based test failures
- Tests trying to enforce private methods
LLM Probability Scoring
A large language model evaluates whether the failure looks like a real bug. The probability of its response is used as a confidence signal.
Ensemble Model Voting
Multiple models analyze the same case and classify it as:
- High likelihood bug
- Medium likelihood
- Low likelihood
The median result becomes the final score.
Together these approaches reduced manual review workload by 70%.
Did It Actually Catch Real Bugs?
Yes.
Engineers were contacted 41 times when the system detected strong signals.
Instead of presenting complex test code, engineers were simply asked: "Was this behavioral change intentional?"
Results:
- 8 confirmed real bugs
- 4 severe production failures prevented
- 4 additional code fixes or abandoned changes
Interestingly, about 50% of the confirmed bugs were severe, far higher than the usual 5-20% rate seen in bug databases.
An Unexpected Bonus
While generating catching tests, the system also produced thousands of passing tests.
Nearly 8,000 hardening tests were generated as a side effect.
This means catching workflows can simultaneously strengthen regression test coverage.
The Bigger Picture
This research suggests a new direction for automated testing.
Instead of maintaining only static test suites, CI systems may increasingly generate dynamic tests tailored to each code change.
AI would act as a temporary bug hunter, probing the change for weaknesses before it lands in production.
Final Thoughts
The results from this study are early but promising.
The system:
- Scales to extremely large codebases
- Detects real production-level bugs
- Reduces developer friction
- Integrates directly into the CI workflow
For me, the most interesting takeaway is that AI-generated throwaway tests could become a natural extension of shift-left testing.
Rather than replacing developer tests, they complement them by exploring scenarios developers might never think of.
It will be fascinating to see how this approach evolves across the industry.
