A few weeks ago, I wrote about why ML beats hardcoded rules for warranty fraud detection. The short version: rule-based systems are static, fraudsters aren’t, and ML models can learn the difference between a legitimate claim and a scam without someone writing every rule by hand. Good. But that article left a question on the table: once you’ve decided to build the ML model, who tunes it? Because that part—the feature engineering, the architecture choices, the hyperparameter sweeps, the retraining cycles—is where most fraud analytics projects quietly stall. What if an LLM could do that tuning on its own, overnight, while the team sleeps?
The Tuning Problem Nobody Talks About
Here’s how fraud model development usually goes. A data science team spends two weeks cleaning and labelling claim data. They build a baseline XGBoost model. It works OK—maybe an AUC-ROC of 0.81, decent precision, not great recall. Now the real work starts: tuning.
Should you add graph features from the dealer network? Switch to LightGBM? Try an autoencoder for anomaly pre-scoring? Upsample the minority class or use SMOTE? Change the classification threshold? Add time-decay to older claims? Every one of these is a valid idea. Every one takes a data scientist a day or two to implement, test, and evaluate. And they interact—adding graph features might help with LightGBM but not XGBoost. Threshold tuning that works before SMOTE doesn’t work after it.
A realistic team might test 3–4 serious model variants per week. Most aftermarket fraud teams I’ve seen take 6–10 weeks to go from baseline to production-ready model. That’s not because the people are slow. It’s because the search space is enormous and each experiment has overhead.
Building a fraud model isn’t hard. Building a good fraud model is a slow grind of try-measure-discard-repeat. The ideas aren’t the bottleneck. The testing is. What if you could compress 10 weeks of tuning into a few nights?
Enter Autoresearch
Andrej Karpathy—ex-Tesla AI lead, one of the people who built the modern deep learning stack—released a project in early 2026 called autoresearch. It’s one of those ideas that’s almost annoyingly simple.
You give an LLM a codebase. You tell it what metric to optimize. You give it a time budget per experiment. And you say: if the metric improves, keep the change. If not, revert. Now loop forever until I tell you to stop.
That’s the whole thing. Karpathy originally aimed it at neural network training—the LLM modifies model code, runs a 5-minute training session, checks the loss, keeps or reverts, repeats. Every experiment is a git commit. Every result goes into a TSV log. The human goes to sleep.
But here’s what caught my attention: there’s nothing in this pattern that’s specific to neural networks. It works on anything with a codebase, a metric, and a training loop. Like, say, a warranty fraud scoring model.
Think of it as a really persistent junior researcher who never needs coffee: come up with an idea → write the code → run it → check the results → keep or throw away → think of the next idea → repeat. There’s literally no stopping condition. The instructions say “NEVER STOP” in all caps. If the LLM runs out of ideas, it’s told to think harder.
How fast could it realistically go?
This depends on what’s being run. Karpathy’s original setup trains neural networks for 5 minutes per experiment on a GPU—that’s the bottleneck. But fraud models on tabular data are a different story. An XGBoost or LightGBM model trained on 100K–500K warranty claims typically fits in 30 seconds to 2 minutes, not 5. The real time cost per iteration is the LLM reasoning overhead—reading the code, thinking about what to change, writing the diff—which currently takes 1–3 minutes depending on the model and the complexity of the change.
So a realistic estimate for fraud model tuning: roughly 8–15 minutes per experiment, including LLM reasoning time, code modification, model training, and evaluation. That’s 4–7 experiments per hour, or somewhere around 30–50 experiments overnight. Not the 100 experiments Karpathy gets with GPU-bound neural net training, but still dramatically more than the 3–4 a human team runs per week.
Autoresearch for Fraud Scoring: How It Would Work
Let’s think through what this looks like if you applied the autoresearch pattern to the exact problem from my previous article—aftermarket warranty fraud scoring.
What’s locked down (prepare.py)
Just like Karpathy’s setup, you’d fix everything the LLM isn’t allowed to touch. The dataset: 18 months of historical warranty claims, labelled by fraud investigators (roughly 7% fraud rate—typical for aftermarket). The train/validation/test split. The evaluation function: train the model, score the validation set, report AUC-ROC, precision at 90% recall, and false positive rate. Time budget per experiment. The LLM can’t touch the data, can’t change the evaluation, can’t install new packages.
What the LLM can mess with (fraud_model.py)
One file, everything inside it is fair game. That includes: which algorithm to use (XGBoost, LightGBM, random forest, logistic regression, neural net, ensembles), all hyperparameters, feature engineering logic (what to compute from raw claim data), class balancing strategy (SMOTE, undersampling, class weights, threshold tuning), feature selection, and how to handle missing data. The LLM reads the code, reasons about what might improve the score, makes a change, and runs the experiment.
The metric
You’d optimize for AUC-ROC as the primary metric, with a hard constraint: precision at 90% recall can’t drop below 0.40. In plain English—the model has to catch at least 90% of actual fraud, and when it flags something, it needs to be right at least 40% of the time. Any experiment that violates this gets discarded, no matter how good the AUC looks. This mirrors how fraud teams actually think: miss rate matters more than false alarm rate, but false alarms can’t be completely out of control.
What a Plausible Overnight Run Might Look Like
| Commit | AUC-ROC | P@90R | FPR | Status | What It Tried |
|---|---|---|---|---|---|
a1b2c3d |
0.8130 | 0.42 | 8.1% | keep | Baseline — vanilla XGBoost, raw features, default hyperparams |
b2c3d4e |
0.8340 | 0.45 | 7.3% | keep | Added claim velocity features: claims per dealer per month, rolling 90-day averages |
c3d4e5f |
0.8280 | 0.38 | 7.8% | discard | SMOTE oversampling on minority class — precision dropped below threshold |
d4e5f6g |
0.8510 | 0.48 | 6.5% | keep | Engineered regional deviation features: dealer claim rate vs. regional baseline |
e5f6g7h |
0.8490 | 0.47 | 6.7% | discard | Switched to random forest — marginal AUC drop, slower training |
f6g7h8i |
0.8720 | 0.51 | 5.8% | keep | Switched to LightGBM + tuned key hyperparams within time budget |
g7h8i9j |
0.8690 | 0.50 | 5.9% | discard | Added part-failure-rate-vs-shipped ratio — multicollinear with existing features |
h8i9j0k |
0.8880 | 0.54 | 5.2% | keep | Added time-since-sale and claim-timing features (day-of-week, month, time-to-claim) |
i9j0k1l |
0.000 | — | — | crash | Tried graph neural network on dealer-technician network. OOM on adjacency matrix. |
j0k1l2m |
0.9010 | 0.57 | 4.6% | keep | Graph features via simple degree/PageRank on dealer network, fed into LightGBM |
k1l2m3n |
0.9080 | 0.59 | 4.3% | keep | TF-IDF cosine similarity on technician diagnostic notes |
l2m3n4o |
0.9050 | 0.58 | 4.5% | discard | Neural net ensemble with LightGBM — added complexity, marginal gain |
m3n4o5p |
0.9210 | 0.62 | 3.8% | keep | Calibrated threshold + cost-sensitive learning (10:1 fraud-to-legit penalty) |
That’s 13 experiments shown—the interesting ones. In a real overnight run of 30–50 total iterations, the majority would be minor tweaks that didn’t move the needle and got silently reverted. That’s normal. That’s how research works.
Reading the Trajectory
The hypothetical experiment log is interesting not just for the final number but for how the progression would unfold. Based on what LLMs are already demonstrably good at—code comprehension, domain reasoning from schema, and iterative refinement—here’s the kind of research trajectory you’d expect:
Phase 1 — Feature engineering. The first thing an LLM would likely do is enrich the raw claim features. Claim velocity per dealer. Regional deviation scores. Time-to-claim ratios. These are exactly the features I described in the previous article under “Feature Engineering”—well-established fraud signals that an LLM could derive from reading the data schema and reasoning about what patterns distinguish fraudulent from legitimate claims.
Phase 2 — Algorithm selection. Once the features are solid, the LLM would likely experiment with algorithm switches. LLMs already know that LightGBM often outperforms XGBoost on mid-size tabular data with lower training time—it’s well-documented in their training data. The keep/discard loop would let it confirm this empirically rather than just guessing.
Phase 3 — Advanced signals. This is where things get more interesting. An LLM reading the data schema and seeing dealer-technician relationships could reasonably infer that network structure might signal coordinated fraud—the same insight I flagged as a key ML technique for catching fraud rings. Graph features like degree centrality and PageRank are straightforward to compute with NetworkX, and LLMs write this kind of code routinely. TF-IDF on diagnostic notes is similarly within reach—it’s a common NLP technique that LLMs know well.
Phase 4 — Calibration. Threshold tuning and cost-sensitive learning are the kind of production-readiness refinements that LLMs handle comfortably. This phase would likely yield the last few points of AUC improvement.
Two plausible discards are worth calling out. SMOTE — a common oversampling technique for imbalanced data. It often hurts precision on fraud datasets where the class boundary is noisy, and many teams spend days tweaking SMOTE parameters before reaching this conclusion. An autoresearch loop would reach it in one iteration. A neural net ensemble — it might work marginally, but Karpathy’s spec explicitly values simplicity: “A small improvement that adds ugly complexity is not worth it.” An LLM following that instruction would likely revert a complex ensemble that adds 0.003 AUC.
Why This Isn’t Just AutoML
If you’ve worked with Optuna or scikit-learn’s GridSearchCV or any AutoML framework, you might be thinking: isn’t this just hyperparameter optimization with extra steps? It’s a fair question. The answer is no, and the difference matters.
AutoML tweaks numbers. Autoresearch rewrites code.
A hyperparameter sweep tries learning_rate=0.01 vs. 0.05 vs. 0.1. Autoresearch would read the feature engineering function, understand that claim velocity is a raw count, reason that a rolling 90-day average would be more robust, and rewrite the function. Adding TF-IDF features on diagnostic text isn’t a parameter toggle—it’s a new data pipeline. No grid search does that.
It can reason about domain context.
An LLM wouldn’t compute dealer PageRank because some config file told it to try graph features. It would read the code, see that the data includes dealer-technician relationships, and reason that network structure might signal coordinated fraud. LLMs routinely demonstrate this kind of schema-to-insight reasoning in code generation tasks today.
It can make judgment calls about complexity.
The neural net ensemble would be discarded not because it failed, but because it wasn’t worth the added complexity. That’s not something you can encode in a search space. Karpathy’s spec asks the LLM to weigh improvement magnitude against code complexity—the same tradeoff a senior data scientist makes, except the loop runs continuously.
| Traditional AutoML | Autoresearch | |
|---|---|---|
| What it changes | Hyperparameters within a fixed model | Model code, features, architecture, everything |
| Feature engineering | You define the features upfront | LLM invents new features from the raw data |
| Algorithm selection | You pick the search space | LLM switches algorithms based on results |
| Domain reasoning | None — treats params as opaque numbers | Reads data schema, infers fraud domain context |
| Complexity judgment | Takes whatever scores highest | Can discard complex low-gain solutions |
| Cross-domain transfer | Needs new search space per problem | Same loop, any codebase, any metric |
What This Could Mean for Fraud Teams — and What It Signals for AI
If you run a fraud analytics team, the practical implication is worth thinking about. The 6–10 week model tuning cycle I described earlier? The part where a data scientist tries XGBoost vs. LightGBM, experiments with SMOTE, engineers graph features, tunes thresholds? Autoresearch could plausibly compress a big chunk of that into a few overnight runs. The team would come back to a set of model variants that have already been through 30–50 iterations, with a clean log of what was tried, what worked, and what didn’t. They review, validate, and decide what to deploy—the exploratory grunt work would be done.
And because the loop would run on labelled data that already exists in any fraud analytics pipeline, the setup cost is mostly just writing the initial fraud_model.py baseline. Most teams already have something close to it.
Can you describe your model training as a Python script that runs in under 10 minutes and outputs a score you want to maximize? If yes, you could point autoresearch at it. For fraud scoring: your script loads the data, trains the model, evaluates on a holdout set, prints AUC-ROC. That’s your fraud_model.py. The bar is low.
But zoom out a bit and something bigger is happening. There’s a reason autoresearch has 51,000+ GitHub stars. It’s one of the clearest demonstrations yet that LLMs can be set up to do iterative, hypothesis-driven research — the kind where you have an idea, test it, learn from the result, and use that learning to form a better idea. Not autocomplete. Not “helpful assistant.” Something closer to actual scientific reasoning, running unsupervised.
Autocomplete (2020–2022)
GPT-3 finishes your sentences. You do all the thinking.
Assistant (2023–2024)
ChatGPT, Claude, Copilot. Writes code, reasons through problems. You still drive every decision.
Researcher (2025–2026)
Autoresearch. Runs its own experiments, evaluates results, compounds discoveries. You review the log in the morning.
??? (2027+)
Same loop, but the model picks which problems to work on and designs its own evaluations. Not there yet. But the trajectory is visible.
The fraud scoring thought experiment makes this tangible. An LLM could plausibly rediscover feature engineering practices that take fraud analysts years to learn. It could make judgment calls about model complexity that mirror what senior data scientists do. It could build a coherent research trajectory — features first, then algorithm selection, then advanced signals, then calibration — that reads like a well-planned development sprint. The individual capabilities are already demonstrated. Autoresearch just gives them a loop to run in.
There are open questions, and they’re not small ones. Would the LLM get stuck in local optima after 20 experiments? Would its “creative” features accidentally encode protected attributes? How do you validate a model that was tuned by a process you didn’t supervise, in a regulated industry that demands explainability? These matter, and they don’t have easy answers yet.
But the direction is clear. Last time, I wrote that ML is doing fraud detection better than rules ever could. This time, the idea is that ML might soon be doing ML better than most teams can — and the tools to try it already exist.
The question isn’t whether this could work — the pattern is proven for neural networks, and tabular ML is a simpler problem. The question is whether your fraud team will be the one exploring it, or the one that hears about the results second-hand.
— On Not Waiting to See What Happens