Autoresearch for Fraud Scoring: What If the Model Tuned Itself Overnight?

A few weeks ago, I wrote about why ML beats hardcoded rules for warranty fraud detection. The short version: rule-based systems are static, fraudsters aren’t, and ML models can learn the difference between a legitimate claim and a scam without someone writing every rule by hand. Good. But that article left a question on the table: once you’ve decided to build the ML model, who tunes it? Because that part—the feature engineering, the architecture choices, the hyperparameter sweeps, the retraining cycles—is where most fraud analytics projects quietly stall. What if an LLM could do that tuning on its own, overnight, while the team sleeps?

The Tuning Problem Nobody Talks About

Here’s how fraud model development usually goes. A data science team spends two weeks cleaning and labelling claim data. They build a baseline XGBoost model. It works OK—maybe an AUC-ROC of 0.81, decent precision, not great recall. Now the real work starts: tuning.

Should you add graph features from the dealer network? Switch to LightGBM? Try an autoencoder for anomaly pre-scoring? Upsample the minority class or use SMOTE? Change the classification threshold? Add time-decay to older claims? Every one of these is a valid idea. Every one takes a data scientist a day or two to implement, test, and evaluate. And they interact—adding graph features might help with LightGBM but not XGBoost. Threshold tuning that works before SMOTE doesn’t work after it.

A realistic team might test 3–4 serious model variants per week. Most aftermarket fraud teams I’ve seen take 6–10 weeks to go from baseline to production-ready model. That’s not because the people are slow. It’s because the search space is enormous and each experiment has overhead.

⏳ The Real Bottleneck

Building a fraud model isn’t hard. Building a good fraud model is a slow grind of try-measure-discard-repeat. The ideas aren’t the bottleneck. The testing is. What if you could compress 10 weeks of tuning into a few nights?

· · ·

Enter Autoresearch

Andrej Karpathy—ex-Tesla AI lead, one of the people who built the modern deep learning stack—released a project in early 2026 called autoresearch. It’s one of those ideas that’s almost annoyingly simple.

You give an LLM a codebase. You tell it what metric to optimize. You give it a time budget per experiment. And you say: if the metric improves, keep the change. If not, revert. Now loop forever until I tell you to stop.

That’s the whole thing. Karpathy originally aimed it at neural network training—the LLM modifies model code, runs a 5-minute training session, checks the loss, keeps or reverts, repeats. Every experiment is a git commit. Every result goes into a TSV log. The human goes to sleep.

But here’s what caught my attention: there’s nothing in this pattern that’s specific to neural networks. It works on anything with a codebase, a metric, and a training loop. Like, say, a warranty fraud scoring model.

🔁 The Loop

Think of it as a really persistent junior researcher who never needs coffee: come up with an idea → write the code → run it → check the results → keep or throw away → think of the next idea → repeat. There’s literally no stopping condition. The instructions say “NEVER STOP” in all caps. If the LLM runs out of ideas, it’s told to think harder.

How fast could it realistically go?

This depends on what’s being run. Karpathy’s original setup trains neural networks for 5 minutes per experiment on a GPU—that’s the bottleneck. But fraud models on tabular data are a different story. An XGBoost or LightGBM model trained on 100K–500K warranty claims typically fits in 30 seconds to 2 minutes, not 5. The real time cost per iteration is the LLM reasoning overhead—reading the code, thinking about what to change, writing the diff—which currently takes 1–3 minutes depending on the model and the complexity of the change.

So a realistic estimate for fraud model tuning: roughly 8–15 minutes per experiment, including LLM reasoning time, code modification, model training, and evaluation. That’s 4–7 experiments per hour, or somewhere around 30–50 experiments overnight. Not the 100 experiments Karpathy gets with GPU-bound neural net training, but still dramatically more than the 3–4 a human team runs per week.

4–7/hr

Realistic experiment rate

LLM reasoning (1–3 min) + code change + model training (1–2 min) + evaluation = roughly 8–15 minutes per cycle on tabular fraud data.

30–50

Experiments per overnight run

Conservative estimate for an 8-hour overnight session. Still more than most fraud teams test in a full development cycle.

Git + TSV

Full audit trail

Every experiment would be a git commit with a TSV log entry. The team returns to a reviewable, reproducible history of every variant tested.

· · ·

Autoresearch for Fraud Scoring: How It Would Work

Let’s think through what this looks like if you applied the autoresearch pattern to the exact problem from my previous article—aftermarket warranty fraud scoring.

What’s locked down (`prepare.py`)

Just like Karpathy’s setup, you’d fix everything the LLM isn’t allowed to touch. The dataset: 18 months of historical warranty claims, labelled by fraud investigators (roughly 7% fraud rate—typical for aftermarket). The train/validation/test split. The evaluation function: train the model, score the validation set, report AUC-ROC, precision at 90% recall, and false positive rate. Time budget per experiment. The LLM can’t touch the data, can’t change the evaluation, can’t install new packages.

What the LLM can mess with (`fraud_model.py`)

One file, everything inside it is fair game. That includes: which algorithm to use (XGBoost, LightGBM, random forest, logistic regression, neural net, ensembles), all hyperparameters, feature engineering logic (what to compute from raw claim data), class balancing strategy (SMOTE, undersampling, class weights, threshold tuning), feature selection, and how to handle missing data. The LLM reads the code, reasons about what might improve the score, makes a change, and runs the experiment.

The metric

You’d optimize for AUC-ROC as the primary metric, with a hard constraint: precision at 90% recall can’t drop below 0.40. In plain English—the model has to catch at least 90% of actual fraud, and when it flags something, it needs to be right at least 40% of the time. Any experiment that violates this gets discarded, no matter how good the AUC looks. This mirrors how fraud teams actually think: miss rate matters more than false alarm rate, but false alarms can’t be completely out of control.

· · ·

What a Plausible Overnight Run Might Look Like

Commit	AUC-ROC	P@90R	FPR	Status	What It Tried
`a1b2c3d`	0.8130	0.42	8.1%	keep	Baseline — vanilla XGBoost, raw features, default hyperparams
`b2c3d4e`	0.8340	0.45	7.3%	keep	Added claim velocity features: claims per dealer per month, rolling 90-day averages
`c3d4e5f`	0.8280	0.38	7.8%	discard	SMOTE oversampling on minority class — precision dropped below threshold
`d4e5f6g`	0.8510	0.48	6.5%	keep	Engineered regional deviation features: dealer claim rate vs. regional baseline
`e5f6g7h`	0.8490	0.47	6.7%	discard	Switched to random forest — marginal AUC drop, slower training
`f6g7h8i`	0.8720	0.51	5.8%	keep	Switched to LightGBM + tuned key hyperparams within time budget
`g7h8i9j`	0.8690	0.50	5.9%	discard	Added part-failure-rate-vs-shipped ratio — multicollinear with existing features
`h8i9j0k`	0.8880	0.54	5.2%	keep	Added time-since-sale and claim-timing features (day-of-week, month, time-to-claim)
`i9j0k1l`	0.000	—	—	crash	Tried graph neural network on dealer-technician network. OOM on adjacency matrix.
`j0k1l2m`	0.9010	0.57	4.6%	keep	Graph features via simple degree/PageRank on dealer network, fed into LightGBM
`k1l2m3n`	0.9080	0.59	4.3%	keep	TF-IDF cosine similarity on technician diagnostic notes
`l2m3n4o`	0.9050	0.58	4.5%	discard	Neural net ensemble with LightGBM — added complexity, marginal gain
`m3n4o5p`	0.9210	0.62	3.8%	keep	Calibrated threshold + cost-sensitive learning (10:1 fraud-to-legit penalty)

That’s 13 experiments shown—the interesting ones. In a real overnight run of 30–50 total iterations, the majority would be minor tweaks that didn’t move the needle and got silently reverted. That’s normal. That’s how research works.

0.813 → 0.921

Plausible AUC-ROC improvement

From a decent baseline to a strong production candidate. Each step is a small, realistic gain — no single experiment makes a magical jump.

8.1% → 3.8%

Potential false positive reduction

Fewer legitimate dealers wrongly flagged for investigation. This compounds into real operational savings and better dealer relationships.

30–50

Experiments per night

Conservative estimate. Still more model variants than most fraud teams evaluate in an entire development cycle.

· · ·

Reading the Trajectory

The hypothetical experiment log is interesting not just for the final number but for how the progression would unfold. Based on what LLMs are already demonstrably good at—code comprehension, domain reasoning from schema, and iterative refinement—here’s the kind of research trajectory you’d expect:

Phase 1 — Feature engineering. The first thing an LLM would likely do is enrich the raw claim features. Claim velocity per dealer. Regional deviation scores. Time-to-claim ratios. These are exactly the features I described in the previous article under “Feature Engineering”—well-established fraud signals that an LLM could derive from reading the data schema and reasoning about what patterns distinguish fraudulent from legitimate claims.

Phase 2 — Algorithm selection. Once the features are solid, the LLM would likely experiment with algorithm switches. LLMs already know that LightGBM often outperforms XGBoost on mid-size tabular data with lower training time—it’s well-documented in their training data. The keep/discard loop would let it confirm this empirically rather than just guessing.

Phase 3 — Advanced signals. This is where things get more interesting. An LLM reading the data schema and seeing dealer-technician relationships could reasonably infer that network structure might signal coordinated fraud—the same insight I flagged as a key ML technique for catching fraud rings. Graph features like degree centrality and PageRank are straightforward to compute with NetworkX, and LLMs write this kind of code routinely. TF-IDF on diagnostic notes is similarly within reach—it’s a common NLP technique that LLMs know well.

Phase 4 — Calibration. Threshold tuning and cost-sensitive learning are the kind of production-readiness refinements that LLMs handle comfortably. This phase would likely yield the last few points of AUC improvement.

🧠 The Ones That Would Get Thrown Away

Two plausible discards are worth calling out. SMOTE — a common oversampling technique for imbalanced data. It often hurts precision on fraud datasets where the class boundary is noisy, and many teams spend days tweaking SMOTE parameters before reaching this conclusion. An autoresearch loop would reach it in one iteration. A neural net ensemble — it might work marginally, but Karpathy’s spec explicitly values simplicity: “A small improvement that adds ugly complexity is not worth it.” An LLM following that instruction would likely revert a complex ensemble that adds 0.003 AUC.

· · ·

Why This Isn’t Just AutoML

If you’ve worked with Optuna or scikit-learn’s GridSearchCV or any AutoML framework, you might be thinking: isn’t this just hyperparameter optimization with extra steps? It’s a fair question. The answer is no, and the difference matters.

AutoML tweaks numbers. Autoresearch rewrites code.

A hyperparameter sweep tries learning_rate=0.01 vs. 0.05 vs. 0.1. Autoresearch would read the feature engineering function, understand that claim velocity is a raw count, reason that a rolling 90-day average would be more robust, and rewrite the function. Adding TF-IDF features on diagnostic text isn’t a parameter toggle—it’s a new data pipeline. No grid search does that.

It can reason about domain context.

An LLM wouldn’t compute dealer PageRank because some config file told it to try graph features. It would read the code, see that the data includes dealer-technician relationships, and reason that network structure might signal coordinated fraud. LLMs routinely demonstrate this kind of schema-to-insight reasoning in code generation tasks today.

It can make judgment calls about complexity.

The neural net ensemble would be discarded not because it failed, but because it wasn’t worth the added complexity. That’s not something you can encode in a search space. Karpathy’s spec asks the LLM to weigh improvement magnitude against code complexity—the same tradeoff a senior data scientist makes, except the loop runs continuously.

	Traditional AutoML	Autoresearch
What it changes	Hyperparameters within a fixed model	Model code, features, architecture, everything
Feature engineering	You define the features upfront	LLM invents new features from the raw data
Algorithm selection	You pick the search space	LLM switches algorithms based on results
Domain reasoning	None — treats params as opaque numbers	Reads data schema, infers fraud domain context
Complexity judgment	Takes whatever scores highest	Can discard complex low-gain solutions
Cross-domain transfer	Needs new search space per problem	Same loop, any codebase, any metric

· · ·

What This Could Mean for Fraud Teams — and What It Signals for AI

If you run a fraud analytics team, the practical implication is worth thinking about. The 6–10 week model tuning cycle I described earlier? The part where a data scientist tries XGBoost vs. LightGBM, experiments with SMOTE, engineers graph features, tunes thresholds? Autoresearch could plausibly compress a big chunk of that into a few overnight runs. The team would come back to a set of model variants that have already been through 30–50 iterations, with a clean log of what was tried, what worked, and what didn’t. They review, validate, and decide what to deploy—the exploratory grunt work would be done.

And because the loop would run on labelled data that already exists in any fraud analytics pipeline, the setup cost is mostly just writing the initial fraud_model.py baseline. Most teams already have something close to it.

💡 The Quick Test

Can you describe your model training as a Python script that runs in under 10 minutes and outputs a score you want to maximize? If yes, you could point autoresearch at it. For fraud scoring: your script loads the data, trains the model, evaluates on a holdout set, prints AUC-ROC. That’s your fraud_model.py. The bar is low.

But zoom out a bit and something bigger is happening. There’s a reason autoresearch has 51,000+ GitHub stars. It’s one of the clearest demonstrations yet that LLMs can be set up to do iterative, hypothesis-driven research — the kind where you have an idea, test it, learn from the result, and use that learning to form a better idea. Not autocomplete. Not “helpful assistant.” Something closer to actual scientific reasoning, running unsupervised.

1

Autocomplete (2020–2022)

GPT-3 finishes your sentences. You do all the thinking.

2

Assistant (2023–2024)

ChatGPT, Claude, Copilot. Writes code, reasons through problems. You still drive every decision.

3

Researcher (2025–2026)

Autoresearch. Runs its own experiments, evaluates results, compounds discoveries. You review the log in the morning.

4

??? (2027+)

Same loop, but the model picks which problems to work on and designs its own evaluations. Not there yet. But the trajectory is visible.

The fraud scoring thought experiment makes this tangible. An LLM could plausibly rediscover feature engineering practices that take fraud analysts years to learn. It could make judgment calls about model complexity that mirror what senior data scientists do. It could build a coherent research trajectory — features first, then algorithm selection, then advanced signals, then calibration — that reads like a well-planned development sprint. The individual capabilities are already demonstrated. Autoresearch just gives them a loop to run in.

There are open questions, and they’re not small ones. Would the LLM get stuck in local optima after 20 experiments? Would its “creative” features accidentally encode protected attributes? How do you validate a model that was tuned by a process you didn’t supervise, in a regulated industry that demands explainability? These matter, and they don’t have easy answers yet.

But the direction is clear. Last time, I wrote that ML is doing fraud detection better than rules ever could. This time, the idea is that ML might soon be doing ML better than most teams can — and the tools to try it already exist.

The question isn’t whether this could work — the pattern is proven for neural networks, and tabular ML is a simpler problem. The question is whether your fraud team will be the one exploring it, or the one that hears about the results second-hand.

— On Not Waiting to See What Happens

References & Further Reading

See also my previous article: When Machines Start Catching the Cheats for the foundational ML vs. rules discussion and full fraud statistics references.

Karpathy, A. (2026). autoresearch — LLM-driven autonomous experimentation. GitHub repository. github.com/karpathy/autoresearch
Karpathy, A. (2026). autoresearch/program.md — The experiment loop specification. Full spec for the autonomous loop: hypothesis → code → run → evaluate → keep/discard → repeat. github.com/karpathy/autoresearch/program.md
Pal, S. (2026). When Machines Start Catching the Cheats. Your Tech Blog. ML-driven fraud analytics in aftermarket warranty — the article this post extends. yourtechblog.com

#aftermarket #AGI #autonomous AI #autoresearch #fraud detection #Karpathy #LLM #machine learning #warranty fraud #XGBoost

Sunil Pal

Writer at Your Tech Blog. Covering AI, machine learning, and intelligent enterprise systems—with a focus on how modern technology reshapes operations, fraud analytics, and software delivery.

← Previous When Machines Start Catching the Cheats

Next → In the Age of AI: Domain Knowledge Is the "King"

Autoresearch for Fraud Scoring: What If the Model Tuned Itself Overnight?

The Tuning Problem Nobody Talks About

Enter Autoresearch

How fast could it realistically go?

Autoresearch for Fraud Scoring: How It Would Work

What’s locked down (prepare.py)

What the LLM can mess with (fraud_model.py)

The metric

What a Plausible Overnight Run Might Look Like

Reading the Trajectory

Why This Isn’t Just AutoML

AutoML tweaks numbers. Autoresearch rewrites code.

It can reason about domain context.

It can make judgment calls about complexity.

What This Could Mean for Fraud Teams — and What It Signals for AI

Autocomplete (2020–2022)

Assistant (2023–2024)

Researcher (2025–2026)

??? (2027+)

References & Further Reading

Share this:

Like this:

Related Articles

When Machines Start Catching the Cheats

Discover more from Your Tech Blog

What’s locked down (`prepare.py`)

What the LLM can mess with (`fraud_model.py`)