How we set out to build a personal data-science agent — not a chatbot that pretends to run code, but one that writes Python, executes it, reads real errors, and verifies answers — and what we learned training Gemma-4-2B on Modal with methods we had to invent along the way.
The hackathon asked for something ambitious: take a small open model and make it genuinely useful for data work — exploring tables, cleaning messy columns, aggregating, joining, visualizing, and answering questions with verifiable correctness, not plausible prose.
Our north star was simple to state and hard to achieve:
A 2B-parameter student agent that behaves like a junior data analyst: inspect schema first, run focused code steps, debug from real tracebacks, and only claim an answer after execution confirms it — with a training story credible enough for slides, papers, and a public Hugging Face demo.
Concretely, we targeted:
stdout / errors, not hallucinated <result> blocks
Answer: tags without executing anything.
We built on unsloth/gemma-4-E2B-it — Google's Gemma 4 2B instruction model in
Unsloth's E2B (execution-to-build) variant. It's small enough to fine-tune on a single GPU,
yet designed with code and tool use in mind. We used 4-bit quantization, LoRA rank 32,
and a 2048-token context throughout.
The project began as three separate Kaggle notebooks covering supervised fine-tuning (SFT),
GRPO reinforcement learning, and DPO preference optimization. We consolidated them into
datasense_pipeline.py — a single Modal application with shared config in
datasense_utils.py — so training could run unattended on cloud GPUs with
checkpoints persisted to a Modal volume and pushed to Hugging Face.
Early runs were misleading because the ported notebooks had latent bugs. We fixed all nine before building the pipeline:
| # | Bug | Impact |
|---|---|---|
| 1 | sft_warmup KeyError | SFT wouldn't start |
| 2 | lora_target_modules KeyError | LoRA attach failed |
| 3 | result_str UnboundLocalError | Agent loop crashed mid-rollout |
| 4 | DPO pairs missing chat template prefix | Preference data malformed |
| 5 | skip_special_tokens=False | Decode pollution in rewards |
| 6 | Dead oci_sft_v1 variable | Confusing / broken cells |
| 7 | GRPO max_steps hardcoded | Config ignored |
| 8 | Shorter SYSTEM_PROMPT in DPO cell | Train/eval prompt drift |
| 9 | _PROBLEM_LOOKUP naming mismatch | Dataset indexing broken |
Our first agent eval reported 0% accuracy for everyone — including SFT — while SFT already showed 100% execution success and ~5.6 agent steps vs base's 2% exec / 1.1 steps. That gap taught us the first big lesson: the model was learning to run code, but we weren't scoring against real data.
Eval workspaces used synthetic random CSVs when DataBench parquet wasn't mounted, but ground truth came from the real dataset. The agent analyzed fake data and was graded against true answers — guaranteed 0%.
Most "data agent" demos finetune on static (question, code, answer) triples. The model learns
to format responses that look like an agent — Answer: tags, pandas snippets,
confident summaries — without ever closing the loop on execution.
We observed three failure modes immediately:
Base Gemma-4 could score well on easy boolean questions by emitting answer tags in a single turn — 0% code execution — beating SFT on accuracy while doing none of the work.
Models invent <result> blocks with fake stdout. RL rewards on text alone
reinforce the illusion of competence.
The fix wasn't "more SFT data." It was changing what we optimize and measure: real subprocess execution, multi-turn observe→fix→retry, and verifiers that compare parsed answers to typed ground truth (boolean, number, category, list types).
Every training rollout and eval episode follows the same production-shaped loop:
head(), describe(), small SQL LIMIT queries<result> from sandbox Answer: + Summary: after verified execution
The system prompt (shared across train, eval, and this HF demo) explicitly forbids hallucinated APIs
and requires the final printed value to match the answer tag. For DataBench we mount real
sample.parquet into the workspace; for DSBench we copy .xlsx workbooks
and use inspect_source for Excel structure.
Reward signal (simplified): + execution actually ran + stdout parseable + answer matches ground truth (typed comparator) − hallucinated inline <result> without [EXEC:real] − debug rambling / column dumps as "answers"
Our planned stack mirrors modern agent training — with execution at every stage:
Bulk SFT on DataBench-style traces plus agent supplements: multi-turn dialogs, Jupyter-agent
traces, dashboard examples, and code-feedback execution pairs. This produced our strongest
baseline — sanjaymalladi/DataSense-Modal-E2B-SFT.
Group Relative Policy Optimization with real Python rollouts per prompt.
Each step spawns multiple agent trajectories; rewards use compute_trajectory_reward()
with require_real_execution=True.
GRPO on Gemma-4 is brutally slow (~11 min/step on A100) because most wall time is
CPU-bound execution, not GPU matmul — 4 rollouts × up to 5 agent steps ×
subprocess sandboxing. We fixed trajectory forwarding bugs, KL instability
(final_logit_softcapping=30), and added parallel rollout workers — but full
300-step GRPO remained impractical within hackathon time. A shortened 100-step run was targeted.
Preference pairs from high vs low reward rollouts (min gap 0.15) — planned but deprioritized once EVTE-STaR showed more promise for hard-question gains within our compute budget.
Before EVTE could work, we needed execution-grounded rollouts, typed verifiers, and honest eval. These are the plumbing; the novel research contribution is EVTE + EVTE-STaR (sections 07–11 below).
Every GRPO/DPO/EVTE trajectory runs code in an isolated workspace. Rewards ignore fake
<result> tags unless tagged [EXEC:real].
databench_compare + neural verifier)
Evidence-bound scoring chain: exec stdout → Answer: tag → LLM extract → typed compare
(boolean, float, category, list[category], list[number]).
Without this, mentors "fail" when extraction fails, not when reasoning fails.
DataBench lite scores against sample_answer on mounted parquet.
run_hackathon_benchmarks_parallel runs Base / SFT / Micro-1 across three benchmarks on T4.
EVTE is the method we built when classical distillation and STaR broke down for data agents. The name encodes three commitments:
Classical STaR (Self-Taught Reasoner) assumes a strong teacher can produce correct reasoning chains, filter them, and fine-tune the student offline. That fails for DataSense because:
Implemented in datasense_evte.py — run_evte_episode (offline collection) and run_evte_star_episode (online training).
2B student, up to 5 agent steps, real workspace (CSV/parquet/xlsx). Scored via score_rollout().
Up to 3 rounds of build_self_recovery_feedback() — real tracebacks, answer withheld.
31B mentor solves in a fresh workspace; must pass the same verifier before any hint.
generate_mentor_hint() under MENTOR_HINT_SYSTEM — no final answer, no full script.
Up to 2 attempts × 5 steps. Episode saved only if student verifies after reading the hint.
run_evte_star_episode (simplified control flow):
student_rollout = phase_1_student()
if clean_first_try_verified and not messy_recovery_in_trace:
return SKIP # already knows it — not trainable in STaR mode
if not verified:
for i in 1..3:
add_user(build_self_recovery_feedback()) # ← EVTE feedback
student_rollout = student_retry()
mentor_ok, mentor_rollout = mentor_verify_solution(
student_rollout=junior_trace # mentor sees failed code
)
if not mentor_ok:
return DISCARD # mentor_unverified — no training signal
hint = generate_mentor_hint(student_rollout, mentor_rollout)
add_user("[MENTOR] " + hint) # diagnostic only
for j in 1..2:
student_rollout = student_retry()
if verified:
return SAVE_TRAINABLE_EPISODE # mentor_assisted
_prioritize_evte_problems() sorts list[category], list[number],
and multi-answer types before easy booleans. EVTE compute is expensive (two models × multi-step agents);
we spend it where SFT v1 plateaus.
Student (2B) and mentor (31B) don't fit comfortably together on one A100. The STaR loop uses
on_micro_batch hooks to unload mentor → micro-SFT student → reload mentor
every 15 episodes. Progress persists to evte_star_progress.json with resume support.
The most underrated piece of EVTE is not the mentor — it's what we put in the user turn
when the student fails. This is build_self_recovery_feedback() in
datasense_evte.py.
Messy success = verified answer but conversation contains debug/recovery language
(trajectory_has_recovery_signal()). We don't want to reinforce "stumble into correctness"
without tutor review in STaR mode.
When we later fine-tuned only on recovery trajectories (SFT v2), the model learned the shape of debug prose — dtype dumps, column lists — without improving verified answers. Lesson: self-recovery feedback is essential during collection, but training must mix clean completions with mentor-assisted wins, not recovery-only soup.
The mentor is google/gemma-4-31B-it (4-bit via Unsloth). It is not an oracle
that whispers answers. It must earn the right to hint by passing the same execution verifier as the student.
| Mode | Behavior | Config |
|---|---|---|
| series | Same conversation; temps ramp 0.4 → 0.65 → 0.85 | evte_mentor_retry_mode=series |
| parallel | 3 independent workspaces; first verified wins; temps [0.2, 0.5, 0.7] | evte_mentor_retry_mode=parallel |
EVTE-STaR combines EVTE episode collection with online weight updates. Classical STaR: collect all successes → train offline once. EVTE-STaR: collect 15 verified mentor-assisted wins → micro-SFT 30 steps → student is slightly better → repeat.
Micro-batch 1 replay in RAM scored 100% on mentor-hard (5 problems). Saved Micro-1 checkpoint: ~60% confirmatory. Replay of batches 2–3: ~80%. Final batch 6 checkpoint: ~40% — worse than SFT v1.
Online micro-SFT needs early stopping on a held-out hard set, not "more batches = better."
We only preserved micro-1 and final checkpoints on the volume — sweet-spot batches 2–3 were lost
until run_micro_replay_eval reconstructed them in RAM.
Every episode ends in exactly one outcome. The outcome determines whether it enters training.
| Outcome | Meaning | EVTE-STaR: train? |
|---|---|---|
self_solved_clean |
First-try verified, no recovery signals in trace | Skip |
self_recovered |
Fixed via self-recovery feedback only | Optional |
mentor_assisted |
Failed → mentor verified → hint → student verified | Yes |
discarded |
Mentor couldn't pass execution verifier | No |
SFT v1 consistently runs real Python (100% exec on many evals), uses ~4–5 agent steps, and beats base on hard questions where base "wins" without code. This is the behavioral foundation everything else builds on.
Saving only mentor-assisted verified trajectories produced high-signal data — multi-turn debug with real errors, not synthetic Q/A. 92 episodes is small but curated.
~11 min/step × hundreds of steps × execution-bound rollouts ≈ multi-day runs. Parallel rollout workers helped but couldn't change the fundamental CPU/GPU pipeline stall. vLLM isn't available for Gemma 4 E2B, so generation stays on HF generate.
Training only on EVTE recovery trajectories taught debug prose — column dtype dumps, rambling — without improving answers. Mentor-hard: 40% vs SFT v1's 60%.
Agent accuracy on real data files (lite DataBench parquet, DSBench Excel, mentor-hard pool). Macro average = unweighted mean across three benchmarks (30 problems). Always pair accuracy with exec_ok — base can match easy booleans via answer tags without running code.
Parallel eval: run_hackathon_benchmarks_parallel · 3× T4 · June 2026.
| Model | DataBench (15) | DSBench (10) | Mentor-hard (5) | Macro avg | Total |
|---|---|---|---|---|---|
| Base | 60.0% | 0.0% | 20.0% | 26.7% | 10/30 |
| SFT v1 | 86.7% | 0.0% | 60.0% | 48.9% | 16/30 |
| EVTE Micro-1 | 80.0% | 0.0%* | 100.0% | 60.0% | 17/30 |
*DSBench official scorer = 0% for all models. Micro-1 Q15 computed $12,829,511 = option A (correct) but was graded wrong because we compare letters not dollar values → value-aware DSBench would be 1/10 (macro 63.3%).
| Eval | Base | SFT v1 | Micro-1 / SFT v2 |
|---|---|---|---|
| Quick DataBench (5) | 80% acc / 0% exec | 80% / 100% exec | SFT v2: 40% |
| Mentor-hard (5) | 40% / 0% exec | 60% / 100% exec | Micro-1 replay: 100% (RAM); saved ckpt ~60% |
Models often run code (50–100% exec_ok) but return dataframe strings, 0.0, or dollar amounts that map to the wrong MCQ letter. Only one case (Micro-1 Q15) was a true scoring-format bug. DSBench 0% is mostly real Excel/parsing failure, not a broken metric.
Micro-1 wins macro average (60% vs 48.9%) on paper — driven by a perfect 5/5 on mentor-hard. We still ship SFT v1 on this Hugging Face Space. Here's why:
| Factor | SFT v1 | EVTE Micro-1 |
|---|---|---|
| DataBench (breadth) | 86.7% — best on the largest held-out slice | 80.0% |
| Mentor-hard (depth) | 60% (3/5), 100% exec | 100% (5/5) on first complete run |
| Stability | Single bulk SFT — predictable at inference | Online micro-SFT batch 1 — replay 100% vs saved ckpt ~60% |
| Straggler reruns | Held up when Modal overwrote volume | Mentor-hard dropped to 60% on duplicate run |
| Live demo risk | Lower — fewer debug ramble / dtype dumps | Higher — tuned on hard pool, can overfit quirks |
| Story on slides | “Execution-grounded baseline that works” | “EVTE-STaR peak — best hard-pool result” |
Gradio Space → SFT v1 (sanjaymalladi/DataSense-Modal-E2B-SFT) for reliable live CSV demos.
Slides → show all three models; cite Micro-1 as evidence EVTE-STaR helps on the hard curated pool, not as the production default yet.
| Benchmark | Problems | What it tests | Status |
|---|---|---|---|
| DataBench test (lite) | 15 | SemEval-style QA on real parquet samples | integrated |
| DSBench analysis | 10 | ModelOff Excel financial modeling | integrated |
| Mentor-hard | 5 | Curated EVTE failures | integrated |
| Checkpoint | HF repo | Role |
|---|---|---|
| Base | unsloth/gemma-4-E2B-it | Frozen foundation |
| SFT v1 ★ demo | DataSense-Modal-E2B-SFT | Live HF Space adapter — stable execution |
| EVTE-STaR Micro-1 | DataSense-Modal-E2B-EVTE-Star-Micro1 | Best mentor-hard (5/5) — research checkpoint |
The Gradio app runs SFT v1 — same agent loop as training eval: load CSV → multi-step code generation → sandbox execution → Answer + Summary. Six built-in examples cover sales, employees, and students datasets.
hf_demo/ folder to a Gradio Space (GPU T4), set HF_TOKEN if needed.README.md frontmatterhf_demo/ including assets/illustrations/ and story.htmlHF_TOKEN if adapter repo is private