DataSense E2B
The Full Story

How we set out to build a personal data-science agent — not a chatbot that pretends to run code, but one that writes Python, executes it, reads real errors, and verifies answers — and what we learned training Gemma-4-2B on Modal with methods we had to invent along the way.

Base: unsloth/gemma-4-E2B-it
Pipeline: Modal A100/T4
Team: DataSense E2B (Execution-verified, Tutor-escalation)

01 · The goal

The hackathon asked for something ambitious: take a small open model and make it genuinely useful for data work — exploring tables, cleaning messy columns, aggregating, joining, visualizing, and answering questions with verifiable correctness, not plausible prose.

Our north star was simple to state and hard to achieve:

North star

A 2B-parameter student agent that behaves like a junior data analyst: inspect schema first, run focused code steps, debug from real tracebacks, and only claim an answer after execution confirms it — with a training story credible enough for slides, papers, and a public Hugging Face demo.

Concretely, we targeted:

Execution-grounded behavior — rewards and eval tied to real stdout / errors, not hallucinated <result> blocks
Multi-benchmark credibility — DataBench, DSBench Excel analysis, and a curated hard pool from our own training data
A reproducible Modal pipeline — one app, volume checkpoints, automatic HF Hub pushes
Novel training for hard questions — when the student fails, a larger mentor verifies a solution and gives diagnostic hints without leaking the answer

Formatter that fakes answers versus a real execution-verified agent — **Fig 1 — Goal.** We optimize for an agent that runs code on real data and verifies answers — not a model that prints plausible `Answer:` tags without executing anything.

02 · Where we started

The base model

We built on unsloth/gemma-4-E2B-it — Google's Gemma 4 2B instruction model in Unsloth's E2B (execution-to-build) variant. It's small enough to fine-tune on a single GPU, yet designed with code and tool use in mind. We used 4-bit quantization, LoRA rank 32, and a 2048-token context throughout.

Three Kaggle notebooks → one Modal app

The project began as three separate Kaggle notebooks covering supervised fine-tuning (SFT), GRPO reinforcement learning, and DPO preference optimization. We consolidated them into datasense_pipeline.py — a single Modal application with shared config in datasense_utils.py — so training could run unattended on cloud GPUs with checkpoints persisted to a Modal volume and pushed to Hugging Face.

Nine bugs we fixed before trusting any number

Early runs were misleading because the ported notebooks had latent bugs. We fixed all nine before building the pipeline:

#	Bug	Impact
1	`sft_warmup` KeyError	SFT wouldn't start
2	`lora_target_modules` KeyError	LoRA attach failed
3	`result_str` UnboundLocalError	Agent loop crashed mid-rollout
4	DPO pairs missing chat template prefix	Preference data malformed
5	`skip_special_tokens=False`	Decode pollution in rewards
6	Dead `oci_sft_v1` variable	Confusing / broken cells
7	GRPO `max_steps` hardcoded	Config ignored
8	Shorter `SYSTEM_PROMPT` in DPO cell	Train/eval prompt drift
9	`_PROBLEM_LOOKUP` naming mismatch	Dataset indexing broken

Day-one eval: 0% accuracy (and why that was informative)

Our first agent eval reported 0% accuracy for everyone — including SFT — while SFT already showed 100% execution success and ~5.6 agent steps vs base's 2% exec / 1.1 steps. That gap taught us the first big lesson: the model was learning to run code, but we weren't scoring against real data.

Root cause

Eval workspaces used synthetic random CSVs when DataBench parquet wasn't mounted, but ground truth came from the real dataset. The agent analyzed fake data and was graded against true answers — guaranteed 0%.

Eval bug: synthetic workspace data scored against real ground truth — **Fig 2 — The 0% eval bug.** Early runs used random synthetic CSVs in the sandbox while ground truth came from real DataBench files — so even a good agent could never match.

03 · The problem with naive finetuning

Most "data agent" demos finetune on static (question, code, answer) triples. The model learns to format responses that look like an agent — Answer: tags, pandas snippets, confident summaries — without ever closing the loop on execution.

We observed three failure modes immediately:

Formatter, not agent

Base Gemma-4 could score well on easy boolean questions by emitting answer tags in a single turn — 0% code execution — beating SFT on accuracy while doing none of the work.

Hallucinated execution

Models invent <result> blocks with fake stdout. RL rewards on text alone reinforce the illusion of competence.

The fix wasn't "more SFT data." It was changing what we optimize and measure: real subprocess execution, multi-turn observe→fix→retry, and verifiers that compare parsed answers to typed ground truth (boolean, number, category, list types).

04 · The DataSense agent loop

Every training rollout and eval episode follows the same production-shaped loop:

THINK→ EXPLORE→ EXECUTE→ DEBUG→ ANSWER

THINK — inspect schema, dtypes, nulls before analysis
EXPLORE — head(), describe(), small SQL LIMIT queries
EXECUTE — one focused Python step; read real <result> from sandbox
DEBUG — fix column names, joins, dtypes from tracebacks
ANSWER — Answer: + Summary: after verified execution

The system prompt (shared across train, eval, and this HF demo) explicitly forbids hallucinated APIs and requires the final printed value to match the answer tag. For DataBench we mount real sample.parquet into the workspace; for DSBench we copy .xlsx workbooks and use inspect_source for Excel structure.

Reward signal (simplified):
  + execution actually ran
  + stdout parseable
  + answer matches ground truth (typed comparator)
  − hallucinated inline <result> without [EXEC:real]
  − debug rambling / column dumps as "answers"

THINK EXPLORE EXECUTE DEBUG ANSWER agent loop — **Fig 3 — Agent loop.** Every rollout follows the same multi-step cycle: inspect, run code, read real output, debug, then answer.

05 · Training pipeline: SFT → GRPO → DPO

Our planned stack mirrors modern agent training — with execution at every stage:

SFT→ GRPO→ DPO→ Eval

Stage 1 — Supervised fine-tuning (SFT v1) ✅

Bulk SFT on DataBench-style traces plus agent supplements: multi-turn dialogs, Jupyter-agent traces, dashboard examples, and code-feedback execution pairs. This produced our strongest baseline — sanjaymalladi/DataSense-Modal-E2B-SFT.

LoRA r=32, α=64 on all attention + MLP projections
~600 max steps, effective batch 8
Teaches the model to use the agent format and run multi-step code

Stage 2 — GRPO (execution-grounded RL) ⚠️ partial

Group Relative Policy Optimization with real Python rollouts per prompt. Each step spawns multiple agent trajectories; rewards use compute_trajectory_reward() with require_real_execution=True.

GRPO on Gemma-4 is brutally slow (~11 min/step on A100) because most wall time is CPU-bound execution, not GPU matmul — 4 rollouts × up to 5 agent steps × subprocess sandboxing. We fixed trajectory forwarding bugs, KL instability (final_logit_softcapping=30), and added parallel rollout workers — but full 300-step GRPO remained impractical within hackathon time. A shortened 100-step run was targeted.

Stage 3 — DPO ⏸️ deferred

Preference pairs from high vs low reward rollouts (min gap 0.15) — planned but deprioritized once EVTE-STaR showed more promise for hard-question gains within our compute budget.

SFT GRPO DPO training pipeline stages — **Fig 4 — Training stages.** SFT v1 shipped and works. Full GRPO was execution-bound and slow. DPO was deferred in favor of EVTE-STaR.

06 · Supporting infrastructure (not EVTE itself)

Before EVTE could work, we needed execution-grounded rollouts, typed verifiers, and honest eval. These are the plumbing; the novel research contribution is EVTE + EVTE-STaR (sections 07–11 below).

Execution-grounded rollouts

Every GRPO/DPO/EVTE trajectory runs code in an isolated workspace. Rewards ignore fake <result> tags unless tagged [EXEC:real].

Typed answer verification (`databench_compare` + neural verifier)

Evidence-bound scoring chain: exec stdout → Answer: tag → LLM extract → typed compare (boolean, float, category, list[category], list[number]). Without this, mentors "fail" when extraction fails, not when reasoning fails.

Lite eval & hackathon harness

DataBench lite scores against sample_answer on mounted parquet. run_hackathon_benchmarks_parallel runs Base / SFT / Micro-1 across three benchmarks on T4.

07 · EVTE — Execution-Verified Tutor Escalation

EVTE is the method we built when classical distillation and STaR broke down for data agents. The name encodes three commitments:

Execution — every claim of success must be backed by real code that ran on real files
Verified — student and mentor answers pass the same typed verifier
Tutor Escalation — a larger model intervenes only after student failure, and only as a coach, not an answer vending machine

Why we needed EVTE

Classical STaR (Self-Taught Reasoner) assumes a strong teacher can produce correct reasoning chains, filter them, and fine-tune the student offline. That fails for DataSense because:

Our 2B student often can't solve list/category questions at all
Our 31B mentor also fails verification on the hardest 5 problems (~40% mentor-hard pool)
Even when code is right, answer extraction fails (no tag, wrong stdout parse)
Distilling final answers teaches memorization; we need debugging under execution constraints

The five-phase episode (EVTE and EVTE-STaR share this skeleton)

Implemented in datasense_evte.py — run_evte_episode (offline collection) and run_evte_star_episode (online training).

Phase 1 · Student first attempt

2B student, up to 5 agent steps, real workspace (CSV/parquet/xlsx). Scored via score_rollout().

Phase 2 · Self-recovery feedback

Up to 3 rounds of build_self_recovery_feedback() — real tracebacks, answer withheld.

Phase 3 · Mentor independent verify

31B mentor solves in a fresh workspace; must pass the same verifier before any hint.

Phase 4 · Diagnostic mentor hint

generate_mentor_hint() under MENTOR_HINT_SYSTEM — no final answer, no full script.

Phase 5 · Post-hint student

Up to 2 attempts × 5 steps. Episode saved only if student verifies after reading the hint.

EVTE five phases from student attempt to mentor-assisted success — **Fig 5 — EVTE in five phases.** Student tries → self-recovery → mentor must verify independently → diagnostic hint → student retries. Only verified post-hint wins become training data.

run_evte_star_episode (simplified control flow):

  student_rollout = phase_1_student()
  if clean_first_try_verified and not messy_recovery_in_trace:
      return SKIP  # already knows it — not trainable in STaR mode

  if not verified:
      for i in 1..3:
          add_user(build_self_recovery_feedback())  # ← EVTE feedback
          student_rollout = student_retry()

  mentor_ok, mentor_rollout = mentor_verify_solution(
      student_rollout=junior_trace  # mentor sees failed code
  )
  if not mentor_ok:
      return DISCARD  # mentor_unverified — no training signal

  hint = generate_mentor_hint(student_rollout, mentor_rollout)
  add_user("[MENTOR] " + hint)  # diagnostic only

  for j in 1..2:
      student_rollout = student_retry()
      if verified:
          return SAVE_TRAINABLE_EPISODE  # mentor_assisted

Hard-first curriculum

_prioritize_evte_problems() sorts list[category], list[number], and multi-answer types before easy booleans. EVTE compute is expensive (two models × multi-step agents); we spend it where SFT v1 plateaus.

Mentor hardware choreography

Student (2B) and mentor (31B) don't fit comfortably together on one A100. The STaR loop uses on_micro_batch hooks to unload mentor → micro-SFT student → reload mentor every 15 episodes. Progress persists to evte_star_progress.json with resume support.

08 · EVTE feedback — self-recovery without answer leakage

The most underrated piece of EVTE is not the mentor — it's what we put in the user turn when the student fails. This is build_self_recovery_feedback() in datasense_evte.py.

Self-recovery feedback loop with real errors but hidden ground truth — **Fig 6 — Self-recovery feedback.** The student sees wrong predictions, last code, and real tracebacks — never the correct answer.

Messy success = verified answer but conversation contains debug/recovery language (trajectory_has_recovery_signal()). We don't want to reinforce "stumble into correctness" without tutor review in STaR mode.

Why SFT v2 failed — feedback without balance

When we later fine-tuned only on recovery trajectories (SFT v2), the model learned the shape of debug prose — dtype dumps, column lists — without improving verified answers. Lesson: self-recovery feedback is essential during collection, but training must mix clean completions with mentor-assisted wins, not recovery-only soup.

09 · Mentor verify & hint protocol

The mentor is google/gemma-4-31B-it (4-bit via Unsloth). It is not an oracle that whispers answers. It must earn the right to hint by passing the same execution verifier as the student.

Mentor must pass verification gate before giving a diagnostic hint — **Fig 7 — Mentor gate.** The 31B mentor must verify its own solution by running code before it may give a hint — and the hint must not leak the final answer.

Mentor retry modes

Mode	Behavior	Config
series	Same conversation; temps ramp 0.4 → 0.65 → 0.85	`evte_mentor_retry_mode=series`
parallel	3 independent workspaces; first verified wins; temps [0.2, 0.5, 0.7]	`evte_mentor_retry_mode=parallel`

10 · EVTE-STaR — online Self-Taught Reasoner with micro-SFT

EVTE-STaR combines EVTE episode collection with online weight updates. Classical STaR: collect all successes → train offline once. EVTE-STaR: collect 15 verified mentor-assisted wins → micro-SFT 30 steps → student is slightly better → repeat.

EVTE-STaR online micro-SFT every 15 verified episodes — **Fig 8 — EVTE-STaR online loop.** Every 15 mentor-assisted wins → 30-step micro-SFT at low LR → student continues on harder problems with nudged weights.

The overtraining curve (batches 2–3 vs batch 6)

Micro-batch 1 replay in RAM scored 100% on mentor-hard (5 problems). Saved Micro-1 checkpoint: ~60% confirmatory. Replay of batches 2–3: ~80%. Final batch 6 checkpoint: ~40% — worse than SFT v1.

Lesson

Online micro-SFT needs early stopping on a held-out hard set, not "more batches = better." We only preserved micro-1 and final checkpoints on the volume — sweet-spot batches 2–3 were lost until run_micro_replay_eval reconstructed them in RAM.

11 · Episode outcomes & trainability gates

Every episode ends in exactly one outcome. The outcome determines whether it enters training.

Outcome	Meaning	EVTE-STaR: train?
`self_solved_clean`	First-try verified, no recovery signals in trace	Skip
`self_recovered`	Fixed via self-recovery feedback only	Optional
`mentor_assisted`	Failed → mentor verified → hint → student verified	Yes
`discarded`	Mentor couldn't pass execution verifier	No

12 · What worked

✅ SFT v1 — real execution behavior

SFT v1 consistently runs real Python (100% exec on many evals), uses ~4–5 agent steps, and beats base on hard questions where base "wins" without code. This is the behavioral foundation everything else builds on.

✅ EVTE episode quality filter

Saving only mentor-assisted verified trajectories produced high-signal data — multi-turn debug with real errors, not synthetic Q/A. 92 episodes is small but curated.

13 · What didn't work

❌ Full GRPO within hackathon time

~11 min/step × hundreds of steps × execution-bound rollouts ≈ multi-day runs. Parallel rollout workers helped but couldn't change the fundamental CPU/GPU pipeline stall. vLLM isn't available for Gemma 4 E2B, so generation stays on HF generate.

❌ SFT v2 (recovery-only fine-tune)

Training only on EVTE recovery trajectories taught debug prose — column dtype dumps, rambling — without improving answers. Mentor-hard: 40% vs SFT v1's 60%.

14 · Evaluation results

Agent accuracy on real data files (lite DataBench parquet, DSBench Excel, mentor-hard pool). Macro average = unweighted mean across three benchmarks (30 problems). Always pair accuracy with exec_ok — base can match easy booleans via answer tags without running code.

Three hackathon benchmarks across three models — **Fig 9 — Hackathon eval suite.** DataBench (15) + DSBench Excel (10) + mentor-hard (5) per model on T4.

Hackathon benchmark suite — final (first complete run)

Parallel eval: run_hackathon_benchmarks_parallel · 3× T4 · June 2026.

Model	DataBench (15)	DSBench (10)	Mentor-hard (5)	Macro avg	Total
Base	60.0%	0.0%	20.0%	26.7%	10/30
SFT v1	86.7%	0.0%	60.0%	48.9%	16/30
EVTE Micro-1	80.0%	0.0%*	100.0%	60.0%	17/30

*DSBench official scorer = 0% for all models. Micro-1 Q15 computed $12,829,511 = option A (correct) but was graded wrong because we compare letters not dollar values → value-aware DSBench would be 1/10 (macro 63.3%).

Earlier standalone evals (sanity checks)

Eval	Base	SFT v1	Micro-1 / SFT v2
Quick DataBench (5)	80% acc / 0% exec	80% / 100% exec	SFT v2: 40%
Mentor-hard (5)	40% / 0% exec	60% / 100% exec	Micro-1 replay: 100% (RAM); saved ckpt ~60%

How to read DSBench

Models often run code (50–100% exec_ok) but return dataframe strings, 0.0, or dollar amounts that map to the wrong MCQ letter. Only one case (Micro-1 Q15) was a true scoring-format bug. DSBench 0% is mostly real Excel/parsing failure, not a broken metric.

15 · Why SFT v1 for the live demo (not Micro-1)

Micro-1 wins macro average (60% vs 48.9%) on paper — driven by a perfect 5/5 on mentor-hard. We still ship SFT v1 on this Hugging Face Space. Here's why:

Factor	SFT v1	EVTE Micro-1
DataBench (breadth)	86.7% — best on the largest held-out slice	80.0%
Mentor-hard (depth)	60% (3/5), 100% exec	100% (5/5) on first complete run
Stability	Single bulk SFT — predictable at inference	Online micro-SFT batch 1 — replay 100% vs saved ckpt ~60%
Straggler reruns	Held up when Modal overwrote volume	Mentor-hard dropped to 60% on duplicate run
Live demo risk	Lower — fewer debug ramble / dtype dumps	Higher — tuned on hard pool, can overfit quirks
Story on slides	“Execution-grounded baseline that works”	“EVTE-STaR peak — best hard-pool result”

Decision

Gradio Space → SFT v1 (sanjaymalladi/DataSense-Modal-E2B-SFT) for reliable live CSV demos.
Slides → show all three models; cite Micro-1 as evidence EVTE-STaR helps on the hard curated pool, not as the production default yet.

16 · Benchmark suite

Benchmark	Problems	What it tests	Status
DataBench test (lite)	15	SemEval-style QA on real parquet samples	integrated
DSBench analysis	10	ModelOff Excel financial modeling	integrated
Mentor-hard	5	Curated EVTE failures	integrated

17 · Model checkpoints on Hugging Face

Checkpoint	HF repo	Role
Base	unsloth/gemma-4-E2B-it	Frozen foundation
SFT v1 ★ demo	DataSense-Modal-E2B-SFT	Live HF Space adapter — stable execution
EVTE-STaR Micro-1	DataSense-Modal-E2B-EVTE-Star-Micro1	Best mentor-hard (5/5) — research checkpoint

18 · This Hugging Face demo

The Gradio app runs SFT v1 — same agent loop as training eval: load CSV → multi-step code generation → sandbox execution → Answer + Summary. Six built-in examples cover sales, employees, and students datasets.

Agent loop used in the HF Space demo — **Same loop as eval.** Upload this `hf_demo/` folder to a Gradio Space (GPU T4), set `HF_TOKEN` if needed.

Deploy checklist

Create Space (Gradio, gpu-t4) — see README.md frontmatter
Upload hf_demo/ including assets/illustrations/ and story.html
Secret HF_TOKEN if adapter repo is private
Smoke-test all 6 examples

Future work

DSBench MCQ letter mapping in scorer
Per-micro-batch checkpointing during EVTE-STaR
Optional Space variant with Micro-1 for hard-pool showcase