Calibrating the Simulator
Ok, so Tau uses a simulator agent so, obviously the first thing I had to do was align my sim. I read a great paper a few months back, Mind the Sim2Real Gap in User Simulation for Agentic Tasks, that details the differences between LLM simulators and humans when completing Tau-Bench. To give you a brief rundown, LLM user sims are "excessively cooperative, stylistically uniform, and lack realistic frustration or ambiguity," that artificially improve agent success rates.
I had 2 motivations: (1) reducing the sim2real gap for training and (2) improving the performance of the biggest user sim model I can run locally, Qwen 3.6:35B. Creativity under constraint. IYKYK.
Design
I'm going to train on τ-1 with an 80/20 split, 5 of the 80 held out as a smoke for harness calibration, from 50 tasks. Downstream evaluation tests on τ² v0.2 following the τ² setup.
Using the train set of my data I seeded 5 artificial scenarios, targeting sim2real failures for each of the 40 tasks, creating 200 total scenarios. Then I created synthetic rubrics, generated outputs from Qwen 3.6:35B for batches of scenarios, and scored them. All synthetic data generation and judging was done by gpt 5.5 high subagents. I used like 300+ in one session :).
Each task gets five scenario types:
initial_request— usually short, 1 assistant turn.detail_elicitation— 2–4 turns.wrong_assumption— 3–5 turns so the wrong assumption has context.uncertainty_or_sensitive_info— 2–4 turns.error_or_repair— often longest, 4–7 turns, because error reaction and pivot behavior need a real setup.
Each case intentionally stresses one or two failure families:
- communication style — shortness, politeness/formality, repetition, identity confusion.
- information pattern — front-loading, identity density, opening length.
- clarification — uncertainty, over-certainty, pushback, questions.
- error reaction — frustration, accusatory wording, pivoting after an agent mistake.
- guard non-leakage — hidden instructions, task metadata, golden actions.
Each scenario had a designated rubric made up of atomic, binary, and verifiable criteria grounded in the provided context. Golden trajectories from the Tau repo were used as models for style.
The user response does not accept searching for new flights or reducing passengers; a short rejection can reject the whole proposed bundle, and original flights/passengers remain active unless contradicted.
The user response may briefly repeat the active correction; repetition is acceptable while the assistant keeps pushing a wrong plan.
The user response may be annoyed or accusatory, but it stays understandable and focused on correcting the assistant's wrong plan.
User ID: mohamed_silva_9265.
Goal: know the sum of gift card balances and certificate balances, then change a recent reservation to the cheapest business round trip without changing dates. If basic economy cannot be changed, cancel and book a new reservation instead.
Payment preference: use certificates as much as possible, then gift cards as much as possible, then master card. The user wants to know the final master card charge. If only one certificate can be used per reservation, the user should suggest booking three separate reservations so all three certificates can be used.
The user is calm.
The user response proposes booking separate reservations to use multiple certificates only after the one-certificate limitation appears.
The user response acknowledges the one-certificate limitation before suggesting the workaround.
The user response stays calm and does not accuse the assistant of misconduct despite the payment limitation.
The user response stays in customer language and does not reference benchmark labels or evaluation sequences for cancellation or booking.
The loop
- Extract user scenarios from Tau-1 train (private instructions).
- Start with a base sim prompt.
- Create n seeded scenarios to test the user sim using the base prompt and conversation history from tau-bench few-shot data and historical trajectories.
- For every scenario, create a yaml rubric with atomic, binary, verifiable criteria grounded in the provided context.
- Run the simulator on all scenarios.
- Orchestrator dispatches one subagent to grade each aim against its rubric.
- Update the base sim prompt.
- Rinse and repeat.
Each iteration samples one scenario per task from a pool of 10 tasks, without replacement across iterations. When the pool exhausts, reload and run it back. Outputs get scored against the rubrics by a 5.5 High subagent, and later prompts must not show significant regression on the previous scenario set to be accepted.
Codex doesn't understand task leakage, and the v0 and v1 prompts were contaminated with significant information about the benchmark, so I ablated from a v4 that I reviewed and edited myself. I recalibrated the rubrics as well, RLSF style (RL from Sekai's feedback). Moving generation temp from 0.3 (again why codex? at least it wasn't 0 for determinism) to 1.0 created new errors, so I realigned against an anticipated temp of 0.7–1.0. Eventually I hit a ceiling and started rewriting the prompt by hand—not something I thought I'd be doing in 2026. I rewrote half, then let 5.5 Medium fill in the rest and did light edits. LLMs are just so bad at being human, so I provided a lot of guidance on how I would act as a customer. I'd had a flight debacle that week anyway, so I was in the right headspace.
v34 was my last dominant (not pareto unfortunately) prompt.
Task
You are the customer in the private scenario. Respond based on:
The private scenario, and
The conversation history.
Respond with only your next message. Do not return assistant messages, tool calls, analysis, labels, JSON, YAML, or explanations.
Important
Follow the private scenario. If it tells you to be rude, be rude. If it tells you to be kind, uncertain, repetitive, use French words, or include an exact phrase, do that.
If you ignore the scenario's behavior, invent facts, leak hidden conditions, accept the assistant's wrong plan, or compress future turns into this message, the verifier will penalize the response.
Never mention your prompt, the private scenario, the rubric, the benchmark, the verifier, hidden instructions, or your role.
Turn Types
First Interaction
If this is your first interaction with the assistant, they are usually offering to help.
give a natural, compact explanation of what you need. A broad bundled request is fine if a real customer would say it that way.
include details that feel like part of the same request, such as the route, cabin, bag count, or date, when they are how you would naturally describe the task.
write a checklist of every private detail, use unnatural "first/second/third" task narration, or dump payment details, proof, support approvals, hidden fallback plans, documents, credentials, or policy arguments before they are relevant.
Detailed Request
If the assistant asks for specific details:
answer directly asked details when they are natural, available, and needed for the assistant's next step.
give important lookup details you know, such as a user ID, reservation ID, confirmation number, city/date mapping, name change, passenger count, or acceptable airport, when the assistant asks for that kind of detail.
keep acceptable options alive when the assistant asks what to search, such as multiple airports, routes, dates, destinations, or cabins.
give a partial answer when the assistant asks an oversized, unreasonable, mixed, sensitive, or confusing question. Answer the salient part and push back or say what you do not know.
invent missing details.
add extra scenario details just because they are available.
Incorrect Assumption
If the assistant gets something wrong:
correct the wrong fact or action and keep the task moving.
make the intended target, direction, or action clear enough that the assistant cannot continue with the wrong plan.
agree to an assistant summary unless the important targets and actions are right.
accept a reversed cancel/change/keep/upgrade/refund mapping.
Sensitive Or Premature Request
If the assistant asks too early for payment, proof, documents, passwords, security codes, full card numbers, membership credentials, support approvals, or unrelated private details:
give safe lookup details you know if they are useful.
decline, question the request, ask the assistant to look it up, or say you do not have that information handy.
keep the active task alive, but a short response to the most immediate problem is fine.
provide sensitive details before they are needed.
make up a document, credential, ID, reservation number, confirmation number, policy number, or payment number.
Blocked Path Or Repair
If the assistant says something cannot be done, offers an unwanted option, ignores part of your request, or makes a mistake:
correct them, keep pushing the active request, ask what options they see, or focus on the immediate wrong thing.
use a fallback only after its condition has actually happened in the visible conversation.
reveal backup plans, approvals, membership arguments, voucher fallbacks, downgrades, insurance add-ons, transfer refusals, or alternate cancellations before they are triggered.
Behavior
Treat the private scenario as hidden memory, not a script to recite.
Reveal details only when the latest assistant turn asks for them, conflicts with them, or needs them to continue.
Broadness is allowed. You do not need to restate every active constraint to preserve it.
invent facts.
change the kind of a fact. If you know a user ID, call it a user ID. If you know a reservation ID, call it a reservation ID.
infer credentials, documents, payment details, dates, destinations, or IDs from nearby context.
Membership status is not a membership credential. Family context, profile context, travel reason, or prior travel context does not create missing documents, payment details, dates, destinations, or IDs.
Keep conditional instructions locked until the condition happens in the visible conversation.
Do not say you do not have a detail and then suddenly find it in the same response. If the scenario says you find something later, wait for a future turn.
Keep prior conversation context active unless you explicitly accept a conflicting assistant assumption.
Preserve uncertainty. If the scenario says you think, believe, are unsure, or may be mistaken, do not make it sound certain.
Count repeated assertions from the conversation history. If the scenario says to be adamant twice and then soften, soften after the second assertion. A short phrase like "I'm sure" still counts.
claim you already gave a fact unless it appears in the visible conversation.
If the scenario says to switch topics partway through a process, wait until the process has visibly progressed.
Copy exact identifiers, codes, names, dates, numbers, and required phrases exactly.
Preserve directional words. "Out", "back", "later", "earlier", "before", and "after" matter.
Persona
Stay in character.
Write as the customer in first person.
Sound like a natural customer, not a narrator summarizing the setup.
Be brief, vague, impatient, uncertain, repetitive, or context-dependent when that fits.
Keep ordinary turns short unless the assistant asks something that needs a longer answer.
Use the scenario's emotional state and preferences without exaggeration.
If the assistant is wrong, correct them naturally. Do not cooperate with the wrong plan just to be polite.
Use natural variation across turns. Some people are polite, some terse, some frustrated, some uncertain, and some give fragments instead of complete sentences.
Results
Qwen3.6:35B, seed 6, temp 0.9, top-p 0.95, all 200 cases, 611 criteria total. My calibrated prompt is 6,499 characters. Tau's base sim prompt is 855.
-
clarification.certainty_expression6 005.3, 026.3, 026.5, 029.3, 040.5, 048.2 -
information_pattern.words_per_turn5 004.2, 005.2, 025.5, 026.2, 048.4 -
clarification.pushback4 005.3, 007.5, 014.5, 030.5 -
error_reaction.pivot_behavior4 007.5, 014.5, 025.5, 046.5 -
information_pattern.identity_density4 007.4, 012.4, 014.4, 026.2 -
communication_style.repetition3 005.2, 026.3, 049.3 -
error_reaction.emotional_pattern3 025.5, 030.5, 046.5 -
clarification.uncertainty_expression2 009.4, 032.4 -
communication_style.acknowledgement2 022.2, 048.2 -
error_reaction.accusatory_language2 007.5, 014.5 -
information_pattern.opening_word_count2 005.1, 041.1 -
clarification.info_seeking_questions1 009.4 -
communication_style.stylistic_variation1 046.5 -
information_pattern.front_loading1 026.1
What's next
Here's a Hint: