I designed and built the AI Interviewer, a production conversational-control system for consulting case interviews. It runs structured simulations (problem identification → framework → analysis → recommendation) while measuring how far each candidate deviates from an ideal solution path and applying correction policy in real time.
The interviewer speaks naturally; a separate orchestration layer owns belief state, advancement, and extraction. Domain experts configure cases declaratively, and the engine compiles that into phase gates, extraction ontologies, and behavioral policy at runtime.
The product thesis: a credible AI interviewer requires explicit state and governable policy, not a better persona.
Case interviews are rigidly structured. A real interviewer enforces phase discipline, catches structural mistakes early, and only advances when the key artifacts are locked: problem definition, a MECE framework, correct math, a clear recommendation.
Most AI interview products collapse this into one LLM call per turn. That fails in production: the model forgets gating rules, advances too early, or cannot tell whether the candidate stated segment-level profit or an option total.
The platform needed an interviewer that feels human and stays governable: coach-configurable, testable in CI, and debuggable turn by turn. I owned the core engine: orchestrator, perception layer, prompt policy, advancement logic, the coach authoring abstraction, and the automated regression harness.
The hard problem is controlled dialogue under partial observability:
Each turn runs through orchestrator.ts as a deterministic pipeline with selective LLM calls:
Each turn emits structured diagnostics: action status updates, drift-control artifact deltas, gate failures, judge verdicts, and optional advance proposals.
I modeled the interview as movement through buckets (global phases) along an ideal action graph. The ideal path is the ordered actions per bucket with required / sufficient / contributing / optional states; the observed path is extracted artifacts plus completed actions plus turn-band pressure.
Canonical state vector (DriftControlArtifacts)
Control signals
- ▹Turn bands (early → light → heavy → lastTurn): correction aggressiveness increases with time in phase.
- ▹Consecutive next-bucket evidence: a guarded override requires two turns, not single-turn hallucination.
- ▹Locked artifact keys: immutable on phase ADVANCE.
- ▹Advance proposals: borderline judge verdicts become offers the candidate can accept or decline.
Instead of one monolithic prompt, interviewer behavior decomposes into composable policy dimensions mapped to prompt sections:
A two-tier prompt architecture composes these at runtime from case JSON: a universal base for behavioral realism, plus phase-specific rules (identify_problem, frame_solution, lead_analysis, provide_recommendations). bucketPromptBuilder.ts assembles case description, previous-bucket context, in-run artifacts, turn-band instructions, drift rules, calculation references, and coach-authored guidance.
The action extractor runs one LLM call per action for clearer classification with action-specific examples. Key design choices:
- ▹Dyadic acceptance: framework_proposed only completes when the interviewer turn shows acceptance language, not a refinement question.
- ▹Solution-graph matching: MECE node alignment against a case-defined semantic graph.
- ▹Artifact merge with lock respect: upsert analysis results without overwriting locked keys.
The framework compiler fixes a real bug class: solution_structure from a single acceptance turn misses multi-turn frameworks. The compiler reads the full frame_solution transcript on ADVANCE; gating still uses the gate artifact, display uses the compiled summary. Analysis extraction uses an explicit state space of dozens of granular slots (for example consumer_credit_revenue_cars versus consumer_credit_revenue) that aggregate to validation slots.
Advancement is evidence-based, not conversational:
The semantic gate requires strong shape (for example decision + objective + metrics) before identify_problem can close. The advancement judge is a single LLM call over gate failures, turn band, merged artifacts, transcript, and per-phase rubric, returning advance / propose / hold. Effective bias escalates with turn pressure so candidates are not trapped. The judge is rescue-only: it can unblock a stuck candidate, but it never overrides a gate-ready advance.
Cases are declarative programs for the engine, with no code per scenario:
- ▹Buckets with objectives, actions, gates, turn caps, and turn-band instruction overrides
- ▹Calculations with expected results for in-prompt validation
- ▹Information-to-provide with triggers (exhibit routing, whitelisted facts)
- ▹Solution-graph node IDs for structure matching, plus a runtimeConfig ontology
- ▹Phase editors in the coach dashboard; the orchestrator always runs authoritative published content
Coaches input domain knowledge once; the engine interprets it as constraints on the state space, and the four behavioral abstractions inject at compile time rather than being hand-edited in TypeScript.
Automated regression runs against the real orchestrator, not mocks:
- ▹Deterministic suite (hard gate): phase transitions, ADVANCE/ROLLBACK, artifact extraction, numeric parsing, gate readiness.
- ▹LLM scenario runners: behavioral regression, drift detection, key collection.
- ▹Turn-level telemetry and structured TurnDetail make prompt and policy iteration measurable.
The voice path shares the same orchestrator contract over HTTP, so text and voice are surfaces on one engine.
The interviewer sounds human; the orchestrator is paranoid. Mixing them creates confident-sounding premature advances.
Phase transitions require artifact readiness, action completion, and turn caps. The judge rescues; it does not govern.
Artifacts credit only when the interviewer validated them in the same turn, which prevents marking structure the coach never accepted.
Different consumers, different correctness requirements.
New cases ship as JSON and DB content, so engineering scales with engine features.
Ontology mapping before the LLM: cheaper, faster, auditable, the kind of NLU discipline production needs.
Higher-autonomy behaviors roll out safely behind feature flags (ADVANCEMENT_JUDGE_ENABLED, ADVANCEMENT_PROPOSALS_ENABLED).
The interviewer generates first; action extraction then runs on the full turn including coach acceptance, so advancement ties to interviewer behavior rather than pre-response guesswork.
One interview turn, end to end:
RunState (client + server)
Engineering properties
- ▹Pure policy functions in advancementPolicy.ts, testable without the API
- ▹Gate diagnostics on every turn, production-debuggable
- ▹Phase summaries plus a rolling message window (last 6 for the model) for bounded context
- ▹Exhibit eager-resolve and data-claim reconciliation, keeping spoken claims consistent with shown exhibits
- ▹API: POST /api/orchestrator/process (turn loop) and /opening; a Zustand store applies structured updates client-side
A production-grade conversational control system, running live in the product:
- ▹Coaches author full interview flows without engineering per case
- ▹Turn-level advancement is explainable: gate failures, missing artifacts, judge rationale
- ▹Automated regression catches drift in policy and language
- ▹The analysis phase supports fine-grained state tracking validated against case calculations
- ▹One orchestrator powers the text interview room, the voice agent, and the User Agent QA harness
You can try the live interviewer at myconsultingcoach.com/practice.