Blog

How to Evaluate an AI Consulting Firm: 10 Questions You Should Be Asking

The AI consulting firm that produces a strategy document and the one that produces a running AI system both send professional proposals. Here are the 10 questions that tell them apart.

Phos Team ·
Phos AI Labs AI Strategy

The AI consulting firm that produces a strategy document and the one that produces a running AI system will both send you a professional proposal, cite relevant case studies, and have a compelling founder story.

The difference between them is not visible in the pitch. It is visible in ten specific questions that the operational firm answers with specifics and the advisory firm answers with principles.

Ask these questions before you sign. The answers will tell you more than the proposal document.

These ten questions are not a gotcha list. They are the questions a well-informed buyer should ask when evaluating any significant service engagement, adapted specifically to the AI consulting context.

A firm that answers all ten with specifics earns the scrutiny. One that hedges, deflects, or answers in principles when the question calls for numbers is telling you something important about what the engagement will produce.


Questions 1–4: What the engagement produces

Question 1: “What will be running and measurably working at the end of week six?”

Why this question matters: week six is early enough that a genuine embedded firm will have something running and measurable. Late enough that a purely advisory firm will not.

If the firm answers with a deliverable list, strategy document, maturity assessment, roadmap, the engagement is advisory.

If it answers with operational outputs, context pack loaded, first workflow at a specific acceptance rate, two team members trained, the engagement is embedded.

What a strong answer looks like:

“By week six, you will have a complete context pack loaded into the shared workspace, workflow specifications for your three highest-priority processes, and your ops lead and one account manager trained on those workflows using real current work. We will have tracked the acceptance rates for two weeks and made the first round of improvement adjustments.”

What a weak answer looks like:

“We will have completed the discovery and assessment phase and be well into strategy development.”

This describes advisory work. Week six produces analysis. The follow-up if the answer is vague: “What specific document, system, or operational output will exist that did not exist before you arrived?”


Question 2: “Can you show me an example of a context pack you have built for a company in a similar industry?”

Why this question matters: the context pack is the central deliverable of Phase 1. A firm that has built many of them can show one quickly, either a sanitised version or a structural description with accurate content examples.

A firm that has not built many of them will describe what a context pack is rather than showing what one looks like.

What a strong answer looks like:

A specific, concrete walk-through:

“For a professional services firm serving manufacturing clients, the client archetype section typically includes these seven elements: role, operational situation, trigger to the conversation, primary concerns, communication preferences, success definition, and relationship sensitivities, and here is what each looks like for a COO at a $20M manufacturer.”

What a weak answer looks like:

A description of what a context pack is in general terms, without a concrete example of what one has looked like for a real client.


Question 3: “Walk me through the Phase 2 training session: specifically what happens in the room.”

Why this question matters: Phase 2 training is the moment when the AI system moves from something the firm built to something the team uses.

A firm that has delivered many training sessions describes them specifically, real current work, role-specific workflows, the specific process for handling a first session that produces poor outputs.

What a strong answer looks like:

“We sit with each team member individually or in role-specific groups. We take a real task from their current work, a proposal they need to write today, a report they run every week, and we run the workflow on it together. If the first output meets the quality bar, we discuss what made it work. If it does not, we diagnose the gap together and fix it before the session ends. The session ends when the team member has produced an output they actually used, not when an hour is up.”

What a weak answer looks like:

“We run a training workshop where we demonstrate the tools and teach best practices for prompting.”

This describes awareness training, not adoption training. The question asked about adoption.


Question 4: “What does the AI system look like on the last day of the engagement: specifically, what can the team do that they could not do on the first day?”

Why this question matters: this question produces a concrete before/after picture that makes the engagement’s scope and value tangible. It also reveals whether the firm thinks in terms of operational change or in terms of deliverables.

What a strong answer looks like:

“On the first day, your team was using AI individually in their own accounts with no shared context, producing outputs that required 20–30 minutes of editing per use. On the last day, every team member using AI is running from the same company-specific context pack, the three highest-priority workflows are documented and running at 75%+ acceptance rate, your AI system owner is maintaining the adoption log independently, and the next round of workflow builds has a specific specification ready.”

What a weak answer looks like:

“Your team will have a much better understanding of how to use AI effectively and a clear roadmap for where to take it next.”

This describes an awareness outcome and a planning outcome. Neither is operational change.


Questions 5–7: How the firm works

Question 5: “Who specifically will be working on our engagement: and will they change during it?”

Why this question matters: AI consulting firms that sell at the partner level and deliver at the junior consultant level produce a different outcome from the one that was pitched.

The person who understands the company well enough to write an accurate context pack and build company-specific workflows is the person who built the pitch, not a junior team member who joined the firm six months ago.

What a strong answer looks like:

A specific named person with a specific role description and a clear commitment:

“I [named partner] will run the discovery and context pack build personally. [Named team member] will run the workflow documentation and team training. If anything changes on our side, you will hear about it before the engagement starts.”

What a weak answer looks like:

“Our team of AI specialists will be assigned to your engagement.”

This is a category description, not a commitment. Follow up with: “Can you tell me specifically who that will be?”


Question 6: “How do you write the context pack: what is the process, and what does it require from us?”

Why this question matters: the context pack build process reveals whether the firm has a structured methodology or improvises. It also reveals what the company will be asked to contribute.

What a strong answer looks like:

“The context pack build starts with a 3–4 hour structured session with you or your COO, not a generic intake form, but a specific conversation that extracts the company’s voice, client archetypes, and decision rules. We produce a first draft within two days. We review it with you against real AI outputs, running the draft context and testing it on two or three actual tasks, before finalising. The whole process takes one to two weeks and requires four to six hours of your time.”

What a weak answer looks like:

“We gather all the relevant information during discovery and build the context pack from there.”

This describes a process without specifics. Discovery could mean anything. Four hours of founder time or four days of consultant interviews, the difference matters.


Question 7: “What is your minimum acceptable acceptance rate before you consider a workflow deployed?”

Why this question matters: acceptance rate is the operational quality standard for AI workflows.

A firm that does not use acceptance rate as a deployment gate has no objective quality standard, which means the workflow goes live when the timeline says it should, not when the quality says it can.

What a strong answer looks like:

“We target 80% acceptance rate on core workflows before we consider them deployed. That means 80% of outputs are used without significant editing. Below 80%, the workflow is in improvement mode, we diagnose the gap, adjust the prompt or context pack entry, and re-evaluate. We track this in the adoption log from the first week of training.”

What a weak answer looks like:

“We make sure outputs are high quality before we sign off on the workflow.”

This is a statement about intent, not a standard. “High quality” is not a number. 80% is.


Questions 8–10: Accountability structures

Question 8: “What happens if the acceptance rate targets are not reached by the end of the engagement?”

Why this question matters: this question reveals whether the engagement has consequences for underperformance or whether the timeline is the only binding constraint.

A firm that is genuinely accountable for outcomes has a specific answer, additional sessions, a remediation process, an extended timeline. A firm accountable only for delivering to the schedule has no specific answer.

What a strong answer looks like:

“If a workflow is below the 80% target at the scheduled end of the engagement, we extend that workflow’s improvement cycle at no additional cost until it reaches the target. We have had this happen three times in 400+ engagements, and in all three cases, the gap was a context pack entry that needed updating, which we completed within two additional weeks. The target does not move because the timeline does.”

What a weak answer looks like:

“We are committed to delivering the highest quality work and will do everything we can to ensure you are satisfied.”

This is a values statement. The question asked about a specific accountability mechanism.


Question 9: “What condition does the AI system have to be in before this engagement ends?”

Why this question matters: most AI consulting engagement contracts specify a deliverable list and an end date. Neither guarantees an operational outcome.

The firm that commits to a system state is accountable for outcomes. The firm that commits to a deliverable list and a date is accountable for outputs.

What a strong answer looks like:

“The engagement ends when three conditions are true: the core workflows are running at 75%+ acceptance rate, every intended AI-using team member has completed their training session on real current work, and the AI system owner has been running the maintenance cadence independently for at least two weeks without our oversight. If those conditions are not met at the scheduled end, we extend at no additional cost.”

What a weak answer looks like:

“The engagement concludes with the delivery of the final documentation package and a closing presentation to your leadership team.”

This is a deliverable-based exit. The system may or may not be working. The engagement ends regardless.


Question 10: “Can you share a reference from an engagement that ran into problems: not just your best case studies?”

Why this question matters: every firm has strong case studies.

The reference from a difficult engagement reveals something the polished case study does not.

How the firm behaves when the work is harder than expected, when the team resists adoption, when the context pack needs three revision cycles instead of one.

A firm with 400+ engagements has had difficult ones. Their willingness to share a reference from a difficult engagement signals confidence in their process and honesty about the reality of AI implementation.

What a strong answer looks like:

“Yes, we had an engagement at a 35-person distribution company where team adoption in Phase 2 was lower than expected for the first four weeks. We ran an additional individual training cycle for the three non-adopters, adjusted two workflows based on the adoption log patterns, and reached the acceptance rate target by week ten instead of week eight. I can put you in touch with their COO directly.”

What a weak answer looks like:

“All of our engagements have produced strong outcomes. Here are three client references.”

This is not an answer to the question. Follow up explicitly: “I appreciate those references, but I specifically asked about an engagement that ran into challenges. Can you point me to one of those?”

A firm that cannot or will not answer this question is either very new (insufficient track record to have had a difficult engagement) or unwilling to discuss failure modes. Neither is reassuring.


How to use these questions: practical guidance for the evaluation conversation

When to ask each question

QuestionsWhen to askWho to ask
1–4 (what the engagement produces)First substantive conversation — before the proposal is writtenThe partner or lead who will run the engagement
5–7 (how the firm works)During the proposal review conversationThe person who will do the actual work
8–10 (accountability structures)When reviewing the proposal contract before signingThe partner who is accountable for delivery

The questions asked before the proposal shapes what is in the proposal. The questions asked after the proposal is received are negotiation, not evaluation.

The four questions that matter most

If you can only ask four: ask Questions 1, 7, 9, and 10.

These are the questions that most directly reveal accountability for outcomes. Specific answers to all four indicate a firm that thinks in outcomes and has the track record to back them up.

QuestionWhat a specific answer signals
Q1: What is running in week six?The firm builds systems, not documents
Q7: What is your acceptance rate deployment threshold?The firm has an objective quality standard
Q9: What is the exit condition?The firm is accountable for system state, not date
Q10: Can you share a difficult engagement reference?The firm has honest experience and a resolution process

How to interpret mixed results

Strong pattern: the firm answers all four key questions with specifics. Proceed to contract review, the accountability terms should reflect the specifics given in conversation.

Caution pattern: the firm answers two or three of the four key questions with specifics and hedges on one. Identify which question produced the hedge and follow up directly. If the hedge is on Question 9 (exit condition), the accountability structure is the gap.

Disqualifying pattern: vague answers to Questions 1, 2, 4, 7, 8, and 9 collectively indicate an advisory firm presenting as embedded. The deliverable will be a document.

How to use the reference from Question 10

When the firm provides a reference from a difficult engagement, the question to ask the reference is not “how did the engagement go overall?”

Ask: “What was the specific problem, and what specifically did the firm do to resolve it?”

The reference’s description of the firm’s response to difficulty is more informative than any performance description.


Common questions on evaluating AI consulting firms

”Should I ask all ten questions or just the most important ones?”

Ask all ten. Questions 5–7 (how the firm works) are easy to skip because they feel like process questions rather than outcome questions.

But the answers reveal the working model and often determine whether the Phase 1 deliverable is generic or company-specific.

”What if a firm refuses to answer Question 10?”

Treat refusal as a disqualifying signal.

A firm with sufficient track record to have produced difficult engagements, and to have resolved them successfully, is willing to share those references because the resolution demonstrates the process, not despite the difficulty.

A firm that will only share success cases is not being transparent about its track record.

”How do I evaluate a firm that is very new and has a limited track record?”

Apply Questions 1–4 and 7–9 as the primary evaluation.

A new firm cannot answer Question 10 with a specific difficult engagement reference because it has not run enough engagements to have one. That is a known limitation, not a disqualifying one.

The compensating questions: what methodology does the firm use for context pack build? What specific acceptance rate standard do they apply? What does the system owner handover process look like?

A firm with a structured methodology and specific standards can produce strong outcomes even with a limited track record.

”Should the questions be asked in writing or verbally?”

Verbally, in a live conversation, not a written questionnaire. The questions are designed to elicit specifics.

A written response allows the firm to prepare polished answers that obscure vagueness. A live conversation reveals whether the firm has the specifics memorised from experience or is constructing them on the fly.

”What if the firm’s answers are good but the proposal price is too high?”

The proposal price should reflect the scope. If the firm answers the ten questions with specifics, naming the acceptance rate target, the exit condition, and the difficult engagement reference, the price reflects an embedded engagement with measurable accountability.

The comparison is not “this firm’s proposal vs a cheaper proposal.” It is “this firm’s proposal vs the cost of running AI at the tool-first ceiling for another 12 months while the compounding gap widens.”


Want to ask these questions to Phos AI Labs: and hear the specific answers?

Ten questions separate the AI consulting firm that produces a running system from the one that produces a document about a running system.

The most important four: what is running in week six, the minimum acceptance rate target, the exit condition, and whether the firm shares a reference from a difficult engagement.

These four questions most directly reveal accountability for outcomes.

A firm that answers all ten with specifics is not making a sales pitch. It is describing a process it has run many times before. That is the firm worth engaging.

Path one: run these questions on the firm you are currently evaluating. The pattern of answers tells you whether you are looking at an advisory or embedded firm before the contract is negotiated.

Path two: bring in a partner. Phos AI Labs answers all ten of these questions with specifics, a named partner doing the work, a documented context pack process, an 80% acceptance rate deployment standard, a system-state exit condition, and 400+ engagements that include difficult ones with references available. We have run 400+ AI engagements. Clients include Zapier, Coca-Cola, Medtronic, Dataiku, and American Express. Thirty minutes, no deck. Start here.

The fastest way to know whether we're the right fit, is a conversation.

STEP 1/2 · ABOUT YOU