Most AI tool selection decisions at non-tech companies are made by the wrong person, using the wrong criteria, at the wrong stage of implementation.
The wrong person: the most technically enthusiastic person on the leadership team, rather than the operations lead who knows what the team actually does.
The wrong criteria: feature breadth and demo impressiveness, rather than output quality on the company’s specific recurring tasks.
The wrong stage: before the context pack is built, so there is no way to evaluate which tool produces better company-specific outputs.
This article fixes all three.
This article gives a specific framework for selecting AI tools for a $5M to $25M non-tech company: the evaluation criteria that predict operational success, the selection process that uses the company’s actual tasks rather than demos.
Also the governance decisions that must be made before any tool is purchased.
The four stages of tool selection — in the right sequence
| Stage | What happens | Why it comes here |
|---|---|---|
| 1. Define the primary task mix | Identify the 5 to 8 most frequent, most AI-appropriate tasks before looking at any tool | Prevents evaluating tools on generic rather than specific capability |
| 2. Evaluate governance fit | Confirm data handling terms, BAA requirements, access controls | A governance failure eliminates a tool regardless of capability |
| 3. Run the two-week pilot | Test both candidate tools on the actual primary tasks with actual team members | Produces decision-relevant evidence that demos and reviews cannot |
| 4. Make the deployment decision | Choose based on pilot results and adoption behaviour | Grounds the decision in evidence, not enthusiasm |
Stage 1: Define the primary task mix before looking at any tool
What the primary task mix is
The five to eight most frequent, most time-consuming, most AI-appropriate recurring tasks the operations team runs every week. Not the most impressive AI use cases: the most operationally relevant ones.
How to identify it
Run a one-hour session with the operations lead and two or three senior team members. Ask three questions:
- “What are the five tasks you do most frequently that involve writing, drafting, or compiling information?”
- “Which of those tasks takes the most time?”
- “Which of those tasks produces the most frustration when the quality is inconsistent?”
The intersection of high-frequency, high-time-cost, and quality-sensitive is the primary task mix.
Examples by sector:
| Sector | Primary task mix |
|---|---|
| Distribution | Back-order notifications, RFQ responses, account health summaries, supplier communications, management briefing |
| Healthcare | Payer appeal letters, compliance report narratives, referral communications, staff notifications, operations briefing |
| Professional services | Work product first drafts, client status communications, research synthesis, proposal sections, performance reports |
| Non-profit | Grant proposal sections, funder reports, donor cultivation letters, board communications, compliance narratives |
Why this step must come first
The company that evaluates AI tools before defining the primary task mix evaluates them on generic capability: how well they do anything.
The one that defines the task mix first evaluates them on specific capability: how well they do the things the company actually needs.
The first evaluation produces the tool that won the most awards. The second produces the tool that fits the company’s operational needs.
Stage 2: Evaluate governance and regulatory fit
Why this is a prerequisite gate
A tool that fails the governance evaluation is eliminated from the capability comparison. No output quality advantage overcomes a governance failure for a regulated company.
This is the most commonly skipped step and the one that produces the most expensive mistakes: companies that deploy a tool for six months before the compliance officer discovers a BAA requirement the tool does not meet.
The three governance questions
Question 1: What data types will the team enter into the AI tool?
Map the primary task mix against data types:
- Customer names and contact information (PII)
- Patient information, diagnosis, or treatment (PHI, HIPAA-applicable)
- Attorney-client communications or privileged legal information
- Student education records (FERPA)
- Substance use treatment records (42 CFR Part 2)
- Confidential client financial information
- Proprietary technical specifications or trade secrets
Question 2: What data handling requirements apply?
For non-regulated industries (manufacturing, distribution, general professional services, real estate, non-profit without health data): standard business data handling terms are typically adequate. Verify the tool’s business terms prevent training on company data.
For regulated industries: identify the specific regulatory requirement (HIPAA BAA requirement, professional conduct rules, financial data protection obligations) and confirm which tools can meet them before proceeding to capability evaluation.
Question 3: What access control requirements exist?
Identify which of these are required (not preferred):
- Role-based access control (different team members access different contexts)
- Audit logs of AI tool use (for compliance documentation)
- SSO integration with existing identity management
- Multi-site access management
Stage 3: The two-week pilot
Pilot setup
Duration: two weeks. Enough time to move past first-session novelty and evaluate consistent output quality.
Participants: five team members with this specific mix:
- Two who are moderately AI-experienced (will make any tool work with effort)
- Two who are at typical AI experience level (will provide the most decision-relevant data)
- One who is an AI adopter (will identify the performance ceiling of each tool)
Do not select five AI enthusiasts. They will make any tool produce acceptable outputs, which defeats the purpose of the pilot.
Tools to pilot: maximum two. Piloting three or more tools simultaneously produces confusion and dilutes the quality of the context loaded into each.
Context loading requirement: before the pilot begins, load the same context pack into both tools: the same voice guides, communication standards, vocabulary guides, and workflow specifications.
The pilot that does not load context is evaluating generic AI capability. The pilot that loads context is evaluating operational AI fit. These are different evaluations and produce different results.
If you have not yet built your context pack, see what an AI context pack is for the document structure required before a meaningful pilot can run. For a head-to-head comparison of the two most common candidates at Stage 3, ChatGPT vs Claude for business evaluates both tools across the six operational dimensions most relevant to non-tech teams. And if the pilot reveals your team is accumulating too many tools rather than consolidating around one, why one AI tool beats five makes the case for consolidation over tool sprawl.
The pilot task set and metrics
Run the five primary tasks from Stage 1. For each task, the five pilot participants run the same workflow in both tools on the same day, using the same inputs.
Collect for each task-tool combination:
| Metric | How to measure |
|---|---|
| First-attempt output quality | 1 to 5 rating against the company’s quality standard |
| Editing time required | Minutes from output to usable draft |
| Input effort required | 1 to 5 (1 = very easy, no prior AI experience needed) |
| “Would use again without being asked” | Yes/No endorsement from each participant |
The pilot decision
At the end of two weeks, calculate:
- Average first-attempt quality score per tool across all tasks
- Average editing time per task-tool combination
- Average adoption friction (input effort) per tool
- Number of “would use again” endorsements from the five participants
The weighting guidance:
- Weight quality higher for regulated or client-facing outputs where the cost of a substandard output is high
- Weight adoption friction higher for high-volume, lower-stakes tasks where team adoption rate is the primary concern
Stage 4: The deployment decision and governance documentation
The deployment decision
The tool with the strongest pilot performance on the primary task mix, that passes the governance evaluation, and that the non-AI-enthusiast pilot participants are most likely to continue using without being prompted, is the deployment tool.
Document the decision in one page: the task mix evaluated, the governance requirements confirmed, the pilot results summary, and the tool selected.
This document is the evidence that the selection was made based on operational evaluation rather than vendor relationship or demo impressiveness.
The governance documentation before going live
Before any team member beyond the pilot uses the tool:
- Data handling standards document (one page): what data categories are appropriate for this tool, what are not, how sensitive data is de-identified before entry, and who reviews AI-assisted outputs before use
- For regulated industries: the signed BAA (or equivalent) on file
- Access configuration: the team member access list, the role-based access controls configured, the admin console access designated to the AI system owner
The five most common selection mistakes — and the correction for each
Mistake 1: Selecting based on the founder’s personal use
The founder uses Claude personally and selects it for the team without a pilot evaluation. The team’s primary tasks are very different from the founder’s tasks.
Correction: the pilot must include the actual team members who will run the actual workflows. Founder personal use is one data point, not the deployment decision.
Mistake 2: Selecting based on demo quality
The vendor demo produces impressive results using ideally prepared inputs, a well-configured context, and carefully selected example tasks. The team’s first deployment does not reproduce these conditions.
Correction: the pilot uses the team’s actual inputs on their actual tasks. Demo quality is the tool’s best case. Pilot quality is the tool’s operational reality.
Mistake 3: Selecting before the context pack is built
The team selects a tool and deploys it before building the context pack. Without the voice guides and communication standards loaded, the tool produces generic outputs. The team concludes the tool does not work for their industry.
Correction: build the context pack first, load it into the pilot tools, and evaluate the tools with the context loaded.
Mistake 4: Selecting the tool with the most features
The tool with the longest feature list wins the selection decision, even though the team only uses three of the forty features.
Correction: the selection criterion is output quality on the primary task mix. Feature breadth is relevant only when a specific feature is required for a specific primary task. Unused features are not a selection advantage.
Mistake 5: Skipping the governance evaluation
The team deploys a tool for six months before the compliance officer reviews the data handling terms and identifies a BAA requirement the tool does not meet.
Correction: the governance evaluation is Stage 2, before capability evaluation. Non-negotiable for regulated industries. Prudent for all industries.
Common questions on AI tool selection
”What if we can only afford one tool — should we still pilot two?”
Yes. The pilot is the cheapest way to get the decision right. Most tools offer a trial period (verify at each tool’s website). A two-week trial costs nothing beyond the pilot participants’ time.
The cost of selecting the wrong tool for twelve months is significantly higher than two weeks of pilot time.
”What if the governance evaluation eliminates all the tools we were considering?”
This means the tools evaluated do not yet offer the data handling terms your regulatory context requires.
The two options: identify whether a higher-tier offering from the same vendors meets the governance requirements (verify BAA availability and ZDR options), or work with a compliance consultant before the capability evaluation.
”What if the pilot produces a tie — both tools perform equally on our tasks?”
Consolidate around the tool that the rest of the company uses, or around the tool with the stronger shared context architecture for the company’s primary function.
A tie on output quality means the other dimensions (adoption friction, governance fit, shared context architecture, cost) determine the decision.
”How often should we re-evaluate our tool selection?”
A formal re-evaluation every 12 to 18 months, or when either of these occurs: a significant new model release from the leading providers, or a measurable decline in the quality gap between the primary tool and alternatives.
The Foundation context pack you build is portable (it lives in text documents) and can be transferred to a different tool within two weeks if re-evaluation produces a different recommendation.
Want the task mix defined, the governance review completed, and the pilot run for your company?
Choosing the right AI tool for a non-tech company requires four stages in the right sequence.
The company that follows this sequence selects the right tool in four weeks and avoids the six-month sunk cost of the wrong one.
The company that follows this sequence selects the right tool in four weeks and avoids the six-month sunk cost of the wrong one.
Path one: define your primary task mix today. Run the one-hour session with your operations lead. Ask the three questions. Write down the five tasks at the intersection of high-frequency, high-time-cost, and quality-sensitive. That list is the evaluation criterion for every tool comparison you will make. No other step requires more than that list to begin.
Path two: bring in a partner. Phos AI Labs runs the task mix definition, the governance review, and the two-week pilot for your specific company. Thirty minutes, no deck. Start here.
Related articles