Why most small-team AI pilots fail
Why 95% of AI pilots produce zero P&L impact, the three failure archetypes we see in small-team audits, and what audit-first looks like for sub-20-person teams.
In July 2025, a small research initiative inside the MIT Media Lab published a report that ended a lot of board-deck slides. The GenAI Divide: State of AI in Business 2025 came out of Project NANDA, based on a systematic review of 300 publicly disclosed AI initiatives, 52 structured executive interviews, and 153 survey responses from senior leaders gathered across four industry conferences. The headline finding made it into Fortune, then into every CFO chat thread for a week: 95% of GenAI pilots produce zero measurable P&L impact. Only 5% of integrated systems create significant value. [Source: MIT Project NANDA, July 2025; Fortune coverage, August 18, 2025]
The number itself isn't the story. The interesting part is the rest of the sentence: "This divide does not seem to be driven by model quality or regulation, but seems to be determined by approach." The 5% that worked were not running better models. They had framed a smaller, measurable problem first.
This article is for the founder or operator of a sub-twenty-person team who has read that statistic, felt a small cold spot, and now has a calendar invite from a vendor offering to "explore AI together." We will walk through why the 95% number is real, why it lands hardest on small teams (not the enterprises everyone assumes), the three failure archetypes we see most often in Garden's pipeline, and what audit-first actually looks like when you do not have a Chief AI Officer or a quarter to spare. The thesis this article defends: the 5% that succeeds didn't pick a better model. They framed a smaller, measurable problem first.
What the 95% Number Actually Says
The MIT NANDA report does not say AI is broken. It says enterprise AI implementations are broken in a specific, recoverable way. Read carefully, the findings cluster around four observations that matter more than the headline:
General-purpose tools win. Custom tools mostly don't. Eighty percent of organizations have explored ChatGPT, Copilot, or similar tools, and roughly forty percent report deployment. These tools "primarily enhance individual productivity, not P&L performance." Meanwhile, custom or vendor-sold enterprise systems get evaluated by 60% of organizations, piloted by 20%, and reach production in just 5%. The drop-off is steeper than the headline suggests, and it lives at the boundary between individual use (where the tools work fine) and workflow use (where they don't).
The binding constraint is not the model. It's the integration with daily work. From the report: pilots fail "due to brittle workflows, lack of contextual learning, and misalignment with day-to-day operations." Pilots that integrated into a specific task, with measurable inputs and outputs, succeeded much more often than pilots that tried to do "AI strategy" across the whole company.
Mid-market beats enterprise on speed to value. Top mid-market performers reported pilot-to-implementation timelines of about 90 days. Enterprises took nine months or longer. This is the finding people forget when they read coverage of the report: smaller is faster, and faster is more likely to succeed.
Buying beats building. External partnerships with learning-capable tools reached deployment about 67% of the time, compared to about 33% for internally built tools. This does not mean you should never build. It means that if your team is small enough that the engineer who'd build it is also the engineer who runs your product, the math is rarely on your side.
The McKinsey State of AI 2025 survey (n=1,491 respondents across 101 nations, fielded July 2025) reaches a complementary conclusion. It finds that 39% of organizations report any enterprise-level EBIT impact attributable to AI, and most of those say it's under 5%. The small group McKinsey calls "AI high performers" (organizations where more than 5% of EBIT is attributed to AI) share a common trait: they are systematically redesigning workflows around AI, not bolting AI onto existing workflows. [Source: McKinsey, State of AI 2025]
So when the trade press says "95% of AI projects fail," what they actually mean is closer to: 95% of attempts to deploy AI into operational workflows produce no measurable financial outcome, primarily because the workflow was not understood before the AI was attached to it. That is a different problem from the one most vendors are selling against, and it has a different solution.
Why Small Teams Fail Differently Than Enterprises
The conventional reading of the 95% number is enterprise-shaped: legacy systems, fragmented data, change management bureaucracy, IT versus business turf wars. All of that is true, and the McKinsey and BCG reports document it carefully. But Garden works almost exclusively with teams of 5–20 people, and the failure modes we see are not the enterprise ones at smaller scale. They are different problems entirely.
Small teams do not fail at scale. They fail at scope. An enterprise pilot fails when it cannot survive the journey from one department to twelve. A small-team pilot fails when nobody can articulate what success would look like for one person doing one task on a Tuesday. The framing question is not "how does this work across the org" but "which afternoon ritual does this replace, and can we measure whether it actually got replaced." Most small-team pilots never answer that question, because nobody is paid to ask it.
Small teams over-trust demos. A demo from a vendor or YouTube tutorial shows one polished happy-path task. The small-team founder watches, sees their workflow mirrored back at them, and signs a contract. Three months later the pilot has stalled because the messy 30% of cases, the ones a senior person handles in their head with five years of context, completely break the tool, and there is no one whose job it is to clean up the mess.
Small teams under-budget the human cost. BCG's "Widening AI Value Gap" study (2025, n=1,250 senior executives across nine industries) found that top performers allocate roughly 10% of effort to algorithms, 20% to data and technology, and 70% to people, process, and organizational change. In a 500-person company, that 70% is funded: someone owns change management, someone runs training, someone manages adoption metrics. In a 12-person company, that 70% has to come out of the founder's calendar, or it does not happen. It usually does not happen. [Source: BCG, "The Widening AI Value Gap: Build for the Future 2025"]
Small teams have less margin for theatre. Enterprise AI engagements have a built-in tolerance for performative output: discovery decks, roadmap workshops, "AI maturity assessments," steering committees. None of that produces anything you can use, but it produces something the board can look at. Small teams cannot afford the theatre. If three weeks go by and nothing has shipped, the founder notices, and notices loudly. This is why small teams that succeed with AI tend to succeed faster than enterprises, and why small teams that fail with AI fail more visibly. There is nowhere to hide a €40,000 slide deck.
The result is that the 95% number, when you zoom into the small-team segment, looks both better and worse than the headline. Better, because small teams can move quickly enough to land on the right side of the divide. Worse, because they fail in ways that are not addressed by any of the standard enterprise consulting playbooks the market keeps selling.
The Three Failure Archetypes Garden Sees
We have run audits, sandboxes, and partial deployments with enough small teams now to have a typology. Most of the failures we see fall into one of three shapes. Naming them helps, because once a founder can locate themselves in one of these patterns, the remediation is usually obvious.
03.01Archetype 1: The Demo-Driven Roadmap
Symptom: The team has a list of seven AI tools they are "exploring." Usually a mix of ChatGPT Team, Notion AI, a vector database SaaS, a vertical-specific copilot, an agent framework, a no-code automation platform, and at least one tool whose pricing page says "Book a Demo." Nobody on the team can name a single workflow that any of these tools has improved.
Underlying cause: The decision-making heuristic is "what tools should we use" instead of "what problems should we solve." Each tool was added because a demo was impressive, or because a peer at another company mentioned it, or because the founder watched a podcast. The portfolio of tools grew without a corresponding portfolio of measurable outcomes.
What it costs: In our experience, a 15-person team in this archetype is paying €800–€2,500/month in subscription fees for tools that produce no measurable productivity gain, plus an opportunity cost of 10–20 hours/month of founder and senior staff time spent "evaluating" things. Over a year, that is €15k–€40k of real money plus the strategic cost of having the team feel busy with AI without anything actually shipping.
The remediation: Audit first. Map the actual workflows. Pick one (exactly one) with a measurable baseline, and rebuild it end to end. Drop every tool that does not directly serve that one workflow. The team usually cancels three to five subscriptions in the first month, which more than pays for the audit.
03.02Archetype 2: The Hallucination-Tolerant Pilot
Symptom: The team built or bought a tool that does something specific (customer support draft replies, contract summarization, market intelligence reports, lead enrichment). The demo looked great. In production, it gets the right answer about 70% of the time, the wrong answer about 25%, and a confidently wrong answer about 5%. Nobody flagged this because the team that built it tested on easy cases. The team that uses it does not have a way to tell when the tool is wrong, so the wrong outputs leak into customer-facing work. Three months later, somebody catches a serious error. Trust collapses. The tool gets shelved.
Underlying cause: No evaluation discipline. The team built the system without a way to measure correctness on representative cases, without a way for users to flag errors as they happen, and without a feedback loop into the prompt or model selection. The MIT NANDA report's "learning gap," systems that don't retain feedback, don't adapt, don't improve over time, manifests most painfully here.
What it costs: Direct: the build or license cost, plus cleaning up the customer-facing errors. Indirect: a much bigger number. Teams that get burned by an AI tool in production typically wait 12–18 months before trying again, and when they do, they over-correct toward zero AI, which is also wrong.
The remediation: Build evaluation in from day one. For every AI workflow, you need three things before it goes anywhere near a customer: a representative test set (~50 cases that span the realistic distribution, including the hard ones), a way for users to flag bad outputs in one click, and a weekly review where someone actually looks at the flagged cases. None of this is glamorous. All of it is non-negotiable.
03.03Archetype 3: The Sovereignty Disaster
Symptom: The team built a useful AI workflow on top of OpenAI or Anthropic six months ago, before it really thought about what data was flowing through it. Now a customer's procurement team asks the GDPR question, or a partner's NDA is up for renewal with a new AI clause, or the team is closing a deal with a healthcare or finance customer and discovers that the contract requires data residency the team cannot prove. The workflow is now blocking commercial work. Replacing it is a quarter of engineering time the team does not have.
Underlying cause: Architecture decisions made without considering the data they would carry. This is not malice or laziness. It is a structural mismatch between how fast AI tools onboard (sign up, paste API key, ship feature) and how slowly data-governance questions surface (months or years after the feature shipped, usually under deal pressure).
What it costs: Hard to bound. We have seen teams lose six-figure deals because they could not answer a data-residency question in writing. We have seen others spend three to four months of engineering time migrating from US-jurisdiction APIs to EU-sovereign infrastructure under deal pressure, with no commercial budget for the work.
The remediation: Decide the data architecture before you decide the model. If your customer data is covered by GDPR, by NDAs, or by sector-specific regulation, the question "where does this data go during inference" has to be answered on day one, not day 200. For most EU-based small teams, the answer is some combination of Mistral La Plateforme, IONOS AI Model Hub, Azure OpenAI in an EU data zone, or a fully local deployment. Once that boundary is set, model selection becomes a much smaller question.
These three archetypes are not exhaustive, but in our pipeline they cover roughly four out of five engagements that come to us after something has already gone wrong. The pattern they share matters more than what distinguishes them: in every case, the team skipped the framing step. They picked a tool, or built a feature, before they had named the problem cleanly enough to evaluate the result.
The Pilot vs. the Production Gap
Somewhere between week six and week ten of most AI pilots, the project changes shape. Up until then, the work has been about getting the demo to work. After that, the work is about getting the demo to survive: surviving variance in input data, surviving the edge cases the demo skipped, surviving the political friction of replacing a workflow that an actual person used to do, surviving the cost structure when usage scales beyond the founder's personal account.
This moment is where most pilots die. The MIT NANDA report calls it the gap between "pilots" and "successfully implemented": a drop from about 60% of organizations piloting custom enterprise tools to about 5% reaching production. Gap is the wrong word for a drop that steep. Cliff is closer.
What lives on the cliff:
| Dimension | Pilot mode / Production mode |
|---|---|
| 01Inputs | Cherry-picked examples | Real, messy, distribution-shifted user inputs |
| 02Outputs | Reviewed by the builder | Used by someone who didn't build it |
| 03Errors | Tolerated, noted as "we'll improve this" | Customer-facing, reputation-affecting |
| 04Cost | $20–$200 in API spend, ignored | Linear-to-superlinear scaling, suddenly visible |
| 05Evaluation | "It feels good" | "Here's the dashboard, here's the trend, here's last week's flags" |
| 06Ownership | Whoever built it | Whoever uses it |
| 07Latency tolerance | "It took 40 seconds, that's fine" | "It took 40 seconds and now my customer is gone" |
| 08Failure mode | Embarrassing | Expensive |
The mistake almost every team makes is treating the pilot as if it will smoothly upgrade into production. It does not. The architecture for production is meaningfully different (different observability, different evaluation, different cost model, different ownership), and the longer the pilot pretends otherwise, the more expensive the transition gets.
This is why Garden's three-phase model puts so much weight on the sandbox phase. The sandbox is not a pilot. The sandbox is where you find out, in a controlled environment, whether the workflow survives the transition. If it does, you have evidence (actual token counts, actual user adoption curves, actual error rates) to make a defensible procurement and deployment decision. If it does not, you have killed the project for less money and less trauma than letting it die in production.
What Audit-First Looks Like for a Sub-20-Person Team
The audit-first approach is not a novel idea. McKinsey, BCG, Deloitte, and every consulting firm worth its day rate has been writing about "AI readiness assessments" for two years. The problem is that the enterprise-shaped version of an audit takes four to six months, costs €150k–€500k, and is calibrated for organizations with multiple business units and a CIO who can ratify the output. None of that works for a 12-person team where the CEO is also doing customer support on Saturdays.
For small teams, audit-first has to be smaller and faster. Here is what we have found actually works, structured around the three phases of Garden's engagement model.
05.01Phase 1: Audit (2–3 weeks, calendar time)
What it is. A structured mapping of how knowledge currently flows through the team: who produces what, who consumes it, where the bottlenecks are, where the repetitive work lives, where the institutional memory is captured (and where it isn't). The output is a workflow inventory, classified by data sensitivity, with two or three candidate workflows ranked for AI augmentation. Not a slide deck.
What it isn't. A generic "AI readiness assessment." A tools survey. A maturity model. A board presentation about "AI strategy." All of those exist; none of them produce decisions for a small team.
What it produces. Three artifacts a small-team operator can actually use:
- A workflow inventory: every recurring task documented by input type, output type, time-on-task estimate, monthly volume, current tooling, identified pain points. For a 15-person team, this is usually 20–40 workflows, and the act of writing them down often surfaces the answer.
- A data-handling classification: each workflow categorized by what data it touches, with implications for architecture (cloud OK, EU-sovereign required, fully local required).
- A shortlist of 2–3 candidate workflows ranked by projected time savings and data-handling cleanliness, with explicit success criteria. Not "AI will help with this." "If we automate X% of the drafts, and Y% of them survive review without changes, we save Z hours per week."
Threshold to proceed. At least one workflow with projected time savings of 30% or more, a clean data-handling path, and a person on the team willing to be the internal champion. If none of those three conditions hold, the right answer is to not proceed to a sandbox. The audit fee was the price of finding that out cheaply, which is much cheaper than building a sandbox you don't need.
This is the audit's most underrated function. An honest audit that says "your team is not ready for AI in the workflows you described, here is what to fix first" is more valuable than a sandbox that pretends otherwise. Most consultancies will not deliver that finding because their commercial model depends on the next phase being sold. The discipline of saying no when no is the right answer is the single most important thing to screen for when hiring anyone to do an AI audit.
05.02Phase 2: Sandbox (4–6 weeks, overlapping with audit)
What it is. A working prototype (not a slide, not a demo, an actual system) that runs the candidate workflow on representative data, with appropriate handling, observed by real users.
What it tests. Five questions, in order:
- Is the output quality sufficient for the specific task? This is not "is the model good?" This is "is this output usable by this analyst, for this deliverable, without more cleanup than the original effort would have required?"
- Is the latency tolerance acceptable? Does response time fit the workflow, or does it create new friction?
- What is the realistic token volume per session? This is the single most important input for hardware sizing or cost modeling.
- What interface do users actually prefer? Chat-style, embedded assistant, workflow-triggered automation? Most teams default to chat. Most workflows want something else.
- Where do users abandon the tool? This is the leading indicator of failed adoption, and you have to catch it during the sandbox, not in production.
What infrastructure it runs on. For most EU-based small teams, the sandbox runs on EU-sovereign cloud infrastructure: typically IONOS AI Model Hub (German jurisdiction, GDPR-aligned, around €0.65 per million tokens for Llama 3.3 70B), Mistral La Plateforme (French jurisdiction, EU-hosted), or Azure OpenAI in an EU data zone. The choice depends on the data classification from the audit, not on the model preference of the consultant. Observability runs on Langfuse (open-source, self-hosted on Hetzner for around €30–€200/month all-in, or hosted Core at $29/month). The whole sandbox stack for a 30-user team typically costs €500–€2,500/month in infrastructure, with the high end being if Azure or AWS regional endpoints are required.
Threshold to proceed. Two or more workflows show measurable improvement on baseline KPIs, user adoption is at least 60% among the pilot cohort, and token volumes are documented well enough to size production infrastructure with confidence.
05.03Phase 3: Deploy (4–8 weeks, after the sandbox passes)
What it is. Installation of the production system, instrumentation, and training the team to maintain it after the consultancy leaves.
What it includes.
- Production infrastructure: either the same EU-sovereign cloud the sandbox used, or, for teams whose data is too sensitive even for sovereign cloud, local hardware. The choice should be evidence-based, not preference-based. For a sub-20-person team, fully local deployment makes sense only when there is a hard contractual or regulatory constraint that forbids data from leaving the building. Otherwise, sovereign cloud is the right answer and the cost difference is meaningful: local 70B-class deployment for a small team is roughly €80k–€160k in CapEx plus €15k–€85k/year in OpEx, while sovereign-cloud equivalents run in the hundreds of euros per month.
- Observability: tracing, evaluation harness, cost monitoring, user-flag pipeline. All of this is open source if you want it to be.
- Documentation: every prompt, every model choice, every fallback rule, written down so the team can maintain the system after we leave. This is the part most consultancies skip, and it is the part that determines whether the engagement creates lasting value or vendor lock-in.
- Training: a one-day workshop for the full team, plus a 30-day check-in. The goal is for the internal champion to run the system unaided by week six.
Threshold to call it done. The system has been running in production for at least four weeks, cost-per-output is stable and within budget, the user-flag rate is trending down, and the internal champion has independently shipped at least one change without our help.
05.04How the three phases compound
The audit costs €20k–€30k. The sandbox costs €30k–€50k. The deployment costs €25k–€60k depending on the architecture choice. The total is roughly €75k–€140k for a complete engagement, over twelve to eighteen weeks.
The compounding effect is what makes this work for small teams. Each phase has a kill switch. If the audit shows the team is not ready, we stop, saving €50k–€100k of wasted sandbox and deployment spend. If the sandbox shows the workflow does not survive contact with real users, we stop, saving a deployment that would have died in production. By the time you reach deployment, the evidence is on your side: real workflows, real KPIs, real users, real data. The 95% failure rate becomes much harder to land in when each step required passing a threshold to continue.
The Thresholds That Decide Whether to Keep Going
The most important sentence in the previous section was buried: each phase has a kill switch. The absence of kill switches is the structural reason most consulting engagements run past their value.
When a vendor or consultancy sells you an AI engagement without explicit go/no-go gates, what they are selling is the commitment of running the full project regardless of whether it should be running. That is a fine commercial model for the vendor. It is rarely a good one for the buyer.
The thresholds we recommend, and contractually commit to:
Audit to sandbox. At least one workflow with projected time savings ≥30%, a clean data-handling path within your data sovereignty constraints, and an identified internal champion willing to allocate at least 20% of their time during the sandbox phase. If any of these three are missing, the right answer is to fix that gap before continuing, not to continue and hope it resolves itself.
Sandbox to deployment. At least two workflows showing measurable KPI improvement against the audit baseline, user adoption ≥60% among the pilot cohort, and token volume measurements that allow production sizing with confidence (not a wide range, but a defensible point estimate plus a peak).
Deployment complete. The system has been running for at least four weeks in production. Cost-per-output is within 20% of the sandbox projection. User-flag rate is trending downward. The internal champion has independently shipped one change without external help.
Each threshold is a place where the conversation can honestly be: we're not there. Let's stop, fix what's missing, and revisit. The goal is not to pass the threshold. The goal is to know whether the threshold has been passed, and to act accordingly.
This is the part of consulting almost nobody does, because admitting that a threshold has not been met means admitting that the next phase fee should not be charged. The long-term math is straightforward: a consultancy that walks away from work when the work should not continue ends up with much better client outcomes, much stronger references, and much less time wasted on engagements that should not have started.
When AI Is the Wrong Tool, Full Stop
Garden writes a "wrong tool" section into every long-form piece, because it is the section that distinguishes honest analysis from vendor marketing. There are workflows where AI is the wrong answer, and a small team trying to ship them is much better served by being told that early.
When the workflow's volume does not justify the engineering cost. If a task happens five times a month, and each instance takes a senior person 20 minutes, the total cost is about one hour and forty minutes of time per month. Automating it might cost €15k of engineering effort to build and €100/month to run. The payback period is on the order of years, in a domain where the underlying tools change every six months. Some workflows are too small for AI, and the right answer is to let them stay small.
When the cost of being wrong is disproportionate to the cost of being slow. Legal contract review, medical triage, regulatory filings, anything where a confidently wrong output creates a serious downstream cost. AI is not categorically excluded from these domains, but the architecture has to put a human-in-the-loop at a step that catches errors, and the labor savings drop sharply once that step is included.
When the bottleneck is not knowledge work. Some teams come to us thinking they have an AI problem, when actually they have a sales problem, a hiring problem, a focus problem, or a product-market-fit problem. AI does not fix any of these. The most useful thing we can do in those conversations is say so and redirect the founder to the kind of help they actually need.
When the team cannot maintain the system after we leave. If a small team has no internal capability to monitor, debug, and update an AI workflow (no engineer, no operator, no champion), then deploying one is setting them up for slow-motion failure. The honest answer is to either build the internal capability first, choose a deployment mode that requires no maintenance (off-the-shelf SaaS with strong contractual protections), or wait.
When the data does not exist. AI augments knowledge work. If the institutional knowledge a workflow depends on is not written down, if it lives in the heads of two senior people who joined when the company was three people, the workflow cannot be augmented until that knowledge is captured. The right intervention is documentation, not AI. We have ended up doing what amounts to knowledge archaeology before any AI tool gets considered. It is not glamorous. It is the work that has to happen first.
The unifying principle: AI is leverage on existing capability. Not a substitute for capability that doesn't exist. Small teams that succeed with AI have something coherent to leverage. Small teams that fail are usually trying to use AI to manufacture a capability they don't have, and that no tool can supply.
FAQ
Is the 95% failure number actually correct?
The figure comes from MIT Project NANDA's July 2025 report, The GenAI Divide: State of AI in Business 2025, based on 300 publicly disclosed AI initiatives, 52 structured executive interviews, and 153 senior-leader survey responses. The authors are careful to call their findings "directionally accurate based on individual interviews rather than official company reporting." The 95% refers specifically to integrated enterprise AI pilots producing no measurable P&L impact. It does not mean 95% of all AI usage is wasted; general-purpose tools like ChatGPT are widely adopted and useful for individual productivity. The 95% is about custom or vendor-sold systems trying to change how operational work gets done. [Source: MIT Project NANDA, July 2025]
Does this mean my small team should not even try?
No. It means the opposite. The mid-market segment in the MIT data moves faster than the enterprise segment: pilot-to-implementation timelines of about 90 days for top performers, versus nine months or more for enterprises. Smaller teams have a structural advantage if they frame the problem correctly. The risk is that small teams will try in the same shape enterprises try and inherit the failure pattern without inheriting the failure tolerance.
What is the minimum credible budget for an AI engagement at a 10-person company?
A complete three-phase engagement runs roughly €75k–€140k over twelve to eighteen weeks, with infrastructure adding €500–€2,500/month during the sandbox phase. A leaner version, audit-only, starts around €20k–€30k and produces a workflow inventory, data-handling classification, and a candidate shortlist. That is sometimes the right size for a very small team that needs to know whether to invest further before investing further.
How long until we see ROI?
For workflows that survive the audit and sandbox thresholds, we typically see measurable productivity gain within four to eight weeks of production deployment. Payback period on the consulting fee depends heavily on the time-on-task baseline and the team's hourly cost, but for the workflows we usually pick, it sits in the six-to-twelve-month range. Anyone promising shorter is probably promising more confidently than the data supports.
What if our data cannot leave Germany / the EU / our office?
Then the architecture decision is mostly made for you, and that is fine. EU-sovereign cloud (IONOS, Mistral La Plateforme, Schwarz/STACKIT, Azure EU data zone) covers most small-team scenarios with strong contractual and jurisdictional guarantees. Fully local deployment is necessary only when contractual constraints prohibit any external infrastructure, for example NDA-bound data from clients who explicitly forbid third-party hosting. Local for a 20-person team typically runs €80k–€160k CapEx plus €15k–€85k/year OpEx; sovereign cloud equivalents are often hundreds of euros per month. The right answer matches your actual constraints, not the most prestigious-sounding option.
Is there a version of this that works without an external consultancy?
Yes. The three phases are not trade secrets; they are a sensible decomposition of the work, and a disciplined internal team can run them in-house. The constraint is that someone on the team has to own the framing, the evaluation, and the cost discipline, and that person cannot also be running the rest of the company at full intensity. Most small teams that try the DIY path stall at the sandbox phase, not because they can't build the prototype but because they can't carve out the calendar discipline to evaluate it honestly. If your team has that discipline, you do not need us. If it does not, the engagement is mostly buying that discipline rather than buying the technical work.
Closing
The 95% number is the most cited statistic in enterprise AI right now, and it is being used to argue at least three contradictory positions: that AI is overhyped, that AI is being deployed badly, that AI is structurally incompatible with how businesses operate. The second is closest to true. The first and third are mostly self-serving rhetoric from people who would benefit from a particular narrative.
The 5% that succeeds is not running better models. They are framing smaller problems, evaluating against measurable baselines, choosing architectures that match their actual data constraints, and stopping when the evidence does not support continuing. None of this requires a Chief AI Officer. It requires a small amount of discipline applied at the start, when the discipline is cheapest. It is much harder to add later, after a tool has shipped, after a contract has been signed, after a workflow has been broken.
For sub-twenty-person teams, the three-phase model (audit, sandbox, deploy) exists because nothing smaller works and nothing larger fits. The audit costs less than a failed pilot. The sandbox costs less than a failed deployment. The deployment costs less, in three-year total cost of ownership, than the sum of false starts most teams accumulate when they try to skip the framing.
If you are trying to figure out which of these matters for your team, that is the conversation we have in a Garden audit. The honest version, not the slide-deck version. We do not always recommend continuing past Phase 1; about a third of audits we run end in some version of "your team is not ready for this in the workflows you described, here is what to fix first." Those engagements are the ones we are proudest of. They saved the team from the 95%.
Email angelina@gardenresearch.eu if you want to start one.
This article is part of Garden Research's investigation into how small teams adopt AI without inheriting the failure patterns of enterprise pilots. If you've been through one of these patterns and want to compare notes, we'd genuinely like to hear from you.