Blog

Will Your AI Projects Survive the Next Model Update?

Every major AI model has been updated at least once in the past 18 months. The ones that did not break had something in common: they were built on foundations, not on prompts.

Phos Team ·

Are you building sandcastles? Will your AI projects survive the next model update?

Every major AI model has been updated at least once in the past 18 months. Output behaviour changed. Some workflows broke. Some companies spent weeks rebuilding what they thought was permanent.

The ones that did not rebuild had something in common: they had built on foundations, not on prompts.

The question is not whether the model will change. It will. The question is whether what you built depends on the model; or on something more durable.


The three ways AI projects break — and what causes each one

Before assessing your own system, you need to know which failure mode you are most exposed to. Three specific patterns account for almost every AI workflow that breaks.

Failure mode 1 — Model version dependency

The workflow was built on a specific version of a model whose behaviour matched the prompt perfectly. A model update changes the output behaviour. The prompt now produces inconsistent outputs; sometimes right, sometimes not.

Who this affects most: companies that built prompts during a specific model version’s peak and relied on behaviour that was never guaranteed to persist. The “magic prompt” that worked perfectly in 2023 is not guaranteed to work in 2025.

The signal: the workflow was never tested across model versions or even across multiple runs. It worked once, was deployed, and was never stress-tested.

Failure mode 2 — Platform dependency

The workflow was built inside a specific platform’s ecosystem; a custom GPT, a specific agent builder, a proprietary workflow tool. The platform changes its pricing, its API, its feature set, or is acquired.

The workflow cannot be migrated without being rebuilt from scratch because the logic is locked inside the platform’s interface; not documented externally.

Who this affects most: companies that built inside third-party AI workflow tools without exporting the underlying logic to their own documentation. When the tool changes, the knowledge of how the workflow worked is inside the tool; not in the company’s hands.

The signal: the workflow exists inside one tool’s interface. If asked to describe the workflow step-by-step without opening the tool, nobody on the team can do it.

Failure mode 3 — Person dependency

The workflow works because one person knows how to run it; what context to load, which prompt variation produces the best result, how to handle edge cases. That person leaves, is promoted, or goes on holiday. The workflow stops producing usable outputs. Nobody else knows why.

Who this affects most: companies where the AI system was built by the founder or one technically capable person who documented nothing. The workflow is in their head. The company’s AI capability walks out the door with them.

The signal: if asked “could a new hire run this workflow at acceptable quality on day three?”, the answer is no.


What a durable AI system is built on — the three foundations

Three specific architectural properties make an AI system survive model changes, platform changes, and team changes. Audit your own system against each one.

Foundation 1 — Context-first design

A system built on context is model-agnostic. The quality comes from what is loaded before the prompt runs; the voice guide, the decision rules, the customer archetypes, the domain terminology.

It does not come from a specific combination of prompt wording and model behaviour.

Test: remove the context pack and run the same prompt. If the output quality collapses dramatically, the context is doing the work and the system is durable. If the output quality stays roughly the same, the context pack is thin and the system is fragile.

A robust context pack produces acceptable outputs across Claude, GPT-4, and Gemini with the same prompt. A fragile prompt breaks when the model version changes.

Foundation 2 — Model-agnostic architecture

The workflow is not hard-coded to a specific model, model version, or platform feature. The prompt logic is documented outside the tool.

The workflow can be rebuilt on a different model in hours if the current one becomes unavailable, changes behaviour, or increases in cost.

Test: is the workflow documented in a format that could be rebuilt on a different AI tool? If yes; architecture is durable. If the workflow only exists inside one platform’s interface and nowhere else; architecture is fragile.

Foundation 3 — Documented ownership

The workflow is documented well enough that the person who did not build it can run it, improve it, and fix it when something breaks. The documentation includes:

  • Inputs required (what data or text goes in)
  • Expected output format and quality bar
  • Prompt structure (the logic, not just the text)
  • Human checkpoint location
  • Common failure modes and their solutions

Test: give the workflow documentation to a capable team member who has never run it. Can they produce an acceptable output on the first attempt? Can they identify what went wrong if the output is bad?

If yes; ownership is documented. If no; the workflow is person-dependent and fragile.


The durability audit — how to assess your current AI projects in 60 minutes

Run this on every active AI workflow in the business. Five to ten minutes per workflow.

For each workflow, answer these five questions:

QuestionDurable answerFragile answer
1. If the model version changes, would this workflow still produce acceptable output?Yes; quality comes from context, not prompt wordingNot sure; prompt was tuned to this specific version
2. If the platform changed its pricing or features, could we rebuild this elsewhere in under a day?Yes; workflow logic is documented outside the platformNo; workflow only exists inside this tool’s interface
3. If the person who built this left tomorrow, could someone else run and fix it?Yes; there is documentation a new person could followNo; they would need to ask the builder
4. If the context pack were removed, would output quality drop significantly?Yes; context is doing the workNo; context pack is thin; prompt is doing everything
5. Has this workflow produced consistent acceptable output across at least 20 runs?Yes; tested and stablePartially; worked well at first and now is inconsistent

Scoring:

  • 5 durable answers: this workflow is built to last; maintain the documentation and review quarterly
  • 3–4 durable answers: partially durable; identify the fragile point and address it specifically
  • 0–2 durable answers: this is a sandcastle; it needs to be rebuilt before the next model change reveals it

How to harden a fragile workflow — the rebuild process

The audit produces a list. This section turns that list into action. Four steps, in order.

Step 1 — Extract the logic from the platform

Open the workflow. Document every element in a plain text document outside the tool:

  • What is the input? (What data or text goes in?)
  • What is the expected output? (Format, length, quality standard?)
  • What is the prompt structure? (The logic: what context is loaded, what instruction is given, what constraints are set?)
  • What is the human checkpoint? (Where does a human review before the output is used?)
  • What are the common failure modes? (What goes wrong and why?)

This document is the workflow’s architecture. If the platform disappears tomorrow, this document is what rebuilds it.

Step 2 — Move quality from the prompt to the context

If current workflow quality depends on a very specific prompt, the quality is fragile. Move as much quality-producing information as possible into the context pack:

  • Company voice and tone → context pack
  • Industry terminology and decision rules → context pack
  • Client archetype and communication standards → context pack
  • Output format requirements → context pack as a “formatting rules” section

The prompt that remains should be simple and generic.

“Draft a customer delay notification using the context loaded” should produce a good output; not a prompt with fifteen specific instructions that only works because of how the model behaved in one particular version.

Step 3 — Test across models and prompt variations

Run the hardened workflow on at least three variations:

  • Different model (Claude vs GPT-4 vs Gemini)
  • Different prompt wording (same logic, different phrasing)
  • Different operator (a team member who did not build it)

If all three produce acceptable output: the workflow is durable. If any produce unacceptable output: the fragile element is in the test that failed; identify it and fix it specifically.

Step 4 — Transfer ownership

Find the team member who will own this workflow going forward. Walk through the documentation with them. Have them run the workflow independently once. Have them identify one thing they would improve.

If they can do both successfully; ownership is transferred. If they cannot; the documentation needs to be clearer before the workflow is considered hardened.


The model update question — what actually changes and what does not

Not every model update breaks existing workflows. Understanding what changes helps calibrate the level of hardening required.

What model updates typically change:

  • Output length defaults (newer models often produce longer or shorter responses by default)
  • Instruction-following behaviour (newer models are often more literal, less creative, or vice versa)
  • Safety filtering thresholds (some outputs that worked before are filtered after an update)
  • Formatting defaults (bullet points vs prose, heading use, markdown rendering)

What model updates do not typically change:

  • The ability to follow clear, explicit instructions
  • The ability to use loaded context to produce domain-specific outputs
  • The quality improvement that comes from well-structured context
  • The fundamental logic of well-designed workflows

Workflows that rely on the model inferring what you want from a vague prompt are vulnerable. Workflows that give the model explicit, structured instructions and load the context explicitly are stable.

The update changes the inference behaviour. It does not change the model’s ability to follow clear instructions.

The quarterly review cadence:

Every AI workflow should be reviewed against its quality bar once per quarter. Not rebuilt; reviewed. Run the workflow on five recent real inputs. Compare the outputs to the quality bar in the documentation:

  • Outputs pass: leave it alone
  • Outputs drift: identify whether the drift is in the model, the context pack, or the workflow design; fix the right element
  • Outputs break: use the architecture documentation to rebuild in hours, not weeks

The workflows most worth hardening first

Not every workflow warrants full hardening. Prioritise by two variables: how often it runs and how high the cost is if it breaks.

Workflow typeHardening priorityReason
Daily, client-facing (follow-up emails, notifications, communications)HighestBreaks daily; bad output reaches clients
Weekly, internal (ops reports, pipeline summaries)HighBreaks weekly; reduces team trust in the system
Periodic, high-stakes (proposals, contracts, senior communications)HighLow frequency but high consequence per failure
Occasional, low-stakes (internal research, draft notes)LowBreaks rarely; low consequence; easy to catch manually
Experimental (things being tested, not yet deployed)None until stableNot worth hardening until the workflow is proven

The hardening effort is proportionate to the risk. Start with the daily client-facing workflow. That is the sandcastle that costs most when it breaks.


Common questions on AI workflow durability

”Should I be on the latest model version?”

Generally yes; but test your most critical workflows before upgrading, not after.

The upgrade path: update in a staging environment first, run your top five workflows against the quality bar in the documentation, confirm they pass, then update production. If a workflow fails the test, use Step 2 above to move more quality into the context pack before deploying.

”What do I do when a workflow breaks after a model update?”

First; check whether the break is in the model behaviour or in the context pack.

Run the same prompt with the old model version (if available). If the output was fine before the update: the issue is model behaviour; apply Step 2. If the output was not fine before either: the workflow was fragile before the update and the update revealed it.

”Is it worth building on one platform or should I use multiple?”

Platform diversification does not make the system more durable. It makes it more complex.

Durability comes from architecture; context-first design, documented logic, transferable ownership. Not from using three tools instead of one. Use the best-fit platform for each workflow type. Document the logic outside it. That is the diversification that matters.

”How do I know if my context pack is doing the work or my prompt is?”

Run the remove-the-context-pack test from Foundation 1 above.

If the output quality drops significantly without the context pack: the context is doing the work and the system is durable. If the output quality stays similar: the prompt is doing the work and the system is fragile.

”What if I’ve built everything inside one vendor’s ecosystem?”

Do Step 1 immediately; extract the logic from every workflow into external documentation.

This does not require moving anything. It creates the architecture documentation that protects you if the vendor changes anything. Once the documentation exists, assess each workflow against the five durability questions above. The ones that score 0–2 get hardened first.

”How often should I test existing workflows?”

Quarterly is the right cadence for most workflows. After any major model update is a non-negotiable additional test. For daily client-facing workflows; monthly is safer.

The test is a five-output spot-check against the quality bar. If it passes: move on. If it does not: investigate.


Want to know which of your current AI workflows are sandcastles — before the next model update reveals them?

The model will update. The platform will change its pricing. The person who built the workflow will eventually leave. None of these need to break the AI system; if the system was built on context, documented thoroughly, and designed with transferable ownership from the start.

Path one: run the audit yourself. Take your five most-used AI workflows through the five durability questions above. You will know within an hour which ones are built to last and which ones need hardening.

Path two: bring in a partner. If you want the full durability audit run, the fragile workflows hardened, and the context layer rebuilt to the standard that survives model changes; that is the work Phos AI Labs does in Phase 1 and Phase 2. The fastest way to know if it is the right fit is a conversation. Thirty minutes, no deck. Start here.

The fastest way to know whether we're the right fit, is a conversation.

STEP 1/2 · ABOUT YOU