April 07, 2026 Thought

Horses for Courses — Choosing the Right AI Model

When building an AI pipeline for image cataloguing, we learnt the importance of model optimisation but also the approach of staggered development from a general model to specific

AI Architecture Performance LLM

Horses for Courses — Choosing the Right AI Model

When we started building an AI pipeline for image cataloguing, the first priority was to get something working. We started with gpt-4o across the board.¹ That got us to a working pipeline reasonably quickly. Once it was stable, we started thinking about the various stages more carefully — with thousands of operations a day in mind, there was a reasonable case for looking at whether we could optimise for speed and cost without compromising on quality.

The Pipeline

The pipeline takes an image through four stages — classification, parsing (extracting a description or raw text content), converting that into structured data for search, and finally generating tags for cataloguing. The specifics of each stage are less important than the overall shape of it: four distinct steps, each with different demands on the model.

We started with gpt-4o across all four stages. The pipeline worked well. Then we looked at whether each stage actually needed the same level of capability.

What We Changed

Classification — gpt-4o-mini

Classification asks one question: what type of image is this? It routes everything downstream based on the answer. We moved this to gpt-4o-mini² and found the quality was on par with gpt-4o. We also noticed the speed improved noticeably, which made sense given the task — it is constrained and well-defined, and does not require the kind of reasoning that justifies the larger model.

OCR and Document Parsing — gpt-4o, detail: high

Document parsing was the one stage we kept on gpt-4o. Extracting text accurately from varied document types — ID cards, scanned forms, mixed layouts — is where errors compound most. A misread here affects structured data and tagging downstream. The larger model remained necessary here and the cost was justified.

Structured Data Extraction — gpt-4o-mini

Once OCR has done its job and raw text is available, structured data extraction works from that text rather than the image directly. The visual complexity is already resolved at this point. We moved this stage to gpt-4o-mini and found no meaningful difference in the quality of the output.

Tagging — gpt-4o-mini

By the time we reach tagging, the content is fully understood. The context is already there. We found gpt-4o-mini handled it reliably and the speed gains here were consistent with the other stages.

What We Found

Moving three of the four stages from gpt-4o to gpt-4o-mini produced approximately a 90% improvement in pipeline speed for those stages. Token consumption dropped proportionally. The one stage where gpt-4o remained — OCR on dense documents — continued to justify the cost clearly.

On Optimising Before Scale

During development, when volumes are low, the cost and speed differences are easy to ignore. But we found it was worth doing this analysis before releasing to users at scale rather than after — retrofitting model changes into a live pipeline with real users is a more uncomfortable exercise than doing it while the pipeline is still being validated.

That said, a staggered approach made more sense than trying to optimise everything at once. We started with what worked, identified the stages where the larger model was doing more than the task required, and changed those one at a time. Each change was tested in isolation before moving to the next. Doing it all at once would have made it harder to attribute any quality differences to a specific stage.

The general finding — that different stages of the same pipeline have genuinely different capability requirements — seems like something worth carrying into future work.

References

OpenAI — GPT-4o Documentation (2024)
OpenAI — GPT-4o Mini Documentation (2024)
OpenAI — Optimizing LLM Accuracy (2024)