What is AI Engineering?

By Alex

AI engineering is emerging as a distinct discipline. Here's what it actually involves, and why it matters now.

There is a pattern that keeps repeating in the tech industry: a new set of capabilities emerges, and for a while everyone scrambles to figure out what role it creates and who is supposed to fill it. That is roughly where we are with large language models. The models exist. They are capable in ways that feel genuinely new. But the discipline of building reliable, production-grade applications on top of them is still being worked out.

Chip Huyen’s book AI Engineering is one of the clearest attempts I’ve read to name and define that discipline. It’s worth spending some time on what the book actually says, because the term gets used loosely and the distinctions matter.

Building on top of foundation models

The core idea is straightforward: AI engineering is the discipline of building applications that use foundation models that someone else has already trained. You are working with the model as an API, as an input-output system, not as something you designed from scratch. This is a meaningful distinction because it opens the door to a much wider group of engineers building things that would have previously required highly specialised teams.

The analogy that clicks for me: you don’t need to understand how a database engine works to build software that uses one well. You do need to understand its behaviour, its failure modes, how to structure your data, when it is the right tool and when it isn’t. AI engineering is developing that same body of practice for foundation models.

One of the most useful parts of the book is how it maps out the stack. At the bottom you have the foundation models themselves. On top of that, the application layer: how you design the system that uses those models to do something useful. The choices in the middle (how you prompt, whether you add retrieval, whether you fine-tune) are where most of the interesting engineering lives.

Prompt engineering is real engineering

The book treats prompt engineering seriously, which I appreciate, because there is still a tendency to dismiss it as not quite legitimate. The reality is that how you structure the input to a language model has an enormous effect on the quality of the output, in ways that are not always intuitive and that require systematic thinking.

What the book describes is not the “magic phrase” version of prompt engineering that gets mocked on social media. It is the discipline of thinking carefully about task decomposition, providing useful context and examples, structuring output formats, and designing prompts that degrade gracefully when the model is uncertain.

This connects to something broader in the book: the idea that working with language models requires a shift in how you think about specifications. In traditional software, you write code that does exactly what you tell it. With an LLM-based system, you are working with a probabilistic component. Your job is not to control every output but to design a system where the distribution of outputs is useful and safe.

Retrieval-augmented generation

RAG gets a significant amount of space in the book, for good reason. It is the most common and often most practical way to give a model access to knowledge it was not trained on: instead of asking the model to answer from memory, you retrieve relevant documents from your own data and include them in the prompt. The model reasons over what you provide. This sidesteps the hallucination problem to a meaningful degree and lets you build on proprietary or recent information without any model training at all.

What the book makes clear is that the simple version of RAG is easy to implement and not very good. Chunking strategies, embedding quality, retrieval precision, prompt design, and citation handling all matter. Getting any one of them wrong degrades the whole system. The pipeline looks simple until you start asking it to be reliable.

Fine-tuning: when it is and isn’t the answer

The book is usefully precise about when fine-tuning is actually worth doing. The honest answer is: not as often as people assume. Prompting and retrieval can get you surprisingly far, and fine-tuning introduces cost, infrastructure complexity, and the ongoing problem of keeping a trained model fresh as your data changes. The cases where it pays off are when you need the model to reliably follow a very specific format, adopt a particular style, or perform well in a narrow domain where in-context examples just don’t cut it.

Evaluation is the hardest problem

If I had to name the chapter that will stick with me longest, it would be the section on evaluation. Evaluating the output of a language model is genuinely hard, and most teams do not do it well.

In traditional software, you write tests with known inputs and expected outputs. With generative AI, the output is open-ended and often evaluated on criteria that are inherently subjective. Is this response helpful? Is it accurate? Is it safe? The book covers a range of approaches: human evaluation, reference-based metrics, model-based evaluation (using another LLM to score outputs), and task-specific automated tests. The honest takeaway is that you need multiple layers and you need to be humble about what each one tells you. Teams that ship AI products without a serious evaluation strategy are not really in control of what they’ve built.

Agents and the limits of autonomy

The final major theme is agents: systems where the model doesn’t just respond to a query but takes actions, uses tools, and operates over a sequence of steps. This is where the most excitement lives right now, and also where the most caution is warranted.

Agents introduce compounding failure modes. A single bad step early in a sequence can invalidate everything that follows. The book is refreshingly clear-eyed about this: the current generation of agents works best in constrained, well-defined domains with good tool interfaces and frequent human checkpoints. Fully autonomous agents operating over long horizons are still a research problem more than a production pattern.

What this actually requires

Reading this book, what strikes me is that the engineers who will build the best AI applications are not necessarily the ones who understand the most about model internals. They are the ones who understand how to build reliable systems, how to think carefully about data, and how to evaluate whether what they built actually works. The discipline is new. The underlying engineering rigour is not.

If your team is starting to think seriously about building with AI, it is worth being clear-eyed about what you are actually building: not a chatbot that runs on vibes, but a system with all the usual engineering requirements, plus some new ones.

Building a data platform?

Free discovery call. Tell me where your stack is today and where you need it to go.

Get in touch More posts