What You Actually Need to Know About LLMs Before the Next Interview
"All models are wrong, but some are useful."
George Box
A short note before the piece. Starting next week, this newsletter will run on a steady rhythm: a short, practical piece every Tuesday and a long, structural piece every Saturday. The Saturday issues keep building the year-long curriculum, the diagnosis, the role map, the path from data engineer to AI expert. The Tuesday pieces are shorter and more practical. Each one closes a specific gap from what the first issue called the missing fifth: the things you do not yet know about LLMs, framed in terms of what you will actually need in the next interview.
This week, as a one-off, the first short piece arrives today instead of Tuesday, because the new rhythm starts cleanly next week. Next Saturday, the long piece continues the diagnosis movement with how to read a 2026 job posting without losing hope. From then on, Tuesdays for the short practical primers, Saturdays for the long structural pieces.
If you have applied to AI engineer roles in the last few weeks, you have probably noticed that the technical conversation now turns, at some point, to a question that sounds harmless but is doing real work. How do you think about model A? Or, why might this model give a great answer to one question and a poor one to a similar question? Or, walk me through what happens when a user sends a prompt.
Want more practical data engineering analysis like this?
Join DWHPro Letters and get field-tested notes on Teradata, Snowflake, AI, migrations, performance, and enterprise data work. Early subscribers keep launch access before the paid plan launches.
These questions are not testing whether you can implement attention from scratch. They are testing whether you have a useful mental model of what the system is doing. A senior candidate who has spent twenty years building data systems can reach that bar in an afternoon of careful study. You do not need a PhD. You need the right small set of ideas, in the right shape.
Here is the shape.
A transformer is a next-token predictor.
That is the entire system, at its core. You give the model a sequence of text. It produces a token a word or part of a word). Then it appends that token to the sequence and produces the next one. Then the next. It keeps going until it generates a stop signal or hits a length limit.
There is no agent inside the model that decides
If you internalize nothing else from this piece, internalize this. predict the next token, given the context, based on what was learned during training.
Three foundations follow.
The model generates one token at a time. This sounds like a small detail. It is not. It explains why response latency scales with output length, not input length. It explains why streaming exists (the model can show you tokens as it produces them, rather than waiting to finish). It explains why an early mistake in a long response often compounds, because the model conditions each new token on everything it has already produced, including its own errors. And it explains why prompting techniques like think step by step work. You are giving the model more tokens to produce, which allows it to perform more intermediate computations to reach a better final answer.
If you work with enterprise data platforms, migrations, performance tuning, or AI-driven delivery teams, DWHPro Letters is written for you. Get the next issue by email.
The context window is the only memory the model has. The model itself is frozen after training. It does not learn from your conversation. Everything the model knows about your specific situation must be within the context window for the current request: the system prompt, the conversation history, and any documents the application has retrieved and inserted. When a chatbot appears to remember what you said three turns ago, it is because the application is resending those three turns to the model with each new request. When the model forgets something halfway through a long conversation, it is because the context window has filled up and earlier turns have been dropped.
This is the foundation of every retrieval-augmented generation system, every agentic workflow, every long-context application. The entire discipline of AI engineering is, in one sense, the work of deciding what goes into the context window and what does not.
The output is probabilistic, not deterministic. Even with identical input, the model can produce different outputs across runs. This is by design. The model outputs a probability distribution over possible next tokens, and the sampling step selects one. A parameter called temperature controls how much the sampling favors the most probable token versus exploring less probable ones. Low temperature gives consistent, conservative output. High temperatures yield varied, creative output at the cost of consistency.
When the interviewer asks why the model gave a great answer to one question and a poor one to a similar question, this is most of the answer. The two inputs were not as similar as they appeared on the surface. Small differences in phrasing, context, or the model's sampled path through the probability distribution produce different outputs. The model is not flaky. It is doing exactly what it was built to do, which is to produce statistically plausible continuations of text.
What this gives you in the interview.
These three foundations let you answer the how do you think about model behaviour family of questions like a practitioner.
A model that produces inconsistent outputs is not broken. It is sampling from a probability distribution.
A model that hallucinates is not lying. It is producing the most probable continuation given its training data, even when no good continuation exists.
A model that loses track of an earlier instruction is not failing. Its context window is finite, and the application has decided what to include in the current request.
A model that responds slowly to long outputs is not poorly engineered. It is producing tokens sequentially, and the latency is bounded by the length of the output.
You will sound like someone who has thought about this for longer than ten minutes. Because you have. That is the actual bar in the interview.
One layer this primer does not cover.
There is a deeper layer of understanding: how the model actually computes the next token from the input. That involves attention, embeddings, the network's layers, and the mathematics of the transformation between them. Most AI engineering interviews do not require this depth in the first conversation. They require the mental model above. The deeper layer is worth learning, and we will reach it in a future Tuesday piece.
For now, the next-token-predictor framing, the three foundations, and the implications above are enough to make you a credible technical conversation partner. Two hours with these ideas, tested against your own experience using ChatGPT or Claude in the last month, and you have made yourself meaningfully more interview-ready by the end of the day.
Understand the system, not just the symptoms.
Want more field-tested data engineering notes?
DWHPro Letters covers enterprise data platforms, migrations, AI, performance, project economics, and the career side of technical work.
Subscribe before the paid plan launches and keep launch access.
Written by Roland Wenzlofsky, founder of DWHPro and author of Teradata Query Performance Tuning. DWHPro has helped data warehouse practitioners for 15+ years.