Skip to main content
Open on GitHub

Few-shot prompting

Prerequisites

Overview​

One of the most effective ways to improve model performance is to give a model examples of what you want it to do. The technique of adding example inputs and expected outputs to a model prompt is known as "few-shot prompting". The technique is based on the Language Models are Few-Shot Learners paper. There are a few things to think about when doing few-shot prompting:

  1. How are examples generated?
  2. How many examples are in each prompt?
  3. How are examples selected at runtime?
  4. How are examples formatted in the prompt?

Here are the considerations for each.

1. Generating examples​

The first and most important step of few-shot prompting is coming up with a good dataset of examples. Good examples should be relevant at runtime, clear, informative, and provide information that was not already known to the model.

At a high-level, the basic ways to generate examples are:

  • Manual: a person/people generates examples they think are useful.
  • Better model: a better (presumably more expensive/slower) model's responses are used as examples for a worse (presumably cheaper/faster) model.
  • User feedback: users (or labelers) leave feedback on interactions with the application and examples are generated based on that feedback (for example, all interactions with positive feedback could be turned into examples).
  • LLM feedback: same as user feedback but the process is automated by having models evaluate themselves.

Which approach is best depends on your task. For tasks where a small number core principles need to be understood really well, it can be valuable hand-craft a few really good examples. For tasks where the space of correct behaviors is broader and more nuanced, it can be useful to generate many examples in a more automated fashion so that there's a higher likelihood of there being some highly relevant examples for any runtime input.

Single-turn v.s. multi-turn examples

Another dimension to think about when generating examples is what the example is actually showing.

The simplest types of examples just have a user input and an expected model output. These are single-turn examples.

One more complex type if example is where the example is an entire conversation, usually in which a model initially responds incorrectly and a user then tells the model how to correct its answer. This is called a multi-turn example. Multi-turn examples can be useful for more nuanced tasks where its useful to show common errors and spell out exactly why they're wrong and what should be done instead.

2. Number of examples​

Once we have a dataset of examples, we need to think about how many examples should be in each prompt. The key tradeoff is that more examples generally improve performance, but larger prompts increase costs and latency. And beyond some threshold having too many examples can start to confuse the model. Finding the right number of examples is highly dependent on the model, the task, the quality of the examples, and your cost and latency constraints. Anecdotally, the better the model is the fewer examples it needs to perform well and the more quickly you hit steeply diminishing returns on adding more examples. But, the best/only way to reliably answer this question is to run some experiments with different numbers of examples.

3. Selecting examples​

Assuming we are not adding our entire example dataset into each prompt, we need to have a way of selecting examples from our dataset based on a given input. We can do this:

  • Randomly
  • By (semantic or keyword-based) similarity of the inputs
  • Based on some other constraints, like token size

LangChain has a number of ExampleSelectors which make it easy to use any of these techniques.

Generally, selecting by semantic similarity leads to the best model performance. But how important this is is again model and task specific, and is something worth experimenting with.

4. Formatting examples​

Most state-of-the-art models these days are chat models, so we'll focus on formatting examples for those. Our basic options are to insert the examples:

  • In the system prompt as a string
  • As their own messages

If we insert our examples into the system prompt as a string, we'll need to make sure it's clear to the model where each example begins and which parts are the input versus output. Different models respond better to different syntaxes, like ChatML, XML, TypeScript, etc.

If we insert our examples as messages, where each example is represented as a sequence of Human, AI messages, we might want to also assign names to our messages like "example_user" and "example_assistant" to make it clear that these messages correspond to different actors than the latest input message.

Formatting tool call examples

One area where formatting examples as messages can be tricky is when our example outputs have tool calls. This is because different models have different constraints on what types of message sequences are allowed when any tool calls are generated.

  • Some models require that any AIMessage with tool calls be immediately followed by ToolMessages for every tool call,
  • Some models additionally require that any ToolMessages be immediately followed by an AIMessage before the next HumanMessage,
  • Some models require that tools are passed in to the model if there are any tool calls / ToolMessages in the chat history.

These requirements are model-specific and should be checked for the model you are using. If your model requires ToolMessages after tool calls and/or AIMessages after ToolMessages and your examples only include expected tool calls and not the actual tool outputs, you can try adding dummy ToolMessages / AIMessages to the end of each example with generic contents to satisfy the API constraints. In these cases it's especially worth experimenting with inserting your examples as strings versus messages, as having dummy messages can adversely affect certain models.

You can see a case study of how Anthropic and OpenAI respond to different few-shot prompting techniques on two different tool calling benchmarks here.


Was this page helpful?