Advanced Prompt Engineering: Chain-of-Thought, ReAct, and Tree-of-Thoughts Explained

Introduction

A prompt is not just a question you type into a chat box. It is a set of instructions that shapes how the model thinks, what format it follows, and how deeply it reasons. Crafting prompts effectively is one of the highest-leverage skills in AI engineering, yet it is often treated as an afterthought.

Early applications relied on simple prompts: "Summarize this text" or "Answer this question." These work for straightforward tasks but fail when problems require multi-step reasoning, tool use, or exploring multiple solution paths. A model asked to solve a multi-step math problem with a direct answer prompt will often produce confident but incorrect results, not because the model lacks capability, but because the prompt does not give it the opportunity to reason carefully.

Modern prompt engineering techniques, Chain-of-Thought (CoT), ReAct, and Tree-of-Thoughts (ToT), unlock significantly better performance on complex tasks by changing how the model structures its thinking. Each technique comes with a different cost and complexity trade-off. Knowing when to use which is as important as knowing how they work.

Problem Statement

LLMs generate responses by predicting the most probable next token given everything before it. For simple factual questions, this works well. For multi-step reasoning, it fails, the model has to jump to a final answer without having worked through the intermediate steps, which means early prediction errors compound without correction.

The fundamental challenge is that the model's output format directly shapes the quality of the reasoning it performs. A prompt that asks for a direct answer receives a direct answer, but that answer may be reached by pattern matching rather than genuine reasoning. A prompt that asks for step-by-step reasoning forces the model to externalize and verify each step before moving forward. The format of the output is not just cosmetic; it is a lever that controls the quality of the underlying computation.

Core Concepts and Terminology

Term	Definition
Zero-shot prompting	Asking the model to perform a task with no examples, relying entirely on pre-training knowledge.
Few-shot prompting	Providing a small number of input-output examples before the actual task to demonstrate the expected format and behavior.
Chain-of-Thought (CoT)	A prompting technique that asks the model to reason through a problem step by step before producing a final answer.
Self-consistency	Generating multiple independent CoT reasoning chains for the same problem and selecting the most common final answer by majority vote.
ReAct	A framework that interleaves Thought (reasoning about what to do next), Action (calling a tool), and Observation (receiving the tool result) in a loop.
Tree-of-Thoughts (ToT)	A framework that generates multiple candidate reasoning paths at each step, evaluates them, and selects the most promising branches, allowing backtracking.
Prompt template	A reusable prompt structure with variable placeholders that is filled in at runtime.
Token budget	The total number of tokens consumed by a prompt and its response, which determines API cost and latency.
Temperature	A parameter controlling the randomness of the model's outputs. Higher temperature produces more diverse responses, which is useful for self-consistency sampling.

How It Works

Foundation: Zero-Shot and Few-Shot

Before advanced techniques, the baseline approaches are zero-shot and few-shot prompting. Zero-shot prompting relies entirely on what the model learned during pre-training. You ask the question and the model answers. This is appropriate for common, well-defined tasks where the expected output format is unambiguous.

Few-shot prompting adds two to five examples before the actual question. The model infers the expected input-output pattern from the examples and applies it to the new input. This is significantly more effective for domain-specific tasks, custom output formats, or situations where "sentiment" could mean different things in different contexts. The examples remove ambiguity without requiring model retraining.

Chain-of-Thought Prompting

Chain-of-Thought prompting works by asking the model to externalize its reasoning before giving a final answer. Think of it like the difference between a student who guesses an answer immediately and one who writes out their working. The student who writes out working catches their own errors mid-solution; the guesser cannot.

The simplest form is zero-shot CoT: appending "Let's think step by step" to a question. Research demonstrated that this single phrase reliably activates more careful, step-by-step reasoning even with no examples. The phrase appears to trigger a learned reasoning pattern from the model's pre-training on human-written explanations and worked solutions.

Few-shot CoT is more powerful: you provide two to three examples that each include a full reasoning chain alongside the answer. The model learns the specific style of reasoning you want, what level of detail, how to handle edge cases, when to check intermediate results, and applies that pattern to the new problem.

Diagram of a multiheaded attention module in a transformer architecture showing parallel attention heads — **Figure:** The multiheaded attention module at the core of transformer models. Chain-of-Thought prompting works by steering this architecture to allocate attention across intermediate reasoning steps rather than jumping directly to a final answer token. Source: Cosmia Nebula / Wikimedia Commons (CC BY-SA 4.0)

Self-Consistency

Self-consistency addresses the fact that a single reasoning chain can still make mistakes. The idea is to generate multiple independent reasoning paths for the same problem using a higher temperature, extract the final answer from each path, and select the answer that appears most often. If five independent chains all reach the same answer through different reasoning routes, that answer is far more likely to be correct than any single chain's output. The cost is five times as many API calls, which makes this appropriate only for high-stakes decisions.

ReAct: Reasoning and Acting

ReAct extends Chain-of-Thought by adding the ability to take actions, searching the web, querying databases, running calculations, calling APIs, and incorporating the results of those actions into subsequent reasoning. Without tool access, an LLM can only work with information in its context window and training data. With ReAct, it can gather new information dynamically as reasoning reveals what is needed.

Each ReAct step follows a three-part cycle: Thought (the model reasons about what it needs to know or do next), Action (the model calls a tool with specific parameters), and Observation (the tool result is returned and added to context). This cycle repeats until the model determines it has enough information to produce a final answer. The explicit reasoning trace makes the agent's decision process inspectable and debuggable, you can see exactly why the model called each tool.

Tree-of-Thoughts

Tree-of-Thoughts generalizes Chain-of-Thought by making the reasoning process a search problem rather than a linear chain. Instead of committing to a single path from the start, the model generates several candidate next reasoning steps, evaluates each for promise, and expands only the most promising ones. If a path leads to a dead end or a clearly wrong intermediate result, it can be abandoned and an alternative explored.

This mirrors how a chess player considers multiple possible moves before choosing, or how a human writer drafts multiple opening sentences and selects the strongest. The cost is significant, token usage grows with the breadth and depth of the search tree, but for genuinely difficult problems with multiple valid solution approaches, ToT finds substantially better solutions than linear CoT.

Simple decision tree diagram showing branching paths of binary choices leading to different outcomes — **Figure:** A simple decision tree illustrating the branching structure that Tree-of-Thoughts replicates at the reasoning level: at each node the model evaluates multiple candidate next steps, expands only the most promising branches, and can backtrack when a path leads to a dead end. Source: Eviatar Bach / Wikimedia Commons (CC0 1.0 Public Domain)

Practical Example

A legal research firm builds an internal assistant to answer questions like "What are the precedent cases for contract termination due to force majeure in commercial leases in the UK?" This is a multi-step task requiring external information retrieval, synthesis across multiple sources, and careful reasoning about applicability.

With a simple zero-shot prompt, the model either hallucinates case names or admits it does not have current information. With ReAct, the model reasons "I need to search for UK cases involving force majeure in commercial leases", calls a legal database search tool with appropriate query terms, receives a list of cases, reasons about which are most relevant, retrieves full text for the top three, synthesizes the precedents, and produces a well-sourced answer. The reasoning trace shows exactly which search queries were used and which cases were considered, making the answer auditable. When a lawyer asks a follow-up question about a specific case, the model can pick up from its previous observation without re-querying.

Advantages

Measurable accuracy improvements: Few-shot CoT has been shown to improve accuracy on mathematical reasoning benchmarks by 20 to 40 percentage points over direct prompting on the same model. Self-consistency adds another 5 to 15 points on top of that.
Inspectable reasoning: CoT and ReAct both produce explicit reasoning traces. When the model gets something wrong, you can see exactly where the reasoning broke down, a significant advantage over opaque direct answers.
Tool integration: ReAct enables LLMs to act as genuine agents that gather real-time information, take actions, and incorporate results into subsequent reasoning. This is the foundational pattern for most production AI agent systems.
Solution space exploration: Tree-of-Thoughts allows the model to explore creative or non-obvious solution paths that a linear reasoning chain would commit past too early. On design and planning tasks, this produces qualitatively better outputs.

Limitations and Trade-offs

Token cost increases significantly. CoT roughly doubles token usage per response. Self-consistency multiplies it by the number of samples. Tree-of-Thoughts can multiply it by an order of magnitude. These costs accumulate quickly in high-volume production systems.
Latency is higher. Multi-step reasoning, multiple samples, and tool call cycles all increase response time. For real-time applications, this trade-off must be weighed carefully.
CoT helps less on simple tasks. For straightforward factual recall or basic classification tasks, adding step-by-step reasoning does not improve accuracy and wastes tokens. The technique is specifically valuable for tasks that genuinely require multi-step logic.
ReAct agents can get stuck. Without a maximum step limit, a ReAct agent that cannot find the information it needs may loop indefinitely. Tool call failures, unexpected API responses, and ambiguous observations can derail the reasoning chain.
Tree-of-Thoughts is expensive to evaluate. Assessing the quality of each candidate thought branch often requires an additional LLM call per branch per depth level. The actual cost in production can be 10 to 30 times that of a single CoT response.

Common Mistakes

Over-engineering from the start. Teams that immediately reach for Tree-of-Thoughts or self-consistency add cost and complexity before establishing that simpler approaches are insufficient. Always start with the simplest prompt and add complexity only with evidence it helps.
Ignoring token budget constraints. Few-shot examples, full CoT chains, and multi-turn ReAct conversations consume context space quickly. For very long prompts, you risk hitting the model's context limit, which silently truncates earlier content.
Not testing edge cases. Prompts that work perfectly on representative inputs often fail on empty inputs, unusually long inputs, inputs in unexpected languages, or adversarial inputs. Systematic edge-case testing is essential before production deployment.
Assuming prompt determinism. Even at temperature zero, LLM outputs can vary slightly across API versions and model updates. Prompts must be re-tested after model upgrades, and systems must be designed to handle output variability gracefully.
Not versioning prompts. Treating prompts as ephemeral strings rather than versioned artifacts means changes cannot be audited, rolled back, or A/B tested. Prompt changes frequently cause silent regressions in downstream task quality.

Best Practices

Start with zero-shot. Move to few-shot when the output format or domain is non-standard. Add CoT only when the task requires multi-step reasoning. Apply self-consistency or ToT only when the stakes justify the cost.
Write few-shot examples that cover the full range of input types your system will encounter, not just the easy central cases. Include at least one example of an ambiguous or edge-case input.
For ReAct agents, always set a maximum step count and implement graceful failure handling for tool call errors. Never let an agent loop without bounds.
Version your prompts like code. Store them in source control, review changes, and use semantic versioning. A version bump signals that downstream evaluation is needed.
A/B test prompt changes on real traffic before fully deploying. Even seemingly minor wording changes can meaningfully shift output quality in unexpected directions.
Log the full reasoning trace for all CoT and ReAct interactions, not just the final answer. The intermediate steps are essential for debugging failures and auditing decisions.
Use role assignment, "You are an expert X with Y years of experience", to prime the model for domain-specific vocabulary and reasoning patterns.
Specify output format constraints explicitly: length limits, structure requirements, what the model should say when it cannot answer. Unspecified constraints produce inconsistent behavior at scale.

Comparison: Prompting Techniques

Technique	Reasoning Type	Relative Cost	Latency	When to Use
Zero-Shot	Direct answer	Lowest	Fastest	Simple, well-defined tasks with unambiguous output
Few-Shot	Pattern learning from examples	Low to Medium	Fast	Domain-specific tasks or custom output formats
Chain-of-Thought	Step-by-step reasoning	Medium	Medium	Multi-step arithmetic, logic, planning, and inference
Self-Consistency	Multiple chains with majority vote	High (N times CoT)	Slow	High-stakes decisions where accuracy justifies cost
ReAct	Reasoning interleaved with tool calls	Medium to High	Slow (includes tool latency)	Tasks requiring external information or multi-step actions
Tree-of-Thoughts	Search over multiple reasoning paths	Very High	Very Slow	Complex problems where early path choices have large downstream impact

FAQ

Does adding "Let's think step by step" always improve results?

No. Chain-of-Thought prompting helps most on tasks that genuinely require multi-step reasoning, mathematics, logic puzzles, code debugging, planning. For simple factual recall ("What is the capital of France?") or straightforward classification tasks, adding step-by-step reasoning produces no accuracy improvement and wastes tokens. The technique is specifically valuable when the model must perform intermediate computations or logical deductions to reach a correct answer.

How is ReAct different from just giving the model tools to call?

Standard tool calling allows the model to invoke functions, but it does not enforce any structure on the reasoning around those calls. ReAct adds an explicit reasoning step before every action, forcing the model to articulate why it is calling a tool and what it expects to learn. This reasoning trace makes the agent's behavior inspectable and debuggable. It also tends to produce better tool selection, the model that writes out its reasoning before acting is less likely to call the wrong tool or pass incorrect parameters.

When is Tree-of-Thoughts actually worth the cost?

ToT is worth the cost when the problem has two properties: first, there are multiple meaningfully different solution approaches (not just one right path with minor variations); second, committing to the wrong approach early is costly, either because backtracking is expensive or because the wrong path produces a result that looks plausible but is wrong. Creative design tasks, constraint-satisfaction problems, and high-stakes planning scenarios can justify ToT. For most production question-answering, summarization, or extraction tasks, CoT or self-consistency gives equivalent quality at far lower cost.

Should I always use few-shot examples over zero-shot?

Not always. Few-shot examples add tokens and latency. For tasks where the model reliably produces the right output format zero-shot, common query types, standard sentiment analysis, straightforward summarization, adding examples provides marginal benefit at a meaningful cost. Use few-shot when the task is domain-specific, when the output format is unusual, or when you observe consistent zero-shot failure modes that well-chosen examples would address.

How do I handle a ReAct agent that gets stuck in a loop?

Always set a maximum step count and fail gracefully when it is reached. In practice, this means tracking the number of Thought-Action-Observation cycles and returning a "could not find sufficient information" response to the user rather than continuing to loop. Additionally, implement tool call timeouts and handle tool errors explicitly in the agent loop, an unhandled tool error often causes the model to retry the same failed call indefinitely. Logging the full trace helps diagnose which step causes the loop and whether it is a tool failure or a reasoning confusion.

References

Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.
Wang, X., et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171.
Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
Yao, S., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023.
Kojima, T., et al. (2022). Large Language Models are Zero-Shot Reasoners. NeurIPS 2022.

Key Takeaways

Chain-of-Thought prompting, triggered as simply as adding "Let's think step by step", significantly improves accuracy on multi-step reasoning tasks by forcing the model to externalize its logic rather than pattern-match to a direct answer.
Self-consistency trades cost for reliability: generating multiple CoT samples and taking a majority vote is worth the expense for high-stakes decisions, but overkill for routine tasks.
ReAct is the right pattern for tasks requiring external information or multi-step actions, interleaving explicit reasoning with tool calls at each step and producing an inspectable trace of every decision.
Tree-of-Thoughts is valuable specifically when early path choices have large downstream consequences and multiple meaningfully different approaches exist. For most production tasks, CoT is sufficient at far lower cost.
Use the simplest technique that solves your problem. Complexity should be proportional to task difficulty and stakes, over-engineered prompts are harder to maintain and can actually confuse the model with contradictory or redundant instructions.
Version, A/B test, and monitor your prompts in production the same way you would any other critical code path. Prompt changes frequently cause silent regressions that only systematic evaluation reveals.