AI Agents

How to Build an AI Agent from Scratch in Python

Build a real AI agent from scratch in Python — tools, memory, reasoning loops, and multi-agent systems. No framework hand-holding.

Everyone is shipping "AI agents" right now. Most of them are lying.

What they've actually shipped is a chatbot with a system prompt, or a pipeline that calls an LLM three times in a fixed sequence, or a workflow with a few API integrations bolted on. Useful, maybe. But agents? Not quite.

The Talent Bottleneck

The Talent Bottleneck

The talent bottleneck is real. Learn how CIOs and CHROs are closing the gap between AI-ready technology and the people who need to use it.

Learn More

An actual agent does something different: it observes its environment, reasons about what to do next, takes action, evaluates the result, and loops — until the task is done. The LLM doesn't just respond. It decides. What tool to call. Whether to call one at all. When it's finished. That decision-making loop is what makes something an agent versus a very fancy API wrapper.

This article walks through how that loop actually works, from the minimal 50-line version up through a proper tool registry. By the end, you'll have working code and a clear mental model of what every production agent is doing under the hood — regardless of which framework you eventually reach for.


The Four-Part Definition

Before we write a single line of code, let's establish the definition precisely. An agent has four components:

The LLM (the brain). The language model is the decision-maker. It reads the current state of the conversation — your messages, previous tool results, context — and decides what to do next. Call a tool? Respond to the user? Ask a clarifying question? The LLM makes that call. It does not execute anything directly.

Tools (the hands). An LLM alone can only produce text. Tools are what let the agent actually do things: read a file, query a database, call an API, search the web. Critically, the LLM never executes tools itself — it outputs a structured request, and your code executes it. The LLM is the decision-maker. You are the executor.

Memory (the filing cabinet). Short-term memory is the conversation history within a session — the rolling context that lets the LLM see what it already tried and what came back. Long-term memory is anything that persists across sessions: vector stores, databases, files on disk. Most agents start with short-term memory only. That gets you further than you'd expect.

The reasoning loop (the nervous system). The loop ties everything together. Send the current state to the LLM. Check if it wants to use a tool. Execute the tool. Append the result. Loop. When the LLM produces a final text response instead of a tool call, the loop exits.

That loop is the heartbeat. Everything else is elaboration.


The Spectrum: What Is and Isn't an Agent

The terminology is polluted enough that it's worth drawing the lines clearly:

LESS AUTONOMOUS                                              MORE AUTONOMOUS
      |                                                              |
      v                                                              v

  Chatbot -----> Pipeline -----> Workflow -----> Agent -----> Autonomous Agent

  - Text in/out   - Fixed steps   - Branching     - Tool use     - Self-directed
  - No tools      - No decisions  - Pre-defined   - Loop         - Goal-seeking
  - No loop       - No loop         paths         - Evaluates    - Minimal human
  - Stateless     - Stateless    - LLM powers      results        oversight
                                   steps, not    - Decides next
                                   flow            action

A pipeline that calls summarize → extract → report in that order every time is not an agent. The sequence is hardcoded. There's no decision. An agent reads the current state and decides the next step — including whether there is a next step at all.


The Simplest Possible Agent

Here's a complete, working agent in under fifty lines of Python. It can answer questions directly, or search the web when it decides it needs current information.

import json
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from environment
model = "gpt-4o"

# --- Tool Implementation ---
def web_search(query: str) -> str:
    """Simulate a web search. Replace with a real search API in production."""
    fake_results = [
        {"title": "Result 1", "snippet": f"Relevant information about: {query}"},
        {"title": "Result 2", "snippet": f"More details regarding: {query}"},
    ]
    return json.dumps(fake_results, indent=2)

available_tools = {"web_search": web_search}

# --- Tool Schema ---
tools = [
    {
        "type": "function",
        "function": {
            "name": "web_search",
            "description": "Search the web for current information on a topic.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query to look up."
                    }
                },
                "required": ["query"]
            }
        }
    }
]

# --- Agent Loop ---
def run_agent(user_message: str, max_iterations: int = 10) -> str:
    messages = [
        {"role": "system", "content": "You are a helpful assistant. Use web_search for current info."},
        {"role": "user", "content": user_message}
    ]

    for i in range(max_iterations):
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            tools=tools,
        )
        message = response.choices[0].message

        # No tool calls means the model is done — return the answer
        if message.tool_calls is None:
            return message.content

        # Process each tool call the model requested
        messages.append(message)
        for tool_call in message.tool_calls:
            name = tool_call.function.name
            args = json.loads(tool_call.function.arguments)
            print(f"  [iteration {i+1}] calling {name}({args})")
            result = available_tools[name](**args)
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result,
            })

    return "Agent reached maximum iterations without producing a final answer."

if __name__ == "__main__":
    answer = run_agent("What is the current population of Tokyo?")
    print(answer)

Walk through what happens with "What is the current population of Tokyo?":

  1. The agent sends the system prompt and user message to the model.
  2. The model decides it needs current data and outputs a structured tool call: web_search({"query": "current population of Tokyo"}).
  3. Your code catches that, executes web_search, gets back JSON results.
  4. Those results get appended to the conversation history and sent back to the model.
  5. The model reads the results, synthesizes an answer, and returns plain text — no more tool calls.
  6. The loop sees no tool calls, returns the answer, exits.

Now try "What is 2 + 2?" The model knows this, skips the tool call entirely, and responds with "4." The loop runs once. No tools are called.

That decision — to act or not — is what makes this an agent. A script always searches. A chatbot never searches. The agent evaluates and chooses.

The Anthropic Version

If you prefer the Anthropic SDK, the structure is nearly identical. The main differences are how tool results are formatted and how stop_reason works:

import json
import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from environment
model = "claude-sonnet-4-20250514"

def web_search(query: str) -> str:
    fake_results = [
        {"title": "Result 1", "snippet": f"Relevant information about: {query}"},
    ]
    return json.dumps(fake_results, indent=2)

available_tools = {"web_search": web_search}

tools = [
    {
        "name": "web_search",
        "description": "Search the web for current information on a topic.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "The search query."}
            },
            "required": ["query"]
        }
    }
]

def run_agent(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]
    system = "You are a helpful assistant. Use web_search for current info."

    while True:
        response = client.messages.create(
            model=model,
            max_tokens=1024,
            system=system,
            messages=messages,
            tools=tools,
        )

        if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})

            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    print(f"  -> calling {block.name}({block.input})")
                    result = available_tools[block.name](**block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result,
                    })

            messages.append({"role": "user", "content": tool_results})
        else:
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text

The loop is the same loop. Send messages, check for tool calls, execute tools, feed results back, repeat. The API surface differs between providers; the pattern doesn't.


Building a Proper Tool Registry

One tool hardcoded into the script is fine for a prototype. Real agents need dozens of tools, and adding a new one shouldn't require touching the agent loop at all.

A ToolRegistry solves this. It stores tool functions, auto-generates their JSON schemas from Python type hints, and handles dispatch. Here's the implementation:

import json
import inspect
from typing import Callable, Any, get_type_hints

class ToolRegistry:
    """Stores tools, their schemas, and handles execution."""

    def __init__(self):
        self._tools: dict[str, Callable] = {}
        self._schemas: dict[str, dict] = {}

    def tool(self, description: str = "", **param_descriptions):
        """Decorator to register a function as an agent tool."""
        def decorator(func: Callable) -> Callable:
            name = func.__name__
            schema = self._generate_schema(func, description, param_descriptions)
            self._tools[name] = func
            self._schemas[name] = schema
            return func
        return decorator

    def _generate_schema(self, func, description, param_descriptions):
        hints = get_type_hints(func)
        sig = inspect.signature(func)
        desc = description or (func.__doc__ or "").strip() or f"Call {func.__name__}"

        properties = {}
        required = []

        for param_name, param in sig.parameters.items():
            if param_name == "self":
                continue
            param_type = hints.get(param_name, str)
            json_type = self._python_type_to_json(param_type)
            prop = {"type": json_type}
            if param_name in param_descriptions:
                prop["description"] = param_descriptions[param_name]
            properties[param_name] = prop
            if param.default is inspect.Parameter.empty:
                required.append(param_name)

        return {
            "type": "function",
            "function": {
                "name": func.__name__,
                "description": desc,
                "parameters": {
                    "type": "object",
                    "properties": properties,
                    "required": required,
                },
            },
        }

    @staticmethod
    def _python_type_to_json(python_type: type) -> str:
        type_map = {
            str: "string", int: "integer", float: "number",
            bool: "boolean", list: "array", dict: "object",
        }
        return type_map.get(python_type, "string")

    def get_schemas(self) -> list[dict]:
        return list(self._schemas.values())

    def call(self, name: str, arguments: dict[str, Any]) -> str:
        if name not in self._tools:
            return json.dumps({"error": f"Unknown tool: {name}"})
        try:
            result = self._tools[name](**arguments)
            return str(result)
        except TypeError as e:
            return json.dumps({"error": f"Invalid arguments for {name}: {e}"})
        except Exception as e:
            return json.dumps({"error": f"Tool {name} failed: {e}"})

Now registering a tool is one decorator:

registry = ToolRegistry()

@registry.tool(
    description="Search the web for current information on a topic.",
    query="The search query string."
)
def web_search(query: str) -> str:
    # Real implementation here
    return json.dumps([{"snippet": f"Results for: {query}"}])

@registry.tool(
    description="Read the contents of a file from the local filesystem.",
    filepath="Absolute or relative path to the file."
)
def read_file(filepath: str) -> str:
    with open(filepath, "r") as f:
        return f.read()

The agent loop uses the registry without knowing about any specific tool:

def run_agent(user_message: str, max_iterations: int = 10) -> str:
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": user_message}
    ]

    for _ in range(max_iterations):
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            tools=registry.get_schemas(),
        )
        message = response.choices[0].message

        if message.tool_calls is None:
            return message.content

        messages.append(message)
        for tool_call in message.tool_calls:
            name = tool_call.function.name
            args = json.loads(tool_call.function.arguments)
            result = registry.call(name, args)
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result,
            })

    return "Max iterations reached."

Add a new tool with a decorator. The loop never changes. That's the design goal.


The Five Problems Every Agent Must Eventually Solve

The 50-line version works, but it's fragile in predictable ways. Here's what breaks next and where the solutions live:

Too few tools, all hardcoded. The tool registry above is the fix. A mature agent might have dozens of tools loaded from a plugin directory or discovered dynamically via MCP (the Model Context Protocol, Anthropic's emerging standard for tool connectivity).

No memory across conversations. Every session starts from zero. Short-term fixes involve persisting the message history to disk between runs. Long-term fixes involve vector stores — embed past interactions, retrieve semantically relevant ones, inject them into the context.

Can't plan multi-step tasks. Ask the simple agent to "research five EV competitors and write a comparison" and it tries to do everything in one search. Planning agents decompose the task first, then execute each step. The ReAct pattern (Reason → Act → Observe, repeat) makes this explicit and debuggable.

No error handling. An exception in any tool call crashes the agent. Production agents need retry logic, fallback tools, and graceful degradation — plus a way to report errors back to the LLM so it can try a different approach.

Single-agent bottleneck. One LLM can't be an expert at everything. Multi-agent systems solve this with a supervisor that delegates subtasks to specialized worker agents — one for research, one for writing, one for code review.

The minimal loop in this article is the foundation that all of those patterns build on. The skeleton doesn't change. What changes is the sophistication of what happens inside it.


Writing Good Tool Descriptions

The single highest-leverage thing you can do to improve agent behavior — outside of the loop design itself — is write better tool descriptions.

The LLM doesn't pattern-match on function names. It reads your description, understands the intent, and makes a judgment call about whether this is the right tool for the current situation. A vague description produces inconsistent tool selection. A precise description produces reliable tool selection.

Compare these two descriptions for a file-reading tool:

# Vague
"Read a file."

# Precise
"Read the complete contents of a text file from the local filesystem. 
Returns the raw file content as a string. Use when the user refers to 
a specific file by path, or when you need to examine code, configuration, 
or documentation stored locally."

The second version tells the model when to use the tool, not just what it does. The model has much better signal about whether the current task warrants this tool versus a different one.

The same logic applies to parameter descriptions. Don't write "The file path." Write "Absolute or relative path to the file. Use forward slashes. Include the file extension." The model reads those descriptions when generating arguments — the more precise they are, the more precisely the arguments come back.


What to Read Next

The patterns here — the loop, the tool registry, the function calling protocol — are the complete foundation. Every agent framework (LangChain, LangGraph, CrewAI, AutoGen, the Anthropic agent SDK) is a more elaborate version of these same pieces.

If you want to go deeper on any of this:

  • MCP (Model Context Protocol) is the emerging standard for tool connectivity. Instead of registering tools directly in your code, MCP lets you connect to external tool servers using a standardized protocol. Anthropic introduced it in late 2024.
  • ReAct (Reason + Act) is the prompting pattern that makes multi-step reasoning explicit. The model outputs its reasoning before each tool call, which both improves accuracy and makes the agent's logic transparent.
  • Vector memory is the standard solution for long-term agent memory. Embed text into vectors, store them, retrieve by semantic similarity. You can build a working version from scratch with Python and NumPy.

The companion repository for these patterns is at github.com/grizzlypeaksoftware/gps-ai-agent-fundamentals. Every pattern above — the minimal agent, the tool registry, the Anthropic version, the verbose debugging loop — exists as a runnable script there.

If you want the complete treatment, from the minimal agent through multi-agent systems and production error handling, I've written it all up in Build Your Own AI Agent From Scratch, available on Amazon.

The architecture isn't complicated once you see it. The loop is the loop. Build it yourself once and you'll never be confused by agent frameworks again.

Powered by Contentful