In early 2023, the cutting edge of practical AI was a single language model answering questions in a chat interface. By 2026, teams of autonomous AI agents are executing multi-day software engineering tasks, conducting research across dozens of sources, managing infrastructure, and orchestrating workflows that span enterprise systems — with minimal human intervention. This is a compressed history of how we got here and what practitioners learned along the way.
When GPT-4 launched in March 2023, the dominant paradigm was simple: write a better prompt, get a better answer. The AI engineering community invested enormous intellectual energy in prompt engineering — crafting system prompts, few-shot examples, chain-of-thought instructions, and output format specifications that coaxed reliable behavior from a fundamentally unpredictable system. GPT-4 was genuinely powerful by the standards of what came before it, but it was also stateless, context-limited (initially 8,000 tokens, later 32,000), and entirely dependent on the quality of the instructions it received in each isolated conversation.
The limitations became apparent quickly. GPT-4 could write impressive code samples but could not run them to test their correctness. It could summarize documents but only those that fit in its context window. It could suggest a multi-step plan but had no mechanism to execute that plan, remember what it had done, or adapt when steps failed. The practical applications in 2023 were consequentially narrow: customer support automation, first-draft content generation, code completion assistants (GitHub Copilot becoming a genuine productivity tool), and question-answering over small documents. Practitioners who understood the boundaries of these systems built valuable narrow applications; those who didn't built things that failed in production with predictable patterns.
Two architectural advances in 2024 fundamentally expanded what was possible with LLMs. Retrieval-Augmented Generation (RAG) solved the knowledge cutoff and context window problem by decoupling the model's reasoning capability from its factual knowledge. Instead of trying to stuff all relevant information into the context, RAG systems retrieve relevant chunks from a vector database at query time and provide them to the model as grounding context. Combined with better embedding models and more sophisticated retrieval strategies (hybrid dense-sparse search, re-ranking, multi-hop retrieval), RAG enabled LLM applications to reason accurately over arbitrarily large knowledge bases — corporate document repositories, codebases, research literature — without the model needing to "know" that information from training.
Function calling (tool use) was equally transformative. By allowing models to invoke external functions — querying databases, calling APIs, reading and writing files, executing code — applications could break out of the text-in, text-out paradigm and begin taking actions in the world. Claude's tool use capabilities and OpenAI's function calling API made it practical to build agents that could look up current information, perform calculations, interact with enterprise systems, and take verifiable actions rather than merely producing text about what actions might be taken. This was the foundation on which the agentic era would be built. Combined, RAG and function calling moved LLMs from conversational assistants to capable components in complex automated systems.
The practitioner insight that emerged from 2024 production deployments: the ceiling on LLM application quality is almost always the quality of the surrounding system — retrieval pipeline, tool implementations, error handling, and evaluation infrastructure — rather than the model itself. Teams that invested in these surrounding systems consistently outperformed those chasing model upgrades.
The launch of Claude Code in 2025 marked a qualitative shift in what AI systems could accomplish. Rather than answering questions about code, Claude Code could autonomously execute multi-step software engineering tasks: reading an entire codebase, understanding its architecture, making targeted changes, running tests, debugging failures, iterating based on results, and delivering working code — all without requiring step-by-step human guidance. For engineering teams, this was the difference between a capable autocomplete assistant and a junior engineer who could be delegated to for hours at a time.
Multi-agent architectures became production-viable in 2025. Rather than a single model handling an entire task, orchestrator-worker patterns emerged where a high-capability planning model would decompose complex tasks into subtasks and delegate them to specialized agents — a research agent that could browse the web, a coding agent with file system access, a data agent connected to databases, a communication agent with email and calendar access. These agents could work in parallel, share intermediate results, and resolve dependencies under the orchestrator's coordination. The Model Context Protocol (MCP) emerged as the standard interface for connecting agents to external tools and services, replacing the proprietary function-calling implementations that had proliferated in 2024 and creating a more interoperable ecosystem.
The reliability challenges in multi-agent systems were significant and not fully solved in 2025. Error propagation — where a mistake by one agent caused cascading failures in downstream agents that assumed correct input — was the most common failure mode in production systems. Teams learned to build explicit validation checkpoints between agent handoffs, implement rollback mechanisms, and design for graceful degradation rather than assuming happy-path execution. The organizations that succeeded with multi-agent deployment invested heavily in observability: logging agent reasoning, tracking tool calls, and building dashboards that made complex agent behavior legible to human supervisors.
By mid-2026, the frontier has moved to agent orchestration at enterprise scale. Anthropic's Claude Opus 4.7 and equivalently capable models from other providers can maintain coherent reasoning across very long task horizons — multi-day software projects, extended research engagements, complex negotiation support — with improved reliability and substantially reduced hallucination rates in tool-calling contexts. The MCP ecosystem has matured significantly: hundreds of pre-built MCP servers cover common enterprise integration points (Salesforce, Jira, GitHub, major cloud providers, data warehouses), and internal MCP server development has become a standard part of enterprise AI platform engineering. The architecture of a modern AI-powered workflow looks less like a chatbot and more like a distributed system: multiple models playing different roles, connected through standardized interfaces, monitored by observability platforms, and coordinated by orchestration logic that handles retries, failures, and human escalation.
What practitioners have learned through this three-year evolution is that AI capability advances faster than the organizational capacity to absorb it. The limiting factor is no longer what the models can do — it is the quality of the data pipelines feeding them, the clarity of the task specifications guiding them, the robustness of the evaluation frameworks measuring their output, and the human judgment systems that know when to override them. Teams that built strong foundations in these areas in 2023 and 2024 are now deploying agent systems that deliver genuine economic value. Teams that skipped the fundamentals are still struggling with reliability in production despite having access to much more capable models.
Several questions will define the next phase of AI development for practitioners. How do you maintain meaningful human oversight of agent systems that operate faster and across more dimensions than any human can track in real time? How do you evaluate the output of agents working on tasks where the correct answer is genuinely ambiguous? How do you govern AI systems that have write access to production infrastructure, financial systems, and customer data? And how do organizations develop the internal capability to evaluate, customize, and trust AI systems rather than treating them as opaque vendor products? The technical progress of the next three years will be as dramatic as the last three — but the organizations that benefit most will be those that have learned to ask the right governance questions alongside the capability questions.
Single-model, stateless interactions. Narrow but valuable applications in content, code assist, and Q&A. Context windows: 8K–32K tokens.
Models gain memory and action. Claude 3 family, GPT-4o. Vector databases mainstream. First production agent deployments.
Long-horizon autonomous tasks. Multi-agent orchestration in production. MCP standardizes tool integration. Observability becomes critical.
Agent swarms on complex business workflows. Governance frameworks mature. The bottleneck shifts entirely to organizational readiness.