Designing Enterprise Data Agents: From Pipelines to Agent-Native Architecture

Natural language interfaces for data are no longer experimental—they’re becoming essential enterprise tools. But building a reliable, production-grade data agent requires moving beyond simple prompt engineering. This post shares architectural lessons learned from building an enterprise natural language to SQL (NL2SQL) agent.

The Problem with Pipeline Architectures

Most NL2SQL systems start with a pipeline approach:

User Question → Intent Detection → RAG Retrieval → SQL Generation → Validation → Response

This works for demos but creates problems at scale:

Issue	Root Cause
Rigid flows	Each question type needs a predefined path
Context fragmentation	RAG retrieves fragments, loses coherence
No memory	Every session starts from scratch
Hard to debug	Which pipeline stage failed?
Limited extensibility	Adding capabilities means adding stages

After building a system this way and hitting these ceilings, we rearchitected from first principles.

Design Philosophy: Agent-Native

The core shift is from “pipeline with steps” to “capable agent with rich context.”

Instead of orchestrating the agent through a fixed flow, we give it:

Rich, structured context loaded deterministically
Tools it can use as needed
Memory that persists across sessions
The autonomy to decide how to answer each question

The agent doesn’t follow a script—it reasons about the task and uses appropriate tools.

Guiding Principles

Principle	What It Means
Context over prompting	Structured knowledge beats clever prompts
Tools over hardcoding	Agent picks tools, not forced through pipelines
Memory over statelessness	Interactions inform future ones
Skills over sub-agents	Composable workflows, not rigid hierarchies
Learning over curation	System grows from feedback, humans seed + review

The Architecture

Here’s the high-level structure we arrived at:

Agent Architecture

1. Orchestrator with Workflow Graph

While we give the agent autonomy, we wrap it in a lightweight workflow graph with named phases. This isn’t a return to rigid pipelines—it provides structure for recovery and debugging while preserving flexibility.

Phases:

Understand: Classify intent, assess risk
Retrieve: Load scoped context
Plan: Break down complex tasks (optional)
Execute: Run tool calls
Verify: Validate results with deterministic checks
Reflect: Assess completeness, replan if needed
Summarize: Generate response with provenance
Persist: Update memory through policy gates

Each phase has explicit recovery paths. If verification fails, we can reflect and replan rather than failing the entire request.

2. Context Layer: Hierarchical, Not Retrieved

We replaced RAG-based retrieval with deterministic, hierarchical context loading.

Context Hierarchy

Why this works better:

Debuggable: You know exactly what context was loaded
Versionable: Context lives in git, changes are reviewed
Fast: No embedding lookups, just file reads
Coherent: Full documents, not fragments

The hierarchy follows our multi-tenant structure. Each level can override or extend the level above it. User preferences override facility defaults override customer settings override global knowledge.

3. Tools as Capabilities

Instead of sub-agents with narrow responsibilities, we expose capabilities as tools the agent can invoke:

Tool	Purpose
`query_data`	Execute analytical queries
`visualize`	Generate charts and graphs
`search_docs`	Find relevant documentation
`ask_user`	Request clarification
`invoke_skill`	Run predefined workflows

The agent decides which tools to use based on the question. A simple factual query might just use query_data. A complex analysis might chain multiple tools. An ambiguous request might start with ask_user for clarification.

4. Memory: Three Tiers

Memory operates at three timescales:

Memory Tiers

Tier	Scope	Lifetime	Purpose
Session	Conversation	Minutes/hours	“What did they just ask?”
User	Individual	Persistent	“How does this person work?”
Collective	Organization	Persistent	“What patterns work here?”

Session memory enables follow-ups: “Show me the same for last week” works because the agent remembers what “the same” means.

User memory learns preferences: If a user corrects “best performer means lowest error rate, not highest volume,” we remember that.

Collective memory accumulates organizational knowledge: Successful query patterns, validated business term definitions, learned thresholds.

5. Skills: Composable Workflows

Skills are reusable, multi-step workflows defined declaratively:

# Skill: period_comparison

## Description
Compare a metric across two or more time periods.

## Inputs
-metric: The measure to compare
-dimension: How to group results
-periods: List of time periods

## Steps
1.Parse and validate period definitions
2.Generate query for each period
3.Execute queries
4.Calculate deltas and trends
5.Format comparison table

Skills can be:

Human-authored: Engineers write them like any other code
Agent-suggested: The agent drafts skills when it sees repeated patterns

Both go through the same lifecycle: draft → test → security scan → human approval → production.

Production Considerations

The architecture above is necessary but not sufficient. Production systems need additional guardrails.

Risk-Tier + Verifier Gating

We don’t trust the agent’s self-reported confidence. Instead, we use deterministic verifiers and risk classification:

Risk Tiers:

Tier	Actions	Auto-Execute?
Read-only	Queries, searches	Yes (if verifiers pass)
Write-back	Memory updates	Requires staging
System-modify	New capabilities	Requires human approval

Verifiers:

SQL syntax validation
Schema matching (referenced tables/columns exist)
Result sanity checks (row counts, value ranges)
Cost estimation

Only when risk is low AND verifiers pass AND historical success is high do we auto-execute. Everything else requires staging or explicit approval.

Memory Write Policies

Memory can silently degrade an agent if you let it write unverified beliefs. We enforce explicit policies:

User Memory:

✅ Write explicit preferences (user confirms)
✅ Write patterns after 3+ consistent observations
❌ Never write business “facts” (must come from system of record)
⚠️ Hypotheses get 30-day expiry

Collective Memory:

✅ Patterns validated through staging process
❌ No direct writes from agent—everything staged first
Human curation for glossary and schema annotations

Provenance and Observability

Every response carries provenance:

What data sources were accessed
What filters and joins were applied
What assumptions were made
What verifiers passed
Full trace of tool calls

This isn’t cosmetic. It’s what makes stakeholders trust the system and what makes incidents debuggable.

Implementation Strategy: Skeleton First

You can’t build all of this at once. Our strategy:

Phase 1: Skeleton Build the minimum viable foundation :

Orchestrator with core phases
Context layer (migrated existing knowledge)
Core tools (query, visualize, search, clarify)
Session memory (full)
User/collective memory (stubs)
Basic trace store

Phase 2+: Enhance With the skeleton in production, layer on:

Full memory system with write policies
Skills framework with lifecycle
Learning loop with risk-tier gating
Self-extension capabilities

This approach de-risks the migration. We prove the architecture works before adding complexity.

Key Takeaways

Pipeline → Agent-Native: Give agents autonomy with structure, not rigid flows
RAG → Herarchical Context: Deterministic loading beats probabilistic retrieval for core knowledge
Confidence → Verifiers: Don’t trust LLM self-assessment; use deterministic checks
Memory Needs Policies: Without write rules, memory degrades quality over time
Provenance is Required: Enterprise trust requires knowing exactly what happened

The shift from “pipeline that calls LLMs” to “capable agent with tools” is more than architectural—it changes how the system can evolve. New capabilities become tools or skills, not new pipeline stages. The agent learns from usage rather than requiring manual updates. And the system becomes more debuggable, not less, as it grows more capable.

This post describes architectural patterns from building enterprise data agents. The specific implementation details are illustrative rather than prescriptive—your domain and constraints will differ.

The Problem with Pipeline Architectures#

Design Philosophy: Agent-Native#

Guiding Principles#

The Architecture#

1. Orchestrator with Workflow Graph#

2. Context Layer: Hierarchical, Not Retrieved#

3. Tools as Capabilities#

4. Memory: Three Tiers#

5. Skills: Composable Workflows#

Production Considerations#

Risk-Tier + Verifier Gating#

Memory Write Policies#

Provenance and Observability#

Implementation Strategy: Skeleton First#

Key Takeaways#