Alkira > Resources > Network Infrastructure-as-a-Service > Building a New Operating Model: The Architectural Evolution of an Enterprise RAG System

Building a New Operating Model: The Architectural Evolution of an Enterprise RAG System

Building a New Operating Model: The Architectural Evolution of an Enterprise RAG System

 Introduction

In any large enterprise, corporate knowledge is both a critical asset and a monumental challenge. It’s scattered across wikis, ticketing systems, design documents, and support channels—often locked in silos, written by different teams with different terminology. The holy grail has always been a centralized, intelligent system that can bridge these silos, understand a user’s intent, and provide accurate, context-aware answers. We embarked on a journey to build such a system.

This wasn’t a straightforward path. We started with the prevailing trends, hit significant walls, and had to re-evaluate our core assumptions. Additionally, data sovereignty was a non-negotiable requirement – managed off-the-shelf RAG solutions would require sending our sensitive corporate data to third-party services, which was unacceptable. All our data needed to remain within Alkira’s infrastructure at all times. This article documents our architectural journey, from the failures of basic fine-tuning and off-the-shelf RAG frameworks to the design of a custom, high-performance hybrid RAG system. 

 Early Attempts and Lessons Learned

Our initial goal was to create a general-purpose, GPT-style chatbot for internal knowledge. 

The real-world constraints were clear: the system had to handle vague questions and scale across a knowledge base of ever-increasing size and complexity.

Part 1: The Fine-Tuning Fallacy

Our first instinct was to try fine-tuning several open-source models (Llama3- 8B, Qwen3-8B, Mistral-7B). The process was standard: generate Q/A pairs from our docs and train. The results were consistently subpar; the models could only provide correct answers if we phrased the prompt in a very specific way.
Fine-tuning adjusts a model’s weights to specialize its style and vocabulary, but it does not function as a queryable memory. This leads to two critical failures:

  1. Brittleness: The model learns statistical patterns, not factual relationships. It cannot reliably answer questions that deviate from the patterns it was trained on.
  2. Lack of Grounding: The model has no direct reference to the source material when generating an answer. This makes it impossible to verify accuracy and easy to introduce hallucinations.

We quickly put a pause on this approach. Fine-tuning may have its place for narrow, targeted applications, but it was not the solution for a broad, dynamic knowledge base.

Part 2: The RAG Rabbit Hole with Off-the-Shelf Frameworks

The logical next step was Retrieval-Augmented Generation (RAG). The principle is sound: retrieve relevant documents first, then use an LLM to synthesize an answer based on that context.

Our initial exploration of existing open source tools and frameworks proved difficult, with convoluted setups and unresolved issues that blocked even basic proof-of-concepts.

We eventually landed on LightRAG, which showed promise.

  • Prototype v0.1:Using LightRAG as-is with Neo4j and Qdrant, we ingested a small set of documents. The results were promising enough to continue.
  • Prototype v0.2:We forked LightRAG and introduced more sophisticated features: LLM-driven semantic chunking, contextual embeddings, and dynamic entity extraction. However, when we scaled up the knowledge base from ~100 to ~3000 documents, the system broke down.

The answers became unsatisfactory, often containing irrelevant information. We were facing a classic RAG problem:
context pollution.

Context pollution occurs when a retrieval system, overwhelmed by the volume of data, pulls in too many irrelevant or tangentially related documents. When this noisy context is fed to the generator LLM, it struggles to distinguish signal from noise, leading to diluted, incorrect, or nonsensical answers.

So, we concluded that an off-the-shelf framework, while great for getting started, is ultimately too generic and rigid. It doesn’t offer the granular control over ingestion and retrieval logic required to combat context pollution in a complex enterprise environment. We realized that to succeed, we had to build from scratch.

Part 3: The Turning Point – A Hybrid Proof of Concept (v0.3)

Before committing to a full custom build, we created another rapid prototype to validate our new architectural hypotheses. 

  • Knowledge Graph: We shifted to FalkorDB and used its graphrag-sdk for quick ingestion.
  • Vector Database & Hybrid Search: We used Qdrant to experiment with a powerful Hybrid Vector Search. For each document chunk, we generated:
  1. A Dense Vector (Qwen3-8B-Embedding) to capture semantic meaning (e.g., understanding that “how do I find a VM?” is similar to “can I look up a server?”).
  2. A Sparse Vector (SPLADE) to capture keyword importance, crucial for matching exact, specific terms (e.g., a function name like “get_instance_by_id” or an error code “CX-8042”).
  • Querying Backend: We built a minimal FastAPI server using gemini-2.5-flash for inference.

The results were outstanding. With ~5,000 documents, the quality was dramatically better. This validated our path.

Key Lessons from v0.3:

  • Validation: The hybrid Graph + Vector retrieval strategy was unequivocally correct.
  • Insight: In hindsight, we realized that supporting both sparse and dense vectors added significant query latency. It was a powerful technique, but a single, high-quality dense vector proved sufficient for our needs, especially in conjunction with graph based queries.
  • Discovery: We were using two databases because we hadn’t realized FalkorDB could handle vectors natively. This discovery directly led to the more streamlined, single-database architecture of v0.4.

With our core assumptions validated, we were ready to start building production-ready system.

The Solution: A Custom, Hybrid RAG Architecture (AKGPT v0.4)

Our final architecture, codenamed AKGPT, was designed around a central principle: total control. This control extends to data sovereignty – ensuring that all sensitive corporate knowledge remains within Alkira’s infrastructure and is never exposed to external services.


1. The Data Layer: FalkorDB at the Core

The foundation is FalkorDB, a high-performance fork of Redis. We chose it for two key reasons:

  • Low-Latency Hybrid Storage: It natively supports both a Knowledge Graph and Vector Embeddings in a single in-memory engine, eliminating the complexity and network overhead of syncing two separate databases.
  • High-Performance Graph Traversal: Under the hood, it represents the graph using sparse matrices. This allows for CPU cache-friendly sequential memory access during traversals, making it significantly faster than traditional graph databases that rely on pointer chasing.

2. The Ingestion Pipeline: Building Hybrid Knowledge

This is where we solve the context pollution problem. Our ingestion process is designed to create a rich, interconnected knowledge structure that supports multiple retrieval strategies. It’s a manually triggered, parallelized workload managed by a Redis queue.

  1. Pre-processing & Cleaning: First, an LLM acts as a gatekeeper, reviewing every input file to ensure it contains viable technical content (e.g., troubleshooting, bug fixes, configurations). It filters out administrative chatter, PII, and other noise. This is our first line of defense against garbage-in, garbage-out.
  2. Semantic Chunking: Cleaned files are chunked based on semantic meaning, not fixed token counts, to ensure that context is preserved within each chunk.
  3. Entity & Relationship Extraction: An LLM analyzes each chunk to extract key entities and their relationships based on a pre-defined schema (e.g., (Service)-[:HAS_PARAMETER]->(Parameter)).
  4. Vector Embedding: The exact same text chunk is then embedded into a dense vector using Qwen3-embedding-8b (4096 dimensions).
  5. Hybrid Linking: This is the most critical step. We create a unified knowledge structure by combining semantic and conceptual information in our graph database. The process involves establishing connections between different data representations to enable multiple retrieval pathways.

This hybrid approach creates explicit links between conceptual information and its source context, allowing our system to traverse from abstract concepts to their grounding documentation. This design enables both precise entity-based retrieval and broader semantic search within a single, cohesive framework.

3. The Retrieval Engine: A Dual-Path Hybrid Search 


Step 1:Query Enhancement

The user’s prompt is intercepted and expanded by an LLM armed with a detailed system prompt containing foundational knowledge. For example, the system prompt knows: “Alkira’s network is composed of Cloud Exchange Points (CXPs). A CXP connects to cloud resources via Connectors.” This reframes the user’s potentially vague query into a more technically precise question.


Step 2:Parallel Retrieval

The enhanced query is dispatched to two retrieval processes that run in parallel:

Path A:Graph Retrieval (Precision)

  1. Entity Extraction: We extract entities from the enhanced query using the same schema as our ingestion pipeline.Graph Traversal: The engine searches the Knowledge Graph for these entities and traverses the relationships to retrieve required chunks.

Path B:Semantic Retrieval (Recall)

  1. Embedding: The enhanced query is vectorized using the same model as our ingestion pipeline.
  2. Vector Search: A vector similarity search is performed against the chunk embeddings indexed in FalkorDB. 


Step 2.5:The Importance of the Reranker

The initial retrieval in both paths is optimized for speed and recall, meaning it casts a wide net and may still include noise. The retrieved results are then passed to a more specialized Reranker LLM (Qwen3-reranker-8b). This model’s sole job is to perform a more computationally expensive, fine-grained analysis, scoring the initial results for relevance against the specific query and filtering out any remaining context pollution. We take the top 10 from each path.

Step 3:Synthesis

The top 20 unique results (10 Graph +10 Semantic) are combined into a single, rich context. This context is then passed to our final generator LLM (gemini-2.5-flash), which synthesizes a comprehensive answer.

Verification

The system now excels where previous versions failed.

Vague Conceptual Query: “How does our billing system handle multi-region failover?”

  • The vector search path finds documents conceptually related to “billing” and “failover.”
  • The graph search path finds specific entities like BillingService, FailoverPolicy, and traverses to the exact chunks describing their interaction.
  • The final synthesis provides a detailed architectural explanation.

Specific Technical Query: “What’s the command to restart the auth-service in production?”

  • The graph search path immediately identifies the auth-service entity and finds chunks from runbooks or wikis containing restart commands.
  • The vector search provides additional context, perhaps explaining the impact of a restart.
  • The final answer gives the command and relevant warnings.

Caveats & Future Enhancements

This is not a simple, plug-and-play solution. It requires significant custom development and expertise in graph databases, vector search, and LLM orchestration. Key considerations include: 

  • Infrastructure: FalkorDB is in-memory, so RAM is a primary scaling cost.
  • Cost & Latency: The multi-LLM pipeline (cleaning, extraction, enhancement, reranking, synthesis) has cost and latency implications that must be monitored.
  • Schema Rigidity: The pre-defined entity schema is powerful but requires maintenance. As new types of knowledge are introduced, the schema and extraction logic may need to be updated.

Our roadmap is focused on maturing this platform into a fully automated, agentic system:

  • Automated Ingestion: Scheduling data downloads from sources like Jira, Confluence, and Slack.
  • Advanced Graph Retrieval: Incorporating temporal awareness (weighting newer information higher) and expanding searches to “2-hop” neighbors to capture indirect relationships.
  • Agentic Orchestration: Evolving from a pure RAG system to an agent that can actively retrieve live data or trigger actions, and perhaps even perform troubleshooting.
  • Performance Optimization: Implementing a corrective RAG loop for self-critique and a caching layer for frequently asked questions.

Conclusion

Building an enterprise AI assistant turned out to be quite the architectural adventure. We started with off-the-shelf frameworks which were great for getting started, but we quickly hit their limits – they trade fine-grained control for convenience. When you’re dealing with complex, ever-changing knowledge bases, a custom hybrid approach really shines. By mixing the exactness of a Knowledge Graph with the wide reach of semantic vector search, we created a system that sidesteps the context pollution problems that trip up so many RAG implementations, while maintaining complete data sovereignty within Alkira’s infrastructure. Sure, the dual-path retrieval adds some complexity, but it means our users get answers that are both thorough and firmly rooted in our collective know-how.

References 

FAQ

Why didn’t fine-tuning alone work for enterprise knowledge retrieval? +
Fine-tuning improved language style but failed at factual recall, leading to brittle responses and hallucinations without grounded source references.
What problems did off-the-shelf RAG frameworks encounter at scale? +
As document volume increased, retrieval quality degraded due to context pollution, where irrelevant data overwhelmed the LLM and diluted answers.
What makes Alkira’s hybrid RAG architecture different? +
It combines a knowledge graph and vector search in a single FalkorDB engine, enabling precise entity-based retrieval alongside semantic recall with low latency.
How does this approach ensure data sovereignty and answer accuracy? +
All data stays within Alkira’s infrastructure, and dual-path retrieval with reranking ensures responses are both relevant and grounded in verified source content.

Further Reading

Technical “Building A New Operating Model” Blog Series
Technical Blog Part 2: “Building a New Operating Model: Building Practical AI Agents with DGX Spark and Alkira CXPs

You May Also Like

Alkira mobile app screens

Introducing the Alkira Mobile App: Network Visibility Wherever, Whenever

Enterprise networks are expected to run 24/7, and the teams responsible for them need visibility wherever work happens. Cloud environments, partner connections, security services, and provisioning workflows are constantly changing. When something needs attention, network and operations teams need a fast way to understand what happened, assess impact, and take the right next step. That...
Jacob Donovan
Simple diagram showing a network as a platform

The Network Needs To Be Part of Your AI Strategy

Enterprises are moving quickly on AI, but many are still running networking models designed for a slower, more centralized and static era. Today’s network has to connect clouds, data centers, campuses, branches, partner environments, and increasingly private AI infrastructure while enforcing consistent policy across all of it. That creates a new operational reality: every new...
Calvin Nguyen
Blue network shield checkmark illustration

Navigating DORA: Operational Resilience and Security by Design

The Digital Operational Resilience Act (DORA) is reshaping how financial institutions in the European Union manage operational risk related to information and communication technology (ICT). As the regulation takes effect, organizations must ensure that their critical ICT service providers support strong operational resilience, risk management, and oversight capabilities. For technology providers supporting financial institutions, this...
Misbah Rehman