Cloud, Data & AI

Anatomy of a RAG Pipeline: From Ingestion to Augmented Response

Skugan V — Wed, 11 Feb 2026 08:15:39 GMT

INTRODUCTION

In the rapidly evolving landscape of Generative AI, Retrieval-Augmented Generation (RAG) has emerged as a game-changing architecture that addresses one of the most critical challenges in Large Language Models (LLMs): hallucinations and knowledge limitations. Having implemented RAG systems across multiple production environments, I've witnessed first-hand how this architecture transforms generic LLMs into domain-specific powerhouses.

According to recent industry reports from sources like Gartner and McKinsey (as of recent report), RAG-based systems have achieved up to 87% accuracy improvements over standalone LLM implementations, while reducing operational costs by 60% compared to fine-tuning approaches. More importantly, RAG systems can be updated in real-time without requiring model retraining, making them ideal for dynamic knowledge bases.

Let me walk you through the three fundamental pillars of a production grade RAG pipeline and the technical considerations that make or break implementations. We'll cover practical examples, code snippets, and actionable insights to make this accessible whether you're a beginner or an experienced practitioner.

UNPACKING THE ARCHITECTURE FLOW

Building on our introduction to RAG systems, it's helpful to see the big picture before unpacking the details. Below is a high-level diagram of the end-to-end RAG flow, highlighting the interconnected phases that power this architecture. This overview will serve as our roadmap as we explore each component in depth starting with data ingestion, followed by embedding, vector storage, retrieval, re-ranking, and monitoring.

This diagram shows the flow from raw data sources to final generated responses, emphasizing how retrieval augments the LLM to produce grounded, accurate outputs. Now, let's break it down pillar by pillar.

Pillar 1: Document Ingestion & Vectorization

The Foundation of Knowledge Retrieval

The ingestion phase is where your knowledge base comes to life. This isn't just about dumping documents into a database; it's about creating a sophisticated information retrieval system that understands context and semantics.

Data Sources & Collection

Modern RAG systems must handle diverse data sources:

Structured sources: SQL/NoSQL databases (e.g., PostgreSQL, MongoDB), data warehouses (e.g., Snowflake, BigQuery).
Unstructured documents: PDFs, Word docs, presentations, spreadsheets – use libraries like PyPDF2 or Apache Tika for extraction.
Web content: Websites, wikis (like Wikipedia), knowledge bases, APIs – tools like BeautifulSoup or Scrapy for scraping.
Real-time streams: Chat logs, support tickets, social media feeds – integrate with Kafka or AWS Kinesis for streaming.

I know we haven't covered LangChain or LangGraph in detail yet we'll deep dive into those in later articles but for easy understanding, here's how you can perform document loading or other actions using LangChain:
from langchain.document_loaders import PyPDFLoader

# Load a PDF document
loader = PyPDFLoader("your_document.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages from the PDF.")

The above code loads the document into manageable pages, ready for further processing.

Intelligent Document Splitting

Here's where most implementations falter. Chunking strategy directly impacts retrieval quality. Based on extensive benchmarking, I've found that:

Optimal chunk size: 512-1024 tokens (not characters) for most use cases
Overlap strategy: 10-20% overlap between chunks preserves context boundaries
Semantic chunking outperforms fixed-size splitting by 23% in retrieval accuracy
Metadata enrichment (source, timestamps, hierarchy) improves filtering precision

Pro tip: Don't split mid-sentence or mid-paragraph. Respect document structure. A chunk that starts with 'Therefore, we conclude...' without context is worthless.

Example Splitting using RecursiveCharacterTextSplitter (LangChain in-built Splitter)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,  # Token-based size
    chunk_overlap=200,  # 20% overlap
    separators=["\n\n", "\n", ".", " "]  # Respect structure
)

Embedding Generation & Vector Storage

Each chunk is transformed into a high-dimensional vector representation using embedding models. The choice of model significantly impacts performance:

OpenAI's text-embedding-3-large (3072 dimensions): Industry standard, excellent semantic understanding
Cohere Embed v3: Multilingual support with compression capabilities
Open-source alternatives (BGE, E5): Cost-effective for high-volume deployments

# Generate embedding for a chunk
response = openai.Embedding.create(
    input=chunks[0].page_content,
    model="text-embedding-3-large"
)

These embeddings are stored in vector databases optimized for similarity search. Some of the market leading databases which I have personally worked upon, I have listed below.
- Pinecone: Fully managed, handles billions of vectors, excellent for production
- Qdrant: Open-source, 10x faster filtering, payload-based search
- ChromaDB: Perfect for prototyping and small-to-medium deployments
- FAISS: Open-source vector search library by Meta, extremely fast in-memory similarity search, ideal for large-scale embedding search with custom infrastructure.

Critical insight: Vector databases aren't just storage; they're the retrieval engine. HNSW (Hierarchical Navigable Small World) indexing enables sub-millisecond searches across millions of vectors. HNSW (Hierarchical Navigable Small World) is an approximate nearest neighbor (ANN) algorithm used to quickly find similar vectors in high-dimensional space.

For upserting embeddings into Pinecone (example):

  import pinecone

  pinecone.init(api_key="your_pinecone_key", environment="your_env")

  index = pinecone.Index("rag-index")
  vectors = [(f"chunk_{i}", embedding, {"metadata": chunk.metadata}) for i, chunk in enumerate(chunks)]
  index.upsert(vectors)

Pillar 2: Query Processing & Intelligent Retrieval

Where Semantic Search Meets Precision

When a user asks 'What is an AI Agent?', the system doesn't just perform keyword matching. This is where the magic happens.

Query Embedding & Similarity Search

The user's query undergoes the same embedding transformation as the documents, creating a vector that exists in the same semantic space. The vector database then performs a similarity search using cosine similarity or dot product to find the most relevant chunks.

Key performance metrics from production systems:

Top-K retrieval: Typically 5-10 chunks strike the balance between context richness and noise
Similarity threshold: 0.7-0.8 cosine similarity filters out low-quality matches
Hybrid search (semantic + keyword BM25) improves recall by 31%
Query latency: Under 100ms for p95 in well-optimized systems

Example query in Pinecone:

query_embedding = openai.Embedding.create(input="What is an AI Agent?", model="text-embedding-3-large")['data'][0]['embedding']

results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True,
    filter={"source": {"$eq": "your_document.pdf"}}  # Optional metadata filter
)

Advanced Retrieval Strategies

Basic vector search is just the starting point. Production systems employ sophisticated techniques:

Query expansion: Automatically generate related queries to improve coverage
Re-ranking: Use cross-encoder models (like Cohere Rerank) to re-score initial results
Metadata filtering: Narrow results by date, source, department, or custom tags
Multi-query retrieval: Generate multiple query variations for comprehensive coverage

Context Assembly & Augmentation

The retrieved chunks are now assembled into a coherent context. This step involves:

Deduplication: Remove redundant information from overlapping chunks
Relevance ordering: Place most relevant context first (recency bias in LLM attention)
Token budget management: Ensure context fits within LLM limits (4K-128K tokens)
Source attribution: Track which chunks came from which documents for citations

The augmented prompt now contains: [System Instructions] + [Retrieved Context] + [User Query]. This structured approach ensures the LLM has all necessary information while maintaining clarity.

Pillar 3: Answer Generation & Quality Assurance

Transforming Context into Coherent Responses

The final pillar is where retrieved knowledge transforms into human-readable answers. This is more nuanced than simply calling an LLM API.

LLM Selection & Configuration

Different LLMs excel at different tasks. Here's what I've learned from production deployments:

GPT-4 Turbo: Best for complex reasoning, 128K context window handles extensive documents
Claude 3 Opus: Superior at following instructions, excellent for structured outputs
Llama 3 70B: Cost-effective for high-volume, lower-complexity queries
Mixtral 8x7B: Open-source alternative with strong multilingual capabilities

Prompt Engineering for RAG

The prompt structure is critical for grounded generation. A production-grade RAG prompt includes:

Role definition: 'You are an expert assistant with access to specific documents'
Grounding instructions: 'Only use information from the provided context. If not found, explicitly state that.'
Citation requirements: 'Include source references for each claim using [Source: document_name]'
Output formatting: Specify tone, structure, and length expectations

Example Prompt Engineering

prompt = f"""
You are an expert assistant.
Context: {assembled_context}
Query: {user_query}
Answer based only on the context, citing sources.
"""
response = openai.ChatCompletion.create(
    model="gpt-4-turbo",
    messages=[{"role": "system", "content": prompt}]
)

Parameter Optimization

Fine-tuning generation parameters dramatically affects output quality:

Temperature: 0.0-0.3 for factual responses (higher = more creative)
Max tokens: Conservative limits prevent rambling (500-1500 for most queries)
Top-p sampling: 0.9-0.95 balances quality and diversity
Stop sequences: Prevent generation beyond desired boundaries

Quality Assurance & Validation

Generation is not the end. Production systems implement multiple validation layers:

Hallucination detection: Compare generated content against retrieved context
Relevance scoring: Ensure answer addresses the original query
Safety filtering: Screen for harmful, biased, or inappropriate content
Citation validation: Verify all cited sources exist in retrieved context

Real-World Impact & Performance Metrics

The proof is in production. Well-architected RAG systems deliver measurable business value:

Customer support automation: 70% reduction in ticket resolution time
Enterprise search: 4x faster information discovery compared to traditional search
Knowledge management: 95% accuracy on domain-specific queries
Developer productivity: 40% faster code documentation searches

Cost considerations are equally important. While GPT-4 queries cost approximately $0.03-0.10 (Approximate cost, please check the OpenAI official link for accurate pricing) per 1K tokens, a well-optimized RAG system with intelligent caching and retrieval can reduce per-query costs to under $0.01(Approximate cost, please check the OpenAI official link for accurate pricing) while maintaining high accuracy.

Key Takeaways for Implementation

If you're building a RAG system, focus on these critical success factors:

Chunk intelligently: Semantic chunking > fixed-size splitting -- experiment with libraries like spaCy for NLP-based splits.
Invest in embeddings: Quality embeddings are non-negotiable -- test multiple models on your data.
Implement hybrid search: Combine semantic and keyword approaches –-- e.g., via Qdrant's hybrid mode.
Re-rank religiously: Initial retrieval is never perfect -- always apply a second pass.
Prompt for grounding: Force the LLM to cite sources -- reduces hallucinations by 50%+.
Monitor continuously: Track retrieval accuracy (e.g., NDCG), generation quality (e.g., ROUGE scores), and latency –- use Prometheus/Grafana and Also LangSmith/LangFuse for tracking step by Step processing and API Cost tracking/usage.

Remember: RAG is not a silver bullet. It's an architecture pattern that requires careful engineering, continuous optimization, and domain-specific tuning. But when done right, it transforms LLMs from general-purpose chatbots into specialized knowledge systems that deliver real business value.

Disclaimer and Recommendations Overview

The following recommendations for components and tools in an AI or machine learning pipeline are based purely on my personal experience and observations from working with various technologies. These suggestions are not exhaustive or universally optimal; they should be adapted to your specific needs, budget, scalability requirements, and use case. Every organization or individual brings their own expertise and preferred tool stack, and the AI landscape evolves rapidly but there may be other solutions available in the market that offer superior capabilities, better integration, or cost efficiencies compared to the ones mentioned here. I strongly advise conducting thorough research, including evaluating alternatives, reading recent reviews, testing proofs-of-concept, and considering factors like data privacy, compliance, and vendor support before adopting any tool. Always consult with domain experts or perform a needs assessment to ensure the chosen solutions align with your goals.

Embedding Model - OpenAI text-embedding-3-large, Cohere Embed v3
Vector Database - Pinecone (managed), Qdrant (self-hosted), ChromaDB (prototyping)
LLM - GPT-4 Turbo, Claude 3 Opus, Llama 3 70B, Mixtral 8x7B
Orchestration - LangChain, LlamaIndex, Haystack
Re-ranking - Cohere Rerank, Cross-encoder models
Monitoring - LangSmith, Weights & Biases, Azure AI

The Path Forward

As we move deeper into 2025, RAG architectures are evolving rapidly. We're seeing innovations in multi-modal RAG (incorporating images, audio, video), graph-based retrieval for complex knowledge graphs, and agentic RAG systems that can reason about which documents to retrieve.

The fundamentals, however, remain constant: high-quality embeddings, intelligent retrieval, and grounded generation. Master these three pillars, and you'll build RAG systems that don't just answer questions—they become trusted knowledge partners.

Have you implemented RAG in your organization? I'd love to hear about your experiences, challenges, and lessons learned. Drop a comment below or reach out directly—let's push the boundaries of what's possible with retrieval-augmented generation.

#AI #MachineLearning #RAG #LLM #GenerativeAI #VectorDatabases #NLP #ArtificialIntelligence #TechLeadership

From Demo to Production: The Enterprise RAG Roadmap

Skugan V — Wed, 11 Feb 2026 06:14:21 GMT

Over the past months, drawing from my knowledge and practical experience of designing and deploying internal RAG pipelines, it’s clear that moving from a compelling demo to a reliable, governed production system is a significant leap. The difference is rarely the model itself. It's the "Boring" but critical engineering layers: data security, latency optimization, retrieval accuracy, cost control, traceability, and governance.

This series will go far beyond the basics. We'll dive deep into advanced architectures like Agentic RAG, hybrid search with GraphDB integrations, self-reflecting and self-correcting agents, multi-hop reasoning, evaluation frameworks, and production patterns for scalability and observability.

But every strong building needs a solid foundation. So, let's begin with the fundamentals.

Why RAG Has Become the De Facto Standard for Enterprise GenAI

We're well past the 2023 hype of "Look what ChatGPT can do!" Serious enterprises are now asking tougher, more practical questions:

How do we make GenAI accurate on our proprietary data?
How do we make it safe and compliant?
How do we make it maintainable without constant retraining?

The core challenge is trust. Large Language Models are extraordinary pattern matchers, but they are also confident hallucinators. Without grounding, they will happily invent Q4 revenue numbers, misinterpret internal policy documents, or confidently provide outdated compliance guidance.

This isn't a theoretical risk; it's a daily reality in enterprises attempting to roll out GenAI at scale.

RAG solves this by anchoring every response in verified, retrieved context

Think of a vanilla LLM as a brilliant consultant taking a closed-book exam: they can reason impressively from what they've memorized during training, but they have no access to your latest information.

A RAG system is that same consultant with secure, real-time access to your company's private library, your internal wikis, contract databases, CRM records, financial reports, and compliance docs. The model is forced to cite and reason only over the retrieved documents before generating a response.

The Strategic Advantages of RAG

✅ Grounded Truth & Reduced Hallucinations Responses are constrained to retrieved evidence. Studies (e.g., from Stanford and various enterprise benchmarks in 2024–2025) consistently show RAG reduces factual errors by 60–90% compared to vanilla LLMs on domain-specific tasks.

✅ Data Sovereignty & Governance Your proprietary data never leaves your environment or gets used to train public models. You maintain full control and audit trails essential for GDPR, HIPAA, SOC 2, and other regulations.

✅ Agility & Low Maintenance Unlike fine-tuning (which requires expensive retraining whenever data changes), RAG allows instant updates. Add a new policy document or quarterly report to your knowledge base, re-index, and the system immediately reflects the latest truth.

✅ Cost Efficiency Fine-tuning large models is expensive and time-consuming. RAG leverages pre-trained models while keeping operational costs predictable mostly vector DB storage and retrieval queries.

✅ Scalability to Institutional Knowledge The true power emerges when you connect AI not just to documents, but to structured data (SQL + vector hybrid), knowledge graphs, and real-time APIs. This is where we move from simple Q&A to sophisticated reasoning agents.

The future of work isn't about replacing humans with generic chatbots. It's about augmenting experts with AI that deeply understands your organization's unique knowledge, processes, and data.

In the coming articles, I'll break down the full architecture stack: chunking strategies, embedding models, hybrid retrieval, reranking, evaluation (RAGAS, ARES, etc.), agentic patterns, guardrails, and deployment blueprints.

If you're building or planning Enterprise GenAI systems, follow along. I'll be sharing battle-tested patterns, code snippets, pitfalls to avoid, and practical implementation details purely based on my knowledge and implementation techniques that I followed.

What challenges are you facing with RAG or Enterprise GenAI right now? Drop a comment I'd love to hear and may cover it in the series. 🚀

How To Create API Key for Google Gemini

Skugan V — Tue, 02 Dec 2025 16:50:26 GMT

API Key Creation Steps

Go to Google AI Studio: Navigate to aistudio.google.com and sign in with your Google account.
Accept Terms: If it's your first time, you'll be prompted to review and accept the Google AI and Gemini API terms of service.
Find the API Key Section: Look for the "Get API key" button or link, often located in the navigation menu or on the main dashboard.
Create the Key: Click "Create API key". You will typically have an option to create the key in a new project (recommended for quick starts and beginners) or an existing Google Cloud project.
Copy and Secure the Key:
- Once generated, the API key string will be displayed. Copy this key immediately and store it in a secure location.
- For security, it is best practice to use the key as an environment variable in your code rather than hardcoding it directly. You can also manage and view your existing keys, check usage, and set restrictions (like limiting the key to certain APIs) within the Google AI Studio interface.

Python FastMCP()

Skugan V — Tue, 02 Dec 2025 15:19:04 GMT

What is Python FastMCP ?

Python FastMCP is a high-level, Pythonic framework designed for building MCP (Model Context Protocol) servers and clients easily and efficiently. MCP is a standardized protocol that allows servers to expose data and functionality specifically tailored for interactions with large language models (LLMs). FastMCP handles the complex details of the MCP protocol and server management, letting developers focus on creating tools, resources, and prompts with minimal boilerplate code.

Key features of FastMCP include:

Creating MCP servers that expose data ("Resources") and functionality ("Tools") to LLMs.
Defining reusable interaction patterns through "Prompts."
Proxying and composing servers for complex applications.
Generating servers from OpenAPI specs or FastAPI objects.
Enterprise authentication options (Google, GitHub, Azure, Auth0, and more).
Deployment tools, client libraries, and testing utilities.
High-level, Pythonic API designed to accelerate development.

Typical usage involves decorating Python functions with @mcp.tool to expose them as callable tools in the MCP server environment, making it intuitive for Python developers. FastMCP 2.0 is actively maintained and considered the standard framework for developing production-grade MCP applications.

Example:

from fastmcp import FastMCP

# Create an MCP instance with a name
mcp = FastMCP("Demo")

# Define a tool that adds two numbers
@mcp.tool
def add(a: int, b: int) -> int:
    """Add two numbers"""
    return a + b

# Run the MCP server
if __name__ == "__main__":
    mcp.run()

This framework is ideal for building servers that integrate with AI applications by providing standardized, secure, and efficient endpoints designed specifically for LLMs.

Will I also be creating MCP Client with FastMCP?

Yes, with FastMCP, you can also create MCP clients in addition to servers. FastMCP provides a built-in Client class that lets you interact programmatically with any MCP server. This client handles all the connection management and MCP protocol details automatically, allowing deterministic and controlled operations such as calling tools, listing available resources, and sending requests. The client supports various transport mechanisms, including in-memory servers (useful for testing), HTTP servers, and local Python scripts. It is designed for explicit function calls rather than autonomous agent behavior, making it ideal for testing MCP servers and building reliable applications. Example usage of the FastMCP client:

Example:

import asyncio
from fastmcp import Client

async def main():
    # Connect to the MCP server
    async with Client("https://example.com/mcp") as client:
        # List available tools
        tools = await client.list_tools()
        print(f"Available tools: {tools}")

        # Call the "add" tool with arguments
        result = await client.call_tool("add", {"a": 5, "b": 3})

        # The result comes back as structured content
        print(f"Result: {result.content[0].text}")

# Run the async main function
asyncio.run(main())

How it works

Client("https://example.com/mcp") → Connects to your MCP server (That we created above. You need to give the MCP Server URL along with the port number on which the MCP server is running)
list_tools() → Queries the server for all registered tools (like add).
call_tool("add", {"a": 5, "b": 3}) → Calls the add tool with arguments a=5 and b=3.
result.content[0].text → Extracts the returned text from the tool’s response.

All About MCP Resource

Skugan V — Tue, 02 Dec 2025 14:41:33 GMT

An MCP Resource is a read-only, addressable content entity exposed by the MCP server. Resources provide structured, contextual data that MCP clients can retrieve and deliver to LLMs for reasoning or enhanced understanding. They typically include items like logs, configuration data, files, real-time statistics, or any other data that can be represented as text, JSON, or binary blobs (e.g., PDFs, images).

Key points about MCP Resources:

They are read-only and deterministic, meaning no side effects or changes occur when accessed.
Resources are identified via unique URIs (e.g., note://, config://, [file://](file://)).
Access is done via standard MCP requests like resources/list to discover and resources/read to fetch content.
They allow LLMs to have contextual information without executing commands or changing state.

How Resources relate to Tools:

Tools are actionable and executable commands exposed by MCP servers. They perform operations such as calculations, database writes, or API calls that can change state or provide dynamic outputs.
Resources are passive data sources that provide context to the LLM but do not execute or trigger actions.
Typically, an MCP server exposes both resources (context/data) and tools (actions/operations).
Resources can be used by tools to provide necessary context or input data for execution.
From the LLM’s perspective, tools enable it to do things while resources help it know things.

Example:

A resource could be a log file or pricing data accessible to the LLM's context.
A tool could be a function calculating the current price or executing a trade.

Together, resources and tools empower LLMs with both rich context and actionable capabilities via the MCP protocol.

This clear distinction lets the MCP Server expose data and functions cleanly and predictably, letting clients and LLMs consume context and invoke actions seamlessly

What MCP Resources Offer

In the context of MCP and FastMCP, resources are a core primitive that provides read-only access to data or content. Unlike tools (which are invocable functions for performing actions, like API calls or computations), resources expose static or dynamically generated data that clients can directly read and use as context in conversations or reasoning. This could include configurations, lists, files, database queries, or any structured information.

Resources help by:

Providing scoped, persistent context to reduce the need for repeated tool calls or large prompts.
Allowing efficient data access without the overhead of function invocation (e.g., no need for the model to "decide" to call something).
Supporting metadata like descriptions, tags, and annotations for better discoverability.
Enabling dynamic generation (e.g., fetching fresh data on request) while remaining read-only.
Reducing token usage and latency in LLM interactions by injecting data directly into the context.

They are particularly useful in scenarios like your code, where tools handle actions (e.g., getting specific prices), but resources can offer supplementary data (e.g., a list of available symbols) to guide or inform those actions.

Communication Flow for Resources

The flow involves a client-server interaction over the MCP protocol:

Client Request: An MCP client (e.g., an LLM application) sends a resources/read request to the server, specifying the resource's unique URI (e.g., "data://symbols"). This can happen automatically if the LLM references the URI in its prompt or reasoning, or explicitly via the client's API.
Server Processing:
- The server (your FastMCP instance) matches the URI to a registered resource.
- If the resource is defined as a function (dynamic), it executes the function lazily (only on request). Parameters can be passed if the URI uses templates (e.g., "data://symbols/{category}").
- The function's return value is converted to MCP-compatible content: strings become text/plain, dicts/lists become application/json (auto-serialized), bytes become base64-encoded blobs.
- Metadata (e.g., name, description, mime_type) is included in the response for client use.
Server Response: The server returns the content directly to the client. If the resource list changes (e.g., you add/enable/disable one during runtime), the server may send a notifications/resources/list_changed notification to active clients.
Client Usage: The client (e.g., LLM) receives the data and incorporates it into its context. For example, an LLM could read a resource to get a list of valid symbols before calling your get_price tool, improving accuracy without extra steps.

This flow is stateless and read-only by default (via annotations like readOnlyHint: True), ensuring safety. Errors (e.g., from API failures) are handled by raising exceptions, which translate to MCP error responses.

Pros of Using Resources vs. Without Resources

With Resources:

Efficiency: Direct data access avoids the multi-step process of tool calls (e.g., model decides to call, invokes, waits for result). This reduces latency, token costs, and context overload.
Better Context Management: Resources can be referenced in prompts or auto-injected, providing structured data (e.g., JSON lists) that helps the model reason more effectively without hallucinating.
Flexibility: Supports dynamic data, templates for parameterization, async execution, and context access (e.g., via ctx: Context parameter for request-specific info).
Scalability: Ideal for read-heavy scenarios, like exposing configs or lists that don't change often. Notifications keep clients updated on changes.
Pros in Your Code Context: Complements your tools by providing upfront data (e.g., valid symbols), reducing invalid tool calls (e.g., bad symbols).

Without Resources (Relying Only on Tools):

Pros: Simpler if all interactions are action-oriented—everything is a callable function, so no need to distinguish read vs. write. Tools can handle both data retrieval and mutations in one paradigm.
Cons: Overkill for passive data access; each request becomes a full tool invocation, increasing steps, potential errors, and costs. For example, fetching a static list via a tool requires the model to explicitly call it every time, bloating prompts. In your code, without resources, users might misuse tools (e.g., call get_price with invalid symbols), leading to more failures.

In summary, resources shine for "give me data" scenarios, while tools are for "do something." Using both (as in MCP best practices) creates a balanced server.

We can take the example of the following code snippet – The following is an example for MCP Resource Decorator

“Confusion – Can we use Tools directly instead of resources – More detailed information provided below"?

We know that we can use tools to perform the required actions and also considering the above example, we can get the trading symbols via direct tool calling the api.binance.com/api/v3/exchangeInfo?

Why Use an MCP Resource Instead of Directly Calling the MCP Tools?

Yes, you could expose get_available_symbols() as an MCP tool (e.g., @mcp.tool()), but it’s less ideal for this use case:

Tool Overhead: Tools are for actions and require the client (e.g., LLM) to decide to invoke them, pass parameters (even if none are needed here), and wait for the result. This adds unnecessary steps for simple data access.
Resource Simplicity: Resources are read-only and can be directly referenced in prompts or auto-injected into the LLM’s context, making them more efficient for data like a symbol list.
Client Expectation: In MCP, clients expect resources for data and tools for actions. Exposing get_available_symbols as a tool might confuse clients expecting a resource for a list.

Example Scenario to Illustrate the Difference

Let’s say an LLM wants to get the price of a valid Binance symbol. Here’s how it works with and without the resource:

With Resource (data://available-symbols):

The LLM’s prompt says: “Use data://available-symbols to pick a valid symbol, then get its price.”
The MCP server injects the symbol list (e.g., ["BTCUSDT", "ETHUSDT", ...]) into the LLM’s context automatically when data://available-symbols is referenced.
The LLM sees BTCUSDT is valid and calls the get_price("BTCUSDT") tool.
Benefits: The LLM gets the symbol list without extra steps, avoids invalid inputs (e.g., get_price("INVALID")), and saves tokens by not invoking a tool for the list.

With Only a Function or Tool:

You’d need to expose get_available_symbols() as a tool (e.g., @mcp.tool()).
The LLM would need to:
- Decide to call the get_available_symbols tool.
- Wait for the server to run it and return the list.
- Process the list and then call get_price("BTCUSDT").
Drawbacks: Extra steps (tool invocation), more tokens used in the LLM’s conversation, and higher chance of errors if the LLM doesn’t call the tool first or misinterprets the list.

Direct Function Call:

If you call get_available_symbols() locally in a script, it works fine for you, but:
- The LLM (running remotely) can’t access it unless you manually send the data to the LLM’s environment.
- You lose the benefits of the MCP server, which is designed to handle remote requests and integrate with clients.
Drawbacks: Not scalable for remote clients, no discoverability, and no integration with MCP’s resource system.

When Would You Call the Function Directly?

You’d call get_available_symbols() directly if:

You’re building a local script, not a server, and don’t need remote access.
You’re testing or debugging the function locally.
You don’t need MCP’s client-server architecture (e.g., no LLM or external clients).

However, since your code uses FastMCP and runs mcp.run(), it’s designed as a server for remote clients, so the resource approach is more appropriate.

Simple Summary

Why Use the Resource? The data://available-symbols resource makes the symbol list accessible to remote clients (like LLMs) over the MCP protocol. It’s efficient (no tool invocation), discoverable (via metadata), and fits MCP’s design for read-only data.
Why Not Just Call the Function? Direct function calls work locally but don’t work for remote clients. The MCP resource lets clients like LLMs access the data seamlessly, reducing errors and improving efficiency in a client-server setup.
Practical Benefit: The resource ensures an LLM can check valid symbols (e.g., BTCUSDT) before calling tools like get_price, making interactions faster and more reliable.

I’m also pasting the full working code with the MCP resource Implementation

Sample Code 1: With MCP Resource Implementation - Function acting as a local resource

# Official Python MCP implementation

# Abstracts away many complexities from the MCP-based protocol

from mcp.server.fastmcp import FastMCP

import requests

from typing import Any

# Note: urllib3.response import is unused; consider removing it

from urllib3 import response

mcp = FastMCP("Binance MCP")

@mcp.tool()

def get_price(symbol: str) -> Any:

"""

Get the current price of a crypto asset from Binance

Args:

symbol (str): The symbol of the crypto asset to get the price of

Returns:

Any: The current price of the crypto asset

"""

symbol = get_symbol_from_name(symbol)

url = f"https://api.binance.com/api/v3/ticker/price?symbol={symbol}"

response = requests.get(url)

response.raise_for_status()

return response.json()

@mcp.tool()

def get_price_change(symbol: str) -> Any:

"""

Get the last 24 hours price change of a crypto asset from Binance

Args:

symbol (str): The symbol of the crypto asset

Returns:

Any: The 24-hour price change data

"""

symbol = get_symbol_from_name(symbol)

url = f"https://data-api.binance.vision/api/v3/ticker/24hr?symbol={symbol}"

response = requests.get(url)

response.raise_for_status()

return response.json()

# Helper function (unchanged)

def get_symbol_from_name(name: str) -> str:

if name.lower() in ["bitcoin", "btc"]:

return "BTCUSDT"

elif name.lower() in ["ethereum", "eth"]:

return "ETHUSDT"

else:

return name.upper()

# Resource section: Removed 'annotations' to fix TypeError

@mcp.resource(

uri="data://available-symbols",

name="Available Trading Symbols",

description="Provides a list of active trading symbols available on Binance.",

mime_type="application/json"

)

def get_available_symbols() -> list:

"""

Fetches and returns a list of active trading symbols from Binance.

Returns:

list: A list of strings representing active symbols (e.g., ['BTCUSDT', 'ETHUSDT']).

"""

url = "https://api.binance.com/api/v3/exchangeInfo"

response = requests.get(url)

response.raise_for_status()

data = response.json()

# Filter for active symbols (status == 'TRADING')

symbols = [s['symbol'] for s in data['symbols'] if s['status'] == 'TRADING']

return symbols

if name == "__main__":

mcp.run()

Sample Code 2: With MCP Resource Implementation (Local CSV File acting as the resource)

# Official Python MCP implementation

# Abstracts away many complexities from the MCP-based protocol

from mcp.server.fastmcp import FastMCP

import requests

from typing import Any

import csv

# Note: urllib3.response import is unused; consider removing it

from urllib3 import response

mcp = FastMCP("Binance MCP")

@mcp.tool()

def get_price(symbol: str) -> Any:

"""

Get the current price of a crypto asset from Binance

Args:

symbol (str): The symbol of the crypto asset to get the price of (e.g., BTCUSDT)

Returns:

Any: The current price of the crypto asset

"""

# Note: MCP Client should use the symbol-mapping resource to validate/convert symbol

url = f"https://api.binance.com/api/v3/ticker/price?symbol={symbol}"

response = requests.get(url)

response.raise_for_status()

return response.json()

@mcp.tool()

def get_price_change(symbol: str) -> Any:

"""

Get the last 24 hours price change of a crypto asset from Binance

Args:

symbol (str): The symbol of the crypto asset (e.g., BTCUSDT)

Returns:

Any: The 24-hour price change data

"""

# Note: MCP Client should use the symbol-mapping resource to validate/convert symbol

url = f"https://data-api.binance.vision/api/v3/ticker/24hr?symbol={symbol}"

response = requests.get(url)

response.raise_for_status()

return response.json()

# New resource: Reads symbols from a CSV file

@mcp.resource(

uri="data://crypto-symbols",

name="Crypto Symbols from CSV",

description="Provides a list of crypto trading symbols from a local CSV file.",

mime_type="application/json"

)

def get_crypto_symbols() -> list:

"""

Reads a list of crypto trading symbols from a CSV file.

Returns:

list: A list of strings representing trading symbols (e.g., ['BTCUSDT', 'ETHUSDT']).

"""

file_path = r"C:\Users\Skugan\Desktop\github-cursor\mcp-course\crypto.csv"

symbols = []

try:

with open(file_path, mode='r', encoding='utf-8') as file:

reader = csv.DictReader(file)

for row in reader:

if 'symbol' in row:

symbols.append(row['symbol'])

except FileNotFoundError:

raise Exception(f"CSV file not found at {file_path}")

except Exception as e:

raise Exception(f"Error reading CSV file: {str(e)}")

return symbols

# New resource: Provides symbol mapping

@mcp.resource(

uri="data://symbol-mapping",

name="Symbol Mapping",

description="Provides a mapping of crypto names/aliases to Binance trading symbols.",

mime_type="application/json"

)

def get_symbol_mapping() -> dict:

"""

Returns a mapping of crypto names/aliases to Binance trading symbols.

Returns:

dict: A dictionary mapping names to symbols (e.g., {'bitcoin': 'BTCUSDT', 'eth': 'ETHUSDT'}).

"""

return {

"bitcoin": "BTCUSDT",

"btc": "BTCUSDT",

"ethereum": "ETHUSDT",

"eth": "ETHUSDT"

}

if name == "__main__":

mcp.run()

Cloud, Data & AI

Anatomy of a RAG Pipeline: From Ingestion to Augmented Response

INTRODUCTION

UNPACKING THE ARCHITECTURE FLOW

Pillar 1: Document Ingestion & Vectorization

The Foundation of Knowledge Retrieval

Data Sources & Collection

Intelligent Document Splitting

Embedding Generation & Vector Storage

Pillar 2: Query Processing & Intelligent Retrieval

Where Semantic Search Meets Precision

Query Embedding & Similarity Search

Advanced Retrieval Strategies

Context Assembly & Augmentation

Pillar 3: Answer Generation & Quality Assurance

LLM Selection & Configuration

Prompt Engineering for RAG

Parameter Optimization

Quality Assurance & Validation

Key Takeaways for Implementation

Disclaimer and Recommendations Overview

From Demo to Production: The Enterprise RAG Roadmap

Why RAG Has Become the De Facto Standard for Enterprise GenAI

The Strategic Advantages of RAG

How To Create API Key for Google Gemini

Python FastMCP()

What is Python FastMCP ?

Will I also be creating MCP Client with FastMCP?

How it works

All About MCP Resource

What is a MCP Resource? How it is related to MCP Tooling?

What MCP Resources Offer

Why Use an MCP Resource Instead of Directly Calling the MCP Tools?