GraphRAG vs. Vector RAG: When Relationships Beat Similarity

Where vector RAG breaks ¶

Vector RAG works by embedding your documents and your query into the same space, then returning the chunks most similar to the query. This is effective when the answer lives in a single chunk and similarity is the right signal.

It breaks on multi-hop questions: "Which customers used both feature A and feature B, and what did they have in common?" No single chunk contains that answer. The relevant information is distributed across multiple documents, and the relationship between them is load-bearing. Cosine similarity to the query does not surface the relationship.

Graph-based retrieval is designed for this failure mode.

Two approaches to graph retrieval ¶

Microsoft GraphRAG

Microsoft's GraphRAG (microsoft.github.io/graphrag) builds a knowledge graph from your corpus at index time:

Entity extraction. LLM-powered extraction of entities (people, places, organizations, concepts) and their relationships from every document.

Community clustering. Graph community detection (Leiden algorithm) groups related entities into communities at multiple granularity levels.

Summarization. LLM generates summaries of each community.

At query time, GraphRAG offers three search modes:

Global search. Queries that require synthesizing across the whole corpus ("What are the main themes in this document set?"). Searches community summaries.

Local search. Queries about specific entities and their relationships. Traverses the knowledge graph from matched entities.

DRIFT search. Combines global and local: starts with a global community query, then drills into local entity relationships.

GraphRAG consistently outperforms vector RAG on global comprehension questions in Microsoft's benchmarks. The cost: index construction requires many LLM calls (entity extraction for every chunk in your corpus). For a 10,000-document corpus, this can cost $10-100+ in LLM API fees just to build the index.

Mixture-of-PageRanks (MixPR)

A December 2024 paper (arXiv:2412.06078) proposes a lighter alternative: Mixture-of-PageRanks (MixPR).

Instead of building a full knowledge graph at index time, MixPR constructs a sparse graph at query time using document-level co-citation and entity overlap. It then runs PageRank variants on this sparse graph to score nodes by importance relative to the query.

The key claim: MixPR matches or outperforms vector RAG on multi-hop questions without requiring LLM-powered entity extraction at index time. Construction cost is proportional to retrieval time, not corpus size. For corpora where relationships matter but upfront LLM indexing cost is prohibitive, this is the practical path.

Decision matrix ¶

Signal	Vector RAG	GraphRAG	MixPR
Single-chunk answers	Excellent	Overkill	Overkill
Multi-hop relationships	Fails	Designed for this	Good
Global corpus synthesis	Poor	Excellent	Moderate
Index build cost	Embeddings only	High (LLM per chunk)	Query-time only
Query latency	Fast	Fast (pre-built graph)	Moderate (graph at query time)
Corpus size sweet spot	Any	Large (amortizes index cost)	Medium

When to use which ¶

Vector RAG: Single-document Q&A, fact lookup, search over well-structured homogeneous content. If your users ask "find me the section about X," vector RAG is correct and GraphRAG is unnecessary overhead.

GraphRAG: Analyst-grade queries over heterogeneous document sets where entities and relationships are the unit of interest. Legal document analysis, research synthesis, customer data that spans multiple systems. Index construction cost is amortized over many queries.

MixPR: Multi-hop queries over medium-sized corpora where you cannot afford upfront LLM extraction. Also useful as a reranking layer on top of vector retrieval: retrieve broad vector candidates, then MixPR-score for relationship relevance.

Graph and vector retrieval share a downstream problem: the retrieved content still has to fit in a context window, and the LLM still has to attend to what matters.

GraphRAG community summaries can be verbose. MixPR-ranked documents still carry noise. Regardless of how your retriever finds content (by similarity or by graph centrality), what you inject into the LLM context determines answer quality more than which retrieval method you used.

Context compression at the injection layer (extracting the information relevant to the specific query from the retrieved set rather than injecting full chunks) is orthogonal to retrieval method and addresses the shared bottleneck.

What to build first ¶

If you are starting a RAG system:

Build vector RAG first. It is simpler, cheaper to index, and correct for 80% of queries.

Instrument it. Log retrieval precision (does the right chunk come back?) and answer quality per query category.

When you see a pattern of multi-hop failures ("it should have connected X to Y"), that is the signal to add GraphRAG or MixPR for that query category.

Run both and route by query type. A lightweight classifier on the query (is this a relationship question?) can route to the right retrieval path.

Building GraphRAG first and paying the index cost upfront without evidence your queries need it is a common and expensive mistake.

Compress what you retrieve before it enters the context window →

Try it on your own context

You just read the writeup. Now run the thing. Paste a doc or some verbose tool output and watch it shrink — free, no signup.

Your text

# Service Operations Runbook: Payments API

## Purpose and scope

This runbook covers the payments-api service: what it does, how it is deployed, what its dependencies are, and what to do when it misbehaves. It is written for the on-call engineer. Every procedure here assumes you have production read access and the ability to trigger a deploy through the standard pipeline. Nothing in this document requires direct database write access, and no procedure here should be improvised under pressure: if the situation is not covered, page the service owner rather than inventing a fix at 3am.

The payments-api accepts charge requests from the checkout frontend, validates them against the pricing catalog, forwards them to the payment processor, and records the outcome in the orders database. It is the only service permitted to talk to the processor. Average traffic is steady during business hours with a daily peak around 19:00 UTC and a weekly peak on Friday evenings.

## Architecture and dependencies

The service runs as three replicas behind the regional load balancer. Each replica is stateless; all persistent state lives in the orders database and the idempotency-key store. The service depends on four things: the orders database (primary and one read replica), the idempotency-key store, the pricing catalog service, and the external payment processor. Of these, only the processor is outside our control.

Dependency failure behavior is deliberate and asymmetric. If the pricing catalog is unreachable, the service serves prices from its local cache for up to ten minutes and emits a degraded-mode metric. If the idempotency store is unreachable, the service refuses new charges entirely, because accepting a charge without idempotency protection risks double-billing, and double-billing is strictly worse than downtime. If the processor times out, the charge is recorded as pending and a reconciliation job resolves it within the hour.

## Deployment

Deploys go through the standard pipeline: merge to main, automated tests, staging deploy, a thirty-minute soak with synthetic checkout traffic, then production rollout one replica at a time. The pipeline aborts automatically if the error rate on the new replica exceeds the old baseline. A full rollout takes about twenty minutes. Rollback is the same pipeline in reverse and takes about six minutes; the on-call engineer can trigger it without approval.

## Monitoring and alerts

Three alerts page the on-call engineer. High charge failure rate fires when more than two percent of charge attempts fail over five minutes; the usual causes are a processor incident or a bad deploy, in that order. Idempotency store unavailable fires immediately on connection failure. Reconciliation backlog fires when pending charges older than ninety minutes accumulate, which usually means the reconciliation job is stuck rather than the processor being slow.

2,912/12,000 chars

Compressed

Compressed text will appear here…

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{graphrag-vs-vector-rag-when-relationships-beat-similarity-2026,
  title  = {GraphRAG vs. Vector RAG: When Relationships Beat Similarity},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://gotcontext.ai/blog/graphrag-vs-vector-rag-when-relationships-beat-similarity},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 8). GraphRAG vs. Vector RAG: When Relationships Beat Similarity. gotcontext.ai. Retrieved from https://gotcontext.ai/blog/graphrag-vs-vector-rag-when-relationships-beat-similarity.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts