Build semantic search, RAG pipelines, recommendation systems, and AI agent infrastructure using Redis Vector Sets—native HNSW-based similarity search with sub-millisecond latency.
See Also: For detailed Vector Set command reference and options, see Vector Sets and Similarity Search.
Vector search finds items based on semantic similarity rather than keyword matching. Text, images, and other data are converted to numerical vectors (embeddings) using machine learning models. Similar items have vectors close together in the embedding space.
Redis Vector Sets provide O(log N) approximate nearest neighbor search using HNSW graphs, with built-in quantization for memory efficiency and filtered queries for hybrid search.
RAG enhances Large Language Models by retrieving relevant context from your private data before generating responses.
The RAG Pipeline:
# Store document chunks with metadata
VADD docs:index FP32 <embedding> "chunk:doc1:p1" SETATTR '{"doc": "manual.pdf", "page": 1, "section": "intro"}'
VADD docs:index FP32 <embedding> "chunk:doc1:p2" SETATTR '{"doc": "manual.pdf", "page": 2, "section": "setup"}'
# Find relevant chunks for user question
VSIM docs:index FP32 <query_embedding> COUNT 5 WITHSCORES
# With metadata filtering (only certain documents, date ranges, etc.)
VSIM docs:index FP32 <query_embedding> COUNT 5 FILTER '.doc == "manual.pdf"'
def answer_question(question):
# Generate query embedding
query_embedding = embed_model.encode(question)
# Retrieve relevant context
results = redis.vsim('docs:index', 'FP32', query_embedding,
count=5, withscores=True)
# Fetch chunk content
context = [redis.get(f"chunk:{chunk_id}:text") for chunk_id, _ in results]
# Generate answer with context
return llm.generate(
prompt=f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"
)
Traditional caches fail with natural language because different phrasings produce different cache keys: - "Who is the US President?" - "Current POTUS" - "Who leads the United States?"
These mean the same thing but hash to different keys.
# Check cache before calling LLM
VSIM query:cache FP32 <query_embedding> COUNT 1 WITHSCORES
If the top result has similarity > 0.95, return the cached response:
def cached_llm_call(query):
query_embedding = embed_model.encode(query)
# Check semantic cache
cached = redis.vsim('query:cache', 'FP32', query_embedding,
count=1, withscores=True)
if cached and cached[0][1] > 0.95: # High similarity
cache_id = cached[0][0]
return redis.get(f"response:{cache_id}")
# Cache miss - call LLM
response = llm.generate(query)
# Store in cache
query_id = str(uuid4())
redis.vadd('query:cache', 'FP32', query_embedding, query_id)
redis.set(f"response:{query_id}", response, ex=3600)
return response
Semantic caching can reduce LLM API costs by 40-60% for applications with repetitive queries like customer support or FAQ bots.
Voice and chat AI agents require maintaining session state across distributed servers.
When an agent session spans multiple requests, route them to the same server:
# Store session mapping
HSET agent:sessions client_id_123 '{"server": "server-5", "started": 1706648400}'
# Custom router looks up mapping before forwarding request
server = HGET agent:sessions client_id_123
route_to(server)
Store conversation history with embeddings for context-aware responses:
# Store each message with embedding
VADD conv:user123 FP32 <msg_embedding> "msg:001" SETATTR '{"role": "user", "ts": 1706648400}'
VADD conv:user123 FP32 <msg_embedding> "msg:002" SETATTR '{"role": "assistant", "ts": 1706648401}'
# Retrieve relevant past messages for context
VSIM conv:user123 FP32 <current_msg_embedding> COUNT 10 FILTER '.role == "user"'
Find similar items to those a user has interacted with:
# User liked product X, find similar products
VSIM products:embeddings ELE "product:123" COUNT 20 FILTER '.category == "electronics" and .in_stock == true'
When a user has liked multiple items:
def recommend(user_liked_items, count=10):
candidates = {}
for item in user_liked_items:
similar = redis.vsim('products', 'ELE', item, count=20)
for product_id, score in similar:
if product_id not in user_liked_items:
candidates[product_id] = candidates.get(product_id, 0) + score
# Return items that are similar to multiple liked items
return sorted(candidates.items(), key=lambda x: -x[1])[:count]
Use vector similarity for zero-shot or few-shot classification:
# Store labeled examples
VADD classifier FP32 <embedding> "spam:ex1" SETATTR '{"label": "spam"}'
VADD classifier FP32 <embedding> "ham:ex1" SETATTR '{"label": "ham"}'
# Classify new item by nearest neighbors
VSIM classifier FP32 <new_item_embedding> COUNT 5 WITHATTRIBS
Majority vote among k-nearest neighbors determines the label.
For real-time LLM response display, use Redis Streams to decouple generation from delivery:
# LLM service writes tokens as they're generated
XADD llm:stream:session123 * event token content "The"
XADD llm:stream:session123 * event token content " answer"
XADD llm:stream:session123 * event token content " is"
XADD llm:stream:session123 * event complete content ""
# Frontend reads from the Stream
XREAD BLOCK 5000 STREAMS llm:stream:session123 $
Benefits: - Multiple frontends can read the same stream - Handles reconnection gracefully (resume from last-seen ID) - Decouples slow LLM inference from client delivery
Common embedding model dimensions: - OpenAI text-embedding-3-small: 1536 dimensions - OpenAI text-embedding-3-large: 3072 dimensions (or reduced) - Sentence transformers: 384-768 dimensions - Cohere embed-v3: 1024 dimensions
With default int8 quantization (Q8): - 1536-dim vector ≈ 1.5KB per element (including graph overhead) - 1 million documents ≈ 1.5GB
The default int8 quantization (Q8) provides 4x memory reduction with minimal recall loss (~96%):
# Q8 is the default - no flag needed
VADD docs:index FP32 <embedding> "doc:1" ...
For full precision when recall is critical:
# Store without quantization
VADD docs:index NOQUANT FP32 <embedding> "doc:1" ...
Vector Sets use cosine similarity (normalized on insertion). This matches most text embedding models which produce direction-oriented embeddings.
For models that produce L2-normalized embeddings (OpenAI, Cohere), cosine similarity and dot product are equivalent.
Redis Vector Sets documentation and AI application patterns from production deployments.