Cache Endpoints¶

Complete reference for Vectorcache API endpoints.

Query Cache¶

Query the semantic cache for a response.

Endpoint¶

POST /v1/cache/query

Request¶

Headers¶

Header	Value	Required
`Authorization`	`Bearer YOUR_API_KEY`	Yes
`Content-Type`	`application/json`	Yes

Body Parameters¶

Parameter	Type	Required	Default	Description
`prompt`	string	Yes	-	The text prompt to cache/query
`model`	string	Yes	-	LLM model identifier
`similarity_threshold`	number	No	`0.85`	Minimum similarity score (0-1)
`context`	string	No	`null`	Additional context for segmentation
`include_debug`	boolean	No	`false`	Include debug information

Example Request¶

{
  "prompt": "What is machine learning?",
  "model": "gpt-4o",
  "similarity_threshold": 0.85,
  "context": "educational-content",
  "include_debug": false
}

Response¶

Success Response (200 OK)¶

Cache Hit¶

{
  "cache_hit": true,
  "response": "Machine learning is a subset of artificial intelligence...",
  "similarity_score": 0.92,
  "cost_saved": 0.003,
  "llm_provider": "cache"
}

Cache Miss¶

{
  "cache_hit": false,
  "response": "Machine learning is a subset of artificial intelligence...",
  "similarity_score": null,
  "cost_saved": 0,
  "llm_provider": "openai"
}

Response Fields¶

Field	Type	Description
`cache_hit`	boolean	Whether the query matched a cached entry
`response`	string	The LLM response text
`similarity_score`	number \| null	Cosine similarity score (0-1), null on cache miss
`cost_saved`	number	Estimated cost saved in USD (0 on cache miss)
`llm_provider`	string	Source of response ('cache' or LLM provider name)
`debug`	object	Debug information (only if `include_debug: true`)

Debug Information¶

When include_debug: true:

{
  "cache_hit": true,
  "response": "...",
  "similarity_score": 0.92,
  "cost_saved": 0.003,
  "llm_provider": "cache",
  "debug": {
    "embedding_time_ms": 45,
    "search_time_ms": 12,
    "total_time_ms": 57,
    "matched_cache_entry_id": "uuid-here",
    "cache_entry_count": 1523
  }
}

Error Responses¶

400 Bad Request¶

Invalid request parameters:

{
  "detail": "similarity_threshold must be between 0 and 1"
}

Common causes: - Missing required fields (prompt, model) - Invalid similarity_threshold (not between 0-1) - Invalid JSON format

401 Unauthorized¶

Authentication failed:

{
  "detail": "Invalid or missing API key"
}

Causes: - Missing Authorization header - Invalid API key - Revoked API key

429 Too Many Requests¶

Rate limit exceeded:

{
  "detail": "Rate limit exceeded. Please try again later."
}

Headers included:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640000000

500 Internal Server Error¶

Server error:

{
  "detail": "Internal server error. Please try again or contact support."
}

Examples¶

Basic Query¶

curl -X POST "https://api.vectorcache.ai/v1/cache/query" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is AI?",
    "model": "gpt-4o"
  }'

With Context¶

curl -X POST "https://api.vectorcache.ai/v1/cache/query" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain neural networks",
    "context": "technical-documentation",
    "model": "gpt-4o",
    "similarity_threshold": 0.90
  }'

With Debug Info¶

curl -X POST "https://api.vectorcache.ai/v1/cache/query" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is deep learning?",
    "model": "gpt-4o",
    "include_debug": true
  }'

Supported Models¶

Vectorcache supports all major LLM providers:

OpenAI¶

gpt-4o
gpt-4o-mini
gpt-4-turbo
gpt-3.5-turbo

Anthropic¶

claude-3-5-sonnet-20241022
claude-3-5-haiku-20241022
claude-3-opus-20240229

Google¶

gemini-1.5-pro
gemini-1.5-flash

Other Providers¶

Check the dashboard for your configured LLM providers.

Context-Based Segmentation¶

Use the context parameter to segment your cache by use case:

// Educational content cache
await client.query({
  prompt: 'What is photosynthesis?',
  context: 'education-biology',
  model: 'gpt-4o'
});

// Technical documentation cache
await client.query({
  prompt: 'What is photosynthesis?',
  context: 'scientific-research',
  model: 'gpt-4o'
});

Even with the same prompt, these will be cached separately due to different contexts.

Similarity Threshold¶

The similarity_threshold parameter controls cache sensitivity:

Threshold	Behavior	Use Case
0.95-1.0	Very strict	Legal, medical, financial content
0.85-0.94	Recommended	General purpose, customer support
0.75-0.84	Relaxed	Educational content, FAQs
<0.75	Very relaxed	Not recommended (low accuracy)

Example: Testing Thresholds¶

// Strict - only nearly identical queries match
const strict = await client.query({
  prompt: 'What is machine learning?',
  model: 'gpt-4o',
  similarityThreshold: 0.95
});

// Relaxed - more cache hits, less precision
const relaxed = await client.query({
  prompt: 'What is machine learning?',
  model: 'gpt-4o',
  similarityThreshold: 0.80
});

Cost Calculation¶

The cost_saved field estimates the LLM API cost you saved from the cache hit:

{
  "cache_hit": true,
  "cost_saved": 0.003,  // $0.003 saved
  "llm_provider": "cache"
}

Calculation: - Based on the model's input/output token pricing - Includes both prompt and response tokens - Updated automatically with latest pricing

Example savings: - gpt-4o query: ~$0.002-0.005 per query saved - 1,000 cache hits/day: ~$2-5 saved/day - Annual savings: ~$730-1,825

Performance¶

Response Times¶

Scenario	Typical Response Time
Cache Hit	50-150ms
Cache Miss (with LLM call)	1-5 seconds
Debug Mode	+10-20ms

Optimization Tips¶

Use appropriate thresholds - Higher thresholds = faster searches
Enable caching - First query is slow, subsequent ones are fast
Batch similar queries - Group related prompts together
Monitor debug metrics - Use include_debug to optimize

Rate Limits¶

Tier	Requests/Minute	Burst
Free	100	120
Pro	1,000	1,200
Enterprise	Custom	Custom

Rate limit headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1640000000

Best Practices¶

Always handle both cache hit and miss - Your application should work in both scenarios
Use context for segmentation - Separate caches by use case
Monitor similarity scores - Tune thresholds based on actual scores
Implement retry logic - Handle 429 errors with exponential backoff
Track cost savings - Monitor cost_saved to measure ROI

Next Steps¶

Error Handling - Handle API errors
Best Practices - Production tips
Similarity Tuning - Optimize cache hits