Cache Endpoints¶
Complete reference for Vectorcache API endpoints.
Query Cache¶
Query the semantic cache for a response.
Endpoint¶
Request¶
Headers¶
| Header | Value | Required |
|---|---|---|
Authorization |
Bearer YOUR_API_KEY |
Yes |
Content-Type |
application/json |
Yes |
Body Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
prompt |
string | Yes | - | The text prompt to cache/query |
model |
string | Yes | - | LLM model identifier |
similarity_threshold |
number | No | 0.85 |
Minimum similarity score (0-1) |
context |
string | No | null |
Additional context for segmentation |
include_debug |
boolean | No | false |
Include debug information |
Example Request¶
{
"prompt": "What is machine learning?",
"model": "gpt-4o",
"similarity_threshold": 0.85,
"context": "educational-content",
"include_debug": false
}
Response¶
Success Response (200 OK)¶
Cache Hit¶
{
"cache_hit": true,
"response": "Machine learning is a subset of artificial intelligence...",
"similarity_score": 0.92,
"cost_saved": 0.003,
"llm_provider": "cache"
}
Cache Miss¶
{
"cache_hit": false,
"response": "Machine learning is a subset of artificial intelligence...",
"similarity_score": null,
"cost_saved": 0,
"llm_provider": "openai"
}
Response Fields¶
| Field | Type | Description |
|---|---|---|
cache_hit |
boolean | Whether the query matched a cached entry |
response |
string | The LLM response text |
similarity_score |
number | null | Cosine similarity score (0-1), null on cache miss |
cost_saved |
number | Estimated cost saved in USD (0 on cache miss) |
llm_provider |
string | Source of response ('cache' or LLM provider name) |
debug |
object | Debug information (only if include_debug: true) |
Debug Information¶
When include_debug: true:
{
"cache_hit": true,
"response": "...",
"similarity_score": 0.92,
"cost_saved": 0.003,
"llm_provider": "cache",
"debug": {
"embedding_time_ms": 45,
"search_time_ms": 12,
"total_time_ms": 57,
"matched_cache_entry_id": "uuid-here",
"cache_entry_count": 1523
}
}
Error Responses¶
400 Bad Request¶
Invalid request parameters:
Common causes:
- Missing required fields (prompt, model)
- Invalid similarity_threshold (not between 0-1)
- Invalid JSON format
401 Unauthorized¶
Authentication failed:
Causes:
- Missing Authorization header
- Invalid API key
- Revoked API key
429 Too Many Requests¶
Rate limit exceeded:
Headers included:
500 Internal Server Error¶
Server error:
Examples¶
Basic Query¶
curl -X POST "https://api.vectorcache.ai/v1/cache/query" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"prompt": "What is AI?",
"model": "gpt-4o"
}'
With Context¶
curl -X POST "https://api.vectorcache.ai/v1/cache/query" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain neural networks",
"context": "technical-documentation",
"model": "gpt-4o",
"similarity_threshold": 0.90
}'
With Debug Info¶
curl -X POST "https://api.vectorcache.ai/v1/cache/query" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"prompt": "What is deep learning?",
"model": "gpt-4o",
"include_debug": true
}'
Supported Models¶
Vectorcache supports all major LLM providers:
OpenAI¶
gpt-4ogpt-4o-minigpt-4-turbogpt-3.5-turbo
Anthropic¶
claude-3-5-sonnet-20241022claude-3-5-haiku-20241022claude-3-opus-20240229
Google¶
gemini-1.5-progemini-1.5-flash
Other Providers¶
Check the dashboard for your configured LLM providers.
Context-Based Segmentation¶
Use the context parameter to segment your cache by use case:
// Educational content cache
await client.query({
prompt: 'What is photosynthesis?',
context: 'education-biology',
model: 'gpt-4o'
});
// Technical documentation cache
await client.query({
prompt: 'What is photosynthesis?',
context: 'scientific-research',
model: 'gpt-4o'
});
Even with the same prompt, these will be cached separately due to different contexts.
Similarity Threshold¶
The similarity_threshold parameter controls cache sensitivity:
| Threshold | Behavior | Use Case |
|---|---|---|
| 0.95-1.0 | Very strict | Legal, medical, financial content |
| 0.85-0.94 | Recommended | General purpose, customer support |
| 0.75-0.84 | Relaxed | Educational content, FAQs |
| <0.75 | Very relaxed | Not recommended (low accuracy) |
Example: Testing Thresholds¶
// Strict - only nearly identical queries match
const strict = await client.query({
prompt: 'What is machine learning?',
model: 'gpt-4o',
similarityThreshold: 0.95
});
// Relaxed - more cache hits, less precision
const relaxed = await client.query({
prompt: 'What is machine learning?',
model: 'gpt-4o',
similarityThreshold: 0.80
});
Cost Calculation¶
The cost_saved field estimates the LLM API cost you saved from the cache hit:
Calculation: - Based on the model's input/output token pricing - Includes both prompt and response tokens - Updated automatically with latest pricing
Example savings:
- gpt-4o query: ~$0.002-0.005 per query saved
- 1,000 cache hits/day: ~$2-5 saved/day
- Annual savings: ~$730-1,825
Performance¶
Response Times¶
| Scenario | Typical Response Time |
|---|---|
| Cache Hit | 50-150ms |
| Cache Miss (with LLM call) | 1-5 seconds |
| Debug Mode | +10-20ms |
Optimization Tips¶
- Use appropriate thresholds - Higher thresholds = faster searches
- Enable caching - First query is slow, subsequent ones are fast
- Batch similar queries - Group related prompts together
- Monitor debug metrics - Use
include_debugto optimize
Rate Limits¶
| Tier | Requests/Minute | Burst |
|---|---|---|
| Free | 100 | 120 |
| Pro | 1,000 | 1,200 |
| Enterprise | Custom | Custom |
Rate limit headers:
Best Practices¶
- Always handle both cache hit and miss - Your application should work in both scenarios
- Use context for segmentation - Separate caches by use case
- Monitor similarity scores - Tune thresholds based on actual scores
- Implement retry logic - Handle 429 errors with exponential backoff
- Track cost savings - Monitor
cost_savedto measure ROI
Next Steps¶
- Error Handling - Handle API errors
- Best Practices - Production tips
- Similarity Tuning - Optimize cache hits