Documentation
Complete guides and API reference for PromptOps
Getting Started with PromptOps
Welcome to PromptOps. This guide will get you from zero to running your first eval in under 10 minutes.
Prerequisites
- Node.js 18+ or Python 3.9+
- API key from PromptOps dashboard (request access first)
- An LLM provider API key (OpenAI, Anthropic, etc.)
Installation
TypeScript / JavaScript
npm install @promptops/sdk
Python
pip install promptops
Quick Start
import { PromptOps } from '@promptops/sdk';
const client = new PromptOps({
apiKey: process.env.PROMPTOPS_API_KEY,
});
const prompt = await client.prompts.create({
name: 'customer-support-classifier',
template: 'Classify this support ticket: {{ticket}}',
version: 'v1',
});
const result = await client.evals.run({
promptId: prompt.id,
testSet: 'golden-set-1',
model: 'gpt-4',
});
console.log('Pass rate:', result.passRate);
Next Steps
- [Learn about prompt versioning](/docs/prompt-registry)
- [Build your first eval harness](/docs/evaluations)
- [Set up CI integration](/docs/ci-integration)
Core Concepts
Understanding these concepts will help you use PromptOps effectively.
Prompts
A prompt is a versioned template with:
- Template string (with variables like {{input}})
- Model configuration (temperature, max tokens, etc.)
- System message and role definitions
- Tool/function definitions (if applicable)
Versions
Each prompt can have multiple versions, identified by tags (v1, v2, prod, staging, etc.) or commit-style hashes.
Version types:
- Draft: Editable, not ready for eval
- Candidate: Locked, ready for testing
- Production: Serving live traffic
- Archived: Deprecated, kept for history
Test Sets
A test set is a collection of input/output pairs used for evaluation:
- Inputs: The variables to inject into your prompt
- Expected outputs: Either exact matches or rubric-based criteria
- Metadata: Tags, difficulty level, scenario type
- Happy paths
- Edge cases
- Adversarial examples
- Previously discovered bugs
Evaluations
An evaluation runs a prompt version against a test set and produces:
- Pass/fail rate
- Latency percentiles
- Token usage and cost
- Per-example scores and diffs
- Exact match
- Semantic similarity
- LLM-as-judge with rubrics
- Custom scoring functions
Deployments
A deployment promotes a prompt version to an environment:
- Development
- Staging
- Production
Traces
A trace links a production request to:
- Prompt version used
- Model and parameters
- Input variables
- Output generated
- Cost and latency
- User feedback (if available)
Prompt Registry
The prompt registry is version control for your LLM templates.
Creating a Prompt
const prompt = await client.prompts.create({
name: 'summarizer',
template: 'Summarize this article in {{length}} words:
{{article}}',
systemMessage: 'You are a concise summarization assistant.',
config: {
model: 'gpt-4-turbo',
temperature: 0.3,
maxTokens: 500,
},
schema: {
inputs: {
article: 'string',
length: 'number',
},
output: 'string',
},
});
Versioning
Tag versions for easy reference:
await client.prompts.tag({
promptId: prompt.id,
version: 'v1.2.0',
});
Branching
Create a branch to experiment:
const branch = await client.prompts.branch({
from: 'main',
name: 'experiment-shorter-context',
});
Diffing
Compare two versions:
const diff = await client.prompts.diff({
promptId: prompt.id,
versionA: 'v1.0.0',
versionB: 'v1.1.0',
});
This shows template changes, config changes, and output diffs on test sets.
Best Practices
Evaluations
Evals prevent regressions and validate improvements.
Creating a Test Set
const testSet = await client.testSets.create({
name: 'golden-summarization',
examples: [
{
inputs: { article: '...long text...', length: 50 },
expected: { output: 'Expected summary here' },
rubric: 'Must mention key facts A, B, C',
},
// ... more examples
],
});
Running an Eval
const result = await client.evals.run({
promptId: 'prompt-123',
version: 'v2.0.0',
testSetId: testSet.id,
scoringMethod: 'llm-judge',
judgeConfig: {
model: 'gpt-4',
rubric: 'Score 1-5 on accuracy, conciseness, and fluency',
},
});
console.log('Results:', {
passRate: result.passRate,
avgScore: result.avgScore,
avgCost: result.avgCost,
p95Latency: result.p95Latency,
});
Scoring Methods
Exact Match
Simple string comparison. Use for structured outputs.Semantic Similarity
Embedding-based similarity. Use for paraphrases.LLM-as-Judge
Use another LLM to score outputs. Flexible but slower and more expensive.Custom Function
Provide your own scoring function:await client.evals.run({
// ...
scoringMethod: 'custom',
customScorer: async (output, expected) => {
// Your logic here
return { score: 0.85, passed: true };
},
});
Regression Detection
Set thresholds to block bad changes:
await client.evals.setThresholds({
promptId: 'prompt-123',
minPassRate: 0.90,
maxAvgCost: 0.05,
maxP95Latency: 2000, // ms
});
Evals that do not meet thresholds will fail in CI.
Deployments
Promote validated prompts to production with confidence.
Deploying a Prompt
await client.deployments.create({
promptId: 'prompt-123',
version: 'v2.1.0',
environment: 'production',
rolloutStrategy: 'immediate', // or 'canary'
});
Canary Deployments
Test in production with limited traffic:
await client.deployments.create({
promptId: 'prompt-123',
version: 'v2.2.0',
environment: 'production',
rolloutStrategy: 'canary',
canaryConfig: {
percentage: 10, // 10% of traffic
duration: 3600, // Run for 1 hour
successMetrics: {
minPassRate: 0.92,
maxErrorRate: 0.02,
},
},
});
If metrics are good, promote to 100%. If bad, automatic rollback.
Rollback
Instant rollback to previous version:
await client.deployments.rollback({
promptId: 'prompt-123',
environment: 'production',
toVersion: 'v2.1.0',
});
Multi-Environment Strategy
Typical flow:
Configure environments in dashboard.
Observability
Monitor prompt performance in production.
Sending Traces
Instrument your application:
const trace = client.traces.start({
promptId: 'prompt-123',
version: 'v2.1.0',
sessionId: 'user-session-abc',
});
const output = await runLLM({
prompt: trace.prompt,
inputs: { article: userArticle, length: 100 },
});
await trace.end({
output,
cost: 0.003,
latency: 1240, // ms
metadata: { userId: '123' },
});
Dashboards
View in PromptOps dashboard:
- Requests per prompt version
- Cost and latency trends
- Error rates
- User feedback (thumbs up/down)
Alerts
Set up alerts for anomalies:
await client.alerts.create({
promptId: 'prompt-123',
conditions: {
errorRate: { above: 0.05, duration: 300 }, // 5% errors for 5 min
p95Latency: { above: 3000, duration: 600 }, // 3s for 10 min
cost: { above: 0.10, duration: 3600 }, // $0.10/request for 1 hour
},
destinations: ['email', 'slack'],
});
Sampling
You cannot score every production output (too expensive). Use sampling:
await client.sampling.configure({
promptId: 'prompt-123',
sampleRate: 0.01, // 1% of requests
scoringMethod: 'llm-judge',
});
Sampled outputs get quality scores for monitoring.
API Reference
Complete SDK reference.
Client Initialization
import { PromptOps } from '@promptops/sdk';
const client = new PromptOps({
apiKey: process.env.PROMPTOPS_API_KEY,
baseURL: 'https://api.promptops.ai', // optional
timeout: 30000, // optional, ms
});
Prompts API
create(params)
Create a new prompt. Parameters:- name: string
- template: string
- systemMessage?: string
- config?: ModelConfig
- schema?: IOSchema
list(filters?)
List prompts. Parameters:- filters.name?: string
- filters.tags?: string[]
get(id)
Get prompt by ID. Returns: Promiseupdate(id, params)
Update draft prompt.delete(id)
Delete prompt (archives it).tag(id, version, tag)
Add tag to version.branch(id, from, name)
Create branch.diff(id, versionA, versionB)
Compare versions.Test Sets API
create(params)
Create test set.addExample(testSetId, example)
Add example to test set.list()
List test sets.Evals API
run(params)
Run evaluation.get(evalId)
Get eval results.list(filters?)
List eval runs.Deployments API
create(params)
Deploy prompt version.rollback(promptId, environment, toVersion)
Rollback deployment.list(promptId)
List deployments.Traces API
start(params)
Start trace.end(traceId, result)
End trace.get(traceId)
Get trace details.query(filters)
Search traces.For Python SDK, see [Python API docs](https://github.com/promptops/python-sdk).