Product
Comprehensive tooling for prompt engineering teams. Version control, testing, deployment, and monitoring in one platform.
Prompt Registry
Version-controlled templates with parameter schemas, role definitions, and chain configurations. Git-like workflow for prompt engineering.
Evaluation Harness
Run golden test sets with custom rubrics. Track regression across prompt versions. Support for LLM-as-judge and deterministic assertions.
Side-by-Side Diff
Compare outputs across model versions, prompts, or parameters. Spot regressions before they hit production.
Tool Use Test Runner
Validate function calling and tool use with synthetic scenarios. Ensure your agent calls the right tools with correct arguments.
Safety & Refusal Packs
Pre-built test sets for jailbreaks, prompt injections, and content policy violations. Add custom adversarial examples.
Cost + Latency Tracking
Per-run token usage and cost breakdowns. Identify expensive prompts and optimize before scaling.
RAG Test Kit
Measure retrieval precision, context relevance, and answer faithfulness. Detect when your RAG pipeline degrades.
CI Integration
Block PRs when eval scores drop. GitHub Actions and GitLab CI examples included. Treat prompts like code.
How PromptOps works
Prompt Lifecycle
Write prompts in the registry with schemas and metadata
Run against golden sets, compare versions, track metrics
Promote to production with canary rollout or instant deployment
Track performance, costs, and quality in real-time
CI Integration Flow
Developer updates prompt template in codebase
Automated test suite runs against new version
Pass/fail status with detailed metrics and diffs
PR blocked if evals fail threshold requirements
SDK
Initialize client
import { PromptOps } from '@promptops/sdk';
const client = new PromptOps({
apiKey: process.env.PROMPTOPS_API_KEY,
});Create and version a prompt
const prompt = await client.prompts.create({
name: 'support-classifier',
template: 'Classify: {{ticket}}',
config: { model: 'gpt-4', temperature: 0.2 },
version: 'v1',
});Run evaluation
const eval = await client.evals.run({
promptId: prompt.id,
testSetId: 'golden-set-1',
model: 'gpt-4',
});
console.log('Pass rate:', eval.passRate);
console.log('Avg cost:', eval.avgCost);Roadmap
What we're building next. Subject to change based on user feedback.
Multi-model eval orchestration
Run the same eval across GPT-4, Claude, Gemini in parallel. Compare model performance on your workload.
Prompt optimization suggestions
LLM-powered recommendations for improving clarity, token efficiency, and output consistency.
Synthetic data generation
Generate diverse test cases from schemas. Expand eval coverage without manual work.
Custom metrics SDK
Plug in your own scoring functions. Track domain-specific quality measures.
Federated eval runs
Run evals in your own infrastructure. Keep sensitive data and prompts private.
Prompt marketplace
Share and discover tested prompts for common use cases. Community-driven quality scores.