Product

Comprehensive tooling for prompt engineering teams. Version control, testing, deployment, and monitoring in one platform.

Prompt Registry

Version-controlled templates with parameter schemas, role definitions, and chain configurations. Git-like workflow for prompt engineering.

Evaluation Harness

Run golden test sets with custom rubrics. Track regression across prompt versions. Support for LLM-as-judge and deterministic assertions.

Side-by-Side Diff

Compare outputs across model versions, prompts, or parameters. Spot regressions before they hit production.

Tool Use Test Runner

Validate function calling and tool use with synthetic scenarios. Ensure your agent calls the right tools with correct arguments.

Safety & Refusal Packs

Pre-built test sets for jailbreaks, prompt injections, and content policy violations. Add custom adversarial examples.

Cost + Latency Tracking

Per-run token usage and cost breakdowns. Identify expensive prompts and optimize before scaling.

RAG Test Kit

Measure retrieval precision, context relevance, and answer faithfulness. Detect when your RAG pipeline degrades.

CI Integration

Block PRs when eval scores drop. GitHub Actions and GitLab CI examples included. Treat prompts like code.

How PromptOps works

Prompt Lifecycle

Create & Version

Write prompts in the registry with schemas and metadata

Test & Evaluate

Run against golden sets, compare versions, track metrics

Deploy

Promote to production with canary rollout or instant deployment

Monitor

Track performance, costs, and quality in real-time

CI Integration Flow

PR opened with prompt changes

Developer updates prompt template in codebase

GitHub Action triggers eval

Automated test suite runs against new version

Results posted to PR

Pass/fail status with detailed metrics and diffs

Merge or block

PR blocked if evals fail threshold requirements

SDK

Initialize client

import { PromptOps } from '@promptops/sdk';

const client = new PromptOps({
  apiKey: process.env.PROMPTOPS_API_KEY,
});

Create and version a prompt

const prompt = await client.prompts.create({
  name: 'support-classifier',
  template: 'Classify: {{ticket}}',
  config: { model: 'gpt-4', temperature: 0.2 },
  version: 'v1',
});

Run evaluation

const eval = await client.evals.run({
  promptId: prompt.id,
  testSetId: 'golden-set-1',
  model: 'gpt-4',
});

console.log('Pass rate:', eval.passRate);
console.log('Avg cost:', eval.avgCost);

Roadmap

What we're building next. Subject to change based on user feedback.

Q1 2025

Multi-model eval orchestration

Run the same eval across GPT-4, Claude, Gemini in parallel. Compare model performance on your workload.

Q1 2025

Prompt optimization suggestions

LLM-powered recommendations for improving clarity, token efficiency, and output consistency.

Q2 2025

Synthetic data generation

Generate diverse test cases from schemas. Expand eval coverage without manual work.

Q2 2025

Custom metrics SDK

Plug in your own scoring functions. Track domain-specific quality measures.

Q3 2025

Federated eval runs

Run evals in your own infrastructure. Keep sensitive data and prompts private.

Q3 2025

Prompt marketplace

Share and discover tested prompts for common use cases. Community-driven quality scores.