Documentation

Complete guides and API reference for PromptOps

Getting Started

Getting Started with PromptOps

Welcome to PromptOps. This guide will get you from zero to running your first eval in under 10 minutes.

Prerequisites

Node.js 18+ or Python 3.9+
API key from PromptOps dashboard (request access first)
An LLM provider API key (OpenAI, Anthropic, etc.)

Installation

TypeScript / JavaScript

npm install @promptops/sdk

Python

pip install promptops

Quick Start

Initialize the SDK:

import { PromptOps } from '@promptops/sdk';
const client = new PromptOps({
  apiKey: process.env.PROMPTOPS_API_KEY,
});

Create your first prompt:

const prompt = await client.prompts.create({
  name: 'customer-support-classifier',
  template: 'Classify this support ticket: {{ticket}}',
  version: 'v1',
});

Run an eval:

const result = await client.evals.run({
  promptId: prompt.id,
  testSet: 'golden-set-1',
  model: 'gpt-4',
});
console.log('Pass rate:', result.passRate);

Next Steps

[Learn about prompt versioning](/docs/prompt-registry)
[Build your first eval harness](/docs/evaluations)
[Set up CI integration](/docs/ci-integration)

Concepts

Core Concepts

Understanding these concepts will help you use PromptOps effectively.

Prompts

A prompt is a versioned template with:

Template string (with variables like {{input}})
Model configuration (temperature, max tokens, etc.)
System message and role definitions
Tool/function definitions (if applicable)

Prompts are immutable once created. Changes create new versions.

Versions

Each prompt can have multiple versions, identified by tags (v1, v2, prod, staging, etc.) or commit-style hashes.

Version types:

Draft: Editable, not ready for eval
Candidate: Locked, ready for testing
Production: Serving live traffic
Archived: Deprecated, kept for history

Test Sets

A test set is a collection of input/output pairs used for evaluation:

Inputs: The variables to inject into your prompt
Expected outputs: Either exact matches or rubric-based criteria
Metadata: Tags, difficulty level, scenario type

Test sets should cover:

Happy paths
Edge cases
Adversarial examples
Previously discovered bugs

Evaluations

An evaluation runs a prompt version against a test set and produces:

Pass/fail rate
Latency percentiles
Token usage and cost
Per-example scores and diffs

Evals can use:

Exact match
Semantic similarity
LLM-as-judge with rubrics
Custom scoring functions

Deployments

A deployment promotes a prompt version to an environment:

Development
Staging
Production

You can rollback to any previous deployment instantly.

Traces

A trace links a production request to:

Prompt version used
Model and parameters
Input variables
Output generated
Cost and latency
User feedback (if available)

Traces enable debugging and quality monitoring in production.

Prompt Registry

The prompt registry is version control for your LLM templates.

Creating a Prompt

const prompt = await client.prompts.create({
  name: 'summarizer',
  template: 'Summarize this article in {{length}} words:
{{article}}',
  systemMessage: 'You are a concise summarization assistant.',
  config: {
    model: 'gpt-4-turbo',
    temperature: 0.3,
    maxTokens: 500,
  },
  schema: {
    inputs: {
      article: 'string',
      length: 'number',
    },
    output: 'string',
  },
});

Versioning

Tag versions for easy reference:

await client.prompts.tag({
  promptId: prompt.id,
  version: 'v1.2.0',
});

Branching

Create a branch to experiment:

const branch = await client.prompts.branch({
  from: 'main',
  name: 'experiment-shorter-context',
});

Diffing

Compare two versions:

const diff = await client.prompts.diff({
  promptId: prompt.id,
  versionA: 'v1.0.0',
  versionB: 'v1.1.0',
});

This shows template changes, config changes, and output diffs on test sets.

Best Practices

Use semantic versioning: Major version for breaking changes, minor for improvements

Tag production versions: Always know what is deployed

Add comments: Explain why you made changes

Link to evals: Reference eval results in version notes

Evaluations

Evals prevent regressions and validate improvements.

Creating a Test Set

const testSet = await client.testSets.create({
  name: 'golden-summarization',
  examples: [
    {
      inputs: { article: '...long text...', length: 50 },
      expected: { output: 'Expected summary here' },
      rubric: 'Must mention key facts A, B, C',
    },
    // ... more examples
  ],
});

Running an Eval

const result = await client.evals.run({
  promptId: 'prompt-123',
  version: 'v2.0.0',
  testSetId: testSet.id,
  scoringMethod: 'llm-judge',
  judgeConfig: {
    model: 'gpt-4',
    rubric: 'Score 1-5 on accuracy, conciseness, and fluency',
  },
});
console.log('Results:', {
  passRate: result.passRate,
  avgScore: result.avgScore,
  avgCost: result.avgCost,
  p95Latency: result.p95Latency,
});

Scoring Methods

Exact Match

Simple string comparison. Use for structured outputs.

Semantic Similarity

Embedding-based similarity. Use for paraphrases.

LLM-as-Judge

Use another LLM to score outputs. Flexible but slower and more expensive.

Custom Function

Provide your own scoring function:

await client.evals.run({
  // ...
  scoringMethod: 'custom',
  customScorer: async (output, expected) => {
    // Your logic here
    return { score: 0.85, passed: true };
  },
});

Regression Detection

Set thresholds to block bad changes:

await client.evals.setThresholds({
  promptId: 'prompt-123',
  minPassRate: 0.90,
  maxAvgCost: 0.05,
  maxP95Latency: 2000, // ms
});

Evals that do not meet thresholds will fail in CI.

Deployments

Promote validated prompts to production with confidence.

Deploying a Prompt

await client.deployments.create({
  promptId: 'prompt-123',
  version: 'v2.1.0',
  environment: 'production',
  rolloutStrategy: 'immediate', // or 'canary'
});

Canary Deployments

Test in production with limited traffic:

await client.deployments.create({
  promptId: 'prompt-123',
  version: 'v2.2.0',
  environment: 'production',
  rolloutStrategy: 'canary',
  canaryConfig: {
    percentage: 10, // 10% of traffic
    duration: 3600, // Run for 1 hour
    successMetrics: {
      minPassRate: 0.92,
      maxErrorRate: 0.02,
    },
  },
});

If metrics are good, promote to 100%. If bad, automatic rollback.

Rollback

Instant rollback to previous version:

await client.deployments.rollback({
  promptId: 'prompt-123',
  environment: 'production',
  toVersion: 'v2.1.0',
});

Multi-Environment Strategy

Typical flow:

Dev: Rapid iteration, no evals required

Staging: Full eval suite must pass

Canary: 5-10% production traffic for 1 hour

Production: Full rollout

Configure environments in dashboard.

Observability

Monitor prompt performance in production.

Sending Traces

Instrument your application:

const trace = client.traces.start({
  promptId: 'prompt-123',
  version: 'v2.1.0',
  sessionId: 'user-session-abc',
});
const output = await runLLM({
  prompt: trace.prompt,
  inputs: { article: userArticle, length: 100 },
});
await trace.end({
  output,
  cost: 0.003,
  latency: 1240, // ms
  metadata: { userId: '123' },
});

Dashboards

View in PromptOps dashboard:

Requests per prompt version
Cost and latency trends
Error rates
User feedback (thumbs up/down)

Alerts

Set up alerts for anomalies:

await client.alerts.create({
  promptId: 'prompt-123',
  conditions: {
    errorRate: { above: 0.05, duration: 300 }, // 5% errors for 5 min
    p95Latency: { above: 3000, duration: 600 }, // 3s for 10 min
    cost: { above: 0.10, duration: 3600 }, // $0.10/request for 1 hour
  },
  destinations: ['email', 'slack'],
});

Sampling

You cannot score every production output (too expensive). Use sampling:

await client.sampling.configure({
  promptId: 'prompt-123',
  sampleRate: 0.01, // 1% of requests
  scoringMethod: 'llm-judge',
});

Sampled outputs get quality scores for monitoring.

API Reference

Complete SDK reference.

Client Initialization

import { PromptOps } from '@promptops/sdk';
const client = new PromptOps({
  apiKey: process.env.PROMPTOPS_API_KEY,
  baseURL: 'https://api.promptops.ai', // optional
  timeout: 30000, // optional, ms
});

Prompts API

create(params)

Create a new prompt. Parameters:

name: string
template: string
systemMessage?: string
config?: ModelConfig
schema?: IOSchema

Returns: Promise

list(filters?)

List prompts. Parameters:

filters.name?: string
filters.tags?: string[]

Returns: Promise

get(id)

Get prompt by ID. Returns: Promise

update(id, params)

Update draft prompt.

delete(id)

Delete prompt (archives it).

tag(id, version, tag)

Add tag to version.

branch(id, from, name)

Create branch.

diff(id, versionA, versionB)

Compare versions.

Test Sets API

create(params)

Create test set.

addExample(testSetId, example)

Add example to test set.

list()

List test sets.

Evals API

run(params)

Run evaluation.

get(evalId)

Get eval results.

list(filters?)

List eval runs.

Deployments API

create(params)

Deploy prompt version.

rollback(promptId, environment, toVersion)

Rollback deployment.

list(promptId)

List deployments.

Traces API

start(params)

Start trace.

end(traceId, result)

End trace.

get(traceId)

Get trace details.

query(filters)

Search traces.

For Python SDK, see [Python API docs](https://github.com/promptops/python-sdk).