Documentation

Complete guides and API reference for PromptOps

Getting Started

Getting Started with PromptOps

Welcome to PromptOps. This guide will get you from zero to running your first eval in under 10 minutes.

Prerequisites

  • Node.js 18+ or Python 3.9+
  • API key from PromptOps dashboard (request access first)
  • An LLM provider API key (OpenAI, Anthropic, etc.)

Installation

TypeScript / JavaScript

npm install @promptops/sdk

Python

pip install promptops

Quick Start

  • Initialize the SDK:
  • import { PromptOps } from '@promptops/sdk';
    

    const client = new PromptOps({ apiKey: process.env.PROMPTOPS_API_KEY, });

  • Create your first prompt:
  • const prompt = await client.prompts.create({
      name: 'customer-support-classifier',
      template: 'Classify this support ticket: {{ticket}}',
      version: 'v1',
    });
  • Run an eval:
  • const result = await client.evals.run({
      promptId: prompt.id,
      testSet: 'golden-set-1',
      model: 'gpt-4',
    });
    

    console.log('Pass rate:', result.passRate);

    Next Steps

    • [Learn about prompt versioning](/docs/prompt-registry)
    • [Build your first eval harness](/docs/evaluations)
    • [Set up CI integration](/docs/ci-integration)
    Concepts

    Core Concepts

    Understanding these concepts will help you use PromptOps effectively.

    Prompts

    A prompt is a versioned template with:

    • Template string (with variables like {{input}})
    • Model configuration (temperature, max tokens, etc.)
    • System message and role definitions
    • Tool/function definitions (if applicable)
    Prompts are immutable once created. Changes create new versions.

    Versions

    Each prompt can have multiple versions, identified by tags (v1, v2, prod, staging, etc.) or commit-style hashes.

    Version types:

    • Draft: Editable, not ready for eval
    • Candidate: Locked, ready for testing
    • Production: Serving live traffic
    • Archived: Deprecated, kept for history

    Test Sets

    A test set is a collection of input/output pairs used for evaluation:

    • Inputs: The variables to inject into your prompt
    • Expected outputs: Either exact matches or rubric-based criteria
    • Metadata: Tags, difficulty level, scenario type
    Test sets should cover:
    • Happy paths
    • Edge cases
    • Adversarial examples
    • Previously discovered bugs

    Evaluations

    An evaluation runs a prompt version against a test set and produces:

    • Pass/fail rate
    • Latency percentiles
    • Token usage and cost
    • Per-example scores and diffs
    Evals can use:
    • Exact match
    • Semantic similarity
    • LLM-as-judge with rubrics
    • Custom scoring functions

    Deployments

    A deployment promotes a prompt version to an environment:

    • Development
    • Staging
    • Production
    You can rollback to any previous deployment instantly.

    Traces

    A trace links a production request to:

    • Prompt version used
    • Model and parameters
    • Input variables
    • Output generated
    • Cost and latency
    • User feedback (if available)
    Traces enable debugging and quality monitoring in production.

    Prompt Registry

    Prompt Registry

    The prompt registry is version control for your LLM templates.

    Creating a Prompt

    const prompt = await client.prompts.create({
      name: 'summarizer',
      template: 'Summarize this article in {{length}} words:
    

    {{article}}', systemMessage: 'You are a concise summarization assistant.', config: { model: 'gpt-4-turbo', temperature: 0.3, maxTokens: 500, }, schema: { inputs: { article: 'string', length: 'number', }, output: 'string', }, });

    Versioning

    Tag versions for easy reference:

    await client.prompts.tag({
      promptId: prompt.id,
      version: 'v1.2.0',
    });

    Branching

    Create a branch to experiment:

    const branch = await client.prompts.branch({
      from: 'main',
      name: 'experiment-shorter-context',
    });

    Diffing

    Compare two versions:

    const diff = await client.prompts.diff({
      promptId: prompt.id,
      versionA: 'v1.0.0',
      versionB: 'v1.1.0',
    });

    This shows template changes, config changes, and output diffs on test sets.

    Best Practices

  • Use semantic versioning: Major version for breaking changes, minor for improvements
  • Tag production versions: Always know what is deployed
  • Add comments: Explain why you made changes
  • Link to evals: Reference eval results in version notes
  • Evaluations

    Evaluations

    Evals prevent regressions and validate improvements.

    Creating a Test Set

    const testSet = await client.testSets.create({
      name: 'golden-summarization',
      examples: [
        {
          inputs: { article: '...long text...', length: 50 },
          expected: { output: 'Expected summary here' },
          rubric: 'Must mention key facts A, B, C',
        },
        // ... more examples
      ],
    });

    Running an Eval

    const result = await client.evals.run({
      promptId: 'prompt-123',
      version: 'v2.0.0',
      testSetId: testSet.id,
      scoringMethod: 'llm-judge',
      judgeConfig: {
        model: 'gpt-4',
        rubric: 'Score 1-5 on accuracy, conciseness, and fluency',
      },
    });
    

    console.log('Results:', { passRate: result.passRate, avgScore: result.avgScore, avgCost: result.avgCost, p95Latency: result.p95Latency, });

    Scoring Methods

    Exact Match

    Simple string comparison. Use for structured outputs.

    Semantic Similarity

    Embedding-based similarity. Use for paraphrases.

    LLM-as-Judge

    Use another LLM to score outputs. Flexible but slower and more expensive.

    Custom Function

    Provide your own scoring function:
    await client.evals.run({
      // ...
      scoringMethod: 'custom',
      customScorer: async (output, expected) => {
        // Your logic here
        return { score: 0.85, passed: true };
      },
    });

    Regression Detection

    Set thresholds to block bad changes:

    await client.evals.setThresholds({
      promptId: 'prompt-123',
      minPassRate: 0.90,
      maxAvgCost: 0.05,
      maxP95Latency: 2000, // ms
    });

    Evals that do not meet thresholds will fail in CI.

    Deployments

    Deployments

    Promote validated prompts to production with confidence.

    Deploying a Prompt

    await client.deployments.create({
      promptId: 'prompt-123',
      version: 'v2.1.0',
      environment: 'production',
      rolloutStrategy: 'immediate', // or 'canary'
    });

    Canary Deployments

    Test in production with limited traffic:

    await client.deployments.create({
      promptId: 'prompt-123',
      version: 'v2.2.0',
      environment: 'production',
      rolloutStrategy: 'canary',
      canaryConfig: {
        percentage: 10, // 10% of traffic
        duration: 3600, // Run for 1 hour
        successMetrics: {
          minPassRate: 0.92,
          maxErrorRate: 0.02,
        },
      },
    });

    If metrics are good, promote to 100%. If bad, automatic rollback.

    Rollback

    Instant rollback to previous version:

    await client.deployments.rollback({
      promptId: 'prompt-123',
      environment: 'production',
      toVersion: 'v2.1.0',
    });

    Multi-Environment Strategy

    Typical flow:

  • Dev: Rapid iteration, no evals required
  • Staging: Full eval suite must pass
  • Canary: 5-10% production traffic for 1 hour
  • Production: Full rollout
  • Configure environments in dashboard.

    Observability

    Observability

    Monitor prompt performance in production.

    Sending Traces

    Instrument your application:

    const trace = client.traces.start({
      promptId: 'prompt-123',
      version: 'v2.1.0',
      sessionId: 'user-session-abc',
    });
    

    const output = await runLLM({ prompt: trace.prompt, inputs: { article: userArticle, length: 100 }, });

    await trace.end({ output, cost: 0.003, latency: 1240, // ms metadata: { userId: '123' }, });

    Dashboards

    View in PromptOps dashboard:

    • Requests per prompt version
    • Cost and latency trends
    • Error rates
    • User feedback (thumbs up/down)

    Alerts

    Set up alerts for anomalies:

    await client.alerts.create({
      promptId: 'prompt-123',
      conditions: {
        errorRate: { above: 0.05, duration: 300 }, // 5% errors for 5 min
        p95Latency: { above: 3000, duration: 600 }, // 3s for 10 min
        cost: { above: 0.10, duration: 3600 }, // $0.10/request for 1 hour
      },
      destinations: ['email', 'slack'],
    });

    Sampling

    You cannot score every production output (too expensive). Use sampling:

    await client.sampling.configure({
      promptId: 'prompt-123',
      sampleRate: 0.01, // 1% of requests
      scoringMethod: 'llm-judge',
    });

    Sampled outputs get quality scores for monitoring.

    API Reference

    API Reference

    Complete SDK reference.

    Client Initialization

    import { PromptOps } from '@promptops/sdk';
    

    const client = new PromptOps({ apiKey: process.env.PROMPTOPS_API_KEY, baseURL: 'https://api.promptops.ai', // optional timeout: 30000, // optional, ms });

    Prompts API

    create(params)

    Create a new prompt. Parameters:
    • name: string
    • template: string
    • systemMessage?: string
    • config?: ModelConfig
    • schema?: IOSchema
    Returns: Promise

    list(filters?)

    List prompts. Parameters:
    • filters.name?: string
    • filters.tags?: string[]
    Returns: Promise

    get(id)

    Get prompt by ID. Returns: Promise

    update(id, params)

    Update draft prompt.

    delete(id)

    Delete prompt (archives it).

    tag(id, version, tag)

    Add tag to version.

    branch(id, from, name)

    Create branch.

    diff(id, versionA, versionB)

    Compare versions.

    Test Sets API

    create(params)

    Create test set.

    addExample(testSetId, example)

    Add example to test set.

    list()

    List test sets.

    Evals API

    run(params)

    Run evaluation.

    get(evalId)

    Get eval results.

    list(filters?)

    List eval runs.

    Deployments API

    create(params)

    Deploy prompt version.

    rollback(promptId, environment, toVersion)

    Rollback deployment.

    list(promptId)

    List deployments.

    Traces API

    start(params)

    Start trace.

    end(traceId, result)

    End trace.

    get(traceId)

    Get trace details.

    query(filters)

    Search traces.

    For Python SDK, see [Python API docs](https://github.com/promptops/python-sdk).