{/* 此页面由 website/scripts/generate-skill-docs.py 从技能的 SKILL.md 自动生成。请编辑源文件 SKILL.md,而非此页面。 */}

Instructor

Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, parse complex JSON with type safety, and stream partial results with Instructor - battle-tested structured output library

技能元数据

来源可选 — 通过 hermes skills install official/mlops/instructor
路径optional-skills/mlops/instructor
版本1.0.0
作者Orchestra Research
许可证MIT
依赖项instructor, pydantic, openai, anthropic
平台linux, macos, windows
标签Prompt Engineering, Instructor, Structured Output, Pydantic, Data Extraction, JSON Parsing, Type Safety, Validation, Streaming, OpenAI, Anthropic

参考:完整 SKILL.md

:::info 以下是 Hermes 在触发此技能时加载的完整技能定义。这是技能激活时代理所看到的指令。 :::

Instructor: Structured LLM Outputs

何时使用此技能

Use Instructor when you need to:

  • Extract structured data from LLM responses reliably
  • Validate outputs against Pydantic schemas automatically
  • Retry failed extractions with automatic error handling
  • Parse complex JSON with type safety and validation
  • Stream partial results for real-time processing
  • Support multiple LLM providers with consistent API

GitHub Stars: 15,000+ | Battle-tested: 100,000+ developers

Installation

# Base installation
pip install instructor
 
# With specific providers
pip install "instructor[anthropic]"  # Anthropic Claude
pip install "instructor[openai]"     # OpenAI
pip install "instructor[all]"        # All providers

快速开始

Basic Example: Extract User Data

import instructor
from pydantic import BaseModel
from anthropic import Anthropic
 
# Define output structure
class User(BaseModel):
    name: str
    age: int
    email: str
 
# Create instructor client
client = instructor.from_anthropic(Anthropic())
 
# Extract structured data
user = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "John Doe is 30 years old. His email is john@example.com"
    }],
    response_model=User
)
 
print(user.name)   # "John Doe"
print(user.age)    # 30
print(user.email)  # "john@example.com"

With OpenAI

from openai import OpenAI
 
client = instructor.from_openai(OpenAI())
 
user = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=User,
    messages=[{"role": "user", "content": "Extract: Alice, 25, alice@email.com"}]
)

Core Concepts

1. Response Models (Pydantic)

Response models define the structure and validation rules for LLM outputs.

Basic Model

from pydantic import BaseModel, Field
 
class Article(BaseModel):
    title: str = Field(description="Article title")
    author: str = Field(description="Author name")
    word_count: int = Field(description="Number of words", gt=0)
    tags: list[str] = Field(description="List of relevant tags")
 
article = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Analyze this article: [article text]"
    }],
    response_model=Article
)

Benefits:

  • Type safety with Python type hints
  • Automatic validation (word_count > 0)
  • Self-documenting with Field descriptions
  • IDE autocomplete support

Nested Models

class Address(BaseModel):
    street: str
    city: str
    country: str
 
class Person(BaseModel):
    name: str
    age: int
    address: Address  # Nested model
 
person = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "John lives at 123 Main St, Boston, USA"
    }],
    response_model=Person
)
 
print(person.address.city)  # "Boston"

Optional Fields

from typing import Optional
 
class Product(BaseModel):
    name: str
    price: float
    discount: Optional[float] = None  # Optional
    description: str = Field(default="No description")  # Default value
 
# LLM doesn't need to provide discount or description

Enums for Constraints

from enum import Enum
 
class Sentiment(str, Enum):
    POSITIVE = "positive"
    NEGATIVE = "negative"
    NEUTRAL = "neutral"
 
class Review(BaseModel):
    text: str
    sentiment: Sentiment  # Only these 3 values allowed
 
review = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "This product is amazing!"
    }],
    response_model=Review
)
 
print(review.sentiment)  # Sentiment.POSITIVE

2. Validation

Pydantic validates LLM outputs automatically. If validation fails, Instructor retries.

Built-in Validators

from pydantic import Field, EmailStr, HttpUrl
 
class Contact(BaseModel):
    name: str = Field(min_length=2, max_length=100)
    age: int = Field(ge=0, le=120)  # 0 <= age <= 120
    email: EmailStr  # Validates email format
    website: HttpUrl  # Validates URL format
 
# If LLM provides invalid data, Instructor retries automatically

Custom Validators

from pydantic import field_validator
 
class Event(BaseModel):
    name: str
    date: str
    attendees: int
 
    @field_validator('date')
    def validate_date(cls, v):
        """Ensure date is in YYYY-MM-DD format."""
        import re
        if not re.match(r'\d{4}-\d{2}-\d{2}', v):
            raise ValueError('Date must be YYYY-MM-DD format')
        return v
 
    @field_validator('attendees')
    def validate_attendees(cls, v):
        """Ensure positive attendees."""
        if v < 1:
            raise ValueError('Must have at least 1 attendee')
        return v

Model-Level Validation

from pydantic import model_validator
 
class DateRange(BaseModel):
    start_date: str
    end_date: str
 
    @model_validator(mode='after')
    def check_dates(self):
        """Ensure end_date is after start_date."""
        from datetime import datetime
        start = datetime.strptime(self.start_date, '%Y-%m-%d')
        end = datetime.strptime(self.end_date, '%Y-%m-%d')
 
        if end < start:
            raise ValueError('end_date must be after start_date')
        return self

3. Automatic Retrying

Instructor retries automatically when validation fails, providing error feedback to the LLM.

# Retries up to 3 times if validation fails
user = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Extract user from: John, age unknown"
    }],
    response_model=User,
    max_retries=3  # Default is 3
)
 
# If age can't be extracted, Instructor tells the LLM:
# "Validation error: age - field required"
# LLM tries again with better extraction

How it works:

  1. LLM generates output
  2. Pydantic validates
  3. If invalid: Error message sent back to LLM
  4. LLM tries again with error feedback
  5. Repeats up to max_retries

4. Streaming

Stream partial results for real-time processing.

Streaming Partial Objects

from instructor import Partial
 
class Story(BaseModel):
    title: str
    content: str
    tags: list[str]
 
# Stream partial updates as LLM generates
for partial_story in client.messages.create_partial(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Write a short sci-fi story"
    }],
    response_model=Story
):
    print(f"Title: {partial_story.title}")
    print(f"Content so far: {partial_story.content[:100]}...")
    # Update UI in real-time

Streaming Iterables

class Task(BaseModel):
    title: str
    priority: str
 
# Stream list items as they're generated
tasks = client.messages.create_iterable(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Generate 10 project tasks"
    }],
    response_model=Task
)
 
for task in tasks:
    print(f"- {task.title} ({task.priority})")
    # Process each task as it arrives

Provider Configuration

Anthropic Claude

import instructor
from anthropic import Anthropic
 
client = instructor.from_anthropic(
    Anthropic(api_key="your-api-key")
)
 
# Use with Claude models
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[...],
    response_model=YourModel
)

OpenAI

from openai import OpenAI
 
client = instructor.from_openai(
    OpenAI(api_key="your-api-key")
)
 
response = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=YourModel,
    messages=[...]
)

Local Models (Ollama)

from openai import OpenAI
 
# Point to local Ollama server
client = instructor.from_openai(
    OpenAI(
        base_url="http://localhost:11434/v1",
        api_key="ollama"  # Required but ignored
    ),
    mode=instructor.Mode.JSON
)
 
response = client.chat.completions.create(
    model="llama3.1",
    response_model=YourModel,
    messages=[...]
)

Common Patterns

Pattern 1: Data Extraction from Text

class CompanyInfo(BaseModel):
    name: str
    founded_year: int
    industry: str
    employees: int
    headquarters: str
 
text = """
Tesla, Inc. was founded in 2003. It operates in the automotive and energy
industry with approximately 140,000 employees. The company is headquartered
in Austin, Texas.
"""
 
company = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"Extract company information from: {text}"
    }],
    response_model=CompanyInfo
)

Pattern 2: Classification

class Category(str, Enum):
    TECHNOLOGY = "technology"
    FINANCE = "finance"
    HEALTHCARE = "healthcare"
    EDUCATION = "education"
    OTHER = "other"
 
class ArticleClassification(BaseModel):
    category: Category
    confidence: float = Field(ge=0.0, le=1.0)
    keywords: list[str]
 
classification = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Classify this article: [article text]"
    }],
    response_model=ArticleClassification
)

Pattern 3: Multi-Entity Extraction

class Person(BaseModel):
    name: str
    role: str
 
class Organization(BaseModel):
    name: str
    industry: str
 
class Entities(BaseModel):
    people: list[Person]
    organizations: list[Organization]
    locations: list[str]
 
text = "Tim Cook, CEO of Apple, announced at the event in Cupertino..."
 
entities = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"Extract all entities from: {text}"
    }],
    response_model=Entities
)
 
for person in entities.people:
    print(f"{person.name} - {person.role}")

Pattern 4: Structured Analysis

class SentimentAnalysis(BaseModel):
    overall_sentiment: Sentiment
    positive_aspects: list[str]
    negative_aspects: list[str]
    suggestions: list[str]
    score: float = Field(ge=-1.0, le=1.0)
 
review = "The product works well but setup was confusing..."
 
analysis = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"Analyze this review: {review}"
    }],
    response_model=SentimentAnalysis
)

Pattern 5: Batch Processing

def extract_person(text: str) -> Person:
    return client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Extract person from: {text}"
        }],
        response_model=Person
    )
 
texts = [
    "John Doe is a 30-year-old engineer",
    "Jane Smith, 25, works in marketing",
    "Bob Johnson, age 40, software developer"
]
 
people = [extract_person(text) for text in texts]

Advanced Features

Union Types

from typing import Union
 
class TextContent(BaseModel):
    type: str = "text"
    content: str
 
class ImageContent(BaseModel):
    type: str = "image"
    url: HttpUrl
    caption: str
 
class Post(BaseModel):
    title: str
    content: Union[TextContent, ImageContent]  # Either type
 
# LLM chooses appropriate type based on content

Dynamic Models

from pydantic import create_model
 
# Create model at runtime
DynamicUser = create_model(
    'User',
    name=(str, ...),
    age=(int, Field(ge=0)),
    email=(EmailStr, ...)
)
 
user = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[...],
    response_model=DynamicUser
)

Custom Modes

# For providers without native structured outputs
client = instructor.from_anthropic(
    Anthropic(),
    mode=instructor.Mode.JSON  # JSON mode
)
 
# Available modes:
# - Mode.ANTHROPIC_TOOLS (recommended for Claude)
# - Mode.JSON (fallback)
# - Mode.TOOLS (OpenAI tools)

Context Management

# Single-use client
with instructor.from_anthropic(Anthropic()) as client:
    result = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        messages=[...],
        response_model=YourModel
    )
    # Client closed automatically

Error Handling

Handling Validation Errors

from pydantic import ValidationError
 
try:
    user = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        messages=[...],
        response_model=User,
        max_retries=3
    )
except ValidationError as e:
    print(f"Failed after retries: {e}")
    # Handle gracefully
 
except Exception as e:
    print(f"API error: {e}")

Custom Error Messages

class ValidatedUser(BaseModel):
    name: str = Field(description="Full name, 2-100 characters")
    age: int = Field(description="Age between 0 and 120", ge=0, le=120)
    email: EmailStr = Field(description="Valid email address")
 
    class Config:
        # Custom error messages
        json_schema_extra = {
            "examples": [
                {
                    "name": "John Doe",
                    "age": 30,
                    "email": "john@example.com"
                }
            ]
        }

Best Practices

1. Clear Field Descriptions

# ❌ Bad: Vague
class Product(BaseModel):
    name: str
    price: float
 
# ✅ Good: Descriptive
class Product(BaseModel):
    name: str = Field(description="Product name from the text")
    price: float = Field(description="Price in USD, without currency symbol")

2. Use Appropriate Validation

# ✅ Good: Constrain values
class Rating(BaseModel):
    score: int = Field(ge=1, le=5, description="Rating from 1 to 5 stars")
    review: str = Field(min_length=10, description="Review text, at least 10 chars")

3. Provide Examples in Prompts

messages = [{
    "role": "user",
    "content": """Extract person info from: "John, 30, engineer"
 
Example format:
{
  "name": "John Doe",
  "age": 30,
  "occupation": "engineer"
}"""
}]

4. Use Enums for Fixed Categories

# ✅ Good: Enum ensures valid values
class Status(str, Enum):
    PENDING = "pending"
    APPROVED = "approved"
    REJECTED = "rejected"
 
class Application(BaseModel):
    status: Status  # LLM must choose from enum

5. Handle Missing Data Gracefully

class PartialData(BaseModel):
    required_field: str
    optional_field: Optional[str] = None
    default_field: str = "default_value"
 
# LLM only needs to provide required_field

Comparison to Alternatives

特性Instructor手动 JSONLangChainDSPy
Type Safety✅ Yes❌ No⚠️ Partial✅ Yes
Auto Validation✅ Yes❌ No❌ No⚠️ Limited
Auto Retry✅ Yes❌ No❌ No✅ Yes
Streaming✅ Yes❌ No✅ Yes❌ No
Multi-Provider✅ Yes⚠️ Manual✅ Yes✅ Yes
Learning CurveLowLowMediumHigh

When to choose Instructor:

  • Need structured, validated outputs
  • Want type safety and IDE support
  • Require automatic retries
  • Building data extraction systems

When to choose alternatives:

  • DSPy: Need prompt optimization
  • LangChain: Building complex chains
  • Manual: Simple, one-off extractions

资源

See Also

  • references/validation.md - Advanced validation patterns
  • references/providers.md - Provider-specific configuration
  • references/examples.md - Real-world use cases