Modal 无服务器 GPU — 用于运行 ML 工作负载的无服务器 GPU 云平台

{/* 此页面由 website/scripts/generate-skill-docs.py 从技能的 SKILL.md 自动生成。请编辑源文件 SKILL.md，而非此页面。 */}

Serverless GPU cloud platform for running ML workloads. Use when you need on-demand GPU access without infrastructure management, deploying ML models as APIs, or running batch jobs with automatic scaling.

技能元数据


来源	可选 — 通过 `hermes skills install official/mlops/modal`
路径	`optional-skills/mlops/modal`
版本	`1.0.0`
作者	Orchestra Research
许可证	MIT
依赖项	`modal>=0.64.0`
平台	linux, macos, windows
标签	`Infrastructure`, `Serverless`, `GPU`, `Cloud`, `Deployment`, `Modal`

参考：完整 SKILL.md

:::info 以下是 Hermes 在触发此技能时加载的完整技能定义。这是技能激活时代理所看到的指令。 :::

全面指南： running ML workloads on Modal’s serverless GPU cloud platform.

Use Modal when:

Running GPU-intensive ML workloads without managing infrastructure
Deploying ML models as auto-scaling APIs
Running batch processing jobs (training, inference, data processing)
Need pay-per-second GPU pricing without idle costs
Prototyping ML applications quickly
Running scheduled jobs (cron-like workloads)

主要功能：

Serverless GPUs: T4, L4, A10G, L40S, A100, H100, H200, B200 on-demand
Python-native: Define infrastructure in Python code, no YAML
Auto-scaling: Scale to zero, scale to 100+ GPUs instantly
Sub-second cold starts: Rust-based infrastructure for fast container launches
Container caching: Image layers cached for rapid iteration
Web endpoints: Deploy functions as REST APIs with zero-downtime updates

替代方案：

RunPod: For longer-running pods with persistent state
Lambda Labs: For reserved GPU instances
SkyPilot: For multi-cloud orchestration and cost optimization
Kubernetes: For complex multi-service architectures

快速开始

安装

pip install modal
modal setup  # Opens browser for authentication

Hello World with GPU

import modal
 
app = modal.App("hello-gpu")
 
@app.function(gpu="T4")
def gpu_info():
    import subprocess
    return subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout
 
@app.local_entrypoint()
def main():
    print(gpu_info.remote())

Run: modal run hello_gpu.py

Basic inference endpoint

import modal
 
app = modal.App("text-generation")
image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")
 
@app.cls(gpu="A10G", image=image)
class TextGenerator:
    @modal.enter()
    def load_model(self):
        from transformers import pipeline
        self.pipe = pipeline("text-generation", model="gpt2", device=0)
 
    @modal.method()
    def generate(self, prompt: str) -> str:
        return self.pipe(prompt, max_length=100)[0]["generated_text"]
 
@app.local_entrypoint()
def main():
    print(TextGenerator().generate.remote("Hello, world"))

核心概念

Key components

Component	Purpose
`App`	Container for functions and resources
`Function`	Serverless function with compute specs
`Cls`	Class-based functions with lifecycle hooks
`Image`	Container image definition
`Volume`	Persistent storage for models/data
`Secret`	Secure credential storage

Execution modes

Command	Description
`modal run script.py`	Execute and exit
`modal serve script.py`	Development with live reload
`modal deploy script.py`	Persistent cloud deployment

GPU 配置

Available GPUs

GPU	VRAM	最适合
`T4`	16GB	Budget inference, small models
`L4`	24GB	Inference, Ada Lovelace arch
`A10G`	24GB	Training/inference, 3.3x faster than T4
`L40S`	48GB	Recommended for inference (best cost/perf)
`A100-40GB`	40GB	Large model training
`A100-80GB`	80GB	Very large models
`H100`	80GB	Fastest, FP8 + Transformer Engine
`H200`	141GB	Auto-upgrade from H100, 4.8TB/s bandwidth
`B200`	Latest	Blackwell architecture

GPU specification patterns

# Single GPU
@app.function(gpu="A100")
 
# Specific memory variant
@app.function(gpu="A100-80GB")
 
# Multiple GPUs (up to 8)
@app.function(gpu="H100:4")
 
# GPU with fallbacks
@app.function(gpu=["H100", "A100", "L40S"])
 
# Any available GPU
@app.function(gpu="any")

容器镜像

# Basic image with pip
image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch==2.1.0", "transformers==4.36.0", "accelerate"
)
 
# From CUDA base
image = modal.Image.from_registry(
    "nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
    add_python="3.11"
).pip_install("torch", "transformers")
 
# With system packages
image = modal.Image.debian_slim().apt_install("git", "ffmpeg").pip_install("whisper")

持久化存储

volume = modal.Volume.from_name("model-cache", create_if_missing=True)
 
@app.function(gpu="A10G", volumes={"/models": volume})
def load_model():
    import os
    model_path = "/models/llama-7b"
    if not os.path.exists(model_path):
        model = download_model()
        model.save_pretrained(model_path)
        volume.commit()  # Persist changes
    return load_from_path(model_path)

Web 端点

FastAPI endpoint decorator

@app.function()
@modal.fastapi_endpoint(method="POST")
def predict(text: str) -> dict:
    return {"result": model.predict(text)}

Full ASGI app

from fastapi import FastAPI
web_app = FastAPI()
 
@web_app.post("/predict")
async def predict(text: str):
    return {"result": await model.predict.remote.aio(text)}
 
@app.function()
@modal.asgi_app()
def fastapi_app():
    return web_app

Web endpoint types

Decorator	Use Case
`@modal.fastapi_endpoint()`	Simple function → API
`@modal.asgi_app()`	Full FastAPI/Starlette apps
`@modal.wsgi_app()`	Django/Flask apps
`@modal.web_server(port)`	Arbitrary HTTP servers

动态批处理

@app.function()
@modal.batched(max_batch_size=32, wait_ms=100)
async def batch_predict(inputs: list[str]) -> list[dict]:
    # Inputs automatically batched
    return model.batch_predict(inputs)

密钥管理

# Create secret
modal secret create huggingface HF_TOKEN=hf_xxx

@app.function(secrets=[modal.Secret.from_name("huggingface")])
def download_model():
    import os
    token = os.environ["HF_TOKEN"]

调度

@app.function(schedule=modal.Cron("0 0 * * *"))  # Daily midnight
def daily_job():
    pass
 
@app.function(schedule=modal.Period(hours=1))
def hourly_job():
    pass

性能 optimization

Cold start mitigation

@app.function(
    container_idle_timeout=300,  # Keep warm 5 min
    allow_concurrent_inputs=10,  # Handle concurrent requests
)
def inference():
    pass

Model loading best practices

@app.cls(gpu="A100")
class Model:
    @modal.enter()  # Run once at container start
    def load(self):
        self.model = load_model()  # Load during warm-up
 
    @modal.method()
    def predict(self, x):
        return self.model(x)

并行处理

@app.function()
def process_item(item):
    return expensive_computation(item)
 
@app.function()
def run_parallel():
    items = list(range(1000))
    # Fan out to parallel containers
    results = list(process_item.map(items))
    return results

通用配置

@app.function(
    gpu="A100",
    memory=32768,              # 32GB RAM
    cpu=4,                     # 4 CPU cores
    timeout=3600,              # 1 hour max
    container_idle_timeout=120,# Keep warm 2 min
    retries=3,                 # Retry on failure
    concurrency_limit=10,      # Max concurrent containers
)
def my_function():
    pass

调试

# Test locally
if __name__ == "__main__":
    result = my_function.local()
 
# View logs
# modal app logs my-app

常见问题

Issue	Solution
Cold start latency	Increase `container_idle_timeout`, use `@modal.enter()`
GPU OOM	Use larger GPU (`A100-80GB`), enable gradient checkpointing
Image build fails	Pin dependency versions, check CUDA compatibility
Timeout errors	Increase `timeout`, add checkpointing

References

Advanced Usage - Multi-GPU, distributed training, cost optimization
Troubleshooting - Common issues and solutions

资源

文档： https://modal.com/docs
示例： https://github.com/modal-labs/modal-examples
定价： https://modal.com/pricing
Discord： https://discord.gg/modal

好奇心花园🪴

探索

最近的笔记

note-template

getMoon.js

getWeather.js

Modal 无服务器 GPU — 用于运行 ML 工作负载的无服务器 GPU 云平台

技能元数据

参考：完整 SKILL.md

快速开始

安装

Hello World with GPU

Basic inference endpoint

核心概念

Key components

Execution modes

GPU 配置

Available GPUs

GPU specification patterns

容器镜像

持久化存储

Web 端点

FastAPI endpoint decorator

Full ASGI app

Web endpoint types

动态批处理

密钥管理

调度

性能 optimization

Cold start mitigation

Model loading best practices

并行处理

通用配置

调试

常见问题

References

资源

关系图谱

目录

反向链接

好奇心花园🪴

探索

最近的笔记

note-template

getMoon.js

getWeather.js

Modal 无服务器 GPU — 用于运行 ML 工作负载的无服务器 GPU 云平台

Modal Serverless Gpu

技能元数据

参考：完整 SKILL.md

Modal Serverless GPU

何时使用 Modal

快速开始

安装

Hello World with GPU

Basic inference endpoint

核心概念

Key components

Execution modes

GPU 配置

Available GPUs

GPU specification patterns

容器镜像

持久化存储

Web 端点

FastAPI endpoint decorator

Full ASGI app

Web endpoint types

动态批处理

密钥管理

调度

性能 optimization

Cold start mitigation

Model loading best practices

并行处理

通用配置

调试

常见问题

References

资源

关系图谱

目录

反向链接