Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 92 additions & 18 deletions weave.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,30 +4,104 @@ description: "Track, test, and improve language model apps with W&B Weave"
mode: wide
---

<Info>
W&B Inference comes free with your account. Get access to open source models through the API and Weave Playground.
- [Quickstart](/weave/quickstart-inference)
- [Product page](https://wandb.ai/site/inference)
</Info>
W&B Weave is a powerful observability and evaluation platform that helps you track, evaluate, and improve your LLM application's performance. Weave has the ability to:

W&B Weave helps you build better language model apps. Use Weave to track, test, and improve your apps:
* [Trace](/weave/quickstart) your application's LLM calls, capturing inputs, outputs, costs, and latency
* [Evaluate](/weave/guides/core-types/evaluations) and [monitor](/weave/guides/evaluation/guardrails_and_monitors) your application's responses using scorers and LLM judges
* [Log versions](/weave/tutorial-weave_models) of your application's code, prompts, datasets, and other attributes
* [Create leaderboards](/weave/guides/core-types/leaderboards) to track and compare your application's performance over time
* [Integrate Weave into your W&B reinforcement-learning training runs](/weave/guides/tools/weave-in-workspaces) to gain observability into how your models perform during training

- **Track & Watch**: See how your language model calls work in live systems.
- **Test Changes**: Try new prompts, data, and models safely.
- **Run Tests**: Test models and prompts in the Playground.
- **Check Performance**: Use response evaluation tools to track and measure how well your LLM app performs.
- **Add Safety**: Protect your app with content filters and prompt guards.
Weave works with many [popular frameworks](/weave/guides/integrations) and has both [Python](/weave/reference/python-sdk) and [TypeScript SDKs](/weave/reference/typescript-sdk).

Connect Weave to your code with:
- [Python SDK](/weave/reference/python-sdk)
- [TypeScript SDK](/weave/reference/typescript-sdk)
- [Service API](/weave/reference/service-api)
## Get Started

Weave works with many language model providers, local models, and tools.
See the following quickstart docs to install and learn how integrate Weave into your code:

## Get started
* [Track LLM inputs and outputs](/weave/quickstart)
* [Learn Weave with W&B inference](/weave/quickstart-inference)

New to Weave? Start with the [Python quickstart](/weave/quickstart) or TypeScript quickstart.
You can also review the following Python example to get a quick understanding of how Weave is implemented into code:

<Accordion title="Send requests to OpenAI and evaluate their responses" >

The following example sends simple math questions to OpenAI and then evaluates the responses for correctness (in parallel) using the built-in `CorrectnessScorer()`:

<a target="_blank" href="https://colab.research.google.com/github/wandb/docs/blob/main/weave/cookbooks/source/intro_page_example.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

```python lines
import weave
from openai import OpenAI
from weave import Scorer
import asyncio

# Initialize Weave
weave.init("parallel-evaluation")

# Create OpenAI client
client = OpenAI()

# Define your model as a weave.op function
@weave.op
def math_model(question: str) -> str:
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "user", "content": question}
]
)
return response.choices[0].message.content

# Create a dataset with questions and expected answers
dataset = [
{"question": "What is 2+2?", "expected": "4"},
{"question": "What is 5+3?", "expected": "8"},
{"question": "What is 10-7?", "expected": "3"},
{"question": "What is 12*3?", "expected": "36"},
{"question": "What is 100/4?", "expected": "25"},
]

# Define a class-based scorer
class CorrectnessScorer(Scorer):
"""Scorer that checks if the answer is correct"""

@weave.op
def score(self, question: str, expected: str, output: str) -> dict:
"""Check if the model output contains the expected answer"""
import re

# Extract numbers from the output
numbers = re.findall(r'\d+', output)

if numbers:
answer = numbers[0]
correct = answer == expected
else:
correct = False

return {
"correct": correct,
"extracted_answer": numbers[0] if numbers else None,
"contains_expected": expected in output
}

# Instantiate the scorer
correctness_scorer = CorrectnessScorer()

# Create an evaluation
evaluation = weave.Evaluation(
dataset=dataset,
scorers=[correctness_scorer]
)

# Run the evaluation - automatically evaluates examples in parallel
asyncio.run(evaluation.evaluate(math_model))
```

To use this example, follow the [installation instructions](/weave/quickstart#1-install-w%26b-weave-and-create-an-api-key) in the first step of the quickstart. You also need an [OpenAI API key](https://platform.openai.com/api-keys).
</Accordion>

## Advanced guides

Expand Down
188 changes: 188 additions & 0 deletions weave/cookbooks/source/intro_page_example.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Parallel Evaluation with W&B Weave\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wandb/docs/blob/main/weave/cookbooks/source/parallel_evaluation_example.ipynb)\n",
"\n",
"This notebook demonstrates how to use W&B Weave to send math questions to OpenAI and evaluate the responses for correctness in parallel.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installation\n",
"\n",
"First, install the required packages:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install weave openai -qU\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup API Keys\n",
"\n",
"Add your W&B and OpenAI API keys:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from getpass import getpass\n",
"\n",
"# Set your OpenAI API key\n",
"if not os.environ.get(\"OPENAI_API_KEY\"):\n",
" os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter your OpenAI API key: \")\n",
"\n",
"# Log in to W&B\n",
"import wandb\n",
"wandb.login()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Parallel Evaluation Example\n",
"\n",
"Run the evaluation example:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import weave\n",
"from openai import OpenAI\n",
"from weave import Scorer\n",
"import asyncio\n",
"\n",
"# Initialize Weave\n",
"weave.init(\"parallel-evaluation\")\n",
"\n",
"# Create OpenAI client\n",
"client = OpenAI()\n",
"\n",
"# Define your model as a weave.op function\n",
"@weave.op\n",
"def math_model(question: str) -> str:\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4\",\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": question}\n",
" ]\n",
" )\n",
" return response.choices[0].message.content\n",
"\n",
"# Create a dataset with questions and expected answers\n",
"dataset = [\n",
" {\"question\": \"What is 2+2?\", \"expected\": \"4\"},\n",
" {\"question\": \"What is 5+3?\", \"expected\": \"8\"},\n",
" {\"question\": \"What is 10-7?\", \"expected\": \"3\"},\n",
" {\"question\": \"What is 12*3?\", \"expected\": \"36\"},\n",
" {\"question\": \"What is 100/4?\", \"expected\": \"25\"},\n",
"]\n",
"\n",
"# Define a class-based scorer\n",
"class CorrectnessScorer(Scorer):\n",
" \"\"\"Scorer that checks if the answer is correct\"\"\"\n",
" \n",
" @weave.op\n",
" def score(self, question: str, expected: str, output: str) -> dict:\n",
" \"\"\"Check if the model output contains the expected answer\"\"\"\n",
" import re\n",
" \n",
" # Extract numbers from the output\n",
" numbers = re.findall(r'\\d+', output)\n",
" \n",
" if numbers:\n",
" answer = numbers[0]\n",
" correct = answer == expected\n",
" else:\n",
" correct = False\n",
" \n",
" return {\n",
" \"correct\": correct,\n",
" \"extracted_answer\": numbers[0] if numbers else None,\n",
" \"contains_expected\": expected in output\n",
" }\n",
"\n",
"# Instantiate the scorer\n",
"correctness_scorer = CorrectnessScorer()\n",
"\n",
"# Create an evaluation\n",
"evaluation = weave.Evaluation(\n",
" dataset=dataset,\n",
" scorers=[correctness_scorer]\n",
")\n",
"\n",
"# Run the evaluation - automatically evaluates examples in parallel\n",
"await evaluation.evaluate(math_model)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Note for Google Colab Users\n",
"\n",
"If you're running this notebook in Google Colab, you may need to handle async differently. Use this version instead:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# For Google Colab, use this approach:\n",
"import nest_asyncio\n",
"nest_asyncio.apply()\n",
"\n",
"# Then run the evaluation\n",
"asyncio.run(evaluation.evaluate(math_model))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## View Results\n",
"\n",
"After running the evaluation, you can view the results in the W&B Weave dashboard. The evaluation shows:\n",
"\n",
"1. **Parallel execution**: All examples are evaluated simultaneously for faster results\n",
"2. **Correctness scores**: Each response is scored based on whether it contains the correct answer\n",
"3. **Detailed metrics**: Including extracted answers and whether the expected value was found\n",
"\n",
"Visit your [W&B Weave dashboard](https://wandb.ai/home) to explore the evaluation results in detail.\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}