评估(通常称为 评估)用于测试模型输出,以确保它们满足你指定的风格和内容标准。编写评估有助于了解你的 LLM 应用在多大程度上符合你的预期,尤其是在升级或尝试新模型时,这是构建可靠应用的重要组成部分。
在本指南中,我们将重点介绍 使用以下工具以编程方式配置评估: Evals API。如果你愿意,也可以配置 eval in the OpenAI dashboard.
如果你刚开始接触评估,或者希望在构建评估时拥有一个更具迭代性的实验环境,可以尝试 数据集 instead.
总的来说,为你的 LLM 应用构建和运行评估包含三个步骤。
- 将待完成的任务描述为一项评估
- 使用测试输入(提示词和输入数据)运行评估
- 分析结果,然后迭代并改进你的提示词
这个过程与行为驱动开发 (BDD) 有些类似,即先指定系统应有的行为,然后再进行系统的实现和测试。接下来,让我们看看如何使用 Evals API.
为任务创建评估
创建评估的第一步是描述一个需要模型完成的任务。假设我们希望使用模型将 IT 支持工单的内容归类到以下三个类别之一: Hardware, Software, or Other.
为了实现此用例,你可以使用 Chat Completions API or the Responses API。以下两个示例均结合了 开发者消息 并搭配一条包含支持工单文本的用户消息。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from openai import OpenAI
client = OpenAI()
instructions = """
You are an expert in categorizing IT support tickets. Given the support
ticket below, categorize the request into one of "Hardware", "Software",
or "Other". Respond with only one of those words.
"""
ticket = "My monitor won't turn on - help!"
response = client.responses.create(
model="gpt-4.1",
input=[
{"role": "developer", "content": instructions},
{"role": "user", "content": ticket},
],
)
print(response.output_text)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from openai import OpenAI
client = OpenAI()
instructions = """
You are an expert in categorizing IT support tickets. Given the support
ticket below, categorize the request into one of "Hardware", "Software",
or "Other". Respond with only one of those words.
"""
ticket = "My monitor won't turn on - help!"
completion = client.chat.completions.create(
model="gpt-4.1",
messages=[
{"role": "developer", "content": instructions},
{"role": "user", "content": ticket}
]
)
print(completion.choices[0].message.content)让我们设置一个评估来测试此行为 通过 API。eval 需要两个关键要素:
data_source_config:用于 eval 的测试数据的 schema。testing_criteria: 评分器 ,用于判断模型输出是否正确。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
curl https://api.openai.com/v1/evals \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "IT Ticket Categorization",
"data_source_config": {
"type": "custom",
"item_schema": {
"type": "object",
"properties": {
"ticket_text": { "type": "string" },
"correct_label": { "type": "string" }
},
"required": ["ticket_text", "correct_label"]
},
"include_sample_schema": true
},
"testing_criteria": [
{
"type": "string_check",
"name": "Match output to human label",
"input": "{{ sample.output_text }}",
"operation": "eq",
"reference": "{{ item.correct_label }}"
}
]
}'创建评估后,它将被分配一个 UUID,您在之后启动运行时需要用到它。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
"object": "eval",
"id": "eval_67e321d23b54819096e6bfe140161184",
"data_source_config": {
"type": "custom",
"schema": { ... omitted for brevity... }
},
"testing_criteria": [
{
"name": "Match output to human label",
"id": "Match output to human label-c4fdf789-2fa5-407f-8a41-a6f4f9afd482",
"type": "string_check",
"input": "{{ sample.output_text }}",
"reference": "{{ item.correct_label }}",
"operation": "eq"
}
],
"name": "IT Ticket Categorization",
"created_at": 1742938578,
"metadata": {}
}现在我们已经创建了一个用于描述应用程序预期行为的评估,接下来让我们使用一组测试数据来测试提示词。
使用您的评估测试提示词
既然我们已经定义了应用在 eval 中的预期行为,接下来让我们构建一个 prompt,使其能够针对具有代表性的测试数据样本可靠地生成正确的输出。
上传测试数据
有多种方法可以为 eval 运行提供测试数据,但更便捷的方式可能是上传一个 JSONL 文件,该文件包含的数据需符合我们创建 eval 时指定的 schema。以下是一个符合我们所设 schema 的 JSONL 示例文件:
1
2
3
{ "item": { "ticket_text": "My monitor won't turn on!", "correct_label": "Hardware" } }
{ "item": { "ticket_text": "I'm in vim and I can't quit!", "correct_label": "Software" } }
{ "item": { "ticket_text": "Best restaurants in Cleveland?", "correct_label": "Other" } }该数据集包含测试输入以及用于与模型输出进行对比的真实标签。
接下来,让我们将测试数据文件上传到 OpenAI 平台,以便后续引用。你也可以 in the dashboard here,但可以 通过 API 上传文件 。以下示例假设你在一个目录中运行命令,并且已将上述 JSON 示例数据保存为名为 tickets.jsonl:
1
2
3
4
curl https://api.openai.com/v1/files \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F purpose="evals" \
-F file="@tickets.jsonl"上传文件时,请记下响应负载中的唯一 id 属性(如果你是通过浏览器上传的,该属性也可以在 UI 中查看)——我们稍后需要引用该值:
1
2
3
4
5
6
7
8
9
10
11
{
"object": "file",
"id": "file-CwHg45Fo7YXwkWRPUkLNHW",
"purpose": "evals",
"filename": "tickets.jsonl",
"bytes": 208,
"created_at": 1742834798,
"expires_at": null,
"status": "processed",
"status_details": null
}创建 eval 运行
测试数据准备就绪后,让我们评估一个 prompt,看看它在我们的测试标准下表现如何。通过 API,我们可以通过 创建一个 eval 运行.
请确保将 YOUR_EVAL_ID and YOUR_FILE_ID 替换为您在上述步骤中创建的评估配置和测试数据文件的唯一 ID。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
curl https://api.openai.com/v1/evals/YOUR_EVAL_ID/runs \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Categorization text run",
"data_source": {
"type": "responses",
"model": "gpt-4.1",
"input_messages": {
"type": "template",
"template": [
{"role": "developer", "content": "You are an expert in categorizing IT support tickets. Given the support ticket below, categorize the request into one of Hardware, Software, or Other. Respond with only one of those words."},
{"role": "user", "content": "{{ item.ticket_text }}"}
]
},
"source": { "type": "file_id", "id": "YOUR_FILE_ID" }
}
}'1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
curl https://api.openai.com/v1/evals/YOUR_EVAL_ID/runs \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Categorization text run",
"data_source": {
"type": "completions",
"model": "gpt-4.1",
"input_messages": {
"type": "template",
"template": [
{"role": "developer", "content": "You are an expert in categorizing IT support tickets. Given the support ticket below, categorize the request into one of Hardware, Software, or Other. Respond with only one of those words."},
{"role": "user", "content": "{{ item.ticket_text }}"}
]
},
"source": { "type": "file_id", "id": "YOUR_FILE_ID" }
}
}'创建运行时,我们会使用 Chat Completions 消息数组或 响应 输入来设置提示词。此提示词用于为数据集中每一行测试数据生成模型响应。我们可以使用双花括号语法来插入动态变量 item.ticket_text,这是从当前测试数据条目中提取的。
如果评估运行成功创建,您将收到如下所示的 API 响应:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
{
"object": "eval.run",
"id": "evalrun_67e44c73eb6481909f79a457749222c7",
"eval_id": "eval_67e44c5becec81909704be0318146157",
"report_url": "https://platform.openai.com/evaluation/evals/abc123",
"status": "queued",
"model": "gpt-4.1",
"name": "Categorization text run",
"created_at": 1743015028,
"result_counts": { ... },
"per_model_usage": null,
"per_testing_criteria_results": null,
"data_source": {
"type": "responses",
"source": {
"type": "file_id",
"id": "file-J7MoX9ToHXp2TutMEeYnwj"
},
"input_messages": {
"type": "template",
"template": [
{
"type": "message",
"role": "developer",
"content": {
"type": "input_text",
"text": "You are an expert in...."
}
},
{
"type": "message",
"role": "user",
"content": {
"type": "input_text",
"text": "{{item.ticket_text}}"
}
}
]
},
"model": "gpt-4.1",
"sampling_params": null
},
"error": null,
"metadata": {}
}1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
{
"object": "eval.run",
"id": "evalrun_67e44c73eb6481909f79a457749222c7",
"eval_id": "eval_67e44c5becec81909704be0318146157",
"report_url": "https://platform.openai.com/evaluation/evals/abc123",
"status": "queued",
"model": "gpt-4.1",
"name": "Categorization text run",
"created_at": 1743015028,
"result_counts": { ... },
"per_model_usage": null,
"per_testing_criteria_results": null,
"data_source": {
"type": "completions",
"source": {
"type": "file_id",
"id": "file-J7MoX9ToHXp2TutMEeYnwj"
},
"input_messages": {
"type": "template",
"template": [
{
"type": "message",
"role": "developer",
"content": {
"type": "input_text",
"text": "You are an expert in...."
}
},
{
"type": "message",
"role": "user",
"content": {
"type": "input_text",
"text": "{{item.ticket_text}}"
}
}
]
},
"model": "gpt-4.1",
"sampling_params": null
},
"error": null,
"metadata": {}
}您的评估运行现已进入队列,并将异步执行,处理数据集中的每一行数据,使用我们指定的提示词和模型生成响应以供测试。
分析结果
要在运行成功、失败或取消时接收更新,请创建一个 webhook 端点并订阅 eval.run.succeeded, eval.run.failed,且 eval.run.canceled 事件。请参阅 webhook 指南 for more details.
根据数据集的大小,评估运行可能需要一些时间才能完成。您可以在仪表盘中查看当前状态,也可以 通过 API 获取评估运行的当前状态:
1
2
3
curl https://api.openai.com/v1/evals/YOUR_EVAL_ID/runs/YOUR_RUN_ID \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json"您需要评估和评估运行的 UUID 才能获取其状态。获取后,您将看到如下所示的评估运行数据:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
{
"object": "eval.run",
"id": "evalrun_67e44c73eb6481909f79a457749222c7",
"eval_id": "eval_67e44c5becec81909704be0318146157",
"report_url": "https://platform.openai.com/evaluation/evals/xxx",
"status": "completed",
"model": "gpt-4.1",
"name": "Categorization text run",
"created_at": 1743015028,
"result_counts": {
"total": 3,
"errored": 0,
"failed": 0,
"passed": 3
},
"per_model_usage": [
{
"model_name": "gpt-4o-2024-08-06",
"invocation_count": 3,
"prompt_tokens": 166,
"completion_tokens": 6,
"total_tokens": 172,
"cached_tokens": 0
}
],
"per_testing_criteria_results": [
{
"testing_criteria": "Match output to human label-40d67441-5000-4754-ab8c-181c125803ce",
"passed": 3,
"failed": 0
}
],
"data_source": {
"type": "responses",
"source": {
"type": "file_id",
"id": "file-J7MoX9ToHXp2TutMEeYnwj"
},
"input_messages": {
"type": "template",
"template": [
{
"type": "message",
"role": "developer",
"content": {
"type": "input_text",
"text": "You are an expert in categorizing IT support tickets. Given the support ticket below, categorize the request into one of Hardware, Software, or Other. Respond with only one of those words."
}
},
{
"type": "message",
"role": "user",
"content": {
"type": "input_text",
"text": "{{item.ticket_text}}"
}
}
]
},
"model": "gpt-4.1",
"sampling_params": null
},
"error": null,
"metadata": {}
}1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
{
"object": "eval.run",
"id": "evalrun_67e44c73eb6481909f79a457749222c7",
"eval_id": "eval_67e44c5becec81909704be0318146157",
"report_url": "https://platform.openai.com/evaluation/evals/xxx",
"status": "completed",
"model": "gpt-4.1",
"name": "Categorization text run",
"created_at": 1743015028,
"result_counts": {
"total": 3,
"errored": 0,
"failed": 0,
"passed": 3
},
"per_model_usage": [
{
"model_name": "gpt-4o-2024-08-06",
"invocation_count": 3,
"prompt_tokens": 166,
"completion_tokens": 6,
"total_tokens": 172,
"cached_tokens": 0
}
],
"per_testing_criteria_results": [
{
"testing_criteria": "Match output to human label-40d67441-5000-4754-ab8c-181c125803ce",
"passed": 3,
"failed": 0
}
],
"data_source": {
"type": "completions",
"source": {
"type": "file_id",
"id": "file-J7MoX9ToHXp2TutMEeYnwj"
},
"input_messages": {
"type": "template",
"template": [
{
"type": "message",
"role": "developer",
"content": {
"type": "input_text",
"text": "You are an expert in categorizing IT support tickets. Given the support ticket below, categorize the request into one of Hardware, Software, or Other. Respond with only one of those words."
}
},
{
"type": "message",
"role": "user",
"content": {
"type": "input_text",
"text": "{{item.ticket_text}}"
}
}
]
},
"model": "gpt-4.1",
"sampling_params": null
},
"error": null,
"metadata": {}
}API 响应包含有关测试标准结果的详细信息、用于生成模型响应的 API 使用情况,以及一个 report_url 属性,该属性可带您前往仪表盘中的页面,您可以在其中以可视化方式探索结果。
在我们的简单测试中,模型可靠地为小型测试用例样本生成了我们所需的内容。实际上,您通常需要使用更多标准、不同的提示词和不同的数据集来运行评估。但上述过程为您提供了为 LLM 应用构建稳健评估所需的所有工具!
后续步骤
现在您已经了解如何通过 API 和使用仪表盘来创建和运行评估!以下是一些在您继续改进模型结果时可能有用的其他资源。
在迭代提示词时密切关注其性能。
一次性比较多种不同提示词和模型的结果。
检查存储的补全结果以测试提示词回退。
提升模型生成针对您的用例量身定制响应的能力。
了解如何将大模型的结果蒸馏到更小、更经济且更快的模型中。