大模型压力测试 evalscope
- / 性能 /
快速入门
专注于大型语言模型的压力测试工具,可定制支持多种数据集格式和不同的API协议格式。
用法
命令行 evalscope perf --help usage: evalscope <command> [<args>] perf [-h] --model MODEL [--url URL] [--connect-timeout CONNECT_TIMEOUT] [--read-timeout READ_TIMEOUT] [-n NUMBER] [--parallel PARALLEL] [--rate RATE] [--log-every-n-query LOG_EVERY_N_QUERY] [--headers KEY1=VALUE1 [KEY1=VALUE1 ...]] [--wandb-api-key WANDB_API_KEY] [--name NAME] [--debug] [--tokenizer-path TOKENIZER_PATH] [--api API] [--max-prompt-length MAX_PROMPT_LENGTH] [--min-prompt-length MIN_PROMPT_LENGTH] [--prompt PROMPT] [--query-template QUERY_TEMPLATE] [--dataset DATASET] [--dataset-path DATASET_PATH] [--frequency-penalty FREQUENCY_PENALTY] [--logprobs] [--max-tokens MAX_TOKENS] [--n-choices N_CHOICES] [--seed SEED] [--stop STOP] [--stream] [--temperature TEMPERATURE] [--top-p TOP_P] options: -h, --help show this help message and exit --model MODEL The test model name. --url URL --connect-timeout CONNECT_TIMEOUT The network connection timeout --read-timeout READ_TIMEOUT The network read timeout -n NUMBER, --number NUMBER How many requests to be made, if None, will will send request base dataset or prompt. --parallel PARALLEL Set number of concurrency request, default 1 --rate RATE Number of requests per second. default None, if it set to -1,then all the requests are sent at time 0. Otherwise, we use Poisson process to synthesize the request arrival times. Mutual exclusion with parallel --log-every-n-query LOG_EVERY_N_QUERY Logging every n query. --headers KEY1=VALUE1 [KEY1=VALUE1 ...] Extra http headers accepts by key1=value1 key2=value2. The headers will be use for each query.You can use this parameter to specify http authorization and other header. --wandb-api-key WANDB_API_KEY The wandb api key, if set the metric will be saved to wandb. --name NAME The wandb db result name and result db name, default: {model_name}_{current_time} --debug Debug request send. --tokenizer-path TOKENIZER_PATH Specify the tokenizer weight path, used to calculate the number of input and output tokens,usually in the same directory as the model weight. --api API Specify the service api, current support [openai|dashscope]you can define your custom parser with python, and specify the python file path, reference api_plugin_base.py, --max-prompt-length MAX_PROMPT_LENGTH Maximum input prompt length --min-prompt-length MIN_PROMPT_LENGTH Minimum input prompt length. --prompt PROMPT Specified the request prompt, all the query will use this prompt, You can specify local file via @file_path, the prompt will be the file content. --query-template QUERY_TEMPLATE Specify the query template, should be a json string, or local file,with local file, specified with @local_file_path,will will replace model and prompt in the template. --dataset DATASET Specify the dataset [openqa|longalpaca|line_by_line]you can define your custom dataset parser with python, and specify the python file path, reference dataset_plugin_base.py, --dataset-path DATASET_PATH Path to the dataset file, Used in conjunction with dataset. If dataset is None, each line defaults to a prompt. --frequency-penalty FREQUENCY_PENALTY The frequency_penalty value. --logprobs The logprobs. --max-tokens MAX_TOKENS The maximum number of tokens can be generated. --n-choices N_CHOICES How may chmpletion choices to generate. --seed SEED The random seed. --stop STOP The stop generating tokens. --stop-token-ids Set the stop token ids. --stream Stream output with SSE. --temperature TEMPERATURE The sample temperature. --top-p TOP_P Sampling top p.
结果: Total requests: 10 Succeed requests: 10 Failed requests: 0 Average QPS: 0.256 Average latency: 3.859 Throughput(average output tokens per second): 23.317 Average time to first token: 0.007 Average input tokens per request: 21.800 Average output tokens per request: 91.100 Average time per output token: 0.04289 Average package per request: 93.100 Average package latency: 0.042 Percentile of time to first token: p50: 0.0021 p66: 0.0023 p75: 0.0025 p80: 0.0030 p90: 0.0526 p95: 0.0526 p98: 0.0526 p99: 0.0526 Percentile of request latency: p50: 3.9317 p66: 3.9828 p75: 4.0153 p80: 7.2801 p90: 7.7003 p95: 7.7003 p98: 7.7003 p99: 7.7003
请求参数
您可以在查询模板中设置请求参数,并使用(--stop、--stream、--temperature 等),参数将替换或添加到请求中。
带参数的请求
示例请求 llama3 vllm openai 兼容接口。 evalscope perf --url 'http://127.0.0.1:8000/v1/chat/completions' --parallel 128 --model 'qwen' --log-every-n-query 10 --read-timeout=120 --dataset-path './datasets/open_qa.jsonl' -n 1 --max-prompt-length 128000 --api openai --stream --stop '<|im_end|>' --dataset openqa --debug
evalscope perf ' ' --parallel 128 --model 'qwen' --log-every-n-query 10 --read-timeout=120 -n 10000 --max-prompt-length 128000 --tokenizer-path "THE_PATH_TO_TOKENIZER/Qwen1.5-32B/" --api openai --query-template '{"model": "%m", "messages": [{"role": "user","content": "%p"}], "stream": true,"skip_special_tokens": false,"stop": ["<|im_end|>"]}' --dataset openqa --dataset-path 'THE_PATH_TO_DATASETS/open_qa.jsonl'
查询模板的使用。
当需要处理更复杂的请求时,可以使用模板来简化命令行。如果模板和参数同时存在,则以参数中的值为准。查询模板示例: evalscope perf --url 'http://127.0.0.1:8000/v1/chat/completions' --parallel 12 --model 'llama3' --log-every-n-query 10 --read-timeout=120 -n 1 --max-prompt-length 128000 --api openai --query-template '{"model": "%m", "messages": [], "stream": true, "stream_options":{"include_usage": true},"n": 3, "stop_token_ids": [128001, 128009]}' --dataset openqa --dataset-path './datasets/open_qa.jsonl'
对于消息,数据集处理器消息将替换查询模板中的消息。
启动客户端 # test openai service evalscope perf --url 'https://api.openai.com/v1/chat/completions' --parallel 1 --headers 'Authorization=Bearer YOUR_OPENAI_API_KEY' --model 'gpt-4o' --dataset-path 'THE_DATA_TO/open_qa.jsonl' --log-every-n-query 10 --read-timeout=120 -n 100 --max-prompt-length 128000 --api openai --stream --dataset openqa ##### open qa dataset and #### dataset address: https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese/blob/main/open_qa.jsonl evalscope perf --url 'http://IP:PORT/v1/chat/completions' --parallel 1 --model 'qwen' --log-every-n-query 1 --read-timeout=120 -n 1000 --max-prompt-length 128000 --tokenizer-path "THE_PATH_TO_TOKENIZER/Qwen1.5-32B/" --api openai --query-template '{"model": "%m", "messages": [{"role": "user","content": "%p"}], "stream": true,"skip_special_tokens": false,"stop": ["<|im_end|>"]}' --dataset openqa --dataset-path 'THE_PATH_TO_DATASETS/open_qa.jsonl'
如何将指标记录到 wandb
--wandb-api-key'你的_wandb_api_key'--name'wandb_and_result_db 的名称'
如何调试
--debug 使用--debug选项,我们将输出请求和响应。
如何分析结果。
该工具会将测试过程中的所有数据,包括请求和响应保存到sqlite3数据库中,测试结束后可以对测试数据进行分析。 import sqlite3 import base64 import pickle import json result_db_path = 'db_name.db' con = sqlite3.connect(result_db_path) query_sql = "SELECT request, response_messages, prompt_tokens, completion_tokens \ FROM result WHERE success='True'" # how to save base64.b64encode(pickle.dumps(benchmark_data["request"])).decode("ascii"), with con: rows = con.execute(query_sql).fetchall() if len(rows) > 0: for row in rows: request = row[0] responses = row[1] request = base64.b64decode(request) request = pickle.loads(request) responses = base64.b64decode(responses) responses = pickle.loads(responses) response_content = '' for response in responses: response = json.loads(response) response_content += response['choices'][0]['delta']['content'] print('prompt: %s, tokens: %s, completion: %s, tokens: %s' % (request['messages'][0]['content'], row[2], response_content, row[3]))
支持 API
目前支持 openai、dashscope、zhipu API 请求。您可以使用 --api 指定 api。您可以使用 --query-template 自定义您的请求,您可以指定一个 json 字符串:'{"model": "%m", "messages": [{"role": "user","content": "%p"}], "stream": true,"skip_special_tokens": false,"stop": ["<|im_end|>"]}' 或使用 @to_query_template_path 指定本地文件。我们将 %m 替换为 model,%p 替换为 prompt。
如何扩展API
要扩展 api,您可以创建 的子类ApiPluginBase
,使用 @register_api("name_of_api") 注释,并通过模型、提示和查询模板使用 build_request 构建请求。您可以参考 opanai_api.py parse_responses 返回 number_of_prompt_tokens 和 number_of_completion_tokens。 class ApiPluginBase: def __init__(self, model_path: str) -> None: self.model_path = model_path @abstractmethod def build_request(self, messages: List[Dict], param: QueryParameters)->Dict: """Build a api request body. Args: messages (List[Dict]): The messages generated by dataset. param (QueryParameters): The query parameters. Raises: NotImplementedError: Not implemented. Returns: Dict: The api request body. """ raise NotImplementedError @abstractmethod def parse_responses(self, responses: List, request: Any=None, **kwargs:Any) -> Tuple[int, int]: """Parser responses and return number of request and response tokens. Args: responses (List[bytes]): List of http response body, for stream output, there are multiple responses, each is bytes, for general only one. request (Any): The request body. Returns: Tuple: (Number of prompt_tokens and number of completion_tokens). """ raise NotImplementedError
支持的数据集
目前支持逐行,longalpaca 和 openqa 数据集。逐行,每行作为提示。longalpaca 将获取 item['instruction'] 作为提示。openqa 将获取 item['question'] 作为提示。
如何扩展数据集。
要扩展 api,您可以创建 的子类DatasetPluginBase
,用 @register_dataset('name_of_dataset') 注释,实现 build_prompt api 返回提示。 class DatasetPluginBase: def __init__(self, query_parameters: QueryParameters): """Build data set plugin Args: dataset_path (str, optional): The input dataset path. Defaults to None. """ self.query_parameters = query_parameters def __next__(self): for item in self.build_messages(): yield item raise StopIteration def __iter__(self): return self.build_messages() @abstractmethod def build_messages(self)->Iterator[List[Dict]]: """Build the request. Raises: NotImplementedError: The request is not impletion. Yields: Iterator[List[Dict]]: Yield request messages. """ raise NotImplementedError def dataset_line_by_line(self, dataset: str)->Iterator[str]: """Get content line by line of dataset. Args: dataset (str): The dataset path. Yields: Iterator[str]: Each line of file. """ with open(dataset, 'r', encoding='utf-8') as f: for line in f: yield line def dataset_json_list(self, dataset: str)->Iterator[Dict]: """Read data from file which is list of requests. Sample: https://huggingface.co/datasets/Yukang/LongAlpaca-12k Args: dataset (str): The dataset path. Yields: Iterator[Dict]: The each request object. """ with open(dataset, 'r', encoding='utf-8') as f: content = f.read() data = json.loads(content) for item in data: yield item