算法

InternVL 最佳实践 swift微调

Ne0inhk

25 Dec 2024 — 12 min read

InternVL 最佳实践

本文档对应以下型号：

下面的做法internvl-chat-v1_5以作为例子，你也可以通过指定切换到其他模型--model_type。

常问问题

模特展示The request model does not exist!

此问题通常在尝试使用 mini-internvl 或 InternVL2 模型时出现，因为 modelscope 上的相应模型需要申请流程。要解决此问题，您需要登录 modelscope 并转到相应的模型页面申请下载。获得批准后，您可以通过以下任一方式获取模型：

使用snap_download将模型下载到本地（相关代码在模型文件的模型下载部分有），然后使用指定本地模型文件路径--model_id_or_path。
获取您账户的 SDK 令牌，并使用--hub_token参数或MODELSCOPE_API_TOKEN环境变量指定它。

为什么运行模型时多张GPU卡分配不均匀，导致OOM？

transformers 中的自动设备映射算法对多模态模型不友好，可能导致不同 GPU 卡之间的内存分配不均匀。

您可以使用设置每张卡的内存使用情况--device_max_memory parameter，例如，在四张卡的环境中，您可以设置--device_max_memory 15GB 15GB 15GB 15GB。
或者，您可以使用明确指定设备映射--device_map_config_path。

InternVL2 模型与其前代模型（InternVL-V1.5 和 Mini-InternVL）之间的差异

InternVL2 模型支持多轮多图推理和训练，即带图像的多轮对话，并支持单轮内文本和图像交错。有关详细信息，请参阅推理部分中的和 InternVL2 部分。前代模型支持多轮对话，但单轮只能有图像。
InternVL2模型支持视频输入，具体格式可参考。

环境设置 git clone https://github.com/modelscope/swift.git cd swift pip install -e '.[llm]' pip install Pillow

推理

笔记

如果要使用本地模型文件，请添加参数--model_id_or_path /path/to/model。
如果你的GPU不支持flash Attention，请使用参数--use_flash_attn false。对于int8模型，需要dtype --bf16在推理时指定，否则输出可能会出现乱码。
模型的配置指定了一个相对较小的max_length，为2048，可以通过设置来修改--max_length。
可以使用参数来减少内存消耗--gradient_checkpointing true。 # Experimental environment: A100 # 55GB GPU memory CUDA_VISIBLE_DEVICES=0 swift infer --model_type internvl-chat-v1_5 --dtype bf16 --max_length 4096 # 2*30GB GPU memory CUDA_VISIBLE_DEVICES=0,1 swift infer --model_type internvl-chat-v1_5 --dtype bf16 --max_length 4096

输出：（支持传入本地路径或者URL） """ <<< Describe this image. Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png This is a high-resolution image of a kitten. The kitten has striking blue eyes and a fluffy white and grey coat. The fur pattern suggests that it may be a Maine Coon or a similar breed. The kitten's ears are perked up, and it has a curious and innocent expression. The background is blurred, which brings the focus to the kitten's face. -------------------------------------------------- <<< clear <<< How many sheep are in the picture? Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png There are four sheep in the picture. -------------------------------------------------- <<< clear <<< What is the calculation result? Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png The calculation result is 59,856. -------------------------------------------------- <<< clear <<< Write a poem based on the content of the picture. Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png Token indices sequence length is longer than the specified maximum sequence length for this model (5142 > 4096). Running this sequence through the model will result in indexing errors In the still of the night, A lone boat sails on the light. The stars above, a twinkling sight, Reflecting in the water's might. The trees stand tall, a silent guard, Their leaves rustling in the yard. The boatman's lantern, a beacon bright, Guiding him through the night. The river flows, a gentle stream, Carrying the boatman's dream. His journey long, his heart serene, In the beauty of the scene. The stars above, a guiding light, Leading him through the night. The boatman's journey, a tale to tell, Of courage, hope, and love as well. """

对于InternVL2系列模型，支持多轮多图推理，在单轮推理中，图像和文本可以交错，图像的位置需要由输入确定。

多转多图像输入示例： CUDA_VISIBLE_DEVICES=0 swift infer --model_type internvl2-2b

<span style="background-color:var(--bgColor-muted, var(--color-canvas-subtle))"><span style="color:#1f2328"><span style="color:var(--fgColor-default, var(--color-fg-default))"><span style="background-color:var(--bgColor-muted, var(--color-canvas-subtle))"><code><<< <image>describe the image Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png The image depicts a kitten with striking blue eyes and a mix of white and black fur. The kitten's fur appears soft and fluffy, and its ears are large and pointed. The eyes of the kitten are particularly expressive, giving it a curious and endearing appearance. The background is blurred, drawing attention to the kitten's detailed features. The overall composition of the image highlights the kitten's innocence and charm. -------------------------------------------------- <<< <image>What is the difference from the last picture Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png The difference between the last picture and the one in the image is primarily in the facial expressions and the overall demeanor of the sheep. In the first picture, the sheep have a calm and curious expression, with large, expressive eyes and a gentle demeanor. The background is blurred, emphasizing the sheep's features. In the second picture, the sheep have a more serious and focused expression. Their eyes are slightly narrowed, and their ears are perked up, giving them a more alert and attentive appearance. The background remains the same, maintaining the focus on the sheep. In the third picture, the sheep have a more neutral expression. Their eyes are slightly narrowed, and their ears are perked up, but their overall demeanor is still calm and attentive. The background remains the same, ensuring the focus remains on the sheep. In the fourth picture, the sheep have a more neutral and calm expression. Their eyes are slightly narrowed, and their ears are perked up, but their overall demeanor is still calm and attentive. The background remains the same, maintaining the focus on the sheep. Overall, the main difference is the facial expressions and the overall demeanor of the sheep, with the second picture showing a more alert and focused look compared to the other three pictures. -------------------------------------------------- <<< What the two pictures have in common The two pictures have several similarities: 1. **Foreground Sheep**: All four sheep in the images are in the foreground, standing on a grassy field. This creates a sense of unity and focus on the sheep. 2. **Expression and Demeanor**: Each sheep has a distinct expression and demeanor. The first picture shows a curious and calm sheep, the second shows a more alert and focused sheep, the third shows a neutral and calm sheep, and the fourth shows a neutral and calm sheep. Despite the differences in expressions, all four sheep maintain a calm demeanor. 3. **Background**: The background remains consistent in all four pictures. The background features a blurred green field with mountains in the distance, which helps to keep the focus on the sheep. 4. **Art Style**: The art style is consistent across all four pictures. The sheep are depicted in a cartoonish and friendly manner, with large eyes and expressive faces. 5. **Overall Composition**: The composition of the images is similar, with the sheep standing in the foreground and the background featuring a blurred natural landscape. These similarities create a cohesive and engaging visual experience, despite the differences in expressions and demeanor. -------------------------------------------------- <<< clear <<< <video>Describe this video. Input a video path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4 In the video, a young child is seen sitting on a bed, engrossed in reading a book. The child is wearing a light blue shirt and dark glasses, and appears to be very focused on the book. The bed has a floral patterned cover, and there is a white blanket spread over it. The child's legs are crossed, and they are holding the book with both hands. The book is open, and the child is reading it with great interest. As the child continues to read, they occasionally glance at the camera, seemingly curious about who is watching them. The child's expression is one of concentration and enjoyment, as they seem to be fully immersed in the story. The camera captures the child's face and the book, providing a clear view of their actions. In the background, there is a glimpse of a room with a white wall and a wooden door. There is also a chair visible in the background, and a small table with a lamp on it. The room appears to be a bedroom, and the child seems to be in a comfortable and cozy environment. The child's actions are repetitive, as they continue to read the book with great enthusiasm. The camera captures their movements and expressions, providing a detailed view of their reading experience. The child's focus and dedication to the book are evident, and the video conveys a sense of innocence and curiosity. Overall, the video captures a heartwarming moment of a young child reading a book, showcasing their love for books and the joy of reading. The setting is simple and cozy, with a focus on the child's engagement with the book. The video is a delightful portrayal of childhood innocence and the simple pleasures of reading. -------------------------------------------------- <<< clear <<< image1: <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png</img> image2: <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img> What is the difference bewteen the two images? The two images are of the same kitten, but the first image is a close-up shot, while the second image is a more distant, artistic illustration. The close-up image captures the kitten in detail, showing its fur, eyes, and facial features in sharp focus. In contrast, the artistic illustration is more abstract and stylized, with a blurred background and a different color palette. The distant illustration gives the kitten a more whimsical and dreamy appearance, while the close-up image emphasizes the kitten's realism and detail. </code></span></span></span></span>

示例图片如下：

猫：

动物：

数学：

诗：

单一样本推断 import os os.environ['CUDA_VISIBLE_DEVICES'] = '0' # os.environ['MODELSCOPE_API_TOKEN'] = 'Your API Token' # If the message "The request model does not exist!" appears. from swift.llm import ( get_model_tokenizer, get_template, inference, get_default_template_type, inference_stream ) from swift.utils import seed_everything import torch model_type = "internvl-chat-v1_5" template_type = get_default_template_type(model_type) print(f'template_type: {template_type}') model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16, model_kwargs={'device_map': 'auto'}) # for GPUs that do not support flash attention # model, tokenizer = get_model_tokenizer(model_type, torch.float16, # model_kwargs={'device_map': 'auto'}, # use_flash_attn = False) model.generation_config.max_new_tokens = 256 template = get_template(template_type, tokenizer) seed_everything(42) images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png'] query = 'How far is it from each city?' response, history = inference(model, template, query, images=images) # chat with image print(f'query: {query}') print(f'response: {response}') # 流式 query = 'Which city is the farthest?' gen = inference_stream(model, template, query, history) # chat withoud image print_idx = 0 print(f'query: {query}\nresponse: ', end='') for response, history in gen: delta = response[print_idx:] print(delta, end='', flush=True) print_idx = len(response) print() print(f'history: {history}') """ query: How far is it from each city? response: The distances from the location of the sign to each city are as follows: - Mata: 14 kilometers - Yangjiang: 62 kilometers - Guangzhou: 293 kilometers These distances are indicated on the road sign in the image. query: Which city is the farthest? response: The city that is farthest from the location of the sign is Guangzhou, which is 293 kilometers away. history: [['How far is it from each city?', 'The distances from the location of the sign to each city are as follows:\n\n- Mata: 14 kilometers\n- Yangjiang: 62 kilometers\n- Guangzhou: 293 kilometers\n\nThese distances are indicated on the road sign in the image. '], ['Which city is the farthest?', 'The city that is farthest from the location of the sign is Guangzhou, which is 293 kilometers away. ']] """

示例图片如下：

路：

微调

多模态大模型微调通常使用自定义数据集进行微调，这里有一个可以直接运行的demo：

LoRA微调：

笔记

如果您的 GPU 不支持闪存注意，请使用参数 --use_flash_attn false。
默认情况下，只有LLM部分的qkv使用LoRA进行微调。如果要微调包括视觉模型部分在内的所有线性层，可以指定--lora_target_modules ALL。 # Experimental environment: A100 # 80GB GPU memory CUDA_VISIBLE_DEVICES=0 swift sft \ --model_type internvl-chat-v1_5 \ --dataset coco-en-2-mini \ --max_length 4096 # device_map # Experimental environment: 2*A100... # 2*43GB GPU memory CUDA_VISIBLE_DEVICES=0,1 swift sft \ --model_type internvl-chat-v1_5 \ --dataset coco-en-2-mini \ --max_length 4096 # ddp + deepspeed-zero2 # Experimental environment: 2*A100... # 2*80GB GPU memory NPROC_PER_NODE=2 \ CUDA_VISIBLE_DEVICES=0,1 swift sft \ --model_type internvl-chat-v1_5 \ --dataset coco-en-2-mini \ --max_length 4096 \ --deepspeed default-zero2

全参数微调： # Experimental environment: 4 * A100 # device map # 4 * 72 GPU memory CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \ --model_type internvl-chat-v1_5 \ --dataset coco-en-2-mini \ --sft_type full \ --max_length 4096

自定义数据集

支持json、jsonl格式，下面是一个自定义数据集的例子：

支持多轮对话，图片支持本地路径或者URL输入，多张图片以逗号','分隔 {"query": "55555", "response": "66666", "images": ["image_path"]} {"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path1", "image_path2"]} {"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]], "images": ["image_path"]}

（支持无图片的数据） {"query": "55555", "response": "66666"} {"query": "eeeee", "response": "fffff", "history": []} {"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]]}

除了上述数据格式，InternVL2模型还支持多图像多轮训练。它使用标签<image>来指示图像在对话中的位置。如果<image>数据集中没有标签，则默认将图像放在最后一轮查询的开头。 {"query": "Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.", "response": "xxxxxxxxx", "history": [["<image>Describe the image", "xxxxxxx"], ["CCCCC", "DDDDD"]], "images": ["image_path1", "image_path2", "image_path3"]}

或者，用来<img>image_path</img>表示图像路径和图像位置。 {"query": "Image-1: <img>img_path</img>\n Image-2: <img>img_path2</img>\n Describe the two images in detail.", "response": "xxxxxxxxx", "history": [["<img>img_path3</img> Describe the image", "xxxxxxx"], ["CCCCC", "DDDDD"]], }

InternVL2模型支持使用视频数据集进行训练，而无需指定标签。 {"query": "Describe this video in detail. Don't repeat", "response": "xxxxxxxxx", "history": [], "videos": ["video_path"]}

InternVL2模型支持接地任务的训练，数据引用格式如下： {"query": "Find <bbox>", "response": "<ref-object>", "images": ["/coco2014/train2014/COCO_train2014_000000001507.jpg"], "objects": "[{\"caption\": \"guy in red\", \"bbox\": [138, 136, 235, 359], \"bbox_type\": \"real\", \"image\": 0}]" } {"query": "Find <ref-object>", "response": "<bbox>", "images": ["/coco2014/train2014/COCO_train2014_000000001507.jpg"], "objects": "[{\"caption\": \"guy in red\", \"bbox\": [138, 136, 235, 359], \"bbox_type\": \"real\", \"image\": 0}]" }

该objects字段包含一个 JSON 字符串，其中包含四个字段：

caption：边界框对应对象的描述。
bbox：建议将坐标用四个整数（而不是浮点数）表示，分别代表值x_min、y_min、x_max和y_max。
bbox_type：边界框类型，目前支持三种：real// norm_1000，norm_1分别代表实际像素值坐标 / 千分之一尺度坐标 / 归一化坐标。
image：对应图像的索引，从0开始。

该格式将被转换为 InternVL2 可识别的格式，具体来说： {"query": "Find <ref>the man</ref>", "response": "<box> [[200, 200, 600, 600]] </box>"}

您也可以直接输入上述格式，但请确保坐标使用千分之一刻度坐标。

微调后推理

直接推断： CUDA_VISIBLE_DEVICES=0 swift infer \ --ckpt_dir output/internvl-chat-v1_5/vx-xxx/checkpoint-xxx \ --load_dataset_config true \ --max_length 4096

合并lora并推理： CUDA_VISIBLE_DEVICES=0 swift export \ --ckpt_dir "output/internvl-chat-v1_5/vx-xxx/checkpoint-xxx" \ --merge_lora true CUDA_VISIBLE_DEVICES=0 swift infer \ --ckpt_dir "output/internvl-chat-v1_5/vx-xxx/checkpoint-xxx-merged" \ --load_dataset_config true \ --max_length 4096 # device map CUDA_VISIBLE_DEVICES=0,1 swift infer \ --ckpt_dir "output/internvl-chat-v1_5/vx-xxx/checkpoint-xxx-merged" \ --load_dataset_config true \ --max_length 4096