InternVL官方微调
上进行微调
模型准备
型号名称 | 类型 | 参数 | 下载 | 尺寸 |
---|---|---|---|---|
实习生VL2-1B | 多层线性模型 | 0.9亿 | 🤗 | 1.8 GB |
实习生VL2-2B | 多层线性模型 | 2.2B | 🤗 | 4.2 GB |
实习生VL2-4B | 多层线性模型 | 4.2B | 🤗 | 7.8 GB |
实习生VL2-8B | 多层线性模型 | 8.1B | 🤗 | 16 GB |
实习生VL2-26B | 多层线性模型 | 25.5亿 | 🤗 | 48 GB |
实习生VL2-40B | 多层线性模型 | 40.1B | 🤗 | 75 GB |
实习生VL2-Llama3-76B | 多层线性模型 | 76.3B | 🤗 | 143 GB |
在开始第二次微调之前,请下载我们提供的预训练模型。 cd pretrained/ # pip install -U huggingface_hub# Download OpenGVLab/InternVL2-1Bhuggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-1B --local-dir InternVL2-1B # Download OpenGVLab/InternVL2-2Bhuggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-2B --local-dir InternVL2-2B # Download OpenGVLab/InternVL2-4Bhuggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-4B --local-dir InternVL2-4B # Download OpenGVLab/InternVL2-8Bhuggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-8B --local-dir InternVL2-8B # Download OpenGVLab/InternVL2-26Bhuggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-26B --local-dir InternVL2-26B # Download OpenGVLab/InternVL2-40Bhuggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-40B --local-dir InternVL2-40B # Download OpenGVLab/InternVL2-Llama3-76Bhuggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-Llama3-76B --local-dir InternVL2-Llama3-76B
目录结构为: pretrained ├── InternVL2-1B ├── InternVL2-2B ├── InternVL2-4B ├── InternVL2-8B ├── InternVL2-26B ├── InternVL2-40B └── InternVL2-Llama3-76B
准备你的自定义训练数据
下载预训练模型后,准备自定义的 SFT(监督微调)数据。创建internvl_chat/shell/data/
类似于JSON 文件。
JSON 文件的格式应为: { "your-custom-dataset-1": { "root": "path/to/the/image/", "annotation": "path/to/the/jsonl/annotation", "data_augment": false, "repeat_time": 1, "length": "number of your data" }, ...}
例子: { "sharegpt4v_instruct_gpt4-vision_cap100k": { "root": "playground/data/", "annotation": "playground/opensource/sharegpt4v_instruct_gpt4-vision_cap100k.jsonl", "data_augment": false, "repeat_time": 1, "length": 102025 }}
对于每种具体的 JSONL(如纯文本数据、单图像数据、多图像数据、视频数据)的格式都可以按照提供的描述进行组织。
之上添加新的特定领域数据。这将增强下游能力,同时保留基础技能。当然,您也可以根据自己的需求选择仅针对新数据进行微调。
开始第二次微调 1B2B4B8B26B40B
根据您可用的 GPU 资源,使用用于 或预训练模型进行微调。
在微调之前,请将 设置--meta_path
为上一步中创建的 JSON 文件的路径。这些 shell 脚本中默认的预训练模型路径是./pretrained/InternVL2-1B
。
在默认设置中,我冻结了视觉编码器。如果需要,您可以解冻它。通常,解冻视觉编码器将带来更好的性能。
💡 对完整的 LLM 进行微调需要 8x 32G/40G GPU,而对 LoRA 进行微调需要 2x 32G/40G GPU。
💡 这里使用的 GPU 数量和超参数仅作为示例。为了获得最佳结果,您可能需要根据可用的硬件和数据集大小调整这些设置。
微调命令: # Using 8 GPUs, fine-tune the full LLM, cost about 30G per GPUGPUS=8 PER_DEVICE_BATCH_SIZE=1 sh shell/internvl2.0/2nd_finetune/internvl2_1b_qwen2_0_5b_dynamic_res_2nd_finetune_full.sh # Using 2 GPUs, fine-tune the LoRA, cost about 27G per GPUGPUS=2 PER_DEVICE_BATCH_SIZE=1 sh shell/internvl2.0/2nd_finetune/internvl2_1b_qwen2_0_5b_dynamic_res_2nd_finetune_lora.sh # Using 8 GPUs, fine-tune the LoRA, cost about 27G per GPUGPUS=8 PER_DEVICE_BATCH_SIZE=1 sh shell/internvl2.0/2nd_finetune/internvl2_1b_qwen2_0_5b_dynamic_res_2nd_finetune_lora.sh
如果您遇到任何问题,请告诉我,我会更新培训指南以增强其可用性。
{
"sharegpt4v_instruct_gpt4-vision_cap100k": {
"root": "playground/data/",
"annotation": "playground/opensource/sharegpt4v_instruct_gpt4-vision_cap100k.jsonl",
"data_augment": false,
"repeat_time": 1,
"length": 102025
},
"llava_instruct_150k_zh": {
"root": "playground/data/coco/",
"annotation": "playground/opensource/llava_instruct_150k_zh.jsonl",
"data_augment": false,
"repeat_time": 1,
"length": 157712
},
"sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k": {
"root": "playground/data/",
"annotation": "playground/opensource/sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl",
"data_augment": false,
"repeat_time": 1,
"length": 665058
},
"dvqa_train_200k": {
"root": "playground/data/dvqa/",
"annotation": "playground/opensource/dvqa_train_200k.jsonl",
"data_augment": false,
"repeat_time": 1,
"length": 200000
},
"chartqa_train_18k": {
"root": "playground/data/chartqa/",
"annotation": "playground/opensource/chartqa_train_18k.jsonl",
"data_augment": false,
"repeat_time": 1,
"length": 18317
},
"ai2d_train_12k": {
"root": "playground/data/ai2d/",
"annotation": "playground/opensource/ai2d_train_12k.jsonl",
"data_augment": false,
"repeat_time": 1,
"length": 12413
},
"docvqa_train_10k": {
"root": "playground/data/docvqa/",
"annotation": "playground/opensource/docvqa_train_10k.jsonl",
"data_augment": false,
"repeat_time": 1,
"length": 10211
},
"geoqa+": {
"root": "playground/data/geoqa+/",
"annotation": "playground/opensource/geoqa+.jsonl",
"data_augment": false,
"repeat_time": 1,
"length": 72318
},
"synthdog_en": {
"root": "playground/data/synthdog-en/",
"annotation": "playground/opensource/synthdog_en.jsonl",
"data_augment": false,
"repeat_time": 1,
"length": 29765
}
}
引文
如果您发现该项目对您的研究有用,请考虑引用: @article{chen2023internvl, title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks}, author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng}, journal={arXiv preprint arXiv:2312.14238}, year={2023}}@article{chen2024far, title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites}, author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others}, journal={arXiv preprint arXiv:2404.16821}, year={2024}}