跳到主要内容
极客日志极客日志面向AI+效率的开发者社区
首页博客GitHub 精选镜像工具UI配色美学隐私政策关于联系
搜索内容 / 工具 / 仓库 / 镜像...⌘K搜索
注册
博客列表
PythonAI算法

AI Agent 与大模型经典论文推荐

综述由AI生成AI Agent 与大模型经典论文推荐。内容涵盖更有趣的 AI Agent、更有用的 AI Agent、任务规划与分解、幻觉、多模态、图片视频生成、语音合成、大模型基础、GPT、开源大模型、微调及性能优化等板块。收录了包括 CLIP、ViT、LLaVA、Transformer、Diffusion Models 在内的多篇核心论文及其链接,旨在帮助读者深入理解大模型原理,从 Prompt 工程师进阶至专业研究者。

FlinkHero发布于 2025/2/7更新于 2026/6/533 浏览
AI Agent 与大模型经典论文推荐

本文整理了一份 AI Agent 与大模型领域的经典论文清单,涵盖多个核心方向,可作为研究参考。

更有趣的 AI Agent

  • Generative Agents: Interactive Simulacra of Human Behavior https://arxiv.org/abs/2304.03442
  • RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models https://arxiv.org/abs/2310.00746
  • Role play with large language models https://www.nature.com/articles/s41586-023-06647-8
  • Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf https://arxiv.org/abs/2309.04658
  • MemGPT: Towards LLMs as Operating Systems https://arxiv.org/abs/2310.08560
  • Augmenting Language Models with Long-Term Memory https://arxiv.org/abs/2306.07174
  • Do LLMs Possess a Personality? Making the MBTI Test an Amazing Evaluation for Large Language Models https://arxiv.org/pdf/2307.16180.pdf

更有用的 AI Agent

  • The Rise and Potential of Large Language Model Based Agents: A Survey https://arxiv.org/abs/2309.07864
  • MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework https://arxiv.org/abs/2308.00352
  • Communicative Agents for Software Development https://arxiv.org/pdf/2307.07924.pdf
  • Large Language Models Can Self-Improve https://arxiv.org/abs/2210.11610
  • Evaluating Human-Language Model Interaction https://arxiv.org/abs/2212.09746
  • Large Language Models can Learn Rules https://arxiv.org/abs/2310.07064
  • AgentBench: Evaluating LLMs as Agents https://arxiv.org/abs/2308.03688
  • WebArena: A Realistic Web Environment for Building Autonomous Agents https://arxiv.org/abs/2307.13854
  • TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT https://arxiv.org/abs/2307.08674

任务规划与分解

  • Chain-of-Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/abs/2201.11903
  • Tree of Thoughts: Deliberate Problem Solving with Large Language Models https://arxiv.org/abs/2305.10601
  • Implicit Chain of Thought Reasoning via Knowledge Distillation https://arxiv.org/abs/2311.01460
  • ReAct: Synergizing Reasoning and Acting in Language Models https://arxiv.org/abs/2210.03629
  • ART: Automatic multi-step reasoning and tool-use for large language models https://arxiv.org/abs/2303.09014
  • Branch-Solve-Merge Improves Large Language Model Evaluation and Generation https://arxiv.org/abs/2310.15123
  • WizardLM: Empowering Large Language Models to Follow Complex Instructions https://arxiv.org/pdf/2304.12244.pdf

幻觉

  • Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models https://arxiv.org/pdf/2309.01219.pdf
  • Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback https://arxiv.org/abs/2302.12813
  • SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models https://arxiv.org/abs/2303.08896
  • WebBrain: Learning to Generate Factually Correct Articles for Queries by Grounding on Large Web Corpus https://arxiv.org/abs/2304.04358

多模态

  • Learning Transferable Visual Models From Natural Language Supervision (CLIP) https://arxiv.org/abs/2103.00020
  • An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT): https://arxiv.org/abs/2010.11929
  • MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning https://arxiv.org/abs/2310.09478
  • MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models https://arxiv.org/abs/2304.10592
  • NExT-GPT: Any-to-Any Multimodal LLM https://arxiv.org/pdf/2309.05519.pdf
  • Visual Instruction Tuning (LLaVA) https://arxiv.org/pdf/2304.08485.pdf
  • Improved Baselines with Visual Instruction Tuning (LLaVA-1.5) https://arxiv.org/abs/2310.03744
  • Sequential Modeling Enables Scalable Learning for Large Vision Models (LVM) https://arxiv.org/pdf/2312.00785.pdf
  • CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation https://arxiv.org/pdf/2311.18775.pdf
  • Neural Discrete Representation Learning (VQ-VAE) https://browse.arxiv.org/pdf/1711.00937.pdf
  • Taming Transformers for High-Resolution Image Synthesis (VQ-GAN) https://arxiv.org/abs/2012.09841
  • Swin Transformer: Hierarchical Vision Transformer using Shifted Windows https://arxiv.org/abs/2103.14030
  • BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models https://browse.arxiv.org/pdf/2301.12597.pdf
  • InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning https://browse.arxiv.org/pdf/2305.06500.pdf
  • ImageBind: One Embedding Space To Bind Them All https://arxiv.org/abs/2305.05666
  • Meta-Transformer: A Unified Framework for Multimodal Learning https://arxiv.org/abs/2307.10802

图片/视频生成

  • High-Resolution Image Synthesis with Latent Diffusion Models https://arxiv.org/pdf/2112.10752.pdf
  • Structure and Content-Guided Video Synthesis with Diffusion Models (RunwayML Gen1) https://browse.arxiv.org/pdf/2302.03011.pdf
  • Hierarchical Text-Conditional Image Generation with CLIP Latents (DaLLE-2) https://arxiv.org/pdf/2204.06125.pdf
  • AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning https://arxiv.org/abs/2307.04725
  • Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet) https://arxiv.org/abs/2302.05543
  • SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis https://arxiv.org/abs/2307.01952
  • Zero-1-to-3: Zero-shot One Image to 3D Object https://arxiv.org/abs/2303.11328
  • Scaling Vision Transformers to 22 Billion Parameters https://arxiv.org/abs/2302.05442
  • Glow: Generative Flow with Invertible 1×1 Convolutions https://browse.arxiv.org/pdf/1807.03039.pdf
  • Language Model Beats Diffusion – Tokenizer is Key to Visual Generation https://arxiv.org/pdf/2310.05737.pdf
  • InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation https://arxiv.org/pdf/2309.06380.pdf
  • Perceptual Losses for Real-Time Style Transfer and Super-Resolution https://arxiv.org/pdf/1603.08155.pdf
  • CogView: Mastering Text-to-Image Generation via Transformers https://arxiv.org/abs/2105.13290
  • Diffusion Models for Video Prediction and Infilling https://arxiv.org/abs/2206.07696

语音合成

  • Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (VITS) https://browse.arxiv.org/pdf/2106.06103.pdf
  • Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E) https://arxiv.org/abs/2301.02111
  • Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling (VALL-E X) https://arxiv.org/pdf/2303.03926.pdf
  • MusicLM: Generating Music From Text https://arxiv.org/abs/2301.11325

大模型基础

  • Attention Is All You Need https://arxiv.org/abs/1706.03762
  • Sequence to Sequence Learning with Neural Networks https://arxiv.org/abs/1409.3215
  • Neural Machine Translation by Jointly Learning to Align and Translate https://arxiv.org/abs/1409.0473
  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding https://arxiv.org/abs/1810.04805
  • Scaling Laws for Neural Language Models https://arxiv.org/pdf/2001.08361.pdf
  • Emergent Abilities of Large Language Models https://openreview.net/pdf?id=yzkSU5zdwD
  • Training Compute-Optimal Large Language Models (ChinChilla scaling law) https://arxiv.org/abs/2203.15556
  • Scaling Instruction-Finetuned Language Models https://arxiv.org/pdf/2210.11416.pdf
  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model https://arxiv.org/pdf/2305.18290.pdf
  • Progress measures for grokking via mechanistic interpretability https://arxiv.org/abs/2301.05217
  • Language Models Represent Space and Time https://arxiv.org/abs/2310.02207
  • GLaM: Efficient Scaling of Language Models with Mixture-of-Experts https://arxiv.org/abs/2112.06902
  • Adam: A Method for Stochastic Optimization https://arxiv.org/abs/1412.6980
  • Efficient Estimation of Word Representations in Vector Space (Word2Vec) https://arxiv.org/abs/1301.3781
  • Distributed Representations of Words and Phrases and their Compositionality https://arxiv.org/abs/1310.4546

GPT

  • Language Models are Few-Shot Learners (GPT-3) https://arxiv.org/abs/2005.14165
  • Language Models are Unsupervised Multitask Learners (GPT-2) https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
  • Improving Language Understanding by Generative Pre-Training (GPT-1) https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
  • Training language models to follow instructions with human feedback (InstructGPT) https://arxiv.org/pdf/2203.02155.pdf
  • Evaluating Large Language Models Trained on Code https://arxiv.org/pdf/2107.03374.pdf
  • Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond https://arxiv.org/abs/2304.13712
  • Instruction Tuning with GPT-4 https://arxiv.org/pdf/2304.03277.pdf
  • The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) https://arxiv.org/abs/2309.17421
  • Sparks of Artificial General Intelligence: Early experiments with GPT-4 https://arxiv.org/abs/2303.12712
  • Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision https://arxiv.org/abs/2312.09390

开源大模型

  • LLaMA: Open and Efficient Foundation Language Models https://arxiv.org/abs/2302.13971
  • Llama 2: Open Foundation and Fine-Tuned Chat Models https://arxiv.org/pdf/2307.09288.pdf
  • Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality https://lmsys.org/blog/2023-03-30-vicuna/
  • LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset https://arxiv.org/abs/2309.11998
  • Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena https://arxiv.org/abs/2306.05685
  • How Long Can Open-Source LLMs Truly Promise on Context Length? https://lmsys.org/blog/2023-06-29-longchat/
  • Mixtral of experts https://mistral.ai/news/mixtral-of-experts/
  • OpenChat: Advancing Open-source Language Models with Mixed-Quality Data https://arxiv.org/abs/2309.11230
  • RWKV: Reinventing RNNs for the Transformer Era https://arxiv.org/abs/2305.13040
  • Mamba: Linear-Time Sequence Modeling with Selective State Spaces https://arxiv.org/ftp/arxiv/papers/2312/2312.00752.pdf
  • Retentive Network: A Successor to Transformer for Large Language Models https://arxiv.org/abs/2307.08621
  • Baichuan 2: Open Large-scale Language Models https://arxiv.org/abs/2309.10305
  • GLM-130B: An Open Bilingual Pre-trained Model https://arxiv.org/abs/2210.02414
  • Qwen Technical Report https://arxiv.org/abs/2309.16609
  • Skywork: A More Open Bilingual Foundation Model https://arxiv.org/abs/2310.19341

微调

  • Learning to summarize from human feedback https://arxiv.org/abs/2009.01325
  • Self-Instruct: Aligning Language Model with Self Generated Instruction https://arxiv.org/abs/2212.10560
  • Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning https://arxiv.org/abs/2303.15647
  • LoRA: Low-Rank Adaptation of Large Language Models https://arxiv.org/abs/2106.09685
  • Vera: Vector-Based Random Matrix Adapation https://arxiv.org/pdf/2310.11454.pdf
  • QLoRA: Efficient Finetuning of Quantized LLMs https://arxiv.org/abs/2305.14314
  • Chain of Hindsight Aligns Language Models with Feedback https://arxiv.org/abs/2302.02676
  • Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models https://arxiv.org/pdf/2312.06585.pdf

性能优化

  • Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM) https://arxiv.org/abs/2309.06180
  • FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness https://arxiv.org/abs/2205.14135
  • S-LoRA: Serving Thousands of Concurrent LoRA Adapters https://arxiv.org/abs/2311.03285
  • GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism https://proceedings.neurips.cc/paper/2019/file/093f65e080a295f8076b1c5722a46aa2-Paper.pdf
  • Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism https://arxiv.org/pdf/1909.08053.pdf
  • ZeRO: Memory Optimizations Toward Training Trillion Parameter Models https://arxiv.org/pdf/1910.02054.pdf

目录

  1. 更有趣的 AI Agent
  2. 更有用的 AI Agent
  3. 任务规划与分解
  4. 幻觉
  5. 多模态
  6. 图片/视频生成
  7. 语音合成
  8. 大模型基础
  9. GPT
  10. 开源大模型
  11. 微调
  12. 性能优化
  • 💰 8折买阿里云服务器限时8折了解详情
  • Magick API 一键接入全球大模型注册送1000万token查看
  • 🤖 一键搭建Deepseek满血版了解详情
  • 一键打造专属AI 智能体了解详情
极客日志微信公众号二维码

微信扫一扫,关注极客日志

微信公众号「极客日志V2」,在微信中扫描左侧二维码关注。展示文案:极客日志V2 zeeklog

更多推荐文章

查看全部
  • 自然语言处理在医疗领域的应用与实战
  • node-llama-cpp 错误处理与调试:本地 AI 开发常见问题
  • C++ STL set/map 模拟实现
  • C/C++ LS3/NS3 球体生成算法及实现
  • 设计支持万人并发抢购的秒杀系统架构方案
  • 金仓数据库 Oracle 与 SQL Server 迁移适配指南
  • Cursor 辅助开发 Web 版背单词应用实战
  • Python SQLAlchemy ORM 数据库操作指南
  • Vibe Coding 时代后端程序员开发前端的最佳实践
  • Flutter 组件 tavily_dart 在鸿蒙系统下的适配与实战
  • Python+AI 学习路线:从零基础到实战专家
  • 使用 MCP-Server 插件将 Dify 工作流发布为第三方服务
  • 从三年前端到 CS 硕士:我在韩国亚大读研的复盘
  • 2026 年 4 款会议纪要工具深度测评:AI 辅助整理
  • 主流免费 AI IDE 工具盘点与使用指南
  • OpenHarmony 使用 web_socket 实现跨平台 WebSocket 通信
  • 实战 Pi0 机器人控制中心:实现机器人智能操控
  • 基于DeepSeek与Cursor构建智能代码审查工具实战
  • 独立开发者变现方式与核心技能解析
  • VSCode Git 工作树多任务并行开发实践

相关免费在线工具

  • 加密/解密文本

    使用加密算法(如AES、TripleDES、Rabbit或RC4)加密和解密文本明文。 在线工具,加密/解密文本在线工具,online

  • RSA密钥对生成器

    生成新的随机RSA私钥和公钥pem证书。 在线工具,RSA密钥对生成器在线工具,online

  • Mermaid 预览与可视化编辑

    基于 Mermaid.js 实时预览流程图、时序图等图表,支持源码编辑与即时渲染。 在线工具,Mermaid 预览与可视化编辑在线工具,online

  • 随机西班牙地址生成器

    随机生成西班牙地址(支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选),支持数量快捷选择、显示全部与下载。 在线工具,随机西班牙地址生成器在线工具,online

  • Gemini 图片去水印

    基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印,支持批量处理与下载。 在线工具,Gemini 图片去水印在线工具,online

  • curl 转代码

    解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。 在线工具,curl 转代码在线工具,online