基于 Rust 实现 GitHub Trending 热门仓库爬虫

本实战项目将使用 Rust 构建一个异步爬虫，目标是抓取 GitHub Trending 页面中热门 Rust 仓库的详细信息（包括仓库名、描述、星标数、作者等），并将结果导出为 JSON 文件。代码重点优化了错误处理机制和 CSS 选择器的稳定性，确保在 GitHub 页面结构微调时仍能正常工作。

技术栈选型

HTTP 请求：reqwest（Rust 生态中最流行的异步 HTTP 客户端）
HTML 解析：scraper（基于 selectors 库，支持 CSS 选择器，轻量高效）
JSON 序列化：serde + serde_json（Rust 标准的序列化/反序列化方案）
异步运行时：tokio（Rust 异步编程的事实标准）
日志：env_logger + log（便于调试和追踪执行流程）
错误处理：anyhow（简化错误传递，避免手动定义复杂的错误枚举类型）

项目结构概览

github-trending-crawler/
├── Cargo.toml      # 依赖配置
├── src/
│   └── main.rs     # 核心逻辑
└── trending_repos.json # 运行后生成的输出文件

文章配图

环境搭建与依赖配置

初始化项目

首先创建一个新的 Rust 项目并进入目录：

cargo new github-trending-crawler
cd github-trending-crawler

配置 `Cargo.toml`

在 Cargo.toml 中添加必要的依赖。建议通过 crates.io 查询最新版本以确保兼容性：

[package]
name = "github-trending-crawler"
version = "0.1.0"
edition = "2021"
description = "A crawler to fetch GitHub Trending Rust repositories"
 = 



 = { version = , features = [, ] }

 = 

 = { version = , features = [] }
 = 

 = { version = , features = [] }

 = 
 = 

 =

fn parse_repos(html: &str) -> Result<Vec<GithubRepo>> { info!("Starting to parse repositories..."); let document = Html::parse_document(html); // 定义 CSS 选择器（优化后：基于语义化属性，降低页面样式变更影响） let repo_selector = Selector::parse("article.Box-row") .map_err(|e| anyhow::anyhow!("Failed to parse repo selector: {}", e))?; let author_name_selector = Selector::parse("h2 a") .map_err(|e| anyhow::anyhow!("Failed to parse author-name selector: {}", e))?; let desc_selector = Selector::parse("p") .map_err(|e| anyhow::anyhow!("Failed to parse description selector: {}", e))?; // 基于 href 后缀选择，比类名更稳定 let stars_selector = Selector::parse("a[href$='/stargazers']") .map_err(|e| anyhow::anyhow!("Failed to parse stars selector: {}", e))?; let forks_selector = Selector::parse("a[href$='/forks']") .map_err(|e| anyhow::anyhow!("Failed to parse forks selector: {}", e))?; // 基于 data-menu-button-text 属性，稳定性更高 let today_stars_selector = Selector::parse("span[data-menu-button-text]") .map_err(|e| anyhow::anyhow!("Failed to parse today-stars selector: {}", e))?; let mut repos = Vec::new(); for repo_node in document.select(&repo_selector) { // 1. 提取作者和仓库名 let author_name_element = repo_node .select(&author_name_selector) .next() .context("Missing author/name element (GitHub page structure may have changed)")?; let author_name_text = author_name_element .text() .collect::<String>() .trim() .to_string(); let (author, name) = author_name_text .split_once('/') .context(format!("Invalid author/name format: '{}' (expected 'author/name')", author_name_text))?; let author = author.trim().to_string(); let name = name.trim().to_string(); // 2. 提取仓库完整链接 let url = author_name_element .value() .attr("href") .context("Missing href attribute for repo link")? .to_string(); let url = format!("https://github.com{}", url); // 3. 提取仓库描述（可选，无描述时为 None） let description = repo_node .select(&desc_selector) .next() .map(|elem| elem.text().collect::<String>().trim().to_string()); // 4. 提取星标数（缺失时默认值为 "0"，容错性优化） let stars = repo_node .select(&stars_selector) .next() .map(|elem| elem.text().collect::<String>().trim().to_string()) .unwrap_or_else(|| "0".to_string()); // 5. 提取分支数 let forks = repo_node .select(&forks_selector) .next() .map(|elem| elem.text().collect::<String>().trim().to_string()) .unwrap_or_else(|| "0".to_string()); // 6. 提取今日新增星标 let today_stars = repo_node .select(&today_stars_selector) .next() .map(|elem| elem.text().collect::<String>().trim().to_string()) .unwrap_or_else(|| "0".to_string()); repos.push(GithubRepo { author, name, description, stars, forks, today_stars, url, }); } info!("Successfully parsed {} repositories", repos.len()); Ok(repos) }

[ { "author": "YaLTeR", "name": "niri", "description": "A scrollable-tiling Wayland compositor.", "stars": "14,823", "forks": "523", "today_stars": "0", "url": "https://github.com/YaLTeR/niri" }, { "author": "librespot-org", "name": "librespot", "description": "Open Source Spotify client library", "stars": "6,131", "forks": "773", "today_stars": "0", "url": "https://github.com/librespot-org/librespot" }, { "author": "zensical", "name": "zensical", "description": "A modern static site generator by the creators of Material for MkDocs", "stars": "859", "forks": "12", "today_stars": "0", "url": "https://github.com/zensical/zensical" }, { "author": "atuinsh", "name": "atuin", "description": "✨ Magical shell history", "stars": "26,951", "forks": "730", "today_stars": "0", "url": "https://github.com/atuinsh/atuin" }, { "author": "openai", "name": "codex", "description": "Lightweight coding agent that runs in your terminal", "stars": "50,236", "forks": "6,238", "today_stars": "0", "url": "https://github.com/openai/codex" }, { "author": "bevyengine", "name": "bevy", "description": "A refreshingly simple data-driven game engine built in Rust", "stars": "43,023", "forks": "4,224", "today_stars": "0", "url": "https://github.com/bevyengine/bevy" }, { "author": "topjohnwu", "name": "Magisk", "description": "The Magic Mask for Android", "stars": "56,884", "forks": "15,868", "today_stars": "0", "url": "https://github.com/topjohnwu/Magisk" }, { "author": "uutils", "name": "coreutils", "description": "Cross-platform Rust rewrite of the GNU coreutils", "stars": "22,144", "forks": "1,637", "today_stars": "0", "url": "https://github.com/uutils/coreutils" }, { "author": "regolith-labs", "name": "ore", "description": "It's time to mine.", "stars": "741", "forks": "262", "today_stars": "0", "url": "https://github.com/regolith-labs/ore" }, { "author": "commonwarexyz", "name": "monorepo", "description": "Commonware Library Primitives and Examples", "stars": "369", "forks": "131", "today_stars": "0", "url": "https://github.com/commonwarexyz/monorepo" }, { "author": "rust-lang", "name": "rust-analyzer", "description": "A Rust compiler front-end for IDEs", "stars": "15,670", "forks": "1,864", "today_stars": "0", "url": "https://github.com/rust-lang/rust-analyzer" }, { "author": "chroma-core", "name": "chroma", "description": "Open-source search and retrieval database for AI applications.", "stars": "24,344", "forks": "1,912", "today_stars": "0", "url": "https://github.com/chroma-core/chroma" }, { "author": "getzola", "name": "zola", "description": "A fast static site generator in a single binary with everything built-in. https://www.getzola.org", "stars": "16,120", "forks": "1,094", "today_stars": "0", "url": "https://github.com/getzola/zola" }, { "author": "longbridge", "name": "gpui-component", "description": "Rust GUI components for building fantastic cross-platform desktop application by using GPUI.", "stars": "7,532", "forks": "297", "today_stars": "0", "url": "https://github.com/longbridge/gpui-component" }, { "author": "fish-shell", "name": "fish-shell", "description": "The user-friendly command line shell.", "stars": "31,470", "forks": "2,159", "today_stars": "0", "url": "https://github.com/fish-shell/fish-shell" }, { "author": "Spotifyd", "name": "spotifyd", "description": "A spotify daemon", "stars": "10,404", "forks": "487", "today_stars": "0", "url": "https://github.com/Spotifyd/spotifyd" }, { "author": "tree-sitter", "name": "tree-sitter", "description": "An incremental parsing system for programming tools", "stars": "22,666", "forks": "2,198", "today_stars": "0", "url": "https://github.com/tree-sitter/tree-sitter" }, { "author": "UwUDev", "name": "ygege", "description": "High-performance indexer for YGG Torrent written in Rust", "stars": "78", "forks": "9", "today_stars": "0", "url": "https://github.com/UwUDev/ygege" }, { "author": "ankitects", "name": "anki", "description": "Anki is a smart spaced repetition flashcard program", "stars": "24,634", "forks": "2,583", "today_stars": "0", "url": "https://github.com/ankitects/anki" }, { "author": "get-convex", "name": "convex-backend", "description": "The open-source reactive database for app developers", "stars": "8,066", "forks": "452", "today_stars": "0", "url": "https://github.com/get-convex/convex-backend" } ]

基于 Rust 实现 GitHub Trending 热门仓库爬虫

技术栈选型

项目结构概览

环境搭建与依赖配置

初始化项目

配置 `Cargo.toml`

更多推荐文章

相关免费在线工具

核心逻辑实现

导入模块与日志初始化

定义数据结构

核心爬虫逻辑

1. 构建 HTTP 客户端并请求页面

2. 解析 HTML 提取仓库信息

3. 保存结果到 JSON 文件

主函数入口

运行与验证

启动爬虫

检查结果

总结

更多推荐文章

相关免费在线工具

基于 Rust 实现 GitHub Trending 热门仓库爬虫