基于 Rust 实现 GitHub Trending 热门仓库爬取
介绍如何使用 Rust 编写 GitHub Trending 热门仓库爬虫。项目利用 reqwest 进行异步 HTTP 请求,scraper 解析 HTML,serde 序列化数据。通过优化 CSS 选择器和错误处理机制,提升了代码的容错性和稳定性。最终将仓库信息(作者、名称、描述、星标数等)保存为 JSON 文件,便于后续分析处理。

介绍如何使用 Rust 编写 GitHub Trending 热门仓库爬虫。项目利用 reqwest 进行异步 HTTP 请求,scraper 解析 HTML,serde 序列化数据。通过优化 CSS 选择器和错误处理机制,提升了代码的容错性和稳定性。最终将仓库信息(作者、名称、描述、星标数等)保存为 JSON 文件,便于后续分析处理。

reqwest(Rust 最流行的 HTTP 客户端,支持异步)scraper(基于 selectors 库,支持 CSS 选择器,轻量高效)serde + serde_json(Rust 标准的序列化 / 反序列化库)tokio(Rust 异步编程的事实标准)env_logger + log(简单的日志输出,方便调试)anyhow(简化错误传递,无需手动定义复杂错误类型)github-trending-crawler/
├── Cargo.toml # 依赖配置
├── src/
│ └── main.rs # 核心逻辑
└── trending_repos.json # 输出结果文件(运行后生成)
cargo new github-trending-crawler
cd github-trending-crawler
Cargo.toml添加依赖项(确保版本兼容,可通过 crates.io 查询最新版本):
[package]
name = "github-trending-crawler"
version = "0.1.0"
edition = "2021"
description = "A crawler to fetch GitHub Trending Rust repositories"
license = "MIT"
[dependencies]
# HTTP 客户端(异步)
reqwest = { version = "0.12", features = ["json", "rustls-tls"] }
# HTML 解析(CSS 选择器)
scraper = "0.18"
# JSON 序列化/反序列化
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
# 异步运行时
tokio = { version = "1.0", features = ["full"] }
# 日志
log = "0.4"
env_logger = "0.10"
# 错误处理(可选,简化错误传递)
anyhow = "1.0"
use anyhow::{Context, Result};
use log::info;
use reqwest::Client;
use scraper::{Html, Selector};
use serde::Serialize;
use std::fs::File;
use std::path::Path;
// 初始化日志(运行时打印调试信息)
fn init_logger() {
env_logger::Builder::from_env(env_logger::Env::default().default_filter_or("info")).init();
}
定义存储仓库信息的结构体,使用 serde::Serialize trait 支持 JSON 序列化,字段与 GitHub Trending 页面信息一一对应:
#[derive(Debug, Serialize)]
struct GithubRepo {
// 作者/组织名
author: String,
// 仓库名
name: String,
// 仓库描述
description: Option<String>,
// 星标数
stars: String,
// 分支数
forks: String,
// 今日新增星标
today_stars: String,
// 仓库链接
url: String,
}
优化点:保持原有 User-Agent 伪装和超时设置,确保请求不被 GitHub 拒绝,同时保留详细的请求错误上下文:
async fn fetch_trending_page(client: &Client) -> Result<String> {
// GitHub Trending Rust 页面 URL(按今日热门排序)
let url = "https://github.com/trending/rust?since=daily";
info!("Fetching page: {}", url);
// 发送 GET 请求(设置 User-Agent 模拟浏览器,避免被 GitHub 拦截)
let response = client
.get(url)
.header(
"User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)
.send()
.await
.context(format!("Failed to request URL: {}", url))?;
// 检查响应状态码(200-299 为成功状态)
if !response.status().is_success() {
return Err(anyhow::anyhow!("Request failed with status: {}", response.status()));
}
// 读取响应体(HTML 字符串),并记录页面大小
let html = response.text().await.context("Failed to read response body")?;
info!("Successfully fetched page (size: {} bytes)", html.len());
Ok(html)
}
优化点:
map_err,直接显示选择器解析错误详情,便于调试;p 标签、星标 / 分支选择器改为基于 href 后缀(如 a[href$='/stargazers']),避免因 GitHub 样式类名变更导致解析失败;unwrap_or_else 给缺失的星标 / 分支 / 今日新增星标设置默认值 "0",避免程序 panic;fn parse_repos(html: &str) -> Result<Vec<GithubRepo>> {
info!("Starting to parse repositories...");
let document = Html::parse_document(html);
// 定义 CSS 选择器(优化后:基于语义化属性,降低页面样式变更影响)
// 1. 每个仓库的根节点选择器(GitHub 仓库列表统一用 article.Box-row 包裹)
let repo_selector = Selector::parse("article.Box-row")
.map_err(|e| anyhow::anyhow!("Failed to parse repo selector: {}", e))?;
// 2. 作者 + 仓库名选择器(h2 下的 a 标签,包含仓库路径)
let author_name_selector = Selector::parse("h2 a")
.map_err(|e| anyhow::anyhow!("Failed to parse author-name selector: {}", e))?;
// 3. 仓库描述选择器(通用 p 标签,避免依赖特定类名)
let desc_selector = Selector::parse("p")
.map_err(|e| anyhow::anyhow!("Failed to parse description selector: {}", e))?;
// 4. 星标数选择器(基于 href 后缀 /stargazers,语义化更强)
let stars_selector = Selector::parse("a[href$='/stargazers']")
.map_err(|e| anyhow::anyhow!("Failed to parse stars selector: {}", e))?;
// 5. 分支数选择器(基于 href 后缀 /forks,语义化更强)
let forks_selector = Selector::parse("a[href$='/forks']")
.map_err(|e| anyhow::anyhow!("Failed to parse forks selector: {}", e))?;
// 6. 今日新增星标选择器(基于 data-menu-button-text 属性,稳定性更高)
let today_stars_selector = Selector::parse("span[data-menu-button-text]")
.map_err(|e| anyhow::anyhow!("Failed to parse today-stars selector: {}", e))?;
let mut repos = Vec::new();
// 遍历每个仓库节点,提取信息
for repo_node in document.select(&repo_selector) {
// 1. 提取作者和仓库名(格式:"author/name",从 a 标签文本中解析)
let author_name_element = repo_node
.select(&author_name_selector)
.next()
.context("Missing author/name element (GitHub page structure may have changed)")?;
// 清理文本(去除空格和换行符)
let author_name_text = author_name_element
.text()
.collect::<String>()
.trim()
.to_string();
// 分割作者和仓库名(格式必须为 "author/name",否则报错)
let (author, name) = author_name_text
.split_once('/')
.context(format!("Invalid author/name format: '{}' (expected 'author/name')", author_name_text))?;
let author = author.trim().to_string();
let name = name.trim().to_string();
// 2. 提取仓库完整链接(补全 GitHub 域名,a 标签的 href 属性为相对路径)
let url = author_name_element
.value()
.attr("href")
.context("Missing href attribute for repo link")?
.to_string();
let url = format!("https://github.com{}", url); // 拼接完整 URL
// 3. 提取仓库描述(可选,无描述时为 None,避免强制 unwrap 导致 panic)
let description = repo_node
.select(&desc_selector)
.next()
.map(|elem| elem.text().collect::<String>().trim().to_string());
// 4. 提取星标数(缺失时默认值为 "0",容错性优化)
let stars = repo_node
.select(&stars_selector)
.next()
.map(|elem| elem.text().collect::<String>().trim().to_string())
.unwrap_or_else(|| "0".to_string());
// 5. 提取分支数(缺失时默认值为 "0",容错性优化)
let forks = repo_node
.select(&forks_selector)
.next()
.map(|elem| elem.text().collect::<String>().trim().to_string())
.unwrap_or_else(|| "0".to_string());
// 6. 提取今日新增星标(缺失时默认值为 "0",容错性优化)
let today_stars = repo_node
.select(&today_stars_selector)
.next()
.map(|elem| elem.text().collect::<String>().trim().to_string())
.unwrap_or_else(|| "0".to_string());
// 构建仓库对象并添加到列表
repos.push(GithubRepo {
author,
name,
description,
stars,
forks,
today_stars,
url,
});
}
info!("Successfully parsed {} repositories", repos.len());
Ok(repos)
}
将解析后的仓库列表序列化为格式化的 JSON(pretty 模式),便于阅读和后续处理:
fn save_repos_to_json(repos: &[GithubRepo], path: &str) -> Result<()> {
info!("Saving repositories to JSON file: {}", path);
// 创建文件(若已存在会覆盖)
let file = File::create(Path::new(path))
.context(format!("Failed to create file: {} (check directory permissions)", path))?;
// 序列化并写入文件(pretty 模式:缩进格式化,可读性强)
serde_json::to_writer_pretty(file, repos)
.context("Failed to serialize repos to JSON (invalid data format)")?;
info!("Successfully saved {} repos to {}", repos.len(), path);
Ok(())
}
使用 tokio 异步运行时,按 '请求页面 → 解析信息 → 保存结果' 的流程执行,同时保留详细日志:
#[tokio::main]
async fn main() -> Result<()> {
// 初始化日志(程序启动时执行)
init_logger();
info!("Starting GitHub Trending Rust Crawler...");
// 创建 HTTP 客户端(设置超时,避免网络问题导致程序卡死)
let client = Client::builder()
.connect_timeout(std::time::Duration::from_secs(10)) // 连接超时:10 秒
.timeout(std::time::Duration::from_secs(15)) // 响应超时:15 秒
.build()
.context("Failed to create HTTP client (check network or dependencies)")?;
// 1. 爬取 GitHub Trending 页面 HTML
let html = fetch_trending_page(&client).await?;
// 2. 解析 HTML,提取仓库信息
let repos = parse_repos(&html)?;
// 3. 将结果保存到 JSON 文件(项目根目录下的 trending_repos.json)
save_repos_to_json(&repos, "trending_repos.json")?;
info!("Crawler finished successfully! Check 'trending_repos.json' for results.");
Ok(())
}
# 直接运行(默认输出 info 级别日志)
cargo run
# (可选)输出 debug 级别日志(查看更详细的执行过程,便于调试)
RUST_LOG=debug cargo run
运行成功后,项目根目录会生成 trending_repos.json 文件,优化后的结果示例(容错性提升,无缺失值):
[
{
"author": "YaLTeR",
"name": "niri",
"description": "A scrollable-tiling Wayland compositor.",
"stars": "14,823",
"forks": "523",
"today_stars": "0",
"url": "https://github.com/YaLTeR/niri"
},
{
"author": "librespot-org",
"name": "librespot",
"description": "Open Source Spotify client library",
"stars": "6,131",
"forks": "773",
"today_stars": "0",
"url": "https://github.com/librespot-org/librespot"
},
{
"author": "zensical",
"name": "zensical",
"description": "A modern static site generator by the creators of Material for MkDocs",
"stars": "859",
"forks": "12",
"today_stars": "0",
"url": "https://github.com/zensical/zensical"
},
{
"author": "atuinsh",
"name": "atuin",
"description": "✨ Magical shell history",
"stars": "26,951",
"forks": "730",
"today_stars": "0",
"url": "https://github.com/atuinsh/atuin"
},
{
"author": "openai",
"name": "codex",
"description": "Lightweight coding agent that runs in your terminal",
"stars": "50,236",
"forks": "6,238",
"today_stars": "0",
"url": "https://github.com/openai/codex"
},
{
"author": "bevyengine",
"name": "bevy",
"description": "A refreshingly simple data-driven game engine built in Rust",
"stars": "43,023",
"forks": "4,224",
"today_stars": "0",
"url": "https://github.com/bevyengine/bevy"
},
{
"author": "topjohnwu",
"name": "Magisk",
"description": "The Magic Mask for Android",
"stars": "56,884",
"forks": "15,868",
"today_stars": "0",
"url": "https://github.com/topjohnwu/Magisk"
},
{
"author": "uutils",
"name": "coreutils",
"description": "Cross-platform Rust rewrite of the GNU coreutils",
"stars": "22,144",
"forks": "1,637",
"today_stars": "0",
"url": "https://github.com/uutils/coreutils"
},
{
"author": "regolith-labs",
"name": "ore",
"description": "It's time to mine.",
"stars": "741",
"forks": "262",
"today_stars": "0",
"url": "https://github.com/regolith-labs/ore"
},
{
"author": "commonwarexyz",
"name": "monorepo",
"description": "Commonware Library Primitives and Examples",
"stars": "369",
"forks": "131",
"today_stars": "0",
"url": "https://github.com/commonwarexyz/monorepo"
},
{
"author": "rust-lang",
"name": "rust-analyzer",
"description": "A Rust compiler front-end for IDEs",
"stars": "15,670",
"forks": "1,864",
"today_stars": "0",
"url": "https://github.com/rust-lang/rust-analyzer"
},
{
"author": "chroma-core",
"name": "chroma",
"description": "Open-source search and retrieval database for AI applications.",
"stars": "24,344",
"forks": "1,912",
"today_stars": "0",
"url": "https://github.com/chroma-core/chroma"
},
{
"author": "getzola",
"name": "zola",
"description": "A fast static site generator in a single binary with everything built-in. https://www.getzola.org",
"stars": "16,120",
"forks": "1,094",
"today_stars": "0",
"url": "https://github.com/getzola/zola"
},
{
"author": "longbridge",
"name": "gpui-component",
"description": "Rust GUI components for building fantastic cross-platform desktop application by using GPUI.",
"stars": "7,532",
"forks": "297",
"today_stars": "0",
"url": "https://github.com/longbridge/gpui-component"
},
{
"author": "fish-shell",
"name": "fish-shell",
"description": "The user-friendly command line shell.",
"stars": "31,470",
"forks": "2,159",
"today_stars": "0",
"url": "https://github.com/fish-shell/fish-shell"
},
{
"author": "Spotifyd",
"name": "spotifyd",
"description": "A spotify daemon",
"stars": "10,404",
"forks": "487",
"today_stars": "0",
"url": "https://github.com/Spotifyd/spotifyd"
},
{
"author": "tree-sitter",
"name": "tree-sitter",
"description": "An incremental parsing system for programming tools",
"stars": "22,666",
"forks": "2,198",
"today_stars": "0",
"url": "https://github.com/tree-sitter/tree-sitter"
},
{
"author": "UwUDev",
"name": "ygege",
"description": "High-performance indexer for YGG Torrent written in Rust",
"stars": "78",
"forks": "9",
"today_stars": "0",
"url": "https://github.com/UwUDev/ygege"
},
{
"author": "ankitects",
"name": "anki",
"description": "Anki is a smart spaced repetition flashcard program",
"stars": "24,634",
"forks": "2,583",
"today_stars": "0",
"url": "https://github.com/ankitects/anki"
},
{
"author": "get-convex",
"name": "convex-backend",
"description": "The open-source reactive database for app developers",
"stars": "8,066",
"forks": "452",
"today_stars": "0",
"url": "https://github.com/get-convex/convex-backend"
}
]
本项目是基于 Rust 开发的 GitHub Trending 热门 Rust 仓库爬虫,通过
reqwest实现异步 HTTP 请求、scraper解析 HTML 页面、serde系列库完成 JSON 序列化,搭配tokio异步运行时和anyhow错误处理库,构建了高效且健壮的爬取流程。相较于初始版本,优化后的代码在 CSS 选择器上采用语义化属性(如基于href后缀、数据属性),降低了 GitHub 页面样式变更带来的维护成本;在错误处理上,通过map_err明确选择器解析错误、unwrap_or_else处理信息缺失场景,大幅提升了程序容错性;同时保留详细日志输出,便于调试和问题定位。项目最终能稳定爬取每日热门 Rust 仓库的作者、名称、描述、星标数等关键信息,并以格式化 JSON 文件存储结果。

微信公众号「极客日志」,在微信中扫描左侧二维码关注。展示文案:极客日志 zeeklog
将字符串编码和解码为其 Base64 格式表示形式即可。 在线工具,Base64 字符串编码/解码在线工具,online
将字符串、文件或图像转换为其 Base64 表示形式。 在线工具,Base64 文件转换器在线工具,online
将 Markdown(GFM)转为 HTML 片段,浏览器内 marked 解析;与 HTML 转 Markdown 互为补充。 在线工具,Markdown 转 HTML在线工具,online
将 HTML 片段转为 GitHub Flavored Markdown,支持标题、列表、链接、代码块与表格等;浏览器内处理,可链接预填。 在线工具,HTML 转 Markdown在线工具,online
通过删除不必要的空白来缩小和压缩JSON。 在线工具,JSON 压缩在线工具,online
将JSON字符串修饰为友好的可读格式。 在线工具,JSON美化和格式化在线工具,online