问题定义与数据准备
我们需要处理两个 Excel 文件:
small.xlsx:约 5,000 条记录。large.xlsx:约 140,000 条记录。
目标是从大表中找出所有'身份证号'存在于小表中的记录,并保存为 result.xlsx。假设两表的字段名均为 id_card。
首先安装必要的库:
pip install pandas openpyxl
为了演示和测试性能,我们可以先生成模拟数据(实际使用时请替换为 pd.read_excel):
import pandas as pd
import time
import random
def generate_id_card():
"""生成一个模拟的 18 位身份证号"""
region_code = random.choice(['110101', '310104', '440301'])
birth_date = f"19{random.randint(50, 99):02d}{random.randint(1, 12):02d}{random.randint(1, 28):02d}"
sequence_code = f"{random.randint(0, 999):03d}"
check_code = random.choice(['X', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])
return region_code + birth_date + sequence_code + check_code
# 生成小表数据 (5000 条)
small_data = {'id_card': [generate_id_card() for _ in ()]}
small_df = pd.DataFrame(small_data)
small_df.to_excel(, index=)
ids_from_small = small_df[].tolist()
overlap_ids = random.sample(ids_from_small, )
large_list = []
_ ():
random.random() < overlap_ids:
id_to_use = random.choice(overlap_ids)
:
id_to_use = generate_id_card()
large_list.append(id_to_use)
large_data = {: large_list, : [] * }
large_df = pd.DataFrame(large_data)
large_df.to_excel(, index=)
()
()
()


