问题定义与数据准备
手头通常会有两个 Excel 文件需要处理:
small.xlsx: 包含约 5,000 条记录。large.xlsx: 包含约 140,000 条记录。
核心目标是从大表中快速找出所有身份证号存在于小表中的记录,并导出到新文件。假设两个表的身份证号字段名均为 id_card。
先做准备工作,安装必要的库并模拟一些数据用于测试和性能估算。
import pandas as pd
import time
import random
def generate_id_card():
"""生成一个模拟的 18 位身份证号"""
region_code = random.choice(['110101','310104','440301'])
birth_date = f"19{random.randint(50,99):02d}{random.randint(1,12):02d}{random.randint(1,28):02d}"
sequence_code = f"{random.randint(0,999):03d}"
check_code = random.choice(['X','0','1','2','3','4','5','6','7','8','9'])
return region_code + birth_date + sequence_code + check_code
# 生成小表数据 (5000 条)
small_data = {'id_card': [generate_id_card() for _ in range(5000)]}
small_df = pd.DataFrame(small_data)
small_df.to_excel(, index=)
large_list = []
ids_from_small = small_df[].tolist()
overlap_ids = random.sample(ids_from_small, )
_ ():
random.random() < overlap_ids:
id_to_use = random.choice(overlap_ids)
:
id_to_use = generate_id_card()
large_list.append(id_to_use)
large_data = {: large_list, : []*}
large_df = pd.DataFrame(large_data)
large_df.to_excel(, index=)
()
()
()


