医疗 NLP 实战：电子病历分析与模型应用

医疗 NLP 应用场景示意图

自然语言处理（NLP）在医疗领域的应用正逐步从理论走向落地。无论是电子病历的结构化分析，还是辅助诊断与药物相互作用检测，核心都在于如何让机器理解专业的医学文本。本文将结合 BioBERT、ClinicalBERT 等前沿模型，通过 Python 实战项目，带你梳理从数据预处理到系统部署的完整链路。

医疗 NLP 的核心场景

电子病历分析

电子病历（EHR）是医疗数据的基石，包含患者基本信息、诊断记录及治疗方案等非结构化文本。利用 NLP 技术，我们可以实现以下关键任务：

文本分类：自动识别入院记录、出院小结或手术记录。
实体识别：精准提取疾病名称、症状描述、用药信息等。
关系抽取：挖掘药物与疾病、症状与诊断之间的关联逻辑。

代码实战：基于 ClinicalBERT 的分类

我们直接使用 Hugging Face Transformers 库中的预训练模型 emilyalsentzer/Bio_ClinicalBERT，它能很好地理解临床语境。

from transformers import BertTokenizer, BertForSequenceClassification
import torch

def analyze_ehr(text, model_name='emilyalsentzer/Bio_ClinicalBERT', num_labels=3):
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
    
    # 编码输入文本
    inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding=True)
    outputs = model(**inputs)
    
    # 计算分类结果
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    label = torch.argmax(probs, dim=-1).item()
    return label

疾病诊断辅助

通过分析症状和病史预测潜在疾病，这通常需要结合传统机器学习方法处理结构化特征。

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer

def disease_diagnosis_assistance(data):
    # 数据预处理
    data = data.dropna()
    data['symptoms'] = data['symptoms'].astype(str)
    
    X = data['symptoms']
    y = data['disease']
    
    # 划分数据集
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # 文本向量化
    tfidf_vectorizer = TfidfVectorizer(stop_words='english')
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
    X_test_tfidf = tfidf_vectorizer.transform(X_test)
    
    # 模型训练与评估
    model = LogisticRegression()
    model.fit(X_train_tfidf, y_train)
    y_pred = model.predict(X_test_tfidf)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"模型准确率：{accuracy}")
    return model

药物相互作用检测

识别药物间的协同或拮抗作用对用药安全至关重要。

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer

def drug_interaction_detection(data):
    data = data.dropna()
    data['drug1'] = data['drug1'].astype(str)
    data['drug2'] = data['drug2'].astype(str)
    
    X = data[['drug1', 'drug2']]
    y = data['interaction']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # 组合药物名称进行向量化
    tfidf_vectorizer = TfidfVectorizer(stop_words='english')
    X_train_combined = X_train['drug1'] + ' ' + X_train['drug2']
    X_test_combined = X_test['drug1'] + ' ' + X_test['drug2']
    
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_combined)
    X_test_tfidf = tfidf_vectorizer.transform(X_test_combined)
    
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X_train_tfidf, y_train)
    
    y_pred = model.predict(X_test_tfidf)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"模型准确率：{accuracy}")
    return model

关键技术细节

医疗文本预处理

医疗文本充斥着专业术语、缩写和特殊符号，直接分词往往效果不佳。我们需要针对医学特点定制流程：

分词与去停用词：保留医学术语，去除无意义虚词。
专业术语识别：利用 spaCy 等工具识别疾病、解剖部位等实体。
缩写解析：还原如 "HTN" (高血压) 等常见缩写。

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import spacy

def preprocess_medical_text(text):
    nlp = spacy.load("en_core_web_sm")
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    
    # 过滤停用词和非字母字符
    tokens = [token for token in tokens if token.lower() not in stop_words and token.isalpha()]
    
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents if ent.label_ in ['DISEASE', 'SYMPTOM', 'DRUG', 'PROCEDURE', 'ANATOMY']]
    
    return tokens, entities

模型选择与优化

在医疗场景下，通用模型往往力不从心。BioBERT 和 ClinicalBERT 分别在生物医学文献和临床笔记上进行了预训练，能更好地捕捉领域语义。训练时需重点关注数据质量、超参数调优以及使用 F1-score 等指标评估模型性能。

实战项目：电子病历分类系统

为了将上述技术整合，我们构建一个基于 Tkinter 的桌面应用，支持用户输入病历并实时分类。

环境搭建

确保安装必要的依赖库：

pip install transformers torch

界面与逻辑实现

系统采用分层架构，包括输入层、处理层和展示层。以下是核心组件的实现思路。

输入模块：提供文本框和按钮，接收用户输入的病历内容。

import tkinter as tk
from tkinter import scrolledtext, messagebox

class TextInputFrame(tk.Frame):
    def __init__(self, parent, on_process):
        super().__init__(parent)
        self.on_process = on_process
        self.create_widgets()

    def create_widgets(self):
        self.text_input = scrolledtext.ScrolledText(self, width=60, height=10)
        self.text_input.pack(pady=10, padx=10, fill="both", expand=True)
        
        tk.Button(self, text="文本分类", command=self.process_text).pack(pady=10, padx=10)

    def process_text(self):
        text = self.text_input.get("1.0", tk.END)
        if text.strip():
            self.on_process(text.strip())
        else:
            messagebox.showwarning("警告", "请输入电子病历文本")

分类逻辑：复用前文提到的 analyze_ehr 函数。

from transformers import BertTokenizer, BertForSequenceClassification
import torch

def analyze_ehr(text, model_name='emilyalsentzer/Bio_ClinicalBERT', num_labels=3):
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
    inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding=True)
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    label = torch.argmax(probs, dim=-1).item()
    return label

结果展示与主程序：将分类结果映射为具体类型（如入院记录、出院记录），并在界面上反馈。

import tkinter as tk
from tkinter import ttk, messagebox
from text_input_frame import TextInputFrame
from result_frame import ResultFrame
from ehr_analysis_functions import analyze_ehr

class EhrAnalysisApp:
    def __init__(self, root):
        self.root = root
        self.root.title("电子病历文本分类应用")
        self.create_widgets()

    def create_widgets(self):
        self.text_input_frame = TextInputFrame(self.root, self.process_text)
        self.text_input_frame.pack(pady=10, padx=10, fill="both", expand=True)
        
        self.result_frame = ResultFrame(self.root)
        self.result_frame.pack(pady=10, padx=10, fill="both", expand=True)

    def process_text(self, text):
        try:
            classification = analyze_ehr(text)
            if classification == 0:
                result = "入院记录"
            elif classification == 1:
                result = "出院记录"
            else:
                result = "手术记录"
            self.result_frame.display_result(result)
        except Exception as e:
            messagebox.showerror("错误", f"处理失败：{str(e)}")

if __name__ == "__main__":
    root = tk.Tk()
    app = EhrAnalysisApp(root)
    root.mainloop()

测试与运行

运行程序后，输入如下示例文本进行测试：

患者男性，65 岁，因'咳嗽、咳痰 1 周'入院。入院后完善相关检查，诊断为'慢性支气管炎急性发作'。给予抗感染、止咳化痰等治疗，患者症状缓解，于今日出院。

点击'文本分类'按钮，系统将输出对应的病历类别。

挑战与展望

尽管技术进展迅速，医疗 NLP 仍面临严峻挑战。首先是数据隐私，必须严格遵守 HIPAA 等法规，确保患者信息脱敏；其次是专业术语歧义，同一词汇在不同科室可能含义迥异；最后是合规性，涉及 FDA 等监管要求的应用需经过严格验证。

总的来说，NLP 技术正在显著提升医疗效率和质量。掌握这些开发方法和技巧，不仅能帮助我们构建实用的辅助工具，更能推动智慧医疗的实质性落地。希望本文提供的实战经验能为你的项目带来启发。