我正在为一个诊所的项目工作,该诊所在一些实验室文档上运行OCR,然后解析数据并自动将其输入到他们的实验室系统中。原始数据是半结构化的,因此我可以通过一系列步骤按照所需顺序提取所需数据。我开始凝视墙壁已经太久了,不胜感激。
该过程如下:
我用来提取测试代码
(?<=•\s*|\.\s*|\s*)(?<ORDER>[A-Z0-9]{3,9})(?=\s*\||\sJ\s|\sj\s|\sI\s|\s\[\s|\s\]\s)
下面是两个实际数据示例。在第一个中,除了匹配末尾的3个字符组(GFR,A1C)外,我还匹配所有测试代码(行首的4位数字)。
第二张图像看起来很理想,只匹配了测试代码。
当我的测试代码可能确实是三个字符(高位字母和数字)时,我怎么不能匹配三个字符组?
原始文本的三个示例
Adult health examination | ICD-10: ZOO.OO: Encounter for general adult medical examination without abnormal findings; Z13.6: Encounter for screening for cardiovascular disorders; • CBCWD | CBC w / differential | BILL: Third Party • LIPID | lipid panel | BILL: Third Party • THYCSCD J thyroid cascade profile | BILL: Third Party • GLYHB | glycated hemoglobin | BILL: Third Party • CMP | comprehensive metabolic panel | BILL: Third Party Vitamin D deficiency | ICD - 10: E55.9: Vitamin D deficiency, unspecified • VITD | vitamin D, 25 - hydroxy | BILL: Third Party Feces contents abnormal | ICD-10: R19.5: Other fecal abnormalities CXSTO1 | stool culture complete | BILL: Patient WBCST | WBC stool | BILL: Patient IFOBT | occult blood fecal(immunochemical) | BILL: Patient 8623 | ova and parasite exam | BILL: Patient Fatigue | ICD-10: R53.83: Other fatigue 2834 | TSH reflex to free T4 | BILL: Third Party 1000 | CBC w/auto diff | BILL: Third Party 9180 | comprehensive metabolic panel + E-GFR | BILL: Third Party 4937 | testosterone, free/total with shbg | BILL: Third Party 2708 | hemoglobin A1C | BILL: Third Party
感谢阅读
我正在为一个诊所的项目工作,该诊所在一些实验室文档上运行OCR,然后解析数据并自动将其输入到他们的实验室系统中。原始数据是半结构化的,我可以放它了...
您需要更好地锚定正则表达式,并使用定期出现的管道|
char: