我如何删除从python中使用pypdf2提取的条形码中提取的整个非ascii字符串？ mantion不是条形码下面的文字[重复]

Question

这个问题在这里已有答案：

Remove text below barcode in python barcode.py library 2回答

我在python中使用pyPDF2包将pdf转换为文本，我想从这个文本中提取特定的单词，但是当我将句子传递给代码时会导致错误，因为pypdf2包以下列方式转换条形码。请帮我解决这个错误。我有这种名为“acc-53.txt”的文本文件：

“-aUUID：F9F537ED-3066-4E99-B6D1-112D5C4551F0'RO76TCGA-OR-ACM-91A-PRIII 11111 IIIIHIIIIIIII1111111111111IIIIIIIIIIIIIIIIIIIIIIIII1111IIIedIII 1111111111111111111111111111111111111111111111111111111111111III 111111111111 IIIIIIIIIIIIIIIII我IIII我IIIIIII III IIII 11111111我IIIC'1E“。 i'+ 6 Jed＆rf0-7q.b ^ JcronnL k4ored6m aeeAly ,, ph / e .- ^^ 1 a ^ c ^ rr / et ,,, tzt av 1.1Procedure：L nephrectomy，preirenal and paraaortal LNGross description：11 x 10.5 x 9cm，497g诊断：肾上腺皮质癌，标本中的小LN无肿瘤参考病理学：诊断：肾上腺皮质癌，K167 5-10％，高10级/ 10 hpfWeiss评分：2Hough评分：1.69Van Slooten评分：5.7lun.o，网站riscrNpnney` ._... - _ I I IAAr I-re.y-'ri：你好啊！忽悠^ y Iij ^ oSYlun！IS; ncnlunou：，'Ji ^'r'J r''^ nlnEi' .Patient＃来自组织来源SiteDate of reportDate of Surgery / samples collectionSite（确认为肾上腺）有侧向指示左侧肿瘤大小11x10.5x9cmHistologic diagnosisACCLymph Node Status0 / 4PathologicinformationT2 NOWeiss得分2，但K167 5-10％和ACC的Dx，参考病理学家“

我已经尝试过以下模式来删除这一行：

regex = re.findall('\w+ k774$ ',text)
text.decode('unicode_escape').encode('latin')
regex = '\u00?'
来自unidecode import unidecode def remove_non_ascii（text）：return unidecode（unicode（text，encoding =“utf-8”））
regex = re.findall('\III IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII IIIIIN IIIIIIIIIIIIIII I I IH I!IIII I IIIIIII I IIIIIIII II !IIIIIIIIIIIIIli, l I I !III IIIIII IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII ',sentences)
re.sub（'\ u00E2 || \ u20AC'，''，句子）

from unidecode import unidecode

text = pdf_file.read()
sentences = sent_tokenize(text)

print(sentences)

def remove_non_ascii(text):
    return unidecode(unicode(text, encoding = "utf-8"))

print(regex)

def findInfo():
    uuId = re.findall('\w{8}\-\w{4}\-\w{4}\-\w{4}\-\w{12}',sentences)   
    Gender= re.findall('female|Female|male|Male$',sentences)   
    tSize = re.findall( r'\d+?\.?\d+?\s?x?\s?\d+?\.?\d+?\s?x?\s?\d+?\.? 
    \d+mm|cm$',sentences)  

    Diag = re.findall(r'(DIAGNOSIS|Diagnosis):(.*?),',sentences)
    side = re.findall(r'(LEFT SIDE):(.*?),', sentences)

    return uuId,Gender,tSize, Diag , side

将从文本中删除条形码解码的字符串以供进一步处理。

Answer 1

假设您正在使用Strings，您只需使用.replace（）函数删除特殊字符即可。像这样：

line.replace('|', '')

另一个例子：

someline = 'red blue green'
print(someline.replace('blue', ''))

哪个版画：“红绿”

我如何删除从python中使用pypdf2提取的条形码中提取的整个非ascii字符串？ mantion不是条形码下面的文字[重复]

问题描述投票：-1回答：1

1个回答

最新问题

我如何删除从python中使用pypdf2提取的条形码中提取的整个非ascii字符串？ mantion不是条形码下面的文字[重复]

问题描述 投票：-1回答：1

1个回答

最新问题

问题描述投票：-1回答：1