我尝试使用Python包tabula-py来读取pdf中的table,似乎pdf表格单元格中的换行符会将原始单元格中的内容分成多个单元格。
我尝试搜索各种python包来解决这个问题。看来 tabula-py 是将 pdf 表转换为 pandas 数据的最稳定的包。然而,如果这个问题无法解决,我必须转向在线服务,这将为我提供理想的excel输出。
from tabula import read_pdf
df=read_pdf("C:/Users/Desktop/test.pdf", pages='all')
我希望 pdf 表可以用 this 正确转换。
Tabula 不再提供“电子表格”选项。相反,使用“lattice”选项来避免换行符分隔成新行。代码如下:
import tabula
# Read pdf into DataFrame
df = tabula.read_pdf("FDA EPC Text Phrases (updated March 2018.pdf", pages='all',
lattice=True)
print(df)
您可以使用值为“True”的“电子表格”选项来省略由换行符引起的多行 NAN 值。
import tabula
# Read pdf into DataFrame
df = tabula.read_pdf("FDA EPC Text Phrases (updated March 2018.pdf", pages='all', spreadsheet=True)
print(df)
#print(df['Active Moiety Name'])
#print(df['FDA Established Pharmacologic Class\r(EPC) Text Phrase\rPLR regulations require that the following\rstatement is included in the Highlights\rIndications and Usage heading if a drug is a\rmember of an EPC [see 21 CFR\r201.57(a)(6)]: “(Drug) is a (FDA EPC Text\rPhrase) indicated for [indication(s)].” For\reach listed active moiety, the associated\rFDA EPC text phrase is included in this\rdocument. For more information about how\rFDA determines the EPC Text Phrase, see\rthe 2009 "Determining EPC for Use in the\rHighlights" guidance and 2013 "Determining\rEPC for Use in the Highlights" MAPP\r7400.13.'])
输出:
1758 ziconotide N-type calcium channel antagonist
1759 zidovudine HIV nucleoside analog reverse transcriptase in...
1760 zileuton 5-lipoxygenase inhibitor
1761 zinc cation copper absorption inhibitor
1762 ziprasidone atypical antipsychotic
1763 zoledronic acid bisphosphonate
1764 zoledronic acid anhydrous bisphosphonate
1765 zolmitriptan serotonin 5-HT1B/1D receptor agonist (triptan)
1766 zolmitriptan serotonin 5-HT1B/1D receptor agonist (triptan)
1767 zolpidem gamma-aminobutyric acid (GABA) A agonist
1768 zonisamide antiepileptic drug (AED)
我建议您使用参数“lattice”,这样换行符将被替换为 。另一种方法是将表存储在 json 文件中并将其加载到数据框中,以确保保留带有换行符的列名。
# Use Tabula to extract table in a specific page and save it in json files
for i, table in enumerate(tabula.read_pdf(pdf_path, pages="85", multiple_tables=True, lattice=True)):
table.to_json(str(i) + "_.json")
加载json文件示例
data_test = pd.read_json("2_.json")
data_test.head()
输出: data_test表头