我想使用 python-docx 提取文本框中的文本。我试过段落和表格属性,但无法提取文本。请让我知道是否可以使用 python-docx 提取文本,或者我应该尝试使用其他库。
这里是使用段落属性的代码片段:
!pip install python-docx
import docx
source_file = 'textbox.docx'
doc = docx.Document(source_file)
text = [];
for para in doc.paragraphs:
text.append(para.text)
text
这里是使用 tables 属性的代码片段:
!pip install python-docx
import docx
source_file = 'textbox.docx'
doc = docx.Document(source_file)
text = [];
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
text.append(cell.text)
text
结果是一样的(没有提取文本):
[]
不支持带有 python-docx 的文本框。
但是你可以用 win32com 提取
您会在 Doc 中查找形状,然后检查
type
是否有 Textbox
,它的数字是 17。然后只需根据代码的第一部分打印出文本。但是,在您的单词示例中,文本框已分组,因此在这种情况下,形状对象的数字为 6,我们需要遍历项目组以再次找到文本框,键入 17,然后我们可以提取文本。代码示例
import win32com.client as win32
word = win32.gencache.EnsureDispatch('Word.Application')
source_file = 'textbox.docx'
doc = word.Documents.Open(source_file)
for sh in doc.Shapes:
print(sh.Type) # Type 17 is a textbox
if sh.Type == 17:
print(sh.Name)
print(sh.TextFrame.TextRange.Text)
elif sh.Type == 6: # 6 is a group
for grp in sh.GroupItems:
print(grp.Name)
print(grp.Type)
if grp.Type == 17:
t = grp.TextFrame.TextRange.Text
print(t.replace('\r', ''))
输出为; (此显示中的文本被截断)
Power Case Study: Solar PanelsMany households in Florida have adopted solar panels as a source of energy. There have been a total of 155,383 solar panel installations in Florida (
Yu et al. 2018), out of which 90.7% (140,265) are residential installations, and 9.3% (15118) are commercial installations. Over 90% of the solar panels have been installed since
2017.Solar panels can provide access to continuous power supply when a hurricane damages the primary grid, as exemplified in Hurricane Ian by Babcock Ranch, a community located 12
miles northeast of Fort Myers.
...