从 Google Doc Python 中提取文本和评论

问题描述 投票:0回答:1

我需要帮助从我的一份谷歌文档中提取评论。基本上我想获取评论的文本以及评论框中的内容。例如,如果我对“Hello World”这句话评论“This is out of place”,那么我可以获得这两个文本。如果这不可能两者兼得,那么我更需要评论框中的内容。到目前为止我的代码是这样的:

def read_comments(comments):
    comment_text = ''
    for comment in comments:
        comment_text += comment['content']
    return comment_text

def main():
    credentials = get_credentials()
    http = credentials.authorize(Http())
    docs_service = discovery.build(
        'docs', 'v1', http=http, discoveryServiceUrl=DISCOVERY_DOC)
    
    doc = docs_service.documents().get(documentId=DOCUMENT_ID_2).execute()
    doc_content = doc.get('body').get('content')

    comments = docs_service.documents().get(documentId=DOCUMENT_ID_2).execute().get('comments', [])
    comments_text = read_comments(comments)

    print(comments_text)

    sentences = sent_tokenize(comments_text)
    for sentence in sentences:
        sentence = "{This is a PB}" + sentence + "{This is a PB}"
        print(sentence)

if __name__ == '__main__':
    main()

运行此程序时,我没有收到任何错误,但没有返回任何内容。列表为空。

python google-cloud-platform google-docs text-extraction
1个回答
0
投票

您需要使用 Google Docs API 来获取 Google Doc 文件的注释。这是因为注释不是文档内容的一部分,它们是与其关联的元数据。这是一个修改后的脚本,它使用 Google Docs API 来获取评论内容和引用的文件内容:

def main():
    credentials = get_credentials()
    http = credentials.authorize(Http())
    gdrive_service = discovery.build(
        "drive", "v3", http=http, discoveryServiceUrl=DISCOVERY_DOC
    )
    
    results = service.comments().list(fileId=file_id, fields='*').execute()
    comments = results.get("comments", [])

    # Now, each item in `comments` is a dictionary, with the following fields:
    # 'content', 'quotedFileContent', 'replies', 'author', 'deleted', 'htmlContent', ...
    # The 'content' field contains the comment text
    # The 'quotedFileContent' field contains the text that was commented on

    comments_text = read_comments(comments)

    # Rest of the code
    ...

请注意,您的项目必须启用 Google Drive API,并且必须与服务帐户的电子邮件地址共享文档。

© www.soinside.com 2019 - 2024. All rights reserved.