有没有办法以 Markdown 格式从 Azure Document Intelligence 获取内容,但逐页分开?

问题描述 投票:0回答:1

这是我尝试过的。内容是降价的,但我没有逐页分隔。另一方面,如果我进入页面属性,则没有降价。

document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))
poller = document_intelligence_client.begin_analyze_document(
    "prebuilt-layout",
    analyze_request=AnalyzeDocumentRequest(base64_source=doc_bytes),
    output_content_format=ContentFormat.MARKDOWN
)

如上所述。我还尝试了“页面”属性 -

document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))
poller = document_intelligence_client.begin_analyze_document(
    "prebuilt-layout",
    analyze_request=AnalyzeDocumentRequest(base64_source=doc_bytes),
    output_content_format=ContentFormat.MARKDOWN,
    pages='1'
)

并迭代文档,但即使它是一页,它仍然需要与完整文档一样多的时间来分析。

python azure azure-form-recognizer
1个回答
0
投票

使用 Azure 文档智能从文档中提取内容并希望逐页分离内容:

  • 关注了这个Azure AI 文档使用python的智能客户端
for page in result.pages:
    # Create a string to hold the Markdown content for this page
    markdown_content = ""
    
    # Iterate through each element on the page
    for element in page.lines:
        # Append each line to the markdown_content string
        markdown_content += f"{element.content}\n"
    
    # Append the markdown content to the markdown_pages list
    markdown_pages.append(markdown_content)

下面的Python代码从文档中提取内容并将内容逐页分开

from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential

# Set up the client
endpoint ="https://your-service-endpoint.cognitiveservices.azure.com/"
api_key = "your-api-key"
client = DocumentAnalysisClient(endpoint, AzureKeyCredential(api_key))

# Analyze the document
document_url = "https://sampath1233sa23344545.blob.core.windows.net/sampathpujari/sample-layout_merged%20(1).pdf?sp=r&st=2024-04-17T11:50:23Z&se=2024-04-17T19:50:23Z&sv=2022-11-02&sr=b&sig=r6USzR4zcaECjsNw9HAbJ%2BvzRwrk3InrBmaXklxbs5c%3D"

# Begin document analysis
analysis_poller = client.begin_analyze_document_from_url("prebuilt-document", document_url)
result = analysis_poller.result()  # Wait for the analysis to complete and get the result

# Initialize a list to hold the markdown content for each page
markdown_pages = []

# Iterate through each page
for page in result.pages:
    # Create a string to hold the Markdown content for this page
    markdown_content = ""
    
    # Iterate through each element on the page
    for element in page.lines:
        # Append each line to the markdown_content string
        markdown_content += f"{element.content}\n"
    
    # Append the markdown content to the markdown_pages list
    markdown_pages.append(markdown_content)

# The markdown_pages list now contains the content of each page in Markdown format
for i, content in enumerate(markdown_pages):
    print(f"Page {i + 1}:\n{content}\n")


输出: enter image description here

enter image description here

  • 请参阅使用 Azure 文档智能 Markdown 分析复杂文档输出
© www.soinside.com 2019 - 2024. All rights reserved.