有没有办法以 Markdown 格式从 Azure Document Intelligence 获取内容，但逐页分开？

Question

这是我尝试过的。内容是降价的，但我没有逐页分隔。另一方面，如果我进入页面属性，则没有降价。

document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))
poller = document_intelligence_client.begin_analyze_document(
    "prebuilt-layout",
    analyze_request=AnalyzeDocumentRequest(base64_source=doc_bytes),
    output_content_format=ContentFormat.MARKDOWN
)

如上所述。我还尝试了“页面”属性 -

document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))
poller = document_intelligence_client.begin_analyze_document(
    "prebuilt-layout",
    analyze_request=AnalyzeDocumentRequest(base64_source=doc_bytes),
    output_content_format=ContentFormat.MARKDOWN,
    pages='1'
)

并迭代文档，但即使它是一页，它仍然需要与完整文档一样多的时间来分析。

Answer 1

使用 Azure 文档智能从文档中提取内容并希望逐页分离内容：

关注了这个Azure AI 文档使用python的智能客户端

for page in result.pages:
    # Create a string to hold the Markdown content for this page
    markdown_content = ""
    
    # Iterate through each element on the page
    for element in page.lines:
        # Append each line to the markdown_content string
        markdown_content += f"{element.content}\n"
    
    # Append the markdown content to the markdown_pages list
    markdown_pages.append(markdown_content)

下面的Python代码从文档中提取内容并将内容逐页分开

from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential

# Set up the client
endpoint ="https://your-service-endpoint.cognitiveservices.azure.com/"
api_key = "your-api-key"
client = DocumentAnalysisClient(endpoint, AzureKeyCredential(api_key))

# Analyze the document
document_url = "https://sampath1233sa23344545.blob.core.windows.net/sampathpujari/sample-layout_merged%20(1).pdf?sp=r&st=2024-04-17T11:50:23Z&se=2024-04-17T19:50:23Z&sv=2022-11-02&sr=b&sig=r6USzR4zcaECjsNw9HAbJ%2BvzRwrk3InrBmaXklxbs5c%3D"

# Begin document analysis
analysis_poller = client.begin_analyze_document_from_url("prebuilt-document", document_url)
result = analysis_poller.result()  # Wait for the analysis to complete and get the result

# Initialize a list to hold the markdown content for each page
markdown_pages = []

# Iterate through each page
for page in result.pages:
    # Create a string to hold the Markdown content for this page
    markdown_content = ""
    
    # Iterate through each element on the page
    for element in page.lines:
        # Append each line to the markdown_content string
        markdown_content += f"{element.content}\n"
    
    # Append the markdown content to the markdown_pages list
    markdown_pages.append(markdown_content)

# The markdown_pages list now contains the content of each page in Markdown format
for i, content in enumerate(markdown_pages):
    print(f"Page {i + 1}:\n{content}\n")

输出： enter image description here

enter image description here

请参阅使用 Azure 文档智能 Markdown 分析复杂文档输出

有没有办法以 Markdown 格式从 Azure Document Intelligence 获取内容，但逐页分开？

问题描述投票：0回答：1

1个回答

最新问题

有没有办法以 Markdown 格式从 Azure Document Intelligence 获取内容，但逐页分开？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1