这是我尝试过的。内容是降价的,但我没有逐页分隔。另一方面,如果我进入页面属性,则没有降价。
document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))
poller = document_intelligence_client.begin_analyze_document(
"prebuilt-layout",
analyze_request=AnalyzeDocumentRequest(base64_source=doc_bytes),
output_content_format=ContentFormat.MARKDOWN
)
如上所述。我还尝试了“页面”属性 -
document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))
poller = document_intelligence_client.begin_analyze_document(
"prebuilt-layout",
analyze_request=AnalyzeDocumentRequest(base64_source=doc_bytes),
output_content_format=ContentFormat.MARKDOWN,
pages='1'
)
并迭代文档,但即使它是一页,它仍然需要与完整文档一样多的时间来分析。
使用 Azure 文档智能从文档中提取内容并希望逐页分离内容:
for page in result.pages:
# Create a string to hold the Markdown content for this page
markdown_content = ""
# Iterate through each element on the page
for element in page.lines:
# Append each line to the markdown_content string
markdown_content += f"{element.content}\n"
# Append the markdown content to the markdown_pages list
markdown_pages.append(markdown_content)
下面的Python代码从文档中提取内容并将内容逐页分开
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
# Set up the client
endpoint ="https://your-service-endpoint.cognitiveservices.azure.com/"
api_key = "your-api-key"
client = DocumentAnalysisClient(endpoint, AzureKeyCredential(api_key))
# Analyze the document
document_url = "https://sampath1233sa23344545.blob.core.windows.net/sampathpujari/sample-layout_merged%20(1).pdf?sp=r&st=2024-04-17T11:50:23Z&se=2024-04-17T19:50:23Z&sv=2022-11-02&sr=b&sig=r6USzR4zcaECjsNw9HAbJ%2BvzRwrk3InrBmaXklxbs5c%3D"
# Begin document analysis
analysis_poller = client.begin_analyze_document_from_url("prebuilt-document", document_url)
result = analysis_poller.result() # Wait for the analysis to complete and get the result
# Initialize a list to hold the markdown content for each page
markdown_pages = []
# Iterate through each page
for page in result.pages:
# Create a string to hold the Markdown content for this page
markdown_content = ""
# Iterate through each element on the page
for element in page.lines:
# Append each line to the markdown_content string
markdown_content += f"{element.content}\n"
# Append the markdown content to the markdown_pages list
markdown_pages.append(markdown_content)
# The markdown_pages list now contains the content of each page in Markdown format
for i, content in enumerate(markdown_pages):
print(f"Page {i + 1}:\n{content}\n")
输出: