400 个文档页数超出限制:“PAGE_LIMIT_EXCEEDED”

问题描述 投票:0回答:2

DocumentProcessorServiceAsyncClient.process_document 方法出错,并显示以下错误消息:

400 Document pages exceed the limit: "PAGE_LIMIT_EXCEEDED"
。根据 API 文档,此进程应该能够处理最多 200 个页面。通过使用 DocumentProcessorServiceAsyncClient 而不是 DocumentProcessorServiceClient,我认为我将能够利用异步最大页面限制。事实似乎并非如此。

我正在测试的示例代码:

api_path = f'projects/{project_id}/locations/{gcloud_region}/processors/{processor_id}'
documentai_client = documentai.DocumentProcessorServiceAsyncClient() # maybe pass some client_options here?

async def invoke_invoice_processor(self, filebytes):
    raw_document = documentai.RawDocument(
        content=filebytes,
        mime_type="application/pdf",
    )
    request = documentai.ProcessRequest(
        name=api_path,
        raw_document=raw_document,
    )
    response = await documentai_client.process_document(request=request)
    return response.document

上述代码块适用于 10 页及以下的 PDF。它仅在 PDF 大于 10 页时失败。

我的问题:我需要对上述代码进行哪些更改才能成功处理超过 10 页的较大 PDF?

google-cloud-platform gcloud cloud-document-ai
2个回答
2
投票
仅供参考,Document AI 有一个主动监控的标签

[cloud-document-ai]



yan-hic@的这条评论是正确的

迟到的答案,但正如我猜想的那样,200 的限制是针对批量请求的,根据定义,这些请求是异步的。造成混乱的原因是客户端库中还有一个异步客户端。使用任一客户端中的

batch_process_documents

 浏览 10 页以上。

要添加更多详细信息,请按照

发送处理请求中提供的代码示例进行批量处理,以一次发送多个文档并发送比在线处理更多的页面。异步客户端不会影响处理器或平台的页面限制。

https://cloud.google.com/document-ai/quotas#content_limits

import re from google.api_core.client_options import ClientOptions from google.api_core.exceptions import InternalServerError from google.api_core.exceptions import RetryError from google.cloud import documentai from google.cloud import storage # TODO(developer): Uncomment these variables before running the sample. # project_id = 'YOUR_PROJECT_ID' # location = 'YOUR_PROCESSOR_LOCATION' # Format is 'us' or 'eu' # processor_id = 'YOUR_PROCESSOR_ID' # Create processor before running sample # gcs_input_uri = "YOUR_INPUT_URI" # Format: gs://bucket/directory/file.pdf # input_mime_type = "application/pdf" # gcs_output_bucket = "YOUR_OUTPUT_BUCKET_NAME" # Format: gs://bucket # gcs_output_uri_prefix = "YOUR_OUTPUT_URI_PREFIX" # Format: directory/subdirectory/ # field_mask = "text,entities,pages.pageNumber" # Optional. The fields to return in the Document object. def batch_process_documents( project_id: str, location: str, processor_id: str, gcs_input_uri: str, input_mime_type: str, gcs_output_bucket: str, gcs_output_uri_prefix: str, field_mask: str = None, timeout: int = 400, ): # You must set the api_endpoint if you use a location other than 'us'. opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com") client = documentai.DocumentProcessorServiceClient(client_options=opts) gcs_document = documentai.GcsDocument( gcs_uri=gcs_input_uri, mime_type=input_mime_type ) # Load GCS Input URI into a List of document files gcs_documents = documentai.GcsDocuments(documents=[gcs_document]) input_config = documentai.BatchDocumentsInputConfig(gcs_documents=gcs_documents) # NOTE: Alternatively, specify a GCS URI Prefix to process an entire directory # # gcs_input_uri = "gs://bucket/directory/" # gcs_prefix = documentai.GcsPrefix(gcs_uri_prefix=gcs_input_uri) # input_config = documentai.BatchDocumentsInputConfig(gcs_prefix=gcs_prefix) # # Cloud Storage URI for the Output Directory # This must end with a trailing forward slash `/` destination_uri = f"{gcs_output_bucket}/{gcs_output_uri_prefix}" gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig( gcs_uri=destination_uri, field_mask=field_mask ) # Where to write results output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config) # The full resource name of the processor, e.g.: # projects/project_id/locations/location/processor/processor_id name = client.processor_path(project_id, location, processor_id) request = documentai.BatchProcessRequest( name=name, input_documents=input_config, document_output_config=output_config, ) # BatchProcess returns a Long Running Operation (LRO) operation = client.batch_process_documents(request) # Continually polls the operation until it is complete. # This could take some time for larger files # Format: projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_ID try: print(f"Waiting for operation {operation.operation.name} to complete...") operation.result(timeout=timeout) # Catch exception when operation doesn't finish before timeout except (RetryError, InternalServerError) as e: print(e.message) # NOTE: Can also use callbacks for asynchronous processing # # def my_callback(future): # result = future.result() # # operation.add_done_callback(my_callback) # Once the operation is complete, # get output document information from operation metadata metadata = documentai.BatchProcessMetadata(operation.metadata) if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED: raise ValueError(f"Batch Process Failed: {metadata.state_message}") storage_client = storage.Client() print("Output files:") # One process per Input Document for process in metadata.individual_process_statuses: # output_gcs_destination format: gs://BUCKET/PREFIX/OPERATION_NUMBER/INPUT_FILE_NUMBER/ # The Cloud Storage API requires the bucket name and URI prefix separately matches = re.match(r"gs://(.*?)/(.*)", process.output_gcs_destination) if not matches: print( "Could not parse output GCS destination:", process.output_gcs_destination, ) continue output_bucket, output_prefix = matches.groups() # Get List of Document Objects from the Output Bucket output_blobs = storage_client.list_blobs(output_bucket, prefix=output_prefix) # Document AI may output multiple JSON files per source file for blob in output_blobs: # Document AI should only output JSON files to GCS if ".json" not in blob.name: print( f"Skipping non-supported file: {blob.name} - Mimetype: {blob.content_type}" ) continue # Download JSON File as bytes object and convert to Document Object print(f"Fetching {blob.name}") document = documentai.Document.from_json( blob.download_as_bytes(), ignore_unknown_fields=True ) # For a full list of Document object attributes, please reference this page: # https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.Document # Read the text recognition output from the processor print("The document contains the following text:") print(document.text)
    

0
投票
如果不一定需要所有页面来处理文档,可以设置一个选项:feedbackProcessOptions

其中有以下选项:

individualPageSelector对象(IndividualPageSelector) 要处理哪些页面(1-索引)。

从起始整数 从一开始就只处理某些页面。如果文档页数较少,则处理全部。

fromEnd 整数 只从最后处理某些页面,与上面相同。

因此,可以将 fromStart 设置为当前的最大值(例如 10),它将仅根据第一页处理请求,从而消除 PAGE_LIMIT_EXCEEDED 错误。

示例代码如下所示:

$filename = 'your-filename-here'; $object = new StdClass(); $object->skipHumanReview = true; $object->rawDocument = new StdClass(); $object->rawDocument->mimeType = mime_content_type($filename); $object->rawDocument->content = base64_encode(trim(file_get_contents($filename))); $object->processOptions = new StdClass(); $object->processOptions->fromStart = 10; $json = json_encode($object, JSON_UNESCAPED_SLASHES);
请参阅 

https://cloud.google.com/document-ai/docs/reference/rest/v1/ProcessOptions 了解更多信息。

© www.soinside.com 2019 - 2024. All rights reserved.