Azure AI 搜索索引 - 多个索引器和分块

Question

我有一个索引器，可以读取 blob 存储、块，并将数据矢量化为索引。这很好用。我还有一个关键字段，我们称之为 fileID，它存储在文档的元数据中，也在索引中。这对于文档来说是唯一的，但是在分块之后它并不唯一，因为一个文档将被分割成多个文档，每个文档都具有相同的 fileid。

我想要第二个索引器，可以将 sql 查询中的数据添加到索引中，并加入到该 fileid 上。但是，由于我不能再使用 fileid 作为键 - 由于分块过程以及键需要唯一的事实，我如何将 sql 查询索引器中的数据合并到索引中？

我猜现在这是不可能的，但如果有人有任何建议，那就太棒了！

Answer 1

索引中的关键字段对于每个文档都是唯一的，并且对于在该文档上创建的块来说也是相同的。

因此，除非您创建两个单独的索引，一个基本字段和使用自定义 Web API 技能集进一步分块，另一个用于通过创建唯一字段加载分块数据，否则不可能为每个块创建唯一的 ID。

在这里，技能组接受输入并为每个文档创建分块数据，并将其写入 blob 存储。然后，将该存储作为数据源，它将关键字段读取为

uniqueid

。

下面的

tmp-index

显示了字段。

enter image description here

接下来，索引器将通过技能组用于此索引。

索引器定义。

{
  "@odata.context": "https://azsearch0303.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": "\"0x8DC5466A843478A\"",
  "name": "tmp-indexer",
  "description": null,
  "dataSourceName": "tmp-datasource",
  "skillsetName": "tmp-skillset",
  "targetIndexName": "tmp-index",
  "disabled": false,
  "parameters": {
    "batchSize": null,
    "maxFailedItems": null,
    "maxFailedItemsPerBatch": null,
    "base64EncodeKeys": null,
    "configuration": {
      "indexedFileNameExtensions": ".txt,.md,.html,.pdf,.docx,.pptx,.deltrack",
      "dataToExtract": "contentAndMetadata"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "document_id",
      "mappingFunction": {
        "name": "base64Encode",
        "parameters": null
      }
    },
    {
      "sourceFieldName": "metadata_storage_name",
      "targetFieldName": "filename",
      "mappingFunction": null
    },
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "url",
      "mappingFunction": null
    }
  ],
  "outputFieldMappings": [],
  "cache": null,
  "encryptionKey": null
}

此处，

dataSourceName

配置为您的原始数据源并提供技能组，该技能组接受输入并将 JSON 格式的分块数据写入存储帐户。

技能组定义。

{
  "@odata.context": "https://azsearch0303.search.windows.net/$metadata#skillsets/$entity",
  "@odata.etag": "\"0x8DC5466966EA9A3\"",
  "name": "tmp-skillset",
  "description": null,
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
      "name": "tmp-skillset",
      "description": null,
      "context": "/document/content",
      "uri": "endpoint",
      "httpMethod": "POST",
      "timeout": "PT1M",
      "batchSize": 10,
      "degreeOfParallelism": 10,
      "inputs": [
        {
          "name": "document_id",
          "source": "/document/document_id"
        },
        {
          "name": "filename",
          "source": "/document/filename"
        },
        {
          "name": "text",
          "source": "/document/content"
        },
        {
          "name": "url",
          "source": "/document/url"
        }
      ],
      "outputs": [
        {
          "name": "recordId",
          "targetName": "recordId"
        }
      ],
      "httpHeaders": {
        "num-tokens": "1024", #chunk size
        "api-key": "api-key-to-endpoint",
        "connection-string": "<connection_string_to_storage_acc>",
        "container-name": "newdata-chunks",
        "metadata-mapping-json": "{}"
      }
    }
  ],
  "cognitiveServices": null,
  "knowledgeStore": null,
  "indexProjections": null,
  "encryptionKey": null
}

此处，向 Web API 发出 POST 请求并传递输入。

您可以根据您的要求配置技能组。在 Web API 中，您可以使用这些输入。

使用
```
text
```
，创建大小为
```
1024
```
的块。
使用
```
document_id
```
创建一个独特的文件夹，使用
```
chunk_id
```
创建一个独特的文件。
将内容写入该文件。以下是写入存储帐户的示例内容。

{"chunk_id": "0", "content": "chunk zero content", "last_updated": "20240404044226", "title": "using System;", "url": "https://jgsblob.blob.core.windows.net/data/csvs/translattion-console-code-plain.txt"}

enter image description here

您可以创建一个新容器并根据您的要求进行编写，但请确保在下一步中将其提供为数据源。

因此，Web API 中的脚本应该创建唯一的文件夹和文件，然后写入内容。

接下来，使用以下定义创建一个新的索引和索引器。

分块索引

enter image description here

在这里，您还可以添加额外的字段，这些字段是联接查询的结果。

索引器

{
  "@odata.context": "https://azsearch0303.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": "\"0x8DC546C5BE2EB06\"",
  "name": "indexer1712209927853",
  "description": null,
  "dataSourceName": "tmp-datasource-chunk",
  "skillsetName": null,
  "targetIndexName": "chunked-index",
  "disabled": null,
  "schedule": null,
  "parameters": {
    "batchSize": null,
    "maxFailedItems": null,
    "maxFailedItemsPerBatch": null,
    "base64EncodeKeys": null,
    "configuration": {
      "parsingMode": "json",
      "indexedFileNameExtensions": ".json"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "uniqueid",
      "mappingFunction": {
        "name": "base64Encode",
        "parameters": null
      }
    }
  ],
  "outputFieldMappings": [],
  "cache": null,
  "encryptionKey": null
}

tmp-datasource-chunk

是使用blob存储创建的DataSource，提供您之前写入数据的容器。

现在，您可以在此处添加自定义 Web API 技能，该技能会加入唯一 ID。

Azure AI 搜索索引 - 多个索引器和分块

问题描述投票：0回答：1

1个回答

最新问题

Azure AI 搜索索引 - 多个索引器和分块

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1