为什么 Azure 认知搜索索引器不必要地创建 Base64 名称？

Question

我正在尝试在 Azure 认知搜索上使用本机软 Blob 删除（根据 https://learn.microsoft.com/en-us/azure/search/search-howto-index-changed-deleted-blobs）当从容器中删除文件时删除，从而从索引中删除文件，但它并没有按照我的预期或文档提供的那样进行。

我创建了一个存储帐户，并打开了“启用 Blob 软删除”。然后我在该帐户中创建一个存储容器。
我创建了一个数据源来使用具有以下设置 (JSON) 的容器：

{
  "@odata.context": "https://sotestservice1.search.windows.net/$metadata#datasources/$entity",
  "@odata.etag": "\"0x8DC3D5D3E87D000\"",
  "name": "sotestdatasource1",
  "description": null,
  "type": "azureblob",
  "subtype": null,
  "credentials": { 
    "connectionString": "DefaultEndpointsProtocol=https;AccountName=soteststorageaccount;AccountKey=OMITTED;EndpointSuffix=core.windows.net"
  },
  "container": {
    "name": "sotestcontainer"
  },
  "dataChangeDetectionPolicy": null,
  "dataDeletionDetectionPolicy": {
    "@odata.type": "#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy"
  },
  "encryptionKey": null,
  "identity": null
}

我在 Azure 认知搜索上创建了一个索引，具有以下设置 (JSON)：

{
  "@odata.context": "https://sotestservice1.search.windows.net/$metadata#indexes/$entity",
  "@odata.etag": "\"0x8DC3D5CD0D66187\"",
  "name": "sotestindex1",
  "defaultScoringProfile": null,
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "searchable": false,
      "filterable": false,
      "retrievable": true,
      "sortable": false,
      "facetable": false,
      "key": true,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "synonymMaps": []
    },
    {
      "name": "title",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "synonymMaps": []
    },
    {
      "name": "content",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "synonymMaps": []
    },
    {
      "name": "titleVector",
      "type": "Collection(Edm.Single)",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": 4,
      "vectorSearchProfile": "vector-profile-1709674759253",
      "synonymMaps": []
    },
    {
      "name": "contentVector",
      "type": "Collection(Edm.Single)",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": 4,
      "vectorSearchProfile": "vector-profile-1709674759253",
      "synonymMaps": []
    }
  ],
  "scoringProfiles": [],
  "corsOptions": null,
  "suggesters": [],
  "analyzers": [],
  "normalizers": [],
  "tokenizers": [],
  "tokenFilters": [],
  "charFilters": [],
  "encryptionKey": null,
  "similarity": {
    "@odata.type": "#Microsoft.Azure.Search.BM25Similarity",
    "k1": null,
    "b": null
  },
  "semantic": null,
  "vectorSearch": {
    "algorithms": [
      {
        "name": "vector-config-1709674581416",
        "kind": "hnsw",
        "hnswParameters": {
          "metric": "cosine",
          "m": 4,
          "efConstruction": 400,
          "efSearch": 500
        },
        "exhaustiveKnnParameters": null
      }
    ],
    "profiles": [
      {
        "name": "vector-profile-1709674759253",
        "algorithm": "vector-config-1709674581416",
        "vectorizer": null
      }
    ],
    "vectorizers": []
  }
}

我创建了一个索引器来使用#2 中的数据源和#3 中的索引

{
  "@odata.context": "https://sotestservice1.search.windows.net/$metadata#datasources/$entity",
  "@odata.etag": "\"0x8DC3D5D3E87D000\"",
  "name": "sotestdatasource1",
  "description": null,
  "type": "azureblob",
  "subtype": null,
  "credentials": {
    "connectionString": "DefaultEndpointsProtocol=https;AccountName=soteststorageaccount;AccountKey=OMITTED;EndpointSuffix=core.windows.net"
  },
  "container": {
    "name": "sotestcontainer"
  },
  "dataChangeDetectionPolicy": null,
  "dataDeletionDetectionPolicy": {
    "@odata.type": "#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy"
  },
  "encryptionKey": null,
  "identity": null
}

我按照以下方式上传了 JSON 文档：

亚伯拉罕林肯.json：

{
    "id": "https---en-wikipedia-org-wiki-Abraham-Lincoln",
    "content": "Lincoln was born into poverty in a log cabin in Kentucky and was raised on the frontier, primarily in Indiana.", 
    "contentVector": [-0.7, 0.3, 0.9, -0.8], 
    "title": "Abraham Lincoln", 
    "titleVector": [0.6, -0.7, 0.2, 0.4], 
    "@search.action": "mergeOrUpload"
}

富兰克林罗斯福.json：

{
    "id": "https----en-wikipedia-org-wiki-Franklin-D-Roosevelt",
    "content": "A member of the Delano family and Roosevelt family, after attending university, Roosevelt began to practice law in New York City.", 
    "contentVector": [0.5, 0.9, 0.3, 0.4], 
    "title": "Franklin D. Roosevelt", 
    "titleVector": [0.7, 0.1, 0.8, -0.3], 
    "@search.action": "mergeOrUpload"
}

我运行索引器，它给了我成功（2 个文档），并且针对索引的“*”搜索返回两个文档，完全符合预期。到目前为止，我都很好。
我从存储容器中删除了 AbrahamLincoln.json。
我重新运行索引器，成功了。
在这里，我希望索引此时仅包含单个文档。相反，它包含三。我原来的两个文档，以及一个如下所示的附加文档：

   {
      "id": "aHR0cHM6Ly9zb3Rlc3RzdG9yYWdlYWNjb3VudC5ibG9iLmNvcmUud2luZG93cy5uZXQvc290ZXN0Y29udGFpbmVyL0ZyYW5rbGluUm9vc2V2ZWx0Lmpzb241",
      "title": "Franklin D. Roosevelt",
      "content": "A member of the Delano family and Roosevelt family, after attending university, Roosevelt began to practice law in New York City.",
      "titleVector": [
        0.7,
        0.1,
        0.8,
        -0.3
      ],
      "contentVector": [
        0.5,
        0.9,
        0.3,
        0.4
      ]
    }

所以现在我很困惑，因为有 3 个文档而不是 1 个，而第三个文档是现有文档的副本。此外，它还有一个新的 id，它是 blob URL 的 Base64 编码（末尾附有一个数字，表示应该有多少个 =）。

认知搜索在这里做错了什么，还是我做错了？

Answer 1

对于本机 blob 软删除，需要满足一些要求。其中之一是：

索引中文档的文档键必须映射到 blob 属性或 blob 元数据，例如“metadata_storage_path”。

因此，您需要将键映射到 blob 属性或 blob 元数据。

修改您的索引定义，如下所示：

{
  "@odata.context": "https://jgsai.search.windows.net/$metadata#indexes/$entity",
  "@odata.etag": "\"0x8DC3D9565D5ADC9\"",
  "name": "azureblob-index-2",
  "defaultScoringProfile": "",
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      .........
    },
    {
      "name": "title",
      "type": "Edm.String",
     .........
    },
    {
      "name": "content",
      "type": "Edm.String",
      .........
    },
    {
      "name": "titleVector",
      "type": "Collection(Edm.Double)",
      .........
    },
    {
      "name": "contentVector",
      "type": "Collection(Edm.Double)",
      .........
    },
    {
      "name": "metadata_storage_path",
      "type": "Edm.String",
      "searchable": false,
      "filterable": false,
      "retrievable": true,
      "sortable": false,
      "facetable": false,
      "key": true,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "synonymMaps": []
    }
  ],
  "scoringProfiles": [],
  "corsOptions": null,
  "suggesters": [],
  "analyzers": [],
  "normalizers": [],
  "tokenizers": [],
  "tokenFilters": [],
  "charFilters": [],
  "encryptionKey": null,
  "similarity": {
    "@odata.type": "#Microsoft.Azure.Search.BM25Similarity",
    "k1": null,
    "b": null
  },
  "semantic": null,
  "vectorSearch": null
}

在这里我添加了元数据

metadata_storage_path

并将其作为关键。

接下来，您因提供的任何一个而获得的 Base64 加密密钥

base64EncodeKeys

为

true

或在映射字段中提供映射函数为

base64Encode

。

以下是索引器的定义。

{
  "@odata.context": "https://jgsai.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": "\"0x8DC3D95C7751E0D\"",
  "name": "azureblob-indexer",
  "description": "",
  "dataSourceName": "blobsource",
  "skillsetName": null,
  "targetIndexName": "azureblob-index-2",
  "disabled": null,
  "schedule": null,
  "parameters": {
    "batchSize": null,
    "maxFailedItems": 0,
    "maxFailedItemsPerBatch": 0,
    "base64EncodeKeys": null,
    "configuration": {
      "dataToExtract": "contentAndMetadata",
      "parsingMode": "json"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "metadata_storage_path",
      "mappingFunction": {
        "name": "base64Encode",
        "parameters": null
      }
    }
  ],
  "outputFieldMappings": [],
  "cache": null,
  "encryptionKey": null
}

在这种情况下，如果您将

base64EncodeKeys

设置为true或

base64Encode

中的

mappingFunction

，您将获得一个base64加密密钥。

数据源定义。

{
  "@odata.context": "https://jgsai.search.windows.net/$metadata#datasources/$entity",
  "@odata.etag": "\"0x8DC3D923FFB589D\"",
  "name": "blobsource",
  "description": null,
  "type": "azureblob",
  "subtype": null,
  "credentials": {
    "connectionString": "DefaultEndpointsProtocol=https;AccountName=jgsblob;AccountKey=..."
  },
  "container": {
    "name": "data",
    "query": "json"
  },
  "dataChangeDetectionPolicy": null,
  "dataDeletionDetectionPolicy": {
    "@odata.type": "#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy"
  },
  "encryptionKey": null,
  "identity": null
}

我已成功输出删除 json 文件的信息。

enter image description here

最初有 2 个文档。删除 1 并重新运行索引器后，我在索引中得到了 1 个文档。

为什么 Azure 认知搜索索引器不必要地创建 Base64 名称？

问题描述投票：0回答：1

1个回答

最新问题

为什么 Azure 认知搜索索引器不必要地创建 Base64 名称？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1