我正在尝试在 Azure 认知搜索上使用本机软 Blob 删除(根据 https://learn.microsoft.com/en-us/azure/search/search-howto-index-changed-deleted-blobs)当从容器中删除文件时删除,从而从索引中删除文件,但它并没有按照我的预期或文档提供的那样进行。
我创建了一个存储帐户,并打开了“启用 Blob 软删除”。然后我在该帐户中创建一个存储容器。
我创建了一个数据源来使用具有以下设置 (JSON) 的容器:
{
"@odata.context": "https://sotestservice1.search.windows.net/$metadata#datasources/$entity",
"@odata.etag": "\"0x8DC3D5D3E87D000\"",
"name": "sotestdatasource1",
"description": null,
"type": "azureblob",
"subtype": null,
"credentials": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=soteststorageaccount;AccountKey=OMITTED;EndpointSuffix=core.windows.net"
},
"container": {
"name": "sotestcontainer"
},
"dataChangeDetectionPolicy": null,
"dataDeletionDetectionPolicy": {
"@odata.type": "#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy"
},
"encryptionKey": null,
"identity": null
}
{
"@odata.context": "https://sotestservice1.search.windows.net/$metadata#indexes/$entity",
"@odata.etag": "\"0x8DC3D5CD0D66187\"",
"name": "sotestindex1",
"defaultScoringProfile": null,
"fields": [
{
"name": "id",
"type": "Edm.String",
"searchable": false,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"normalizer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "title",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "standard.lucene",
"normalizer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "content",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "standard.lucene",
"normalizer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "titleVector",
"type": "Collection(Edm.Single)",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"normalizer": null,
"dimensions": 4,
"vectorSearchProfile": "vector-profile-1709674759253",
"synonymMaps": []
},
{
"name": "contentVector",
"type": "Collection(Edm.Single)",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"normalizer": null,
"dimensions": 4,
"vectorSearchProfile": "vector-profile-1709674759253",
"synonymMaps": []
}
],
"scoringProfiles": [],
"corsOptions": null,
"suggesters": [],
"analyzers": [],
"normalizers": [],
"tokenizers": [],
"tokenFilters": [],
"charFilters": [],
"encryptionKey": null,
"similarity": {
"@odata.type": "#Microsoft.Azure.Search.BM25Similarity",
"k1": null,
"b": null
},
"semantic": null,
"vectorSearch": {
"algorithms": [
{
"name": "vector-config-1709674581416",
"kind": "hnsw",
"hnswParameters": {
"metric": "cosine",
"m": 4,
"efConstruction": 400,
"efSearch": 500
},
"exhaustiveKnnParameters": null
}
],
"profiles": [
{
"name": "vector-profile-1709674759253",
"algorithm": "vector-config-1709674581416",
"vectorizer": null
}
],
"vectorizers": []
}
}
{
"@odata.context": "https://sotestservice1.search.windows.net/$metadata#datasources/$entity",
"@odata.etag": "\"0x8DC3D5D3E87D000\"",
"name": "sotestdatasource1",
"description": null,
"type": "azureblob",
"subtype": null,
"credentials": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=soteststorageaccount;AccountKey=OMITTED;EndpointSuffix=core.windows.net"
},
"container": {
"name": "sotestcontainer"
},
"dataChangeDetectionPolicy": null,
"dataDeletionDetectionPolicy": {
"@odata.type": "#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy"
},
"encryptionKey": null,
"identity": null
}
亚伯拉罕林肯.json:
{
"id": "https---en-wikipedia-org-wiki-Abraham-Lincoln",
"content": "Lincoln was born into poverty in a log cabin in Kentucky and was raised on the frontier, primarily in Indiana.",
"contentVector": [-0.7, 0.3, 0.9, -0.8],
"title": "Abraham Lincoln",
"titleVector": [0.6, -0.7, 0.2, 0.4],
"@search.action": "mergeOrUpload"
}
富兰克林罗斯福.json:
{
"id": "https----en-wikipedia-org-wiki-Franklin-D-Roosevelt",
"content": "A member of the Delano family and Roosevelt family, after attending university, Roosevelt began to practice law in New York City.",
"contentVector": [0.5, 0.9, 0.3, 0.4],
"title": "Franklin D. Roosevelt",
"titleVector": [0.7, 0.1, 0.8, -0.3],
"@search.action": "mergeOrUpload"
}
我运行索引器,它给了我成功(2 个文档),并且针对索引的“*”搜索返回两个文档,完全符合预期。到目前为止,我都很好。
我从存储容器中删除了 AbrahamLincoln.json。
我重新运行索引器,成功了。
在这里,我希望索引此时仅包含单个文档。相反,它包含三。我原来的两个文档,以及一个如下所示的附加文档:
{
"id": "aHR0cHM6Ly9zb3Rlc3RzdG9yYWdlYWNjb3VudC5ibG9iLmNvcmUud2luZG93cy5uZXQvc290ZXN0Y29udGFpbmVyL0ZyYW5rbGluUm9vc2V2ZWx0Lmpzb241",
"title": "Franklin D. Roosevelt",
"content": "A member of the Delano family and Roosevelt family, after attending university, Roosevelt began to practice law in New York City.",
"titleVector": [
0.7,
0.1,
0.8,
-0.3
],
"contentVector": [
0.5,
0.9,
0.3,
0.4
]
}
所以现在我很困惑,因为有 3 个文档而不是 1 个,而第三个文档是现有文档的副本。此外,它还有一个新的 id,它是 blob URL 的 Base64 编码(末尾附有一个数字,表示应该有多少个 =)。
认知搜索在这里做错了什么,还是我做错了?
对于本机 blob 软删除,需要满足一些要求。其中之一是:
- 索引中文档的文档键必须映射到 blob 属性或 blob 元数据,例如“metadata_storage_path”。
因此,您需要将键映射到 blob 属性或 blob 元数据。
修改您的索引定义,如下所示:
{
"@odata.context": "https://jgsai.search.windows.net/$metadata#indexes/$entity",
"@odata.etag": "\"0x8DC3D9565D5ADC9\"",
"name": "azureblob-index-2",
"defaultScoringProfile": "",
"fields": [
{
"name": "id",
"type": "Edm.String",
.........
},
{
"name": "title",
"type": "Edm.String",
.........
},
{
"name": "content",
"type": "Edm.String",
.........
},
{
"name": "titleVector",
"type": "Collection(Edm.Double)",
.........
},
{
"name": "contentVector",
"type": "Collection(Edm.Double)",
.........
},
{
"name": "metadata_storage_path",
"type": "Edm.String",
"searchable": false,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"normalizer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
}
],
"scoringProfiles": [],
"corsOptions": null,
"suggesters": [],
"analyzers": [],
"normalizers": [],
"tokenizers": [],
"tokenFilters": [],
"charFilters": [],
"encryptionKey": null,
"similarity": {
"@odata.type": "#Microsoft.Azure.Search.BM25Similarity",
"k1": null,
"b": null
},
"semantic": null,
"vectorSearch": null
}
在这里我添加了元数据
metadata_storage_path
并将其作为关键。
接下来,您因提供的任何一个而获得的 Base64 加密密钥
base64EncodeKeys
为 true
或在映射字段中提供映射函数为 base64Encode
。
以下是索引器的定义。
{
"@odata.context": "https://jgsai.search.windows.net/$metadata#indexers/$entity",
"@odata.etag": "\"0x8DC3D95C7751E0D\"",
"name": "azureblob-indexer",
"description": "",
"dataSourceName": "blobsource",
"skillsetName": null,
"targetIndexName": "azureblob-index-2",
"disabled": null,
"schedule": null,
"parameters": {
"batchSize": null,
"maxFailedItems": 0,
"maxFailedItemsPerBatch": 0,
"base64EncodeKeys": null,
"configuration": {
"dataToExtract": "contentAndMetadata",
"parsingMode": "json"
}
},
"fieldMappings": [
{
"sourceFieldName": "metadata_storage_path",
"targetFieldName": "metadata_storage_path",
"mappingFunction": {
"name": "base64Encode",
"parameters": null
}
}
],
"outputFieldMappings": [],
"cache": null,
"encryptionKey": null
}
在这种情况下,如果您将
base64EncodeKeys
设置为true或base64Encode
中的mappingFunction
,您将获得一个base64加密密钥。
数据源定义。
{
"@odata.context": "https://jgsai.search.windows.net/$metadata#datasources/$entity",
"@odata.etag": "\"0x8DC3D923FFB589D\"",
"name": "blobsource",
"description": null,
"type": "azureblob",
"subtype": null,
"credentials": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=jgsblob;AccountKey=..."
},
"container": {
"name": "data",
"query": "json"
},
"dataChangeDetectionPolicy": null,
"dataDeletionDetectionPolicy": {
"@odata.type": "#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy"
},
"encryptionKey": null,
"identity": null
}
我已成功输出删除 json 文件的信息。
最初有 2 个文档。删除 1 并重新运行索引器后,我在索引中得到了 1 个文档。