如标题。
我使用“导入和矢量化数据”来创建索引,并且索引自动分块。
索引架构如;
"value": [
{
"@search.score":
"chunk_id": "",
"chunk": "",
"title": "",
"image": ""
},
参考官方文档,我使用“/document/normalized_images/*/data”检索归一化图像的base64数据,然后使用程序对其进行处理,将其转换为图像文件。然而,我的目标是获取每个块对应的base64数据。因此,我将技能组修改如下,但结果出现错误消息:
“一个或多个索引投影选择器无效。详细信息:索引“名称”中的输入“图像”没有匹配的索引字段。”
"indexProjections": {
"selectors": [
{
"targetIndexName": "name",
"parentKeyFieldName": "parent_id",
"sourceContext": "/document/pages/*",
"mappings": [
{
"name": "chunk",
"source": "/document/pages/*",
"sourceContext": null,
"inputs": []
},
{
"name": "vector",
"source": "/document/pages/*/vector",
"sourceContext": null,
"inputs": []
},
{
"name": "title",
"source": "/document/metadata_storage_name",
"sourceContext": null,
"inputs": []
},
{
"name": "image",
"sourceContext":"/document/pages/*",
"inputs": [
{
"source":"/document/normalized_images/*/pages/data",
"name":"imagedata"
}
]
}
]
}
]
我想获取每个索引块文本对应的base64数据。我如何适应这种方法或探索替代解决方案?
我想获取每个索引块文本对应的base64数据。
索引架构和技能组配置之间不匹配。用于存储图像 URL 的名为
"image"
的字段似乎不适合存储 base64 数据。
"imageData"
,如下所示。"fields": [
{ "name": "imageData", "type": "Edm.String", "filterable": false, "sortable": false, "facetable": false, "searchable": false }
]
修改上述内容后,只需更新技能组,如下所示。
"skills": [
{
"@odata.type": "#Microsoft.Skills.Util.ShaperSkill",
"name": "#1",
"inputs": [
{
"name": "chunk",
"source": "/document/pages/*"
}
],
"outputs": [
{
"name": "chunk"
},
{
"name": "imageData",
"targetName": "imageData"
}
]
},
{
"@odata.type": "#Microsoft.Skills.Text.ExtractKeyPhrasesSkill",
"name": "#2",
"context": "/document",
"inputs": [
{
"name": "text",
"source": "/document/pages/*/text"
}
],
"outputs": [
{
"name": "keyPhrases",
"targetName": "keyPhrases"
}
]
}
]
"image"
字段中提取 Base64 图像数据并将其存储在 "imageData"
字段中。更新索引器:
"parameters": {
"configuration": {
"dataToExtract": "contentAndMetadata",
"imageAction": "generateNormalizedImages",
"indexedFileNameExtensions": ".pdf,.docx,.pptx,.xlsx",
"skillsetName": "your_updated_skillset_name",
"targetIndexName": "your_index_name",
"fieldMappings": [
{
"sourceFieldName": "/document/pages/*/text",
"targetFieldName": "text"
},
{
"sourceFieldName": "/document/pages/*/title",
"targetFieldName": "title"
}
]
}
}
您的指数预测定义错误。首先,您要在“图像”映射中创建嵌套输入。仅当“image”字段的类型为 Edm.ComplexType 并且您想要创建要映射到索引的内联复杂类型时,才应使用此选项。另外,看起来您正在映射“/document/normalized_images/*/pages/data”。您需要从该源路径中删除“页面”。因此,之后,索引投影定义中的特定映射应该如下所示:
{
"name": "image",
"source":"/document/normalized_images/*/data"
}
但是,请注意索引投影的 sourceContext 是“/document/pages/*”。这意味着对于每一“页面”,搜索索引中都会有一个文档。但是,图像在单独的路径“/document/normalized_images/*”下进行跟踪。这意味着页面到图像不一定是 1-1 的映射。因此,如果您使用我上面共享的映射,它实际上会输出一个字符串数组,其中包含该文档中每个页面的父文档中所有图像的单独的 Base64 数据。
如果您希望从图像到搜索文档存在 1-1 映射,那么您应该利用您的技能执行类似的操作。请注意,如果每个图像的文本输出太多而无法矢量化,那么您将看到错误。
{
"description": "Skillset to chunk documents by image and generate embeddings",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
"context": "/document/normalized_images/*",
"inputs": [
{
"name": "image",
"source": "/document/normalized_images/*"
}
],
"outputs": [
{
"name": "text",
"targetName": "text"
}
]
},
{
"@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
"context": "/document/normalized_images/*",
"resourceUri": "<fill in>",
"apiKey": "<fill in>",
"deploymentId": "<fill in>",
"inputs": [
{
"name": "text",
"source": "/document/normalized_images/*/text"
}
],
"outputs": [
{
"name": "embedding",
"targetName": "vector"
}
]
}
],
"cognitiveServices": null,
"indexProjections": {
"selectors": [
{
"targetIndexName": "name",
"parentKeyFieldName": "parent_id",
"sourceContext": "/document/normalized_images/*",
"mappings": [
{
"name": "chunk",
"source": "/document/normalized_images/*/text"
},
{
"name": "vector",
"source": "/document/normalized_images/*/vector"
},
{
"name": "title",
"source": "/document/metadata_storage_name"
},
{
"name": "image",
"source": "/document/normalized_images/*/data"
}
]
}
],
"parameters": {
"projectionMode": "skipIndexingParentDocuments"
}
}
}