我正在尝试使用 Azure AI 搜索从一组与搜索查询匹配的 pdf 中返回特定页面。现在,我正在使用“generateNormalizedImagePerPage”图像操作将每个页面转换为图像,然后使用 OcrSkill 从生成的图像中读取文本。这允许我分割内容,但问题是,当您查询索引时,它返回整个 pdf 文档,而不是仅匹配的特定页面。
我认为我可以使用索引投影将 pdf 的每一页作为搜索索引中的单独文档。
这是我尝试过的。我创建了索引。
var index = new SearchIndex(name: "myindex")
{
Fields =
[
new SearchField (name: "id", type: SearchFieldDataType.String)
{ IsSearchable = true, IsKey = true, },
new SearchField (name: "content", type: SearchFieldDataType.String)
{ IsFilterable = true, IsKey = false },
new SearchField (name: "pagetext", type: SearchFieldDataType.String)
{ IsSearchable = true },
new SearchField (name: "pagenumber", type: SearchFieldDataType.String)
{ IsSearchable = true }
]
};
然后我创建了索引投影,设置投影模式以跳过索引父文档。我还将parentKeyFieldName设置为“content”,因为这篇文章说该字段必须是Edm.String,不能是关键字段,并且必须将Filterable设置为true。
var mappings = new List<InputFieldMappingEntry>
{
new (name: "pagetext")
{
Source = "/document/normalized_images/*/text"
},
new (name: "pagenumber")
{
Source = "/document/normalized_images/*/pageNumber"
}
};
var selectors = new List<SearchIndexerIndexProjectionSelector>
{
new (targetIndexName: "myindex",
parentKeyFieldName: "content",
sourceContext: "/document/normalized_images/*",
mappings: mappings)
};
var indexProjections = new SearchIndexerIndexProjections(selectors)
{
Parameters = new SearchIndexerIndexProjectionsParameters
{
ProjectionMode = IndexProjectionMode.SkipIndexingParentDocuments
}
};
我的问题是在尝试创建技能组时出现错误。
One or more index projection selectors are invalid.
Details: Index 'myindex' must contain field 'content', it must be of type Edm.String,
cannot be the key field and it must be filterable.
这个错误让我很困惑,因为我以为我满足了文章中指定的 targetIndexName 的所有要求:
索引中的
content
字段不符合错误消息中指定的要求。
我们必须有一个这样的索引
Fields =
{
new SearchField("id", SearchFieldDataType.String) { IsSearchable = true, IsKey = true },
new SearchField("content", SearchFieldDataType.String) { IsSearchable = true, IsFilterable = true },
new SearchField("pagetext", SearchFieldDataType.String) { IsSearchable = true },
new SearchField("pagenumber", SearchFieldDataType.Int32) { IsFilterable = true }
}
修改了技能集的创建以包括索引预测和创建技能。
CreateOrUpdateDemoSkillSetWithIndexProjections
方法现在将 indexProjections
作为附加参数,并将其设置在技能组的索引选项中。
注:
实体识别技能 (v2) (Microsoft.Skills.Text.EntityRecognitionSkill) 现已停止,并由 Microsoft.Skills.Text.V3.EntityRecognitionSkill 取代。请按照已弃用的技能中的建议迁移到受支持的技能。
代码取自git
private static SearchIndexerSkillset CreateOrUpdateDemoSkillSet(SearchIndexerClient indexerClient, IList<SearchIndexerSkill> skills, string azureAiServicesKey)
{
// Azure AI services was formerly known as Cognitive Services.
// The APIs still use the old name, so we need to create a CognitiveServicesAccountKey object
SearchIndexerSkillset skillset = new SearchIndexerSkillset("demoskillset", skills)
{
Description = "Demo skillset",
CognitiveServicesAccount = new CognitiveServicesAccountKey(azureAiServicesKey)
};
// Create the skillset in your search service.
// The skillset does not need to be deleted if it was already created
// since we are using the CreateOrUpdate method
try
{
indexerClient.CreateOrUpdateSkillset(skillset);
}
catch (RequestFailedException ex)
{
Console.WriteLine("Failed to create the skillset\n Exception message: {0}\n", ex.Message);
ExitProgram("Cannot continue without a skillset");
}
return skillset;
}
输出: