尽管字段被标记为索引= true,Solr仍无法搜索原始的爬网条目

问题描述 投票:1回答:1

我同时运行Nutch 1.16搜寻器实例和Solr版本8.3.0。我已经能够在本地目录上搜寻文件,并编辑nutch-site.xml,并从中提取一些元数据(尽管不如我期望的那样)并运行bin/crawl -s urls dircrawl 2 >& dircrawl.log。然后,已爬网的数据通过bin/nutch index dircrawl/crawldb/ -linkdb dircrawl/linkdb/ -dir dircrawl/segments/ -filter -normalize发送到Solr,然后通过条目的标签存储和管理条目。

现在,从用户界面运行Solr Admin,我正在尝试搜索数据。我确保将我感兴趣的所有条目签名为indexed=true。但是,运行除*:*以外的任何搜索都将返回零结果。我尝试了搜索字段的所有可能组合,也没有骰子。我将链接到我的配置文件的描述,首先是solr,然后是八卦...

schema.xml (becomes managed-schema when running it, for some reason)

<?xml version="1.0" encoding="UTF-8"?>
<schema name="nutch-crawler-indexing-config" version="1.6">
  <uniqueKey>id</uniqueKey>
  <fieldType name="_nest_path_" class="solr.NestPathField" omitTermFreqAndPositions="true" omitNorms="true" maxCharsForDocValues="-1" stored="false"/>
  <fieldType name="ancestor_path" class="solr.TextField">
    <analyzer type="index">
      <tokenizer class="solr.KeywordTokenizerFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/"/>
    </analyzer>
  </fieldType>
  (all fieldTypes are the default ones)
  <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.CJKWidthFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.CJKBigramFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_general_rev" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.ReversedWildcardFilterFactory" maxPosQuestion="2" maxFractionAsterisk="0.33" maxPosAsterisk="3" withOriginal="true"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_gl" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_gl.txt" ignoreCase="true"/>
      <filter class="solr.GalicianStemFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_hi" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.IndicNormalizationFilterFactory"/>
      <filter class="solr.HindiNormalizationFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_hi.txt" ignoreCase="true"/>
      <filter class="solr.HindiStemFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_hu" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_hu.txt" ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Hungarian"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_hy" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_hy.txt" ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Armenian"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_id" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_id.txt" ignoreCase="true"/>
      <filter class="solr.IndonesianStemFilterFactory" stemDerivational="true"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_it" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.ElisionFilterFactory" articles="lang/contractions_it.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_it.txt" ignoreCase="true"/>
      <filter class="solr.ItalianLightStemFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_ja" class="solr.TextField" autoGeneratePhraseQueries="false" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
      <filter class="solr.JapaneseBaseFormFilterFactory"/>
      <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt"/>
      <filter class="solr.CJKWidthFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_ja.txt" ignoreCase="true"/>
      <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.KoreanTokenizerFactory" outputUnknownUnigrams="false" decompoundMode="discard"/>
      <filter class="solr.KoreanPartOfSpeechStopFilterFactory"/>
      <filter class="solr.KoreanReadingFormFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_lv" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_lv.txt" ignoreCase="true"/>
      <filter class="solr.LatvianStemFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_nl" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_nl.txt" ignoreCase="true"/>
      <filter class="solr.StemmerOverrideFilterFactory" dictionary="lang/stemdict_nl.txt" ignoreCase="false"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Dutch"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_no" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_no.txt" ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Norwegian"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_pt" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_pt.txt" ignoreCase="true"/>
      <filter class="solr.PortugueseLightStemFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_ro" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_ro.txt" ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Romanian"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_ru" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_ru.txt" ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Russian"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_sv" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_sv.txt" ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Swedish"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_th" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.ThaiTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_th.txt" ignoreCase="true"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_tr" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.TurkishLowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_tr.txt" ignoreCase="false"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Turkish"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    </analyzer>
  </fieldType>
  <field name="_nest_path_" type="_nest_path_"/>
  <field name="_root_" type="string" docValues="false" indexed="true" stored="false"/>
  <field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>
  <field name="_version_" type="plong" indexed="false" stored="false"/>
  <field name="boost" type="pdoubles"/>
  <field name="content" type="text_general"/>
  <field name="digest" type="text_general"/>
  <field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
  <field name="metatag.author" type="text_general" indexed="true"/>
  <field name="metatag.channels" type="plongs"/>
  <field name="metatag.creator" type="text_general" indexed="true"/>
  <field name="metatag.samplerate" type="plongs"/>
  <field name="metatag.version" type="text_general"/>
  <field name="title" type="text_general" indexed="true"/>
  <field name="tstamp" type="pdates"/>
  <field name="url" type="text_general" stored="true"/>
  <dynamicField name="*_txt_en_split_tight" type="text_en_splitting_tight" indexed="true" stored="true"/>
  <dynamicField name="*_descendent_path" type="descendent_path" indexed="true" stored="true"/>
  <dynamicField name="*_ancestor_path" type="ancestor_path" indexed="true" stored="true"/>
  <dynamicField name="*_txt_en_split" type="text_en_splitting" indexed="true" stored="true"/>
  <dynamicField name="*_txt_sort" type="text_gen_sort" indexed="true" stored="true"/>
  <dynamicField name="ignored_*" type="ignored"/>
  <dynamicField name="*_txt_rev" type="text_general_rev" indexed="true" stored="true"/>
  <dynamicField name="*_phon_en" type="phonetic_en" indexed="true" stored="true"/>
  <dynamicField name="*_s_lower" type="lowercase" indexed="true" stored="true"/>
  <dynamicField name="*_txt_cjk" type="text_cjk" indexed="true" stored="true"/>
  <dynamicField name="random_*" type="random"/>
  <dynamicField name="*_t_sort" type="text_gen_sort" multiValued="false" indexed="true" stored="true"/>
  <dynamicField name="*_txt_en" type="text_en" indexed="true" stored="true"/>
  <dynamicField name="*_txt_ar" type="text_ar" indexed="true" stored="true"/>
  <dynamicField name="*_txt_bg" type="text_bg" indexed="true" stored="true"/>
  <dynamicField name="*_txt_ca" type="text_ca" indexed="true" stored="true"/>
  <dynamicField name="*_txt_cz" type="text_cz" indexed="true" stored="true"/>
  <dynamicField name="*_txt_da" type="text_da" indexed="true" stored="true"/>
  <dynamicField name="*_txt_de" type="text_de" indexed="true" stored="true"/>
  <dynamicField name="*_txt_el" type="text_el" indexed="true" stored="true"/>
  <dynamicField name="*_txt_es" type="text_es" indexed="true" stored="true"/>
  <dynamicField name="*_txt_et" type="text_et" indexed="true" stored="true"/>
  <dynamicField name="*_txt_eu" type="text_eu" indexed="true" stored="true"/>
  <dynamicField name="*_txt_fa" type="text_fa" indexed="true" stored="true"/>
  <dynamicField name="*_txt_fi" type="text_fi" indexed="true" stored="true"/>
  <dynamicField name="*_txt_fr" type="text_fr" indexed="true" stored="true"/>
  <dynamicField name="*_txt_ga" type="text_ga" indexed="true" stored="true"/>
  <dynamicField name="*_txt_gl" type="text_gl" indexed="true" stored="true"/>
  <dynamicField name="*_txt_hi" type="text_hi" indexed="true" stored="true"/>
  <dynamicField name="*_txt_hu" type="text_hu" indexed="true" stored="true"/>
  <dynamicField name="*_txt_hy" type="text_hy" indexed="true" stored="true"/>
  <dynamicField name="*_txt_id" type="text_id" indexed="true" stored="true"/>
  <dynamicField name="*_txt_it" type="text_it" indexed="true" stored="true"/>
  <dynamicField name="*_txt_ja" type="text_ja" indexed="true" stored="true"/>
  <dynamicField name="*_txt_ko" type="text_ko" indexed="true" stored="true"/>
  <dynamicField name="*_txt_lv" type="text_lv" indexed="true" stored="true"/>
  <dynamicField name="*_txt_nl" type="text_nl" indexed="true" stored="true"/>
  <dynamicField name="*_txt_no" type="text_no" indexed="true" stored="true"/>
  <dynamicField name="*_txt_pt" type="text_pt" indexed="true" stored="true"/>
  <dynamicField name="*_txt_ro" type="text_ro" indexed="true" stored="true"/>
  <dynamicField name="*_txt_ru" type="text_ru" indexed="true" stored="true"/>
  <dynamicField name="*_txt_sv" type="text_sv" indexed="true" stored="true"/>
  <dynamicField name="*_txt_th" type="text_th" indexed="true" stored="true"/>
  <dynamicField name="*_txt_tr" type="text_tr" indexed="true" stored="true"/>
  <dynamicField name="*_point" type="point" indexed="true" stored="true"/>
  <dynamicField name="*_srpt" type="location_rpt" indexed="true" stored="true"/>
  <dynamicField name="attr_*" type="text_general" multiValued="true" indexed="true" stored="true"/>
  <dynamicField name="*_txt" type="text_general" indexed="true" stored="true"/>
  <dynamicField name="*_str" type="strings" docValues="true" indexed="false" stored="false" useDocValuesAsStored="false"/>
  <dynamicField name="*_dts" type="pdate" multiValued="true" indexed="true" stored="true"/>
  <dynamicField name="*_dpf" type="delimited_payloads_float" indexed="true" stored="true"/>
  <dynamicField name="*_dpi" type="delimited_payloads_int" indexed="true" stored="true"/>
  <dynamicField name="*_dps" type="delimited_payloads_string" indexed="true" stored="true"/>
  <dynamicField name="*_is" type="pints" indexed="true" stored="true"/>
  <dynamicField name="*_ss" type="strings" indexed="true" stored="true"/>
  <dynamicField name="*_ls" type="plongs" indexed="true" stored="true"/>
  <dynamicField name="*_bs" type="booleans" indexed="true" stored="true"/>
  <dynamicField name="*_fs" type="pfloats" indexed="true" stored="true"/>
  <dynamicField name="*_ds" type="pdoubles" indexed="true" stored="true"/>
  <dynamicField name="*_dt" type="pdate" indexed="true" stored="true"/>
  <dynamicField name="*_ws" type="text_ws" indexed="true" stored="true"/>
  <dynamicField name="*_i" type="pint" indexed="true" stored="true"/>
  <dynamicField name="*_s" type="string" indexed="true" stored="true"/>
  <dynamicField name="*_l" type="plong" indexed="true" stored="true"/>
  <dynamicField name="*_t" type="text_general" multiValued="false" indexed="true" stored="true"/>
  <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
  <dynamicField name="*_f" type="pfloat" indexed="true" stored="true"/>
  <dynamicField name="*_d" type="pdouble" indexed="true" stored="true"/>
  <dynamicField name="*_p" type="location" indexed="true" stored="true"/>
  <copyField source="digest" dest="digest_str" maxChars="256"/>
  <copyField source="title" dest="title_str" maxChars="256"/>
  <copyField source="url" dest="url_str" maxChars="256"/>
  <copyField source="content" dest="content_str" maxChars="256"/>
  <copyField source="metatag.author" dest="metatag.author_str" maxChars="256"/>
  <copyField source="metatag.version" dest="metatag.version_str" maxChars="256"/>
  <copyField source="metatag.creator" dest="metatag.creator_str" maxChars="256"/>
</schema>

然后nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<configuration>
<property>
 <name>http.agent.name</name>
 <value>NutchSpiderTest</value>
</property>

<property>
  <name>http.robots.agents</name>
  <value>NutchSpiderTest,*</value>
  <description>...
  </description>
</property>

<property>
  <name>plugin.includes</name>
  <value>protocol-file|urlfilter-regex|parse-(html|tika|metatags|text)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>...
  </description>
</property>

<property>
 <name>file.content.limit</name>
 <value>-1</value>
 <description> Needed to stop buffer overflow errors - Unable to read.....</description>
</property>

<property>
  <name>file.crawl.parent</name>
  <value>false</value>
  <description>The crawler is not restricted to the directories that you specified in the
    Urls file but it is jumping into the parent directories as well. For your own crawlings you can
    change this behavior (set to false) the way that only directories beneath the directories that you specify get
    crawled.</description>
</property>


<property>
    <name>parser.skip.truncated</name>
    <value>false</value>
    <description>Boolean value for whether we should skip parsing for truncated documents. By default this
        property is activated due to extremely high levels of CPU which parsing can sometimes take.
    </description>
</property>
<!--
  <value>protocol-file|protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|index-more</value>
-->

<!-- Used only if plugin parse-metatags is enabled. -->
<property>
<name>metatags.names</name>
<value>*</value>
<description> ...
</description>
</property>

<property>
  <name>index.parse.md</name>
  <value>metatag.description,metatag.keywords,metatag.author,metatag.channels,metatag.content_encoding,metatag.content_type,metatag.creator,metatag.dc_creator,metatag.dc_title,metatag.id,metatag.meta_author,metatag.samplerate,metatag.stream_content_type,metatag.stream_name,metatag.stream_size,metatag.stream_source_info,metatag.title,metatag.version,metatag.x_parsed_by,metatag.xmpdm_album,metatag.album,metatag.xmpdm_albumartist,metatag.albumartist,metatag.xmpdm_artist,metatag.artist,metatag.xmpdm_audiochanneltype,metatag.audiochanneltype,metatag.xmpdm_audiocompressor,metatag.audiocompressor,metatag.xmpdm_audiosamplerate,metatag.audiosamplerate,metatag.xmpdm_composer,metatag.composer,metatag.xmpdm_discnumber,metatag.discnumber,metatag.xmpdm_duration,metatag.duration,metatag.xmpdm_genre,metatag.genre,metatag.xmpdm_releasedate,metatag.releasedate,metatag.xmpdm_tracknumber,metatag.tracknumber,metatag.copyright,author,Genre</value>
  <description>
  Comma-separated list of keys to be taken from the parse metadata to generate fields.
  Can be used e.g. for 'description' or 'keywords' provided that these values are generated
  by a parser (see parse-metatags plugin)
  </description>
</property>

</configuration>

对“ ”进行查询的结果:]]

{
  "responseHeader":{
    ...,
    "params":{
      "q":"*:*",
      "_":"..."}},
  "response":{"numFound":24,"start":0,"docs":[
      {...

运行任何其他类型的查询的响应:

{
  "responseHeader":{
    ...
    "params":{
      "q":"Bumblebee",
      "_":"..."}},
  "response":{"numFound":0,"start":0,"docs":[]
  }}

另外,我要索引的数据是免费音乐档案中的各种.mp3文件。

编辑:我要搜索的文件如下:

  {
        "metatag.author":["A Kombi",
          "A Kombi"],
        "metatag.samplerate":[44100,
          44100],
        "title":["Plight Of The Bumblebee"],
        "url":["file:/c:/Users/.../fma/fma_small/009/009476.mp3"],
        "content":["Plight Of The Bumblebee\nPlight Of The Bumblebee\nA Kombi\nMusic to Drive By, track 2\n2004-09-14T00:00:00\nField Recordings\n30014.912\n"],
        "metatag.creator":["A Kombi",
          "A Kombi"],
        "tstamp":["2020-04-02T15:26:29.507Z"],
        "digest":["ddd4ab2288c5799a5646592e1a63437f"],
        "boost":[0.20851442],
        "id":"file:/c:/Users/.../fma/fma_small/009/009476.mp3",
        "metatag.version":["MPEG 3 Layer III Version 1",
          "MPEG 3 Layer III Version 1"],
        "metatag.channels":[2,
          2],
        "_version_":1662875102548590596}

我同时运行Nutch 1.16搜寻器实例和Solr版本8.3.0。我已经能够搜寻本地目录中的文件,并编辑nutch-site.xml,从中提取一些元数据(尽管...

indexing solr web-crawler nutch
1个回答
0
投票

除非您have a default search field configured,否则必须设置要针对哪个字段进行搜索-在较旧的schema.xml版本中,可以为模式配置此属性,但是建议的方法是在查询本身中对其进行配置。

© www.soinside.com 2019 - 2024. All rights reserved.