将文档上传到FSCrawler以便在Elasticsearch中建立索引的正确方法

问题描述 投票:0回答:1

我正在对Rails应用程序进行原型设计,以将文档上传到FSCrawler(运行REST接口),以整合到Elasticsearch索引中。以他们的例子为例:

response = `curl -F "file=@#{params[:document][:upload].tempfile.path}" "http://127.0.0.1:8080/fscrawler/_upload?debug=true"`

文件被上传,内容被索引。这是我得到的一个例子:

"{\n \"ok\" : true,\n \"filename\" : \"RackMultipart20200130-91061-16swulg.pdf\",\n \"url\" : \"http://127.0.0.1:9200/local/_doc/d661edecf3e28572676e97a6f0d1d\",\n \"doc\" : {\n \"content\" : \"\\n \\n \\n\\nBasically, what you need to know is that Dante is all IP-based, and makes use of common IT standards. Each Dante device behaves \\n\\nmuch like any other network device you would already find on your network. \\n\\nIn order to make integration into an existing network easy, here are some of the things that Dante does: \\n\\n▪ Dante...

当我在命令行中运行curl时,我得到了一切,就像正确设置了“文件名”一样。如果我如上所述使用它,则在Rails控制器中,如您所见,文件名设置为Tempfile的文件名。那不是可行的解决方案。尝试使用params[:document][:upload].tempfile(没有.path)或仅使用params[:document][:upload]都完全失败。

我正在尝试以“正确的方式”执行此操作,但是使用适当的HTTP客户端执行此操作的所有化身都失败了。我不知道如何调用HTTP POST,该HTTP POST将以curl(在命令行上)的方式将文件提交给FSCrawler。

在此示例中,我只是尝试使用Tempfile文件对象发送文件。由于某些原因,FSCrawler在注释中给了我错误,并获得了一些元数据,但没有为内容建立索引:

## Failed to extract [100000] characters of text for ...
## org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
uri = URI("http://127.0.0.1:8080/fscrawler/_upload?debug=true")
request = Net::HTTP::Post.new(uri)
form_data = [['file', params[:document][:upload].tempfile,
  { filename: params[:document][:upload].original_filename,
  content_type: params[:document][:upload].content_type }]]
request.set_form form_data, 'multipart/form-data'
response = Net::HTTP.start(uri.hostname, uri.port) do |http|
  http.request(request)
end

如果将above更改为使用params[:document][:upload].tempfile.path,则不会收到有关InputStream的错误,但我也(仍)不会为任何内容建立索引。这是我得到的一个例子:

 {"_index":"local","_type":"_doc","_id":"72c9ecf2a83440994eb87d28786e6","_version":3,"_seq_no":26,"_primary_term":1,"found":true,"_source":{"content":"/var/folders/bn/pcc1h8p16tl534pw__fdz2sw0000gn/T/RackMultipart20200130-91061-134tcxn.pdf\n","meta":{},"file":{"extension":"pdf","content_type":"text/plain; charset=ISO-8859-1","indexing_date":"2020-01-30T15:33:45.481+0000","filename":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf"},"path":{"virtual":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf","real":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf"}}}

如果我尝试使用RestClient,并尝试通过引用Tempfile的实际路径来发送文件,那么我会收到此错误消息,但我什么也没收到:

## Unsupported media type
response = RestClient.post 'http://127.0.0.1:8080/fscrawler/_upload?debug=true',
  file: params[:document][:upload].tempfile.path,
  content_type: params[:document][:upload].content_type

如果我尝试.read()该文件并提交,那么我将破坏FSCrawler表格:

## Internal server error
request = RestClient::Request.new(
  :method => :post,
  :url => 'http://127.0.0.1:8080/fscrawler/_upload?debug=true',
  :payload => {
    :multipart => true,
    :file => File.read(params[:document][:upload].tempfile),
    :content_type => params[:document][:upload].content_type
})
response = request.execute

显然,我一直在尽一切努力,但是我无法复制curl对任何已知的基于Ruby的HTTP客户端所做的任何事情。我完全不知道如何让Ruby以一种能够正确索引文档内容的方式向FSCrawler提交数据。我去的时间比我承认的要长得多。我在这里想念什么?

ruby elasticsearch curl rest-client net-http
1个回答
0
投票

我终于尝试了Faraday,并且基于this answer,提出了以下内容:

connection = Faraday.new('http://127.0.0.1:8080') do |f|
  f.request :multipart
  f.request :url_encoded
  f.adapter :net_http
end
file = Faraday::UploadIO.new(
  params[:document][:upload].tempfile.path,
  params[:document][:upload].content_type,
  params[:document][:upload].original_filename
)
payload = { :file => file }
response = connection.post('/fscrawler/_upload', payload)

随着Fiddler的请求越来越近,使用curl帮助我看到了尝试的结果。此代码段几乎与curl一样发出请求。要通过代理路由此呼叫,我只需要在连接设置的末尾添加, proxy: 'http://localhost:8866'

© www.soinside.com 2019 - 2024. All rights reserved.