Ruby处理大文件时搜索并合并CSV文件

Question

摘要

查看与此相符的其他问题无济于事，因为我仍在逐行打开文件，因此我不会在大文件上用尽内存。实际上，我的内存使用率很低，但是创建较小的文件花了很长时间，因此我可以搜索其他CSV并将其连接到文件中。

问题

已经5天了，我不确定我还需要走多远，但是它还没有退出主文件的foreach行，因此csv文件中有1780万条记录。有没有更快的方法来处理红宝石中的处理？我可以对MacOSX进行优化吗？任何建议都很好。

# # -------------------------------------------------------------------------------------
# # USED TO GET ID NUMBERS OF THE SPECIFIC ITEMS THAT ARE NEEDED
# # -------------------------------------------------------------------------------------
etas_title_file = './HathiTrust ETAS Titles.csv'
oclc_id_array = []
angies_csv = []
CSV.foreach(etas_title_file ,'r', {:headers => true, :header_converters => :symbol}) do |row| 
  oclc_id_array << row[:oclc]
  angies_csv << row.to_h
end 
oclc_id_array.uniq!


# -------------------------------------------------------------------------------------
# RUN ONCE IF DATABASE IS NOT POPULATED
# -------------------------------------------------------------------------------------

headers = %i[htid   access  rights  ht_bib_key  description source  source_bib_num  oclc_num    isbn    issn    lccn    title   imprint rights_reason_code  rights_timestamp    us_gov_doc_flag rights_date_used    pub_place   lang    bib_fmt collection_code content_provider_code   responsible_entity_code digitization_agent_code access_profile_code author]

remove_keys = %i[access rights description  source  source_bib_num isbn issn    lccn    title   imprint rights_reason_code  rights_timestamp    us_gov_doc_flag rights_date_used    pub_place   lang    bib_fmt collection_code content_provider_code   responsible_entity_code digitization_agent_code access_profile_code author]

new_hathi_csv = []
processed_keys = []
CSV.foreach('./hathi_full_20200401.txt' ,'r', {:headers => headers, :col_sep => "\t", quote_char: "\0" }) do |row| 
  next unless oclc_id_array.include? row[:oclc_num]
  next if processed_keys.include? row[:oclc_num]
  puts "#{row[:oclc_num]} included? #{oclc_id_array.include? row[:oclc_num]}"
  new_hathi_csv << row.to_h.except(*remove_keys)
  processed_keys << row[:oclc_num]
end

Answer 1

据我所能确定，OCLC ID是字母数字。这意味着我们要使用哈希来存储这些ID。哈希的一般查找复杂度为O（1），而未排序的数组的查找复杂度为O（n）。

[如果使用数组，则最坏的情况是进行1800万次比较（要查找单个元素，Ruby必须经过所有1800万个ID），而使用散列则将是一次比较。简而言之：使用哈希将比您当前的实现快几百万倍。

下面的伪代码将使您了解如何进行操作。我们将使用一个Set，就像一个Hash，但是当您需要做的只是检查包含性时，便很方便：

oclc_ids = Set.new

CSV.foreach(...) {
  ...
  oclc_ids.add(row[:oclc])  # Add ID to Set
  ...
}

# No need to call unique on a Set. 
# The elements in a Set are always unique.

processed_keys = Set.new

CSV.foreach(...) {
   next unless oclc_ids.include?(row[:oclc_num])   # Extremely fast lookup
   next if processed_keys.include?(row[:oclc_num]) # Extremely fast lookup
   ...
   processed_keys.add(row[:oclc_num])
}

Ruby处理大文件时搜索并合并CSV文件

问题描述投票：0回答：1

摘要

问题

1个回答

最新问题

Ruby处理大文件时搜索并合并CSV文件

问题描述 投票：0回答：1

摘要

问题

1个回答

最新问题

问题描述投票：0回答：1