我有一些数据,我需要导出为csv。目前大约有10,000条记录,并且会不断增加,因此我希望有一种有效的方法来进行迭代,尤其是在运行几个every循环时,一个接一个。我的问题是,有没有一种方法可以避免我下面描述的多次each循环,如果没有,除了Ruby的eachmap之外,我还可以使用其他的方法来保持处理时间不变,而不考虑数据的大小。
比如说,我可以这样做
首先,我会对整个数据进行循环,将持有数组值的字段进行扁平化和重命名,这样像issue这样持有数组值的字段就会变成issue_1,如果数组中只包含两个项目,则会变成issue_1。
接下来,我将再做一次循环,以获得数组中所有哈希的唯一键。
使用步骤2中的唯一键,我将做另一个循环,使用不同的数组对这些唯一键进行排序,该数组拥有键的排列顺序。
最后再做一个循环来生成CSV
所以我每次都用Ruby的everymap对数据进行了4次迭代,完成这个循环的时间会随着数据大小而增加。
原始数据的形式如下。
def data
[
{"file"=> ["getty_883231284_200013331818843182490_335833.jpg"], "id" => "60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded" => "2019-12-24", "date_modified" => "2019-12-24", "book_title_1"=>"", "title"=> ["haha"], "edition"=> [""], "issue" => ["nov"], "creator" => ["yes", "some"], "publisher"=> ["Library"], "place_of_publication" => "London, UK"]},
{"file" => ["getty_883231284_200013331818843182490_335833.jpg"], "id" => "60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded" => "2019-12-24", "date_modified"=>"2019-12-24", "book_title"=> [""], "title" => ["try"], "edition"=> [""], "issue"=> ["dec", 'ten'], "creator"=> ["tako", "bell", 'big mac'], "publisher"=> ["Library"], "place_of_publication" => "NY, USA"}]
end
通过将数组扁平化和重命名这些数组中的键来重新映射日期。
def csv_data
@csv_data = [
{"file_1"=>"getty_883231284_200013331818843182490_335833.jpg", "id"=>"60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded"=>"2019-12-24", "date_modified"=>"2019-12-24", "book_title_1"=>"", "title_1"=>"haha", "edition_1"=>"", "issue_1"=>"nov", "creator_1"=>"yes", "creator_2"=>"some", "publisher_1"=>"Library", "place_of_publication_1"=>"London, UK"},
{"file_1"=>"getty_883231284_200013331818843182490_335833.jpg", "id"=>"60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded"=>"2019-12-24", "date_modified"=>"2019-12-24", "book_title_1"=>"", "title_1"=>"try", "edition_1"=>"", "issue_1"=>"dec", "issue_2" => 'ten', "creator_1"=>"tako", "creator_2"=>"bell", 'creator_3' => 'big mac', "publisher_1"=>"Library", "place_of_publication_1"=>"NY, USA"}]
end
对上述数据的标题进行排序
def csv_header
csv_order = ["id", "edition_1", "date_uploaded", "creator_1", "creator_2", "creator_3", "book_title_1", "publisher_1", "file_1", "place_of_publication_1", "journal_title_1", "issue_1", "issue_2", "date_modified"]
headers_object = []
sorted_header = []
all_keys = csv_data.lazy.flat_map(&:keys).force.uniq.compact
#resort using ordering by suffix eg creator_isni_1 comes before creator_isni_2
all_keys = all_keys.sort_by{ |name| [name[/\d+/].to_i, name] }
csv_order.each {|k| all_keys.select {|e| sorted_header << e if e.start_with? k} }
sorted_header.uniq
end
生成csv,这也涉及到更多的循环。
def to_csv
data = csv_data
sorted_headers = csv_header(data)
csv = CSV.generate(headers: true) do |csv|
csv << sorted_header
csv_data.lazy.each do |hash|
csv << hash.values_at(*sorted_header)
end
end
end
说实话,比起编程部分,我更想知道我是否能够在没有进一步描述的情况下找到你所需要的逻辑(但当然我也很喜欢,我已经很久没有做Ruby了,这是一个很好的复习)。既然任务没有说清楚,那就要通过阅读你的描述、输入数据和代码来 "提炼"。
我觉得你应该做的是把所有的东西都放在很基本很轻量级的数组里,在读取数据的时候,大步流星地做重头戏。我还做了一个假设,如果一个键以数字结尾,或者一个值是一个数组,你希望它被返回为{key}_{n},即使只有一个值存在。
到目前为止,我想出了这样的代码(逻辑在评论中描述),并且 在这里重新演示
class CustomData
# @keys array structure
# 0: Key
# 1: Maximum amount of values associated
# 2: Is an array (Found a {key}_n key in feed,
# or value in feed was an array)
#
# @data: is a simple array of arrays
attr_accessor :keys, :data
CSV_ORDER = %w[
id edition date_uploaded creator book_title publisher
file place_of_publication journal_title issue date_modified
]
def initialize(feed)
@keys = CSV_ORDER.map { |key| [key, 0, false]}
@data = []
feed.each do |row|
new_row = []
# Sort keys in order to maintain the right order for {key}_{n} values
row.sort_by { |key, _| key }.each do |key, value|
is_array = false
if key =~ /_\d+$/
# If key ends with a number, extract key
# and remember it is an array for the output
key, is_array = key[/^(.*)_\d+$/, 1], true
end
if value.is_a? Array
# If value is an array, even if the key did not end with a number,
# we remember that for the output
is_array = true
else
value = [value]
end
# Find position of key if exists or nil
key_index = @keys.index { |a| a.first == key }
if key_index
# If you could have a combination of _n keys and array values
# for a key in your feed, you need to change this portion here
# to account for all previous values, which would add some complexity
#
# If current amount of values is greater than the saved one, override
@keys[key_index][1] = value.length if @keys[key_index][1] < value.length
@keys[key_index][2] = true if is_array and not @keys[key_index][2]
else
# It is a new key in @keys array
key_index = @keys.length
@keys << [key, value.length, is_array]
end
# Add value array at known key index
# (will be padded with nil if idx is greater than array size)
new_row[key_index] = value
end
@data << new_row
end
end
def to_csv_data(headers=true)
result, header, body = [], [], []
if headers
@keys.each do |key|
if key[2]
# If the key should hold multiple values, build the header string
key[1].times { |i| header << "#{key[0]}_#{i+1}" }
else
# Otherwise it is a singular value and the header goes unmodified
header << key[0]
end
end
result << header
end
@data.each do |row|
new_row = []
row.each_with_index do |value, index|
# Use the value counter from @keys to pad with nil values,
# if a value is not present
@keys[index][1].times do |count|
new_row << value[count]
end
end
body << new_row
end
result << body
end
end