使用mapreduce读取CSV(并非所有列都匹配)并合并为DataFrame

问题描述 投票:1回答:1

我正在使用Julia 1.4.2。

我想使用mapreduce()

  1. 先读一堆CSV,然后

  2. 将它们合并为一个大的DataFrame。

首先是预备赛:

using CSV, DataFrames

# Create CSVs
df1 = DataFrame([['a', 'b', 'c'], [1, 2, 3]],
                ["name", "id"])
df2 = DataFrame([['d', 'e', 'f'], [4, 5, 6]],
                ["name", "id"])
# NOTE: This df has an extra column not present in the other two
df3 = DataFrame([['x', 'y', 'z'], [7, 8, 9], [11, 22, 33]],
                ["name", "id", "num"])
CSV.write("df1.csv", df1)
CSV.write("df2.csv", df2)
CSV.write("df3.csv", df3)

# Get Vector of file paths for the above-created CSVs.
# Regex because there might be other files in working directory.
files = filter(x -> occursin(r"df\d\.csv$", x),
               readdir(join=true))

如果我分别叫map()reduce(),我会得到想要的:

# Import the above-created CSVs as a Vector of DataFrames
dfs = map(x -> CSV.File(x) |> DataFrame,
          files)

# Combine them into one big DataFrame
df = reduce(vcat, dfs, cols=:union)

((注:df3的其他两列中没有多余的列,因此我需要 cols=:union参数。]

但是,我想将上面的map()reduce()调用浓缩为mapreduce()调用。这是我尝试过的:

df = mapreduce(x -> CSV.File(x) |> DataFrame,
               x -> vcat(x, cols=:union),
               files)
# MethodError: no method matching (::var"#16#18")(::DataFrame, ::DataFrame)

df = mapreduce(x -> CSV.File(x) |> DataFrame,
               vcat,
               files,
               cols=:union)
# MethodError: no method matching _mapreduce_dim(::var"#21#22", ::typeof(vcat), ::NamedTuple{(:cols,),Tuple{Symbol}}, ::Array{String,1}, ::Colon)

我的问题的根源是我不理解documentationmapreduce()。如何将命名参数传递给二进制函数(op参数)?例如,我可以像在cols=:union中一样将reduce(op, itr)参数添加到reduce(vcat, dfs, cols=:union)中。如何在op中将参数传递给二进制函数mapreduce(f, op, itrs...)

dataframe mapreduce julia
1个回答
0
投票

op必须是两个参数的函数,因为它将当前状态与新映射的元素结合在一起。试试这个:

df = mapreduce(x -> CSV.File(x) |> DataFrame,
               (x, y) -> vcat(x, y; cols=:union),
               files)
© www.soinside.com 2019 - 2024. All rights reserved.