我有一些数据具有重复的字段,但我想加入一个字段。在数据中,除report
以外的所有内容在每一天和每一家公司都应保持不变。公司可以在同一天提交多个报告。
我可以使用以下代码加入,但是我丢失了by
函数中没有的变量。有什么建议吗?
using DataFrames
# Number of observations
n = 100
words = split("the wigdet drop air flat fall fling flap freeze flop tool fox", " ")
df = DataFrame(day = cumsum(rand(0:1, n)), company = rand(0:3, n),
report = [join(rand(words, rand(1:5, 1)[1]), " ") for i in 1:n])
x = df[:, [:day, :company]]
# Number of variables which are identical for each day/company.
nv = 100
for i in 1:nv
df[:, Symbol("v" * string(i))] = ""
end
for i in 1:size(x, 1),j in 1:nv
df[(df.day .== x[i,1]) .& (df.company .== x[i,2]), Symbol("v" * string(j))] =
join(rand('a':'z', 3), "")
end
outdf = by(df, [:company, :day]) do sub
t = DataFrame(fullreport = join(sub.report, "\n(Joined)\n"))
end
以下是您的数据准备代码中的一些细微调整:
using DataFrames
# Number of observations
n = 100
words = split("the wigdet drop air flat fall fling flap freeze flop tool fox", " ")
df = DataFrame(day = cumsum(rand(0:1, n)), company = rand(0:3, n),
report = [join(rand(words, rand(1:5, 1)[1]), " ") for i in 1:n])
x = df[:, [:day, :company]]
# Number of variables which are identical for each day/company.
nv = 100
for i in 1:nv
df[:, Symbol("v", i)] .= ""
end
for i in 1:size(x, 1), j in 1:nv
df[(df.day .== x[i,1]) .& (df.company .== x[i,2]), Symbol("v", j)] .= join(rand('a':'z', 3), "")
end
这里是by
,保留所有其他变量(假设它们在每个组中都是恒定的,即使对于较大的数据,此代码也应有效):
outdf = by(df, [:company, :day]) do sub
merge((fullreport = join(sub.report, "\n(Joined)\n"),),
copy(sub[1, Not([:company, :day, :report])]))
end
我将fullreport
变量作为第一个变量。
这里是将原始数据框中的所有行都保留的代码:
outdf = by(df, [:company, :day]) do sub
insertcols!(select(sub, Not([:company, :day, :report])), 1,
fullreport = join(sub.report, "\n(Joined)\n"))
end
现在您可以例如检查unique(outdf)
产生的数据帧与拳头by
产生的数据帧相同。
((在上面的代码中,我还丢弃了:report
变量,因为我想您不想在结果中使用它-对吗?)