编写具有较少攻击性引用的RFC4180兼容平面文件

问题描述 投票:1回答:1

在R中使用write.tablewrite.csv时,默认情况下,在所有非数字字段周围都添加双引号,无论是否正确解析csv文件实际上都需要使用引号。

以Python脚本为例:

import csv
f_out=open("pytest.csv", "w")
wri = csv.writer(f_out, delimiter=',')
wri.writerow(['c_numeric', 'c_str', 'c_str_spec'])
wri.writerow([11, "r1c2", "r1c3 nothing special"])
wri.writerow([21, "r2c2", "r2c3,with delim"])
wri.writerow([31, "r3c2", "r3c3\nwith carriage return"])
wri.writerow([41, "r4c2", "r3c3\"with double quote"])
f_out.close()

这会将以下内容输出到pytest.csv

c_numeric,c_str,c_str_spec
11,r1c2,r1c3 nothing special
21,r2c2,"r2c3,with delim"
31,r3c2,"r3c3
with carriage return"
41,r4c2,"r3c3""with double quote"

这是我期望的,并遵循Excel也会输出的内容。

现在让我们使用R处理此文件,并使用带引号和不带引号的方式写:

df <- read.csv("pytest.csv")
write.csv(df, 'Rtest.csv', row.names=FALSE)
write.csv(df, 'Rtest_NQ.csv', row.names=FALSE, quote=FALSE)

这里是Rtest.csv

"c_numeric","c_str","c_str_spec"
11,"r1c2","r1c3 nothing special"
21,"r2c2","r2c3,with delim"
31,"r3c2","r3c3
with carriage return"
41,"r4c2","r3c3""with double quote"

注意all非数字字段周围的引号。

这里是Rtest_NQ.csv

c_numeric,c_str,c_str_spec
11,r1c2,r1c3 nothing special
21,r2c2,r2c3,with delim
31,r3c2,r3c3
with carriage return
41,r4c2,r3c3"with double quote

此文件在技术上已损坏,因为任何csv读取器都无法读取,因此不是一个好的选择。

[我的问题:R中是否有与rfc4180兼容的编写器,其编写方式类似于Excel或python csv库以及大多数其他与rfc4180兼容的工具?

r csv double-quotes rfc4180
1个回答
0
投票

您可以编写一个简单的函数来构造csv,方法是将数据帧转换为字符矩阵,转义任何双引号,然后引用任何包含逗号或换行符的字符串。然后,您添加列名并使用writeLines

作为csv写入
write_unquoted <- function(df, path)
{
  x <- as.matrix(df) 
  x[grep("\"", x)] <- paste0("\"", gsub("\"", "\"\"", x[grep("\"", x)]), "\"")
  x[grep(",|\n", x)]  <- paste0("\"", x[grep(",|\n", x)], "\"")
  x <- c(paste0(colnames(x), collapse = ","), apply(x, 1, paste0, collapse = ","))
  writeLines(x, path)
}

所以,如果我们从您的示例开始:

df
#>   c_numeric c_str                 c_str_spec
#> 1        11  r1c2       r1c3 nothing special
#> 2        21  r2c2            r2c3,with delim
#> 3        31  r3c2 r3c3\nwith carriage return
#> 4        41  r4c2     r3c3"with double quote

我们做

write_unquoted(df, "my.csv")

我们可以看到它忠实地存储了数据帧:

identical(read.csv("my.csv"),  df)
#> [1] TRUE

并且,如果我们查看生成的csv,它看起来像这样:

c_numeric,c_str,c_str_spec
11,r1c2,r1c3 nothing special
21,r2c2,"r2c3,with delim"
31,r3c2,"r3c3
with carriage return"
41,r4c2,"r3c3""with double quote"

即,仅在需要时引用。

© www.soinside.com 2019 - 2024. All rights reserved.