R+箭头:使用逗号小数分隔符读取双精度数

问题描述 投票:0回答:1

请查看帖子末尾的代码。 您可以从

下载输入 tsv 文件(没有恶意!)

https://e.pcloud.link/publink/show?code=XZ5eCWZdrwFuo5POSVzi7ywCmteHfE4rdmV

我正在尝试将文本文件转换为镶木地板文件,而不将其加载到内存中。 这失败了,因为我有一个 tsv 文件,其中使用逗号“,”作为小数分隔符。 有没有办法在不更改输入文件的情况下修复我的代码?

谢谢!

library(tidyverse)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:lubridate':
#> 
#>     duration
#> The following object is masked from 'package:utils':
#> 
#>     timestamp



data <- open_dataset("test.tsv",
  format = "tsv",
  skip_rows = 1, 
  schema = schema(
    AID_MEASURE_ID = string(), 
    DATE_CREATED = string(), 
    DATE_GRANTED = string(), 
    AA_PUBLISHED_DATE = string(), 
    SERVER_REF = string(), 
    AM_TITLE = string(), 
    AM_TITLE_EN = string(), 
    STATUS = string(), 
    AM_PROC_TYPE_CD = string(), 
    COFINANCE = string(), 
    OBJECTIVE = string(), 
    OTHER_OBJECTIVE_EN = string(), 
    AID_INSTRUMENT = string(), 
    OTHER_AID_INSTRUMENT_EN = string(), 
    BENEFICIARY_NAME = string(), 
    BENEFICIARY_NAME_ENGLISH = string(), 
    BENEFICIARY_NATIONAL_ID = string(), 
    BENEFICIARY_NAT_ID_TYPE_SD = string(), 
    BENEFICIARY_TYPE_SD = string(), 
    COUNTRY_SD = string(), 
    REGION_SD = string(), 
    SECTOR_SD = string(), 
    GRANTED_AMOUNT_FROM_EUR = double(), 
    NOMINAL_AMOUNT_EUR_FROM = double(), 
    GRANT_RANGE = string(),
    GRANTED_AMOUNT_RANGE_DESC=string(),
    GRANTING_AUTHORITY_NAME = string(), 
    GRANTING_AUTHORITY_NAME_EN = string(), 
    NUTS_CD = string(), 
    GRANTING_AUTHORITY_COUNTRY = string()
  )
)


write_dataset(
  data,
  format = "parquet",
  path = ".",
  max_rows_per_file = 1e7
)
#> Error: Invalid: Could not open CSV input source '/home/lorenzo/mega_pcloud/work/COMP/stat_support/tam_arrow/new_test/test.tsv': Invalid: In CSV column #22: Row #5: CSV conversion error to double: invalid value '631135,74'

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 12 (bookworm)
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Brussels
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] arrow_13.0.0.1  lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0  
#>  [5] dplyr_1.1.2     purrr_1.0.1     readr_2.1.4     tidyr_1.3.0    
#>  [9] tibble_3.2.1    ggplot2_3.4.2   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] bit_4.0.5         gtable_0.3.3      compiler_4.3.1    reprex_2.0.2     
#>  [5] tidyselect_1.2.0  assertthat_0.2.1  scales_1.2.1      yaml_2.3.7       
#>  [9] fastmap_1.1.1     R6_2.5.1          generics_0.1.3    knitr_1.43       
#> [13] munsell_0.5.0     R.cache_0.16.0    tzdb_0.4.0        pillar_1.9.0     
#> [17] R.utils_2.12.2    rlang_1.1.1       utf8_1.2.3        stringi_1.7.12   
#> [21] xfun_0.39         fs_1.6.2          bit64_4.0.5       timechange_0.2.0 
#> [25] cli_3.6.1         withr_2.5.0       magrittr_2.0.3    digest_0.6.31    
#> [29] grid_4.3.1        hms_1.1.3         lifecycle_1.0.3   R.methodsS3_1.8.2
#> [33] R.oo_1.25.0       vctrs_0.6.2       evaluate_0.21     glue_1.6.2       
#> [37] styler_1.10.1     fansi_1.0.4       colorspace_2.1-0  rmarkdown_2.22   
#> [41] tools_4.3.1       pkgconfig_2.0.3   htmltools_0.5.5

创建于 2023-10-03,使用 reprex v2.0.2

r apache-arrow decimal-point
1个回答
0
投票

恐怕 C++ CSV ConvertOptions 类中的

decimal_point
参数尚未在 R 绑定中公开,这导致了这里的复杂化。我已经开出了一张票来执行此操作,我们将尽力在下一个版本发布之前对其进行排序。

© www.soinside.com 2019 - 2024. All rights reserved.