将字符串列拆分为2列,一列为数字,另一列为日期

问题描述 投票:3回答:5

我有一个通过网络抓取获得的称为“价格”的数据框。目标是跟踪津巴布韦证券交易所股票的每日价格。

来自网站的网页抓取:

library(rvest)
library(stringr)
library(reshape2)
# Data from African Financials
url <- "https://africanfinancials.com/zimbabwe-stock-exchange-share-prices/"
prices <- url %>%
  read_html() %>%
  html_table(fill = T)
prices <- prices[[1]]

价格数据框:

> prices

                   Counter   PriceRTGS cents  Volume ChangeRTGS cents ChangePercent YTDPercent
1            AFDS.zw Afdis   169.75 4 Apr 19       0             0.00         0.00%     10.95%
2          ARIS.zw Ariston     2.90 4 Apr 19     572            -0.03        -1.02%     20.83%
3     ARTD.zw ART Holdings     9.20 4 Apr 19       0             0.00         0.00%      4.55%

我想将“PriceRTGS美分”栏分成两列“Price RTGS Cents”和“Date”。

我尝试使用以下代码,但它在价格列中捕获了月份的日期4。

str_split_fixed(prices$`PriceRTGS cents`," ", 2)
colsplit(prices$`PriceRTGS cents`," ",c("Price RTGS Cents", "Date"))

我希望输出看起来像这样:

                   Counter   Price RTGS Cents              Date         Volume ChangeRTGS cents ChangePercent YTDPercent
1            AFDS.zw Afdis             169.75         4/04/2019              0             0.00         0.00%     10.95%
2          ARIS.zw Ariston               2.90         4/04/2019            572            -0.03        -1.02%     20.83%
3     ARTD.zw ART Holdings               9.20         4/04/2019              0             0.00         0.00%      4.55%

输入数据:

structure(list(Counter = c("AFDS.zw Afdis", "ARIS.zw Ariston", 
"ARTD.zw ART Holdings", "ASUN.zw Africansun", "AXIA.zw Axia", 
"BAT.zw BAT"), `PriceRTGS cents` = c("169.75 4 Apr 19", "2.90 4 Apr 19", 
"9.20 4 Apr 19", "15.00 4 Apr 19", "35.05 4 Apr 19", "3,000.00 4 Apr 19"
), Volume = c("0", "572", "0", "0", "8,557", "0"), `ChangeRTGS cents` = c(0, 
-0.03, 0, 0, 0, 0), ChangePercent = c("0.00%", "-1.02%", "0.00%", 
"0.00%", "0.00%", "0.00%"), YTDPercent = c("10.95%", "20.83%", 
"4.55%", "50.00%", "-22.11%", "-9.09%")), row.names = c(NA, 6L
), class = "data.frame")
r dataframe split
5个回答
0
投票

在这里:类似于你的str_split_fixed解决方案。它还会从您的价格变量中删除逗号,以便可以强制转换为numeric并格式化日期列。

split_string <- str_split(prices$`PriceRTGS cents`, regex("\\s"), 2, simplify = T)

prices$Price <- as.numeric(gsub(",", "", split_string[,1], fixed = T))
prices$Date <- as.Date(split_string[,2], format = "%d %b %y")

head(prices[-2])
               Counter Volume ChangeRTGS cents ChangePercent YTDPercent   Price       Date
1        AFDS.zw Afdis      0             0.00         0.00%     10.95%  169.75 2019-04-04
2      ARIS.zw Ariston    572            -0.03        -1.02%     20.83%    2.90 2019-04-04
3 ARTD.zw ART Holdings      0             0.00         0.00%      4.55%    9.20 2019-04-04
4   ASUN.zw Africansun      0             0.00         0.00%     50.00%   15.00 2019-04-04
5         AXIA.zw Axia  8,557             0.00         0.00%    -22.11%   35.05 2019-04-04
6           BAT.zw BAT      0             0.00         0.00%     -9.09% 3000.00 2019-04-04

固定解决方案的问题在于它没有在价格之后识别空间,即:

table(str_count(prices$`PriceRTGS cents`, fixed(" ")))

 2 
55 

但它确实使用正则表达式来表示空格,即:

table(str_count(prices$`PriceRTGS cents`, regex("\\s")))

 3 
55 

1
投票

我只是将您的第一个价格数据复制并粘贴到文本编辑器中,并用“;”更改空格(我还没有看到你的数据版本)。

prices <- read.table("dat.txt", sep=";", header=T)

一种“快速和肮脏”的代码,但它的工作原理:

str_split_fixed(prices$PriceRTGS.cents," ", 2)
new_prices <- data.frame(prices$Counter, str_split_fixed(prices$PriceRTGS.cents," ", 2), prices$Volume, prices$ChangeRTGS.cents, prices$ChangePercent, prices$YTDPercent)
colnames(new_prices) <- c("Counter", "PriceRTGS_cents", "Date",  "Volume", "ChangeRTGS cents", "ChangePercent",  "YTDPercent")
new_prices$Date <- gsub("Apr", "04", new_prices$Date)
new_prices$Date <- gsub(" ", "/", new_prices$Date)
new_prices <- data.frame(prices$Counter, new_prices$PriceRTGS_cents, new_prices$Date, prices$Volume, prices$ChangeRTGS.cents, prices$ChangePercent, prices$YTDPercent)
colnames(new_prices) <- c("Counter", "PriceRTGS_cents", "Date",  "Volume", "ChangeRTGS cents", "ChangePercent",  "YTDPercent")
new_prices

如果你有其他月份比'Apr',juste添加其他行(例如:if“Nov”)

new_prices$Date <- gsub("Nov", "10", new_prices$Date)
new_prices$Date <- gsub(" ", "/", new_prices$Date)

1
投票

替代。分隔符( - )和日期格式,列名称可以更改:

prices$Prices<-stringr::str_extract_all(prices$`PriceRTGS cents`,"\\d{1,}.*\\.\\d{1,}",simplify=T)

prices$Dates<-stringr::str_remove_all(prices$`PriceRTGS cents`,"\\d{1,}.*\\.\\d{1,} ")
prices %>% 
  select(-`PriceRTGS cents`) %>% 
  mutate(Dates=lubridate::dmy(Dates))

结果:

               Counter Volume ChangeRTGS cents ChangePercent YTDPercent   Prices      Dates
1        AFDS.zw Afdis      0             0.00         0.00%     10.95%   169.75 2019-04-04
2      ARIS.zw Ariston    572            -0.03        -1.02%     20.83%     2.90 2019-04-04
3 ARTD.zw ART Holdings      0             0.00         0.00%      4.55%     9.20 2019-04-04
4   ASUN.zw Africansun      0             0.00         0.00%     50.00%    15.00 2019-04-04
5         AXIA.zw Axia  8,557             0.00         0.00%    -22.11%    35.05 2019-04-04
6           BAT.zw BAT      0             0.00         0.00%     -9.09% 3,000.00 2019-04-04

0
投票

你可以做点什么 -

library(data.table)
setDT(dt)
dt[,Date:=sub("^\\S+\\s+", "\\1", `PriceRTGS cents`)]
dt[,cents:=sub("^\\s*(\\S+\\S+).*", "\\1", `PriceRTGS cents`)]

注意 - 稍后从dt删除原始列

> dt <- subset(dt,select = -c(`PriceRTGS cents`))
> dt
                Counter Volume ChangeRTGS cents ChangePercent YTDPercent    cents     Date
1:        AFDS.zw Afdis      0             0.00         0.00%     10.95%   169.75 4 Apr 19
2:      ARIS.zw Ariston    572            -0.03        -1.02%     20.83%     2.90 4 Apr 19
3: ARTD.zw ART Holdings      0             0.00         0.00%      4.55%     9.20 4 Apr 19
4:   ASUN.zw Africansun      0             0.00         0.00%     50.00%    15.00 4 Apr 19
5:         AXIA.zw Axia  8,557             0.00         0.00%    -22.11%    35.05 4 Apr 19
6:           BAT.zw BAT      0             0.00         0.00%     -9.09% 3,000.00 4 Apr 19

编辑 - 如果你想要你提到的Date那么这样做 -

dt[,Date:=format(as.Date(sub("^\\S+\\s+", "\\1", `PriceRTGS cents`),format='%d %b %Y'),"%d/%m/%Y")]

0
投票

基本R选项是在空白区域上拆分并创建字符串的两个部分,首先是价格部分,其余部分一起作为日期。

t(sapply(strsplit(prices$`PriceRTGS cents`,"\\s+"), function(x) 
  c(x[1], format(as.Date(paste0(x[-1], collapse = "-"), "%d-%b-%y"), "%d/%m/%Y"))))

#         [,1]           [,2]        
#[1,] "169.75"   "04/04/2019"
#[2,] "2.90"     "04/04/2019"
#[3,] "9.20"     "04/04/2019"
#[4,] "15.00"    "04/04/2019"
#[5,] "35.05"    "04/04/2019"
#[6,] "3,000.00" "04/04/2019"

然后你可以将cbind这两列放到原始数据帧中。

如果您可以保持日期列没有任何格式,我们可以放开和as.Dateformat并缩短代码。

t(sapply(strsplit(prices$`PriceRTGS cents`,"\\s+"), function(x) 
             c(x[1], paste0(x[-1], collapse = "-"))))
© www.soinside.com 2019 - 2024. All rights reserved.