我想研究一些股票或金融指数。我使用 yfR 包和 yf_get 函数从雅虎财经下载数据。该函数返回一个带有许多变量的 df 。我想选择其中的一些,然后它们仅使用所需的变量创建 df 。这是我的问题:
library(yfR)
library(tidyverse)
Symbols <- c("^GSPC", "^FTSE")
StartDate <- "2010-01-01"
EndDate <- "2019-12-31"
RawData <- yf_get (Symbols, first_date = StartDate, last_date = EndDate, freq_data = "daily" ,do_complete_data = TRUE)
# Here is the initial structure of the RawDat df
str(RawData)
tibble [5,039 × 11] (S3: tbl_df/tbl/data.frame)
$ ticker : chr [1:5039] "^FTSE" "^FTSE" "^FTSE" "^FTSE" ...
$ ref_date : Date[1:5039], format: "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" ...
$ price_open : num [1:5039] 5413 5500 5522 5530 5527 ...
$ price_high : num [1:5039] 5500 5536 5536 5552 5549 ...
$ price_low : num [1:5039] 5411 5481 5498 5500 5495 ...
$ price_close : num [1:5039] 5500 5522 5530 5527 5534 ...
$ volume : num [1:5039] 7.51e+08 1.15e+09 9.98e+08 1.16e+09 1.01e+09 ...
$ price_adjusted : num [1:5039] 5500 5522 5530 5527 5534 ...
$ ret_adjusted_prices : num [1:5039] NA 0.004036 0.001358 -0.000597 0.001357 ...
$ ret_closing_prices : num [1:5039] NA 0.004036 0.001358 -0.000597 0.001357 ...
$ cumret_adjusted_prices: num [1:5039] 1 1 1.01 1 1.01 ...
- attr(*, "df_control")= tibble [2 × 5] (S3: tbl_df/tbl/data.frame)
..$ ticker : chr [1:2] "^FTSE" "^GSPC"
..$ dl_status : chr [1:2] "OK" "OK"
..$ n_rows : int [1:2] 2524 2515
..$ perc_benchmark_dates: num [1:2] 0.982 1
..$ threshold_decision : chr [1:2] "KEEP" "KEEP"
请注意,由于某些国家假期等原因,我们想要的指数或股票可能有不同的长度(不同的obs数量)。所以现在我们有 2515 个 GSPC OB 和 2524 个 FTSE OB。可以说我有兴趣保留列 ref_date、price_adjusted 和ticker(以便稍后以某种方式用作过滤机制)。我尝试管道直到某个点,它是这样的:
Returns <- RawData %>%
select(ref_date, price_adjusted, ticker) %>%
rename(Date = ref_date, Price = price_adjusted, Ticker = ticker)
# And we end up with this
str(Returns)
tibble [5,039 × 3] (S3: tbl_df/tbl/data.frame)
$ Date : Date[1:5039], format: "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" ...
$ Price : num [1:5039] 5500 5522 5530 5527 5534 ...
$ Ticker: chr [1:5039] "^FTSE" "^FTSE" "^FTSE" "^FTSE" ...
- attr(*, "df_control")= tibble [2 × 5] (S3: tbl_df/tbl/data.frame)
..$ ticker : chr [1:2] "^FTSE" "^GSPC"
..$ dl_status : chr [1:2] "OK" "OK"
..$ n_rows : int [1:2] 2524 2515
..$ perc_benchmark_dates: num [1:2] 0.982 1
..$ threshold_decision : chr [1:2] "KEEP" "KEEP"
我的问题来了。我希望最终产品是具有 4 列的 df(Date_Stock1、Price_Stock1、Date_Stock2、Price_Stock2)。如果我有 3 只股票和 3 个变量,最终产品将是一个具有 9 列的 df(Date_Stock1、Price_Stock1、Volume_Stock1、Date_Stock2、Price_Stock2、Volume_Stock1、Date_Stock3、Price_Stock3、Volume_Stock3)
我尝试使用 tidyr 的过滤器和子集,但失败了。我最好的尝试是使用pivot_wider,结果是一个4列1行的df,里面有包含值的列表,但我不知道如何将它们恢复为df。
Returns <- RawData %>%
select(ref_date, price_adjusted, ticker) %>%
rename(Date = ref_date, Price = price_adjusted, Ticker = ticker) %>%
pivot_wider(names_from = "Ticker", values_from = c(Date, Price))
# Also received this warning
Warning message:
Values from `Date` and `Price` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
{data} %>%
dplyr::group_by(Ticker) %>%
dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
dplyr::filter(n > 1L)
str(Returns)
tibble [1 × 4] (S3: tbl_df/tbl/data.frame)
$ Date_^FTSE :List of 1
..$ : Date[1:2524], format: "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" ...
$ Date_^GSPC :List of 1
..$ : Date[1:2515], format: "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" ...
$ Price_^FTSE:List of 1
..$ : num [1:2524] 5500 5522 5530 5527 5534 ...
$ Price_^GSPC:List of 1
..$ : num [1:2515] 1133 1137 1137 1142 1145 ...
我怎样才能达到我的最终目标?某种变异或 for 循环,或者我可能不知道地图。我不知道如何处理这些功能。我只看到了教程,但我想让他们做到这一点。有什么想法吗?
要将带有值的列表转换回数据框:
bro.df<-do.call(rbind.data.frame, data_list)
您面临的挑战包括将数据集从长格式重组为更宽的格式,同时处理由于观察数量不同而导致每个股票/指数的不同长度。在处理这种性质的时间序列数据时,这是一个常见问题。 tidyverse 中的函数pivot_wider确实是完成此任务的一个不错的选择,但是,由于观察数量不同,直接应用pivot_wider会在每个单元格中产生您观察到的列表。
解决这个问题的一个好方法是将每个股票/指数的数据分成单独的数据框,确保它们具有相同数量的观察值(必要时用 NA 填充缺失的日期),然后将它们按列绑定在一起。您可以这样做:
library(yfR)
library(tidyverse)
# Define the symbols and date range
Symbols <- c("^GSPC", "^FTSE")
StartDate <- "2010-01-01"
EndDate <- "2019-12-31"
# Get the raw data from Yahoo Finance
RawData <- yf_get(Symbols, first_date = StartDate, last_date = EndDate, freq_data = "daily", do_complete_data = TRUE)
# Select and rename the columns of interest
Returns <- RawData %>%
select(ref_date, price_adjusted, ticker) %>%
rename(Date = ref_date, Price = price_adjusted, Ticker = ticker)
# Split the data into separate data frames for each stock/index
list_of_dfs <- split(Returns, Returns$Ticker)
# Ensure each data frame has the same number of observations by filling missing dates with NA
# First, create a sequence of dates that covers the entire range for both stocks
all_dates <- seq.Date(min(sapply(list_of_dfs, function(df) min(df$Date))),
max(sapply(list_of_dfs, function(df) max(df$Date))),
by = "day")
# Now, for each stock/index data frame, ensure it has a row for each date in all_dates
list_of_dfs <- lapply(list_of_dfs, function(df) {
df <- df %>%
full_join(data.frame(Date = all_dates), by = "Date") %>%
arrange(Date)
return(df)
})
# Now bind the separate data frames together column-wise
result <- NULL
for (i in seq_along(list_of_dfs)) {
df <- list_of_dfs[[i]]
# Create column names based on the stock/index ticker
colnames(df)[2] <- paste0("Price_", df$Ticker[1])
colnames(df)[1] <- paste0("Date_", df$Ticker[1])
df$Ticker <- NULL # remove the Ticker column as it's no longer needed
if (is.null(result)) {
result <- df
} else {
result <- bind_cols(result, df)
}
}
# View the resulting data frame
str(result)
在此代码片段中:
这将为您提供一个数据框结果,其中包含每个股票/指数的日期和价格的单独列,每个股票/指数的行数相同,并为缺失的日期填写 NA。