我是 R 新手,我正在开发一个项目,我想对公司已销售的城市的销售额进行求和。然后我想考虑每个城市的人口规模,以调节总销售数据。
为此,我必须合并两个数据集:一个包含所有销售数据,另一个包含美国所有城市及其人口。
我从此链接上的最后一个文件下载了人口数据:https://www2.census.gov/programs-surveys/popest/datasets/2010-2014/cities/totals/
我从以下位置下载了销售数据:https://www.kaggle.com/datasets/roopeshbharatwajkr/ecommerce-dataset
我认为我正在使用的两个 CSV 的编码类型存在问题。它们似乎并不相同,完全相同的值不会被视为相同,因此合并无法正常工作。我尝试强制使用 UTF-8,但它不适用于人口普查数据 CSV。强制 Windows-1252 对两个 csv 都有效,但对解决我的匹配问题没有帮助。
require(tidyverse)
library(dplyr)
library(stringr)
###loading and cleaning sales data
fullData <- read.csv("Us-Ecommerce Dataset.csv",header=TRUE,stringsAsFactors = FALSE)
#formatting Date column
fullData$Date <- as.Date(fullData$Date,format="%d/%m/%Y")
fullData$Date <- format(fullData$Date,"%m/%d/%Y")
#fix typos
fullData <- fullData %>%
mutate(City = str_replace(City, "Los Angles", "Los Angeles"))
#make state and city lowercase
fullData$State <- tolower(fullData$State)
fullData$City <- tolower(fullData$City)
#remove whitespace
fullData$State <- trimws(fullData$State)
fullData$City <- trimws(fullData$City)
###loading and cleaning population data
censusData <- read.csv("sub-est2014_all.csv",header=TRUE,stringsAsFactors = FALSE,fileEncoding = ("UTF-8"))
#remove rows where state name is == to city name, not sure why data was formatted like that
censusData <- subset(censusData,NAME != STNAME)
#remove unnecessary columns, need both state and city in order to join w/ other data
censusData <- select(censusData,NAME,STNAME,POPESTIMATE2013)
##make state and city lowercase
censusData$STNAME <- tolower(censusData$STNAME)
#replacing invalid characters
censusData$NAME <- str_replace_all(censusData$NAME,"[^[:graph:]]", " ")
censusData$NAME <- tolower(censusData$NAME)
#removing whitespace
censusData$NAME <- trimws(censusData$NAME)
censusData$STNAME <- trimws(censusData$STNAME)
# Merge the two datasets based on City and State columns
merged_data <- merge(fullData, censusData, by.x = c("City", "State"), by.y = c("NAME", "STNAME"), all.x = TRUE)
# Rename the POPESTIMATE2013 column to Population
merged_data <- rename(merged_data, Population = POPESTIMATE2013)
# If there are missing values in the Population column, replace them with 0
merged_data$Population[is.na(merged_data$Population)] <- 0
运行此代码会生成一个新的数据帧,其中 merged_data$Population 中除 1 之外的每个值都是 0(从 NA 转换而来)。当然,这不是我想做的。
merged_data$Population 唯一不为 0 的值是当 merged_data$City == "new york city" & merged_data$State == "new york" 时。在这种情况下,正确的值约为 840 万。
如您所见,我尝试了几种不同的数据清理方法以使它们匹配。 tolower() 和trimws()。 censusData 中也存在有问题的字符,必须使用 str_replace_all() 进行替换
当我运行下面的代码(在将 NA 变为 0 之前)时,它显示大多数值尚未匹配。
unmatched_rows <- merged_data[is.na(merged_data$Population), c("City", "State")]
print(unmatched_rows)
人口普查数据包含不同地理单位:
人口普查地点最适合商业数据中的
City
列。 NAME
人口普查数据中的栏位包括“市”、“镇”、“村”,其中
是一个分类,而不是实际名称的一部分。我们必须考虑到这一点
在协调数据集进行合并时要考虑。
商业数据仅包括洛杉矶、纽约和 西雅图。下面的代码仅针对这三个城市。
library(tidyverse)
census_data <- read_csv("sub-est2014_all.csv")
commerce_data <- read_csv("Us-Ecommerce Dataset.csv")
# Remove "City" from "New York City" and correct "Los Angeles"
commerce_data <-
commerce_data |>
mutate(City = str_remove(City, " City"),
City = str_replace(City, "Los Angles", "Los Angeles"))
# Filter down to relevant NAMEs of class "city", select relevant columns,
# and remove "city" from NAME
census_data <-
census_data |>
filter(str_detect(NAME, "New York|Los Angeles|Seattle"),
str_detect(NAME, "city")) |>
distinct(NAME, STNAME, POPESTIMATE2013) |>
mutate(NAME = str_remove(NAME, " city"))
# join/merge
res <- left_join(commerce_data,
census_data,
by = join_by(City == NAME, State == STNAME))
# check results
res |>
distinct(State, City, POPESTIMATE2013)
#> # A tibble: 3 × 3
#> State City POPESTIMATE2013
#> <chr> <chr> <dbl>
#> 1 New York New York 8438379
#> 2 California Los Angeles 3897940
#> 3 Washington Seattle 653404