下面我使用函数主成分分析来降低 1989 年之前时间序列的维度。之后,我对 fPCA 分数进行聚类。考虑以下问题:
library(splines)
library(fda)
library(dplyr)
set.seed(1420)
data <-read.csv("https://raw.githubusercontent.com/synth-inference/synthdid/master/data/california_prop99.csv",
header=TRUE, sep=";")
# Imports our dataset
data = subset(data, select = c(State, Year,PacksPerCapita,treated))
# Restricts our dataset to the outcome, time, and panel
preperiod = nrow(data[data$treated== 1 & data$State=="California", ])
# Now we count the number of T_0 periods...
data <-data %>% select(-(treated))
#The treatment variable is no longer needed.
nptperiod=preperiod+2
if(preperiod < 3){
# If we have less than 3 pre-periods, stop.
stop("error message to print")
}
# Step 0.2: Rehshape our dataframe to wide.
trainframe = reshape(data, idvar = "State", timevar = "Year", direction = "wide")
# Step 1: Define X as the pre-intervention
#period for both treated unit and
#non-treated units in the data set
X = trainframe[,2:nptperiod]
last = nptperiod-1
x = seq(0,1,length=last)
# Step 1.01: Define our spline for the eigenfunction
splinebasis = create.bspline.basis(c(0,1),10)
smooth = smooth.basis(x,t(X),splinebasis)
# Step 1.2: Compute FPC scores xi for all units
# that explain most of the variation in the data
Xfun = smooth$fd
#Do PCA over it
pca = pca.fd(Xfun, 10)
var.pca = cumsum(pca$varprop)
nharm = sum(var.pca < 0.95) + 1
pc = pca.fd(Xfun, nharm)
# Step 2: Apply K-Means algorithm on
# FPC-scores and extract the donor pool
#Make our fPCA scores as matrices
cluster_x = as.matrix(pc$scores)
cluster_x <- scale(cluster_x)
# 2.1: Calculating the Silhouette statistics
library(cluster)
k.max = 8
sil = rep(0, k.max)
# Compute the average silhouette width
for(i in 2:k.max){
tmp = kmeans(cluster_x, centers = i, nstart = 10)
ss <- silhouette(tmp$cluster, dist(cluster_x))
# Our Silhouette stat
sil[i] <- mean(ss[, 3])
}
# Now, we get the row of the vector
# that has the largest Silhouette stat
optnum = which.max(sil)
# 2.2: K-Means Cluster Analysis
fit <- kmeans(cluster_x, optnum) # 5 cluster solution
# 5 cluster solution?
trainframe <- data.frame(trainframe, fit$cluster)
cols=ncol(trainframe)
colm1 = cols-1
California 集群(处理单元)属于集群 1,我想保留仅在集群 1 中的状态。但是,当我检查它是否在
new_data
之后像这样
trainframe[,c(1,cols)]
treat_cluster = fit$cluster[11]
new_data = trainframe[trainframe[,cols]==treat_cluster,1:colm1]
istr= sum(str_detect(new_data$fullname, '^California$')) > 0
if(istr == "FALSE"){
# If California isn't there, stop.
stop("You MUST have California in the cluster.")
}
R 告诉我找不到 California,它应该是。如何将集群 1 保存到它自己的数据框中?我怀疑这个问题与
treat_cluster = fit$cluster[11]
有关。