将集群的元素放入数据框

问题描述 投票:0回答:0

下面我使用函数主成分分析来降低 1989 年之前时间序列的维度。之后,我对 fPCA 分数进行聚类。考虑以下问题:

library(splines)
library(fda)
library(dplyr)

set.seed(1420)
data <-read.csv("https://raw.githubusercontent.com/synth-inference/synthdid/master/data/california_prop99.csv",
                header=TRUE, sep=";")

# Imports our dataset
data = subset(data, select = c(State, Year,PacksPerCapita,treated))
# Restricts our dataset to the outcome, time, and panel
preperiod = nrow(data[data$treated== 1 & data$State=="California", ])
# Now we count the number of T_0 periods...

data <-data %>% select(-(treated))
#The treatment variable is no longer needed.
nptperiod=preperiod+2
if(preperiod < 3){
  # If we have less than 3 pre-periods, stop.
  stop("error message to print")   
}
# Step 0.2: Rehshape our dataframe to wide.
trainframe = reshape(data, idvar = "State", timevar = "Year", direction = "wide")
# Step 1: Define X as the pre-intervention
#period for both treated unit and
#non-treated units in the data set
X = trainframe[,2:nptperiod]
last = nptperiod-1
x = seq(0,1,length=last)

# Step 1.01: Define our spline for the eigenfunction
splinebasis = create.bspline.basis(c(0,1),10)
smooth = smooth.basis(x,t(X),splinebasis)
# Step 1.2: Compute FPC scores xi for all units
# that explain most of the variation in the data
Xfun = smooth$fd
#Do PCA over it
pca = pca.fd(Xfun, 10)
var.pca = cumsum(pca$varprop)
nharm = sum(var.pca < 0.95) + 1
pc = pca.fd(Xfun, nharm)



# Step 2: Apply K-Means algorithm on
# FPC-scores and extract the donor pool
#Make our fPCA scores as matrices
cluster_x = as.matrix(pc$scores)
cluster_x <- scale(cluster_x)
# 2.1: Calculating the Silhouette statistics
library(cluster)
k.max = 8
sil = rep(0, k.max)
# Compute the average silhouette width 
for(i in 2:k.max){
  tmp = kmeans(cluster_x, centers = i, nstart = 10)
  ss <- silhouette(tmp$cluster, dist(cluster_x))
  # Our Silhouette stat
  sil[i] <- mean(ss[, 3])
}
# Now, we get the row of the vector
# that has the largest Silhouette stat
optnum = which.max(sil)
# 2.2: K-Means Cluster Analysis
fit <- kmeans(cluster_x, optnum) # 5 cluster solution
# 5 cluster solution?
trainframe <- data.frame(trainframe, fit$cluster)
cols=ncol(trainframe)
colm1 = cols-1

California 集群(处理单元)属于集群 1,我想保留仅在集群 1 中的状态。但是,当我检查它是否在

new_data
之后像这样

trainframe[,c(1,cols)]
treat_cluster = fit$cluster[11]
new_data = trainframe[trainframe[,cols]==treat_cluster,1:colm1]


istr= sum(str_detect(new_data$fullname, '^California$')) > 0

if(istr == "FALSE"){
  
  # If California isn't there, stop.
  
  stop("You MUST have California in the cluster.")   
}

R 告诉我找不到 California,它应该是。如何将集群 1 保存到它自己的数据框中?我怀疑这个问题与

treat_cluster = fit$cluster[11]
有关。

r cluster-analysis pca
© www.soinside.com 2019 - 2024. All rights reserved.