mean
use "filename", clear
drop if varname < 1500
sum
Std.Dev.
mdesc varname
Min:1333
drop if varname < 1500
1368 Type: float Meanwhile, Python: PYTHON (raw data: ): varname Count: 610 Mean: 1339.481934 Std: 17.274755 Min: 1304.000000 25%: 1326.000000 50%: 1341.000000 75%: 1353.000000 max: 1368.000000 varname 10953 So the number of missings in raw data is same in Stata and Python, but after dropping i get two different datasets.PYTHON, after ##Count: 288.000000 Mean: 1325.760376 Std: 13.369122 Min: 1304.000000 25%: 1316.000000 50%: 1325.000000 75%: 1332.000000 max: 1365.000000 In partcular, the differences are in counts of observations. For some variables there is a patterned difference, i.e. Stata: 11 342 obs, Python: 5064 obs (twice as few). For some variables, the difference is not patterned, just different values. The summary statistics are not too different, but different. I am new to Python, so can you please share if that is indeed possible that it operates on data differently from Stata? Edit:I figured out that I dropped incorrectly, instead of , I should have typed
. I dont know the difference, but now I have the dataset that I need. Thanks everyone for spending time here!
import pandas as pd
df = pd.read_stata("filename.dta", convert_missing = False)
df = df[df.varname<1500]
df.describe()
df=pd.read_stata("filename.dta")
I understand it is highly unlikely, but I can't figure out why Python outputs a slightly different dataset after simple manipulations, which I think are identical to those that I do in Stata. So, ...
df.isnull().sum()
I guess you misinterpret the behavior of boolean operation inside the
df = df[df.varname<1500]
In pandas, the statement inside must be
, so that it can be selected.
will returns what is df = df[df.varname<1500]
for df_new = df.drop(df[df.varname< 1500].index)
. So you will get those rows satisfing
STATA(原始数据)df[]
观测值:610 平均值:1339.482 1339.482 Std: 17.27477 Min: 1304 最大 1368
检查是否有缺失的(df[statement]
)True
失踪。10953 总数:11563 失踪百分比:94.72 94.72
STATA(经过 df = df[df.varname<1500]
):True
varnamedf.varname<1500