obs

问题描述 投票:0回答:1
: 389

mean

use "filename", clear  
drop if varname < 1500  
sum  

: 1350.599

Std.Dev.

: 9.564949 mdesc varnameMin:

1333

Max:drop if varname < 1500 1368 Type: float

Meanwhile, Python: PYTHON (raw data: ): varname Count: 610 Mean: 1339.481934 Std: 17.274755 Min: 1304.000000 25%: 1326.000000 50%: 1341.000000 75%: 1353.000000 max: 1368.000000 varname 10953 So the number of missings in raw data is same in Stata and Python, but after dropping i get two different datasets.PYTHON, after ##Count: 288.000000 Mean: 1325.760376 Std: 13.369122 Min: 1304.000000 25%: 1316.000000 50%: 1325.000000 75%: 1332.000000 max: 1365.000000 In partcular, the differences are in counts of observations. For some variables there is a patterned difference, i.e. Stata: 11 342 obs, Python: 5064 obs (twice as few). For some variables, the difference is not patterned, just different values. The summary statistics are not too different, but different. I am new to Python, so can you please share if that is indeed possible that it operates on data differently from Stata? Edit:I figured out that I dropped incorrectly, instead of , I should have typed

. I dont know the difference, but now I have the dataset that I need. Thanks everyone for spending time here!

import pandas as pd  
df = pd.read_stata("filename.dta", convert_missing = False)  
df = df[df.varname<1500]  
df.describe()  

df=pd.read_stata("filename.dta") I understand it is highly unlikely, but I can't figure out why Python outputs a slightly different dataset after simple manipulations, which I think are identical to those that I do in Stata. So, ...

df.isnull().sum()I guess you misinterpret the behavior of boolean operation inside the

clause. df = df[df.varname<1500]In pandas, the statement inside

must be

, so that it can be selected.

In your example,

will returns what is df = df[df.varname<1500] for df_new = df.drop(df[df.varname< 1500].index). So you will get those rows satisfing

, instead of dropping them.
python pandas stata
1个回答
1
投票

STATA(原始数据)df[]观测值:610 平均值:1339.482 1339.482 Std: 17.27477 Min: 1304 最大 1368

检查是否有缺失的(df[statement])True失踪。10953 总数:11563 失踪百分比:94.72 94.72

STATA(经过 df = df[df.varname<1500]):Truevarnamedf.varname<1500

© www.soinside.com 2019 - 2024. All rights reserved.