根据数据帧行中列内的参数或差异对df进行分段或分组?

问题描述 投票:2回答:1

我试图弄清楚是否存在一种方式,其中我有一个包含多个字段的数据帧,并且我想根据特定列的值是否在彼此的x量之内,将数据帧分段或分组为新的数据帧?

   I.D  |      Created_Time            | Home_Longitude | Home_Latitude | Work_Longitude | Home_Latitude
  Faa1      2019-02-23 20:01:13.362           -77.0364            38.8951    -72.0364      38.8951

以上是原始df看起来多行的方式。我想创建一个新的数据框,其中所有行或I.Ds包含彼此在x分钟内的创建时间,并且在彼此的x英里内使用hasrsine,并且彼此之间的x英里工作。

因此,基本上尝试将此数据帧过滤为df,该df仅包含在订单创建时间x分钟内的行,在另一个家庭内x英里,以及每个工作列值内x英里。

python-3.x pandas dataframe group-by haversine
1个回答
0
投票

我这样做了

  1. 计算相对于第一行的距离(以英里为单位)和时间 我的逻辑 如果n行在第一行的x分钟/英里内,则那n行在彼此的x分钟/英里内
  2. 使用所需的距离和时间过滤条件过滤数据

生成一些虚拟数据

# Generate random Lat-Long points
def newpoint():
   return uniform(-180,180), uniform(-90, 90)
home_points = (newpoint() for x in range(289))
work_points = (newpoint() for x in range(289))

df = pd.DataFrame(home_points, columns=['Home_Longitude', 'Home_Latitude'])
df[['Work_Longitude', 'Work_Latitude']] = pd.DataFrame(work_points)

# Insert `ID` column as sequence of integers
df.insert(0, 'ID', range(289))

# Generate random datetimes, separated by 5 minute intervals
# (you can choose your own interval)
times = pd.date_range('2012-10-01', periods=289, freq='5min')
df.insert(1, 'Created_Time', times)
print(df.head())

   ID        Created_Time  Home_Longitude  Home_Latitude  Work_Longitude  Work_Latitude
0   0 2012-10-01 00:00:00      -48.885981     -39.412351      -68.756244      24.739860
1   1 2012-10-01 00:05:00       58.584893      59.851739     -119.978429     -87.687858
2   2 2012-10-01 00:10:00      -18.623484      85.435248      -14.204142      -3.693993
3   3 2012-10-01 00:15:00      -29.721788      71.671103      -69.833253     -12.446204
4   4 2012-10-01 00:20:00      168.257968     -13.247833       60.979050     -18.393925

用hasrsine距离公式创建Python辅助函数(vectorized haversine distance formula, in km

def haversine(lat1, lon1, lat2, lon2, to_radians=False, earth_radius=6371):
    """
    slightly modified version: of http://stackoverflow.com/a/29546836/2901002

    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees or in radians)

    All (lat, lon) coordinates must have numeric dtypes and be of equal length.

    """
    if to_radians:
        lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))

使用半正公式计算以km为单位的距离(相对于第一行)。然后,将km转换为英里

df['Home_dist_miles'] = \
    haversine(df.Home_Longitude, df.Home_Latitude,
                 df.loc[0, 'Home_Longitude'], df.loc[0, 'Home_Latitude'])*0.621371
df['Work_dist_miles'] = \
    haversine(df.Work_Longitude, df.Work_Latitude,
                 df.loc[0, 'Work_Longitude'], df.loc[0, 'Work_Latitude'])*0.621371

计算time differences, in minutes(相对于第一行)

  • 对于这里的虚拟数据,时间差将是5分钟的倍数(但在实际数据中,它们可以是任何东西)
df['time'] = df['Created_Time'] - df.loc[0, 'Created_Time']
df['time_min'] = (df['time'].dt.days * 24 * 60 * 60 + df['time'].dt.seconds)/60

应用过滤器(方法1),然后选择满足OP中所述条件的任何2行

home_filter = df['Home_dist_miles']<=12000 # within 12,000 miles
work_filter = df['Work_dist_miles']<=8000 # within 8,000 miles
time_filter = df['time_min']<=25 # within 25 minutes
df_filtered = df.loc[(home_filter) & (work_filter) & (time_filter)]

# Select any 2 rows that satisfy required conditions
df_any2rows = df_filtered.sample(n=2)
print(df_any2rows)

   ID        Created_Time  Home_Longitude  Home_Latitude  Work_Longitude  Work_Latitude  Home_dist_miles  Work_dist_miles     time  time_min
0   0 2012-10-01 00:00:00     -168.956448     -42.970705       -6.340945     -12.749469         0.000000         0.000000 00:00:00       0.0
4   4 2012-10-01 00:20:00      -73.120352      13.748187      -36.953587      23.528789      6259.078588      5939.425019 00:20:00      20.0

应用过滤器(方法2),然后应用满足OP中规定条件的select any 2 rows

multi_query = """Home_dist_miles<=12000 & \
                Work_dist_miles<=8000 & \
                time_min<=25"""
df_filtered = df.query(multi_query)

# Select any 2 rows that satisfy required conditions
df_any2rows = df_filtered.sample(n=2)
print(df_any2rows)

   ID        Created_Time  Home_Longitude  Home_Latitude  Work_Longitude  Work_Latitude  Home_dist_miles  Work_dist_miles     time  time_min
0   0 2012-10-01 00:00:00     -168.956448     -42.970705       -6.340945     -12.749469         0.000000         0.000000 00:00:00       0.0
4   4 2012-10-01 00:20:00      -73.120352      13.748187      -36.953587      23.528789      6259.078588      5939.425019 00:20:00      20.0
© www.soinside.com 2019 - 2024. All rights reserved.