geopandas to_crs 返回的记录少于预期

问题描述 投票:0回答:1

对于包含 POLYGON 和 MULTIPOLYGON 几何数据的 geopandas 数据框,我尝试从另一个坐标参考系统(CRS)转换为 EPSG:4326。

因为地理数据框有大约 20 万条记录,所以我有

  • 将完整的地理数据框拆分为 200 个较小的地理数据框,每个较小的地理数据框大约有 1000 条记录
  • 然后我就跑了
    small_gdf.to_crs('epsg:4326', inplace=True)
  • 然后导出每个“small_gdf”:
    small_gdf.to_file(f'small_gdf_{filecounter}.shp')

此转换过程大约需要整整 2 天。 将所有small_gdf部分应用pd.concat到完整的地理数据框中后,结果显示原始地理数据框中大约60%的记录。是否会因为“to_crs”转换失败而删除记录?

同时,我将向每个“small_gdf”添加一个新列,并重新运行 to_crs 操作以追溯在转换过程中删除了哪些记录

代码示例[请原谅任何拼写错误。我不得不重新输入这篇文章]

import geopandas as gpd
gdf = gpd.read_file('bigShapefilePath.shp')
n_records = len(gdf)

# create tuples for start-end indexes of each chunk
chunksize = 1000
i=0
list_start_end_idx_tuples = []
for start in range(i, n_records, chunksize):
    end = start+999
    if end > n_records:
        end = n_records-1

    start_end_idx_tuple = (start, end)
    list_start_end_idx_tuples.append(start_end_idx_tuple)

# convert in chunks
parts_folderpath = <parts_folderpath>
file_counter=1
for each_start_end in list_start_end_idx_tuples:
    start, end = each_start_end
    small_gdf = gdf.iloc[start:end+1]
    small_gdf['WITHIN_PART_IDX'] = range(len(small_gdf))
    small_gdf.to_crs('epsg:4326', inplace=True)
    small_gdf.to_file(f'{parts_folderpath}/small_gdf_part{file_counte
    r}.shp')

    file_counter+=1


# find file parts
full_folderpath = <full_folderpath>
i=0
list_smallgdf_filename = []
list_smallgdf_filenamenext = []

for dir, subdir, filenames in os.walk(parts_folderpath):
    for filenamenext in filenames:
        if ('.shp' in filenamenext) and ('.xml' not in filenamenext):
            filename = filenamenext.split('.')[0]
            i+=1
            list_smallgdf_filename.append(filename)
            list_smallgdf_filenamenext.append(filenamenext)


# concat into full gdf
i=0
for filenamenext in list_smallgdf_filenamenext:
    small_gdf = gpd.read_file(f'{parts_folderpath}/{filenamenext}'
    small_filename = small_filename[i]
    part_num = small_filename.split('_')[-1].split('.')[0]
    small_gdf['PART_NUM'] = int(part_num)
    
    if i<1:
        concat_gdf = small_gdf
    else:
        concat_gdf = pd.concat([concat_gdf, small_gdf])
    i+=1

concat_gdf.to_file(f'{full_folderpath}/concat_gdf.shp')
geopandas pyproj
1个回答
0
投票

问题出在块大小上。

  • 发生了什么:chunksize 设置为 1000。 这意味着对于我们应用“to_crs”转换的 1000 条记录块,在每个块中大约每 800 条记录之后,“to_crs”显然会删除剩余的 200 条记录

  • 解决问题的方法:将 chunksize 降低到 100。虽然您希望稍后使用 pd.concat 进行 UNION 的块数量会增加 10 倍,但在“to_crs”坐标转换期间您的记录将不再丢失。

© www.soinside.com 2019 - 2024. All rights reserved.