对于包含 POLYGON 和 MULTIPOLYGON 几何数据的 geopandas 数据框,我尝试从另一个坐标参考系统(CRS)转换为 EPSG:4326。
因为地理数据框有大约 20 万条记录,所以我有
small_gdf.to_crs('epsg:4326', inplace=True)
small_gdf.to_file(f'small_gdf_{filecounter}.shp')
此转换过程大约需要整整 2 天。 将所有small_gdf部分应用pd.concat到完整的地理数据框中后,结果显示原始地理数据框中大约60%的记录。是否会因为“to_crs”转换失败而删除记录?
同时,我将向每个“small_gdf”添加一个新列,并重新运行 to_crs 操作以追溯在转换过程中删除了哪些记录
代码示例[请原谅任何拼写错误。我不得不重新输入这篇文章]
import geopandas as gpd
gdf = gpd.read_file('bigShapefilePath.shp')
n_records = len(gdf)
# create tuples for start-end indexes of each chunk
chunksize = 1000
i=0
list_start_end_idx_tuples = []
for start in range(i, n_records, chunksize):
end = start+999
if end > n_records:
end = n_records-1
start_end_idx_tuple = (start, end)
list_start_end_idx_tuples.append(start_end_idx_tuple)
# convert in chunks
parts_folderpath = <parts_folderpath>
file_counter=1
for each_start_end in list_start_end_idx_tuples:
start, end = each_start_end
small_gdf = gdf.iloc[start:end+1]
small_gdf['WITHIN_PART_IDX'] = range(len(small_gdf))
small_gdf.to_crs('epsg:4326', inplace=True)
small_gdf.to_file(f'{parts_folderpath}/small_gdf_part{file_counte
r}.shp')
file_counter+=1
# find file parts
full_folderpath = <full_folderpath>
i=0
list_smallgdf_filename = []
list_smallgdf_filenamenext = []
for dir, subdir, filenames in os.walk(parts_folderpath):
for filenamenext in filenames:
if ('.shp' in filenamenext) and ('.xml' not in filenamenext):
filename = filenamenext.split('.')[0]
i+=1
list_smallgdf_filename.append(filename)
list_smallgdf_filenamenext.append(filenamenext)
# concat into full gdf
i=0
for filenamenext in list_smallgdf_filenamenext:
small_gdf = gpd.read_file(f'{parts_folderpath}/{filenamenext}'
small_filename = small_filename[i]
part_num = small_filename.split('_')[-1].split('.')[0]
small_gdf['PART_NUM'] = int(part_num)
if i<1:
concat_gdf = small_gdf
else:
concat_gdf = pd.concat([concat_gdf, small_gdf])
i+=1
concat_gdf.to_file(f'{full_folderpath}/concat_gdf.shp')
问题出在块大小上。
发生了什么:chunksize 设置为 1000。 这意味着对于我们应用“to_crs”转换的 1000 条记录块,在每个块中大约每 800 条记录之后,“to_crs”显然会删除剩余的 200 条记录。
解决问题的方法:将 chunksize 降低到 100。虽然您希望稍后使用 pd.concat 进行 UNION 的块数量会增加 10 倍,但在“to_crs”坐标转换期间您的记录将不再丢失。