K-means如何确定特定经纬度附近的大部分位置。

Question

我知道一个城市中每个社区的中心经纬度，我有一组餐馆的数据，并有它们的经纬度。我需要用类似K-meaans的方法来确定哪个街区最密集。很明显，我是个新手。那么我们就说，我有第一个系列比如说十个纬度和经度，第二个系列大概有200个，我怎么确定这十个纬度中哪个纬度是最密集的，或者说附近有最多的纬度。如果不清楚，对不起。不过请大家帮忙，因为我需要问这个问题，很难过。

Answer 1

如果你从城市的一些地图数据中知道每个街区的边界（或其半径来做一个近似值），你可以直接检查每个餐厅位于哪个街区。

否则，你可以计算餐厅与街区中心点之间的距离，并将200家餐厅中的每一家分配到最近的街区。

然后你就可以用邻里的餐厅数量除以餐厅总数来近似计算每个邻里的密度。

我想你不需要任何机器学习算法。

当然，你可以选择距离度量衡根据你的问题。

Answer 2

这个怎么样？

# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint


# define the number of kilometers in one radian
kms_per_radian = 6371.0088


# load the data set
df = pd.read_csv('C:\\your_path_here\\summer-travel-gps-full.csv', encoding = "ISO-8859-1")
df.head()


# how many rows are in this data set?
len(df)


# scatterplot it to get a sense of what it looks like
df = df.sort_values(by=['lat', 'lon'])
ax = df.plot(kind='scatter', x='lon', y='lat', alpha=0.5, linewidth=0)

 

# represent points consistently as (lat, lon)
# coords = df.as_matrix(columns=['lat', 'lon'])
df_coords = df[['lat', 'lon']]
# coords = df.to_numpy(df_coords)

# define epsilon as 10 kilometers, converted to radians for use by haversine
epsilon = 10 / kms_per_radian


start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)

# get the number of clusters
num_clusters = len(set(cluster_labels))


# get colors and plot all the points, color-coded by cluster (or gray if not in any cluster, aka noise)
fig, ax = plt.subplots()
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))

# for each cluster label and color, plot the cluster's points
for cluster_label, color in zip(unique_labels, colors):
    
    size = 150
    if cluster_label == -1: #make the noise (which is labeled -1) appear as smaller gray points
        color = 'gray'
        size = 30
    
    # plot the points that match the current cluster label
    # X.iloc[:-1]
    # df.iloc[:, 0]
    x_coords = df_coords.iloc[:, 0]
    y_coords = df_coords.iloc[:, 1]
    ax.scatter(x=x_coords, y=y_coords, c=color, edgecolor='k', s=size, alpha=0.5)

ax.set_title('Number of clusters: {}'.format(num_clusters))
plt.show()

coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(df_coords, cluster_labels)))


# set eps low (1.5km) so clusters are only formed by very close points
epsilon = 1.5 / kms_per_radian

# set min_samples to 1 so we get no noise - every point will be in a cluster even if it's a cluster of 1
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)

# get the number of clusters
num_clusters = len(set(cluster_labels))

# all done, print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(df), num_clusters, 100*(1 - float(num_clusters) / len(df)), time.time()-start_time))


# Result:
Silhouette coefficient: 0.854
Clustered 1,759 points down to 138 clusters, for 92.2% compression in 0.17 seconds

coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(df_coords, cluster_labels)))



# number of clusters, ignoring noise if present
num_clusters = len(set(cluster_labels)) #- (1 if -1 in labels else 0)
print('Number of clusters: {}'.format(num_clusters))


# Result:
Number of clusters: 138


# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([df_coords[cluster_labels == n] for n in range(num_clusters)])
clusters.tail()

最终结果。

0                  lat        lon
1587  37.921659  22...
1                  lat        lon
1658  37.933609  23...
2                  lat        lon
1607  37.966766  23...
3                  lat        lon
1586  38.149019  22...
4                  lat        lon
1584  38.374766  21...
                       
133              lat        lon
662  50.37369  18.889205
134               lat        lon
561  50.448704  19.0...
135               lat        lon
661  50.462271  19.0...
136               lat        lon
559  50.489304  19.0...
137             lat       lon
1  51.474005 -0.450999

https:/github.comgboeingurban-data-scienceblobmaster15-Spatial-Cluster-Analysiscluster-analysis.ipynb。

https:/geoffboeing.com201408clustering-to-reduce空间数据集大小。

K-means如何确定特定经纬度附近的大部分位置。

问题描述投票：0回答：1

1个回答

最新问题

K-means如何确定特定经纬度附近的大部分位置。

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1