首页 > 其他分享 >Clustering to Reduce Spatial Data Set Size

Clustering to Reduce Spatial Data Set Size

时间:2024-07-12 09:29:53浏览次数:7  
标签:Clustering set point Reduce cluster Set clusters data points

Read/cite the paper here.

In this tutorial, I demonstrate how to reduce the size of a spatial data set of GPS latitude-longitude coordinates using Python and its scikit-learn implementation of the DBSCAN clustering algorithm. All my code is in this IPython notebook in this GitHub repo, where you can also find the data.

Traditionally it’s been a problem that researchers did not have enough spatial data to answer useful questions or build compelling visualizations. Today, however, the problem is often that we have too much data. Too many scattered points on a map can overwhelm a viewer looking for a simple narrative. Furthermore, rendering a JavaScript web map (like Leaflet) with millions of data points on a mobile device can swamp the processor and be unresponsive.

 

The data set

How can we reduce the size of a data set down to a smaller set of spatially representative points? Consider a spatial data set with 1,759 latitude-longitude coordinates. This manageable data set is not too large to map, but it serves as a useful object for this tutorial (for a more complex example clustering 1.2 million GPS coordinates, see this project).

I have discussed this data set in a series of posts, and reverse-geocoded the coordinates to add city and country data. Here is a simple Python matplotlib scatter plot of all the coordinates in the full data set:

 At this scale, only a few dozen of the 1,759 data points are really visible. Even zoomed in very close, several locations have hundreds of data points stacked directly on top of each other due to the duration of time spent at one location. Unless we are interested in time dynamics, we simply do not need all of these spatially redundant points – they just bloat the data set’s size.

How much data do we need?

Look at the tight cluster of points representing Barcelona around the coordinate pair (2.15, 41.37). I stayed at the same place for a month and my GPS coordinates were recorded every 15 minutes, so I ended up with hundreds of rows in my data set corresponding to the coordinates of my apartment.

This high number of observations is useful for representing the duration of time spent at certain locations. However, it grows less useful if the objective is to represent merely where one has been. In that case only a single data point is needed for each geographical location to demonstrate that it has been visited. This reduced-size data set would be far easier to map with an on-the-fly rendering tool like JavaScript. It’s also far easier to reverse-geocode only the spatially representative points rather than the thousands or possibly millions of points in some full data set.

Clustering algorithms: k-means and DBSCAN

The k-means algorithm is likely the most common clustering algorithm. But for spatial data, the DBSCAN algorithm is far superior. Why?

The k-means algorithm groups N observations (i.e., rows in an array of coordinates) into k clusters. However, k-means is not an ideal algorithm for latitude-longitude spatial data because it minimizes variance, not geodetic distance. There is substantial distortion at latitudes far from the equator, like those of this data set. The algorithm would still “work” but its results are poor and there isn’t much that can be done to improve them.

With k-means, locations where I spent a lot of time – such as Barcelona – would still be over-represented because the initial random selection to seed the k-means algorithm would select them multiple times. Thus, more rows near a given location in the data set means a higher probability of having more rows selected randomly for that location. Even worse, due to the random seed, many locations would be missing from any clusters, and increasing the number of clusters would still leave patchy gaps throughout the reduced data set.

Instead, let’s use an algorithm that works better with arbitrary distances: scikit-learn’s implementation of the DBSCAN algorithm. DBSCAN clusters a spatial data set based on two parameters: a physical distance from each point, and a minimum cluster size. This method works much better for spatial latitude-longitude data.

Spatial data clustering with DBSCAN

Time to cluster. I begin by importing necessary Python modules and loading up the full data set. I convert the latitude and longitude coordinates’ columns into a two-dimensional numpy array, called coords:

import pandas as pd, numpy as np, matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
df = pd.read_csv('summer-travel-gps-full.csv')
coords = df.as_matrix(columns=['lat', 'lon'])

Next I compute DBSCAN. The epsilon parameter is the max distance (1.5 km in this example) that points can be from each other to be considered a cluster. The min_samples parameter is the minimum cluster size (everything else gets classified as noise). I’ll set min_samples to 1 so that every data point gets assigned to either a cluster or forms its own cluster of 1. Nothing will be classified as noise.

I use the haversine metric and ball tree algorithm to calculate great circle distances between points. Notice my epsilon and coordinates get converted to radians, because scikit-learn’s haversine metric needs radian units:

kms_per_radian = 6371.0088
epsilon = 1.5 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
print('Number of clusters: {}'.format(num_clusters))

Ok, now I’ve got 138 clusters. Unlike k-means, DBSCAN doesn’t require you to specify the number of clusters in advance – it determines them automatically based on the epsilon and min_samples parameters.

Finding a cluster’s center-most point

To reduce my data set size, I want to grab the coordinates of one point from each cluster that was formed. I could just take the first point in each cluster, but it would be more spatially-representative if I take the point nearest the cluster’s centroid. Note that with DBSCAN, clusters may be non-convex and centers may fall outside the cluster – however, we just want to reduce the cluster down to a single point. The point nearest its center is perfectly suitable for this.

This function returns the center-most point from a cluster by taking a set of points (i.e., a cluster) and returning the point within it that is nearest to some reference point (in this case, the cluster’s centroid):

def get_centermost_point(cluster):
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
    centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
    return tuple(centermost_point)
centermost_points = clusters.map(get_centermost_point)

The function above first calculates the centroid’s coordinates. Then I use Python’s built-in min function to find the smallest member of the cluster in terms of distance to that centroid. The key argument does this with a lambda function that calculates each point’s distance to the centroid in meters, via geopy’s great circle function. Finally, I return the coordinates of the point that was the least distance from the centroid.

To use this function, I map it to my pandas series of clusters. In other words, for each element (i.e., cluster) in the series, it gets the center-most point and then assembles all these center-most points into a new series called centermost_points. Then I turn these center-most points into a pandas dataframe of points which are spatially representative of my clusters (and in turn, my original full data set):

lats, lons = zip(*centermost_points)
rep_points = pd.DataFrame({'lon':lons, 'lat':lats})

Great! Now I’ve got my set of 138 spatially representative points. But, I also want the city, country, and date information that was contained in the original full data set. So, for each row of representative points, I pull the full row from the original data set where the latitude and longitude columns match the representative point’s latitude and longitude:

rs = rep_points.apply(lambda row: df[(df['lat']==row['lat']) && (df['lon']==row['lon'])].iloc[0], axis=1)

All done. I’ve reduced my original data set down to a spatially representative set of points with full details.

Final result from DBSCAN

I’ll plot the final reduced set of data points versus the original full set to see how they compare:

fig, ax = plt.subplots(figsize=[10, 6])
rs_scatter = ax.scatter(rs['lon'], rs['lat'], c='#99cc99', edgecolor='None', alpha=0.7, s=120)
df_scatter = ax.scatter(df['lon'], df['lat'], c='k', alpha=0.9, s=3)
ax.set_title('Full data set vs DBSCAN reduced set')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.legend([df_scatter, rs_scatter], ['Full set', 'Reduced set'], loc='upper right')
plt.show()

 Looks good! You can see the 138 representative points, in green, approximating the spatial distribution of the 1,759 points of the full data set, in black. DBSCAN reduced the data by 92.2%, from 1,759 points to 138 points. There are no gaps in the reduced data set and heavily-trafficked spots (like Barcelona) are no longer drastically over-represented.

出处:Clustering to Reduce Spatial Data Set Size – Geoff Boeing

标签:Clustering,set,point,Reduce,cluster,Set,clusters,data,points
From: https://www.cnblogs.com/jingsupo/p/18297595

相关文章

  • JavaScript 进阶(五)---forEach/map/filterevery/some/includes/reduce的详细用法
    目录1.forEach2.map3.filter4.for...in5.for...of6.every7.some8.includes9.reduce举个例子:使用fliter:使用 map 来筛选并转换数组使用 forEach 来筛选并构建数组总结1.forEach-详解:`forEach`方法对数组的每个元素执行一次提供的函数。这个方......
  • [题解] [ABC221H] Count Multiset - DP
    [ABC221H]CountMultiset题面翻译输入两个正整数\(N,M\),并存在一个集合,问你一个长度为\(k\)的合法集合存在多少个?你需要回答\(k\)的值为\(1\)到\(N\)的每种情况。一个合法的集合定义指长度为\(k\),元素和为\(N\),每一个数字在集合中出现的次数都小于等于\(M\)的集......
  • python项目导入上级目录设置”的setting.json是不是哪里还有错误呀?
    大家好,我是Python进阶者。一、前言前几天在Python白银交流群【王者级混子】问了一个Python代码处理的问题,问题如下:大佬们,我想问问我抄网上“vscode运行python项目导入上级目录设置”的setting.json是不是哪里还有错误呀?还是没法导入上级目录二、实现过程这里后来很快他自己找......
  • How to setup and configure mptcp on Ubuntu
    https://medium.com/high-performance-network-programming/how-to-setup-and-configure-mptcp-on-ubuntu-c423dbbf76cc  HowtosetupandconfiguremptcponUbuntu  MartenGartner·FollowPublishedinHighPe......
  • Setup Multipath TCP
    https://medium.com/@iheb.zannina/setup-mptcpv1-in-linux-v5-6-9b5e48173b5b  SetupMultipathTCP IhebZannina·Follow5minread·Mar23,2023 1   AbstractMPTCP,orMultipat......
  • Windows定时器-timeSetEvent
     接口:MMRESULTtimeSetEvent(UINTuDelay,//以毫秒指定事件的周期UINTuResolution,//以毫秒指定延时的精度,缺省值为1msLPTIMECALLBACKlpTimeProc,//指向回调函数的指针WORDdwUser,//用户定义的回调数据,传递给回调函数......
  • 【JS】 简单回忆一下 Set 和 Map
    ES6引入了两种新的数据结构:Set和Map。它们提供了一种存储键值对的方式,但与传统的对象(Object)和数组(Array)有所不同。SetSet是一种特殊的类型,它类似于数组,但成员的值都是唯一的,没有重复的值。使用Set创建Setletset=newSet();添加元素使用add()方法添加元素......
  • Franka Robot setZeroForceTorque 设置零力矩
    在FrankaEmika机器人中,可以使用setZeroForceTorque()函数来设置机器人的零力矩。这个函数可以让机器人保持在零力矩状态,即不施加任何额外的力矩。这种状态下,机器人关节会保持"放松"的状态,可以被外力轻易地移动。以下是一个示例代码:#include<franka/robot.h>intmain()......
  • Franka Robot setDefaultBehavior的作用
    Franka机器人的setDefaultBehavior()函数是一个非常有用的功能,它可以设置机器人在遇到意外情况时的默认行为。这个函数可以帮助开发者更好地控制机器人的安全性和稳定性。以下是setDefaultBehavior()函数的一些常见用法:安全停止行为可以设置机器人在遇到紧急情况时(如检测......
  • Franka Robot robot.setJointImpedance()和robot.setCartesianImpedance()两个函数有
    robot.setJointImpedance()和robot.setCartesianImpedance()两个函数有以下区别和联系:区别:参考坐标系不同setJointImpedance()是设置每个关节的阻抗参数,以关节坐标系为参考。setCartesianImpedance()是设置机器人末端在笛卡尔空间中的阻抗参数,以笛卡尔坐标系为参考。......