CN110837540A

CN110837540A - Method and system for processing spatial position data

Info

Publication number: CN110837540A
Application number: CN201911037134.XA
Authority: CN
Inventors: 鲁仕维; 黄亚平
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-10-29
Filing date: 2019-10-29
Publication date: 2020-02-25

Abstract

The invention discloses a method and a system for processing spatial position data. First, quality evaluation on spatio-temporal sampling needs to be performed on the sparsely sampled spatial position big data. Therefore, the present invention proposes a sparsely sampled spatial location big data through dynamic time period division, computer random simulation and calculation, and regression analysis, starting from the basic feature of sparsely sampled individual location big data, which is the inherent spatiotemporal distribution. A quantitative loss assessment model that depicts the typical characteristics of urban residents' activities, and clearly explains the quality loss distribution characteristics corresponding to different sampling rates. Finally, the sample data set that meets the actual needs is selected according to the distribution of the obtained deviation, so that the conclusions of related research and analysis can be more reliable and more scientific to guide the practical application.

Description

A method and system for processing spatial position data

技术领域technical field

本发明涉及空间位置数据处理技术领域，更具体地，涉及一种空间位置数据的处理方法及系统。The present invention relates to the technical field of spatial position data processing, and more particularly, to a method and system for processing spatial position data.

背景技术Background technique

通信与信息技术以及位置感知技术等技术的迅猛发展，有助于以较低成本、较大范围、并且非常快速的方式收集大规模个体的空间位置移动数据并共享信息，带有时间戳和空间位置坐标的大数据已是触手可及。时空大数据应用于城市居民移动动力学分析、时空模式挖掘、交通分析以及城市规划等领域中，在提供新的研究视角的同时，为更有效且合理地回答所需要研究的问题，分析过程中会涉及到数据质量问题。The rapid development of technologies such as communication and information technology and location-aware technology has facilitated the collection of spatial location movement data of large-scale individuals and the sharing of information in a relatively low-cost, large-scale, and very fast manner, with time stamps and spatial Big data of location coordinates is already at your fingertips. The spatiotemporal big data is applied in the fields of urban residents’ mobility dynamics analysis, spatiotemporal pattern mining, traffic analysis, and urban planning. While providing new research perspectives, in order to more effectively and reasonably answer the questions that need to be studied, in the analysis process. There will be data quality issues.

稀疏采样的位置大数据是当前涉及城市空间分析研究和应用中的重要数据源，如手机信令时空位置数据、签到时空位置数据等。其一是因为移动通讯设备在城市居民中的广泛使用、普及率高、而且用户随身携带以及使用时间较长等特点；其二是城市内的通讯基站具有大范围、高密度建立等特点。然而，诸如以上的位置大数据同时也存在着个体的采样位置具有一定程度的随机性、时空稀疏性和不确定性等特点，当前众多基于位置大数据的研究和应用中，在数据预处理环节多是简单对数据进行筛选，较少考虑数据的质量问题，从而对当前分析和应用工作带来的不确定性也无法衡量。Sparsely sampled location big data is an important data source in the current research and application of urban spatial analysis, such as mobile phone signaling spatiotemporal location data, check-in spatiotemporal location data, etc. One is that mobile communication equipment is widely used among urban residents, has a high penetration rate, and is carried around by users and has a long use time; the other is that the communication base stations in cities have the characteristics of large-scale and high-density establishment. However, the location big data such as the above also have the characteristics of randomness, spatio-temporal sparsity and uncertainty in the sampling location of individuals to a certain extent. Most of the data are simply screened, and the quality of the data is less considered, so the uncertainty brought by the current analysis and application work cannot be measured.

发明内容SUMMARY OF THE INVENTION

针对现有技术的缺陷，本发明的目的在于解决当前众多基于位置大数据的研究和应用中，在数据预处理环节多是简单对数据进行筛选，较少考虑数据的质量问题，从而对当前分析和应用工作带来的不确定性也无法衡量的技术问题。Aiming at the defects of the prior art, the purpose of the present invention is to solve the problems in the current research and application of location-based big data. In the data preprocessing link, the data is mostly simply screened, and the quality of the data is less considered, so as to analyze the current situation. And the uncertainty brought about by the application work is also not measurable technical issues.

为实现上述目的，第一方面，本发明提供一种空间位置数据的处理方法，包括以下步骤：In order to achieve the above object, in a first aspect, the present invention provides a method for processing spatial position data, comprising the following steps:

步骤1、建立城市空间数据库，导入稀疏采样的空间位置数据至所述城市空间数据库；将所述稀疏采样的空间位置数据划分成覆盖全时段的空间位置数据和覆盖部分时段的空间位置数据；Step 1, establish an urban spatial database, import sparsely sampled spatial location data to the urban spatial database; divide the sparsely sampled spatial location data into spatial location data covering the entire period and spatial location data covering part of the period;

步骤2、将所述覆盖全时段的空间位置数据按照时段进行划分，划分成M个时段的时序空间位置数据；M为大于或等于2的正整数；Step 2: Divide the spatial position data covering the whole period according to the period, and divide it into the time series spatial position data of M periods; M is a positive integer greater than or equal to 2;

步骤3、从所述M个时段的时序空间位置数据中随机挑选C组m个时段的空间位置数据，M个时段的时序空间位置数据中每个时段的空间位置数据在所述C组m个时段的空间位置数据中均至少被挑选k次以上；m初始值为2，m为小于M的正整数；k为小于或等于C的正整数，C为正整数；Step 3, randomly select the spatial position data of C groups of m time periods from the time series space position data of the M time periods, and the spatial position data of each time period in the time series space position data of the M time periods is in the C group m. The spatial location data of the time period are selected at least k times; the initial value of m is 2, and m is a positive integer less than M; k is a positive integer less than or equal to C, and C is a positive integer;

步骤4、计算每组m个时段的空间位置数据对应的每个用户的指标值，并将每个用户在m个时段下的C组指标值求平均作为每个用户在m个时段下的指标值；以及计算M个时段的时序空间位置数据对应的每个用户在全时段下的指标值；所述指标值包括：用户空间活动范围、用户在所述空间活动范围内的活动路径长度、以及用户在所述空间活动范围内不同空间位置上的差异性和不均衡性；Step 4: Calculate the index value of each user corresponding to the spatial location data of each group of m time periods, and average the C group index values of each user in m time periods as the index of each user in m time periods and calculating the index value of each user in the whole time period corresponding to the time-series space position data of M time periods; the index value includes: the user's space activity range, the user's activity path length within the space activity range, and Differences and imbalances of users in different spatial positions within the spatial activity range;

步骤5、根据每个用户在m个时段下的指标值和每个用户在全时段下的指标值确定m个时段对应的指标值偏差，并基于每个用户在m个时段对应的指标值偏差确定每个用户在m个时段下指标值的质量损失系数；Step 5. Determine the index value deviation corresponding to m time periods according to the index value of each user in m time periods and the index value of each user in the whole time period, and based on the index value deviation corresponding to each user in m time periods Determine the quality loss coefficient of the index value of each user in m time periods;

步骤6、若m＝M，则执行步骤7，若m小于M，则将m加1，作为新的m值，执行步骤3；Step 6. If m=M, go to step 7, if m is less than M, add 1 to m as a new value of m, go to step 3;

步骤7、确定所述覆盖部分时段的空间位置数据中覆盖m个时段空间位置数据的用户数量，根据所述覆盖部分时段的空间位置数据确定的覆盖m个时段的空间位置数据的用户数量、所述覆盖全时段的空间位置数据确定的每个用户在m个时段下指标值的质量损失系数以及所有用户数量确定所述稀疏采样的空间位置数据的加权质量损失系数，2≤m≤M。Step 7: Determine the number of users covering the spatial location data of m time periods in the spatial location data covering part of the time period, the number of users covering the spatial location data of m time periods determined according to the spatial location data covering part of the time period, The weighted quality loss coefficient of the sparsely sampled spatial position data is determined by the quality loss coefficient of the index value of each user in m time periods and the number of all users determined by the spatial position data covering the whole period, 2≤m≤M.

在一个可选的实施例中，所述步骤1具体包括如下步骤：In an optional embodiment, the step 1 specifically includes the following steps:

将所述稀疏采样的空间位置数据导入至所述城市空间数据库，将每个空间位置数据转换到预设坐标系中，每个空间位置数据包括采样坐标和采样时间。The sparsely sampled spatial location data is imported into the urban spatial database, and each spatial location data is converted into a preset coordinate system, and each spatial location data includes sampling coordinates and sampling time.

在一个可选的实施例中，所述步骤2具体包括如下步骤：In an optional embodiment, the step 2 specifically includes the following steps:

根据实际需求或者采用自适应的方式将覆盖全时段的空间位置数据划分成M个时段的空间位置数据。The spatial position data covering the whole time period is divided into spatial position data of M time periods according to actual requirements or in an adaptive manner.

在一个可选的实施例中，所述指标值具体包括：In an optional embodiment, the index value specifically includes:

空间活动范围指标为回旋半径R_g：The space activity range index is the radius of gyration R _g :

在空间活动范围内的活动路径长度的指标为移动距离S：The index of the activity path length in the space activity range is the moving distance S:

在空间活动范围内不同空间位置上的访问差异性和不均衡性的指标为熵E：The index of access difference and imbalance at different spatial locations within the scope of spatial activities is entropy E:

其中，n是m个时段每种组合的空间位置数据或M个时段的空间位置数据中每个用户的空间位置采样点总数，(x_j,y_j)是每个用户第j个采样点的坐标值，(x_c,y_c)是每个用户所有采样点位置的重心，n′是每个用户相异的采样位置数量，p_i是每个用户第i个相异采样点出现的概率；Among them, n is the spatial location data of each combination of m time periods or the total number of spatial location sampling points of each user in the spatial location data of M time periods, and (x _j , y _j ) is the jth sampling point of each user. Coordinate value, (x _c , y _c ) is the center of gravity of all sampling points for each user, n′ is the number of different sampling positions for each user, p _i is the probability of the i-th different sampling point for each user appearing ;

每个用户所有采样点位置的重心(x_c,y_c)的计算公式为：The calculation formula of the center of gravity (x _c , y _c ) of all sampling point positions of each user is:

在一个可选的实施例中，所述步骤5具体包括如下步骤：In an optional embodiment, the step 5 specifically includes the following steps:

根据偏差度量模型，求得每个用户在各个时段数量m下指标值的质量损失系数，所述偏差度量模型为：According to the deviation measurement model, the quality loss coefficient of the index value of each user under the number m of each time period is obtained, and the deviation measurement model is:

F_m(X_u)＝A_mX_u-BF _m (X _u )=A _m X _u -B

其中，F_m(X_u)表示每个用户u在各个时段数量m下的指标值，X_u表示每个用户在全时段下的指标值，A_m为回归系数，B为常数；Among them, F _m (X _u ) represents the index value of each user u under the number _m of each time period, X _u represents the index value of each user in the whole time period, Am is the regression coefficient, and B is a constant;

所述质量损失系数QL_m通过如下公式确定：The mass loss coefficient _QLm is determined by the following formula:

QL_m＝1-|A_m|QL _m =1-|A _m |

其中，|A_m|为系数A_m的绝对值；将每个用户在各个时段数量m下每个指标值对应的回归系数带入上述公式，分别可求得每个用户在各个时段数量m下回旋半径指标值、移动距离指标值和熵指标值所对应的质量损失系数QL_{m_Rg}，QL_{m_S}和QL_{m_E}。Among them, |A _m | is the absolute value of the coefficient A _m ; the regression coefficient corresponding to each index value of each user in each time period m is brought into the above formula, and each user can be obtained in each time period. The mass loss coefficients QL _{m_Rg} , QL _{m_S} and QL _{m_E} corresponding to the gyration radius index value, the moving distance index value and the entropy index value.

在一个可选的实施例中，所述步骤7具体包括如下步骤：In an optional embodiment, the step 7 specifically includes the following steps:

通过如下公式确定所述稀疏采样的空间位置数据的各个指标值对应的质量损失系数w_QL：The quality loss coefficient w _QL corresponding to each index value of the sparsely sampled spatial location data is determined by the following formula:

其中，users_m表示所述覆盖部分时段的空间位置数据中覆盖m个时段空间位置数据的用户数量，users表示所有用户的数量；QL_m分别表示每个用户在m个时段下指标值的质量损失系数，QL_m具体包括：QL_{m_Rg}，QL_{m_S}或QL_{m_E}；Among them, users _m represents the number of users who cover the spatial location data of m time periods in the spatial location data covering part of the time period, users represents the number of all users; QL _m respectively represents the quality loss of the index value of each user in m time periods Coefficient, QL _m specifically includes: QL _{m_Rg} , QL _{m_S} or QL _{m_E} ;

通过分别计算回旋半径、移动距离和熵的质量损失系数w_{QL_Rg}，w_{QL_S}和w_{QL_E}，计算所述稀疏采样的空间位置数据的加权质量损失系数W_QL：The weighted quality loss coefficient W _QL of the sparsely sampled spatial location data is calculated by calculating the mass loss coefficients w _{QL_Rg} , w _{QL_S} and w _{QL_E} of the radius of gyration, moving distance and entropy, respectively:

第二方面，本发明提供一种空间位置数据的处理系统，包括：In a second aspect, the present invention provides a processing system for spatial position data, comprising:

数据采样单元，用于建立城市空间数据库，导入稀疏采样的空间位置数据至所述城市空间数据库；将所述稀疏采样的空间位置数据划分成覆盖全时段的空间位置数据和覆盖部分时段的空间位置数据；A data sampling unit, used for establishing an urban spatial database, and importing sparsely sampled spatial location data into the urban spatial database; dividing the sparsely sampled spatial location data into spatial location data covering the entire time period and spatial location covering part of the time period data;

全时段数据处理单元，用于将所述覆盖全时段的空间位置数据按照时段进行划分，划分成M个时段的时序空间位置数据；M为大于或等于2的正整数；从所述M个时段的时序空间位置数据中随机挑选C组m个时段的空间位置数据，M个时段的时序空间位置数据中每个时段的空间位置数据在所述C组m个时段的空间位置数据中均至少被挑选k次以上；m初始值为2，m为小于M的正整数；k为小于或等于C的正整数，C为正整数；计算每组m个时段的空间位置数据对应的每个用户的指标值，并将每个用户在m个时段下的C组指标值求平均作为每个用户在m个时段下的指标值；以及计算M个时段的时序空间位置数据对应的每个用户在全时段下的指标值；所述指标值包括：用户空间活动范围、用户在所述空间活动范围内的活动路径长度、以及用户在所述空间活动范围内不同空间位置上的差异性和不均衡性；根据每个用户在m个时段下的指标值和每个用户在全时段下的指标值确定m个时段对应的指标值偏差，并基于每个用户在m个时段对应的指标值偏差确定每个用户在m个时段下指标值的质量损失系数；若m＝M，则结束处理，若m小于M，则将m加1，作为新的m值，继续从所述M个时段的时序空间位置数据中随机挑选C组m个时段的空间位置数据，以根据所述覆盖全时段的空间位置数据确定不同时段数值m下每个用户指标值的质量损失系数；A full-period data processing unit, configured to divide the spatial position data covering the full period of time according to periods, and divide it into time-series spatial position data of M periods; M is a positive integer greater than or equal to 2; The spatial position data of C groups of m time periods are randomly selected from the time-series spatial position data of the C groups, and the spatial position data of each time period in the time-series space position data of the M time periods are at least equal to the spatial position data of the C groups of m time periods. Select more than k times; the initial value of m is 2, m is a positive integer less than M; k is a positive integer less than or equal to C, and C is a positive integer; calculate the spatial location data of each group of m time periods corresponding to each user's index value, and average the C group index values of each user in m time periods as the index value of each user in m time periods; The index value under the time period; the index value includes: the user's space activity range, the user's activity path length within the space activity range, and the difference and imbalance of the user's different spatial positions within the space activity range ; Determine the index value deviation corresponding to m time periods according to the index value of each user in m time periods and the index value of each user in the whole time period, and determine the index value deviation corresponding to each user in m time periods. The quality loss coefficient of the index value of the number of users in m time periods; if m=M, end the process, if m is less than M, add 1 to m as the new m value, and continue from the time series space of the M time periods Randomly select the spatial position data of C groups of m time periods in the position data, to determine the quality loss coefficient of each user index value under the numerical value m of different time periods according to the spatial position data covering the whole period;

部分时段数据处理单元，用于确定所述覆盖部分时段的空间位置数据中覆盖m个时段空间位置数据的用户数量；m分别取从2到M之间的整数；A part-time-period data processing unit, configured to determine the number of users covering the spatial position data of m time periods in the spatial position data covering part of the period; m is an integer from 2 to M respectively;

数据质量评估单元，用于根据所述覆盖部分时段的空间位置数据确定的覆盖m个时段的空间位置数据的用户数量、所述覆盖全时段的空间位置数据确定的每个用户在m个时段下指标值的质量损失系数以及所有用户数量确定所述稀疏采样的空间位置数据的加权质量损失系数，2≤m≤M。A data quality evaluation unit, configured to determine the number of users of the spatial location data covering m time periods according to the spatial location data covering part of the time period, and each user determined by the spatial location data covering the whole time period is under m time periods The quality loss coefficient of the index value and the number of all users determine the weighted quality loss coefficient of the sparsely sampled spatial location data, 2≤m≤M.

在一个可选的实施例中，所述全时段数据处理单元，根据偏差度量模型，求得每个用户在各个时段数量m下指标值的质量损失系数，所述偏差度量模型为：F_m(X_u)＝A_mX_u-B；其中，F_m(X_u)表示每个用户u在各个时段数量m下的指标值，X_u表示每个用户在全时段下的指标值，A_m为回归系数，B为常数；所述质量损失系数QL_m通过如下公式确定：QL_m＝1-|A_m|；其中，|A_m|为系数A_m的绝对值；将每个用户在各个时段数量m下每个指标值对应的回归系数带入上述公式，分别可求得每个用户在各个时段数量m下回旋半径指标值、移动距离指标值和熵指标值所对应的质量损失系数QL_{m_Rg}，QL_{m_S}和QL_{m_E}。In an optional embodiment, the full-time data processing unit, according to the deviation measurement model, obtains the quality loss coefficient of the index value of each user under the number m of each time period, and the deviation measurement model is: F _m ( X _u )=A _m X _u -B; wherein, F _m (X _u ) represents the index value of each user u under the number m of each time period, X _u represents the index value of each user in the whole time period, A _m is the regression coefficient, and B is a constant; the quality loss coefficient QL _m is determined by the following formula: QL _m =1-|A _m |; where |A _m | is the absolute value of the coefficient _Am ; The regression coefficient corresponding to each index value under the number m of time periods is brought into the above formula, and the quality loss coefficient QL corresponding to the index value of the gyration radius, the index value of moving distance and the index value of entropy for each user under the number m of time periods can be obtained respectively. _{m_Rg} , QL _{m_S} and QL _{m_E} .

在一个可选的实施例中，所述数据质量评估单元，通过如下公式确定所述稀疏采样的空间位置数据的各个指标值对应的质量损失系数w_QL：In an optional embodiment, the data quality evaluation unit determines the quality loss coefficient w _QL corresponding to each index value of the sparsely sampled spatial location data by the following formula:

第三方面，本发明提供一种计算机可读存储介质，所述计算机可读存储介质上存储有计算机程序，所述计算机程序被处理器执行时实现如上述第一方面所述的空间位置数据的处理方法。In a third aspect, the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, realizes the processing of the spatial location data according to the first aspect above. Approach.

总体而言，通过本发明所构思的以上技术方案与现有技术相比，具有以下有益效果：In general, compared with the prior art, the above technical solutions conceived by the present invention have the following beneficial effects:

本发明提供一种空间位置数据的处理方法及系统，可克服时空位置数据采样过程中的随机性、稀疏性及所引起的不确定性问题。基于本发明中所提出对各时段及不同组合方式对应的位置数据进行质量损失系数评估，可直接指导当前现实应用中具有确定性质量及疏密程度的时空位置数据的采样。The present invention provides a method and system for processing spatial position data, which can overcome the problems of randomness, sparseness and the resulting uncertainty in the sampling process of the space-time position data. Based on the evaluation of the quality loss coefficient of the position data corresponding to each time period and different combination methods proposed in the present invention, the sampling of the spatiotemporal position data with deterministic quality and density in the current practical application can be directly guided.

本发明提供一种空间位置数据的处理方法及系统，提出的空间位置数据质量的定量化评价模型。本发明所提出的计算过程充分考虑不同采样特征对评价结果的影响，不仅有力保障计算过程的无偏性，还弥补了当前众多基于位置大数据中研究和应用中有关数据质量评估研究的空缺。The present invention provides a method and system for processing spatial position data, and a quantitative evaluation model for the quality of spatial position data proposed. The calculation process proposed by the present invention fully considers the influence of different sampling features on the evaluation results, which not only effectively guarantees the unbiasedness of the calculation process, but also fills the gap in the current research and application of location-based big data related to data quality evaluation research.

本发明提供一种空间位置数据的处理方法及系统，数据抽样方法科学、有据可依。基于本发明中所提出的质量损失评估结果，可以直观有效的挑选出不同可信度和质量的数据用于城市空间分析，做到因数据制宜。The invention provides a processing method and system for spatial position data, and the data sampling method is scientific and evidence-based. Based on the quality loss assessment result proposed in the present invention, data with different reliability and quality can be selected intuitively and effectively for urban spatial analysis, so as to adapt to the data conditions.

本发明提供一种空间位置数据的处理方法及系统，应用范围广泛。本发明提出的评价模型与方法可使用于多种类型的个体位置稀疏采样的时空大数据，如手机信令时空数据、社交媒体签到时空数据，信用卡消费记录时空数据等。The present invention provides a method and system for processing spatial position data, which have a wide range of applications. The evaluation model and method proposed in the present invention can be used for various types of spatiotemporal big data of individual location sparse sampling, such as mobile phone signaling spatiotemporal data, social media check-in spatiotemporal data, credit card consumption record spatiotemporal data, and the like.

本发明提供一种空间位置数据的处理方法及系统，可按需定制数据。本发明提出的典型指标的损失估计演变规则曲线及其计算方法，可计算多种类型的稀疏采样位置大数据的偏差规律，对于挑选定制特定需求的数据集合具有直观科学的指导性。The present invention provides a method and system for processing spatial position data, which can customize data on demand. The loss estimation evolution rule curve and the calculation method of the typical index proposed by the present invention can calculate the deviation rule of various types of sparse sampling location big data, and has intuitive and scientific guidance for selecting and customizing the data set for specific needs.

本发明提供一种空间位置数据的处理方法及系统，节约成本。本发明不需要额外添加大型设备和器材，不需要花费大量的人力物力进行调查，仅需要少量的工作人员进行维护，而是充分利用数据自身的采样特性进行数据的质量评估。The present invention provides a method and system for processing spatial position data, which saves costs. The present invention does not require additional large-scale equipment and equipment, does not need to spend a lot of manpower and material resources for investigation, and only requires a small number of staff for maintenance, but makes full use of the sampling characteristics of the data itself to evaluate the quality of the data.

附图说明Description of drawings

图1为本发明提供的空间位置数据处理方法的流程图；1 is a flowchart of a method for processing spatial position data provided by the present invention;

图2为本发明提供的稀疏采样的位置大数据所反映居民活动轨迹示意图；2 is a schematic diagram of a resident activity trajectory reflected by the sparsely sampled location big data provided by the present invention;

图3为本发明提供的质量损失系数拟合曲线图；Fig. 3 is a mass loss coefficient fitting curve diagram provided by the present invention;

图4为本发明提供的空间位置数据的处理系统架构图。FIG. 4 is an architecture diagram of a processing system for spatial location data provided by the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

针对现有空间位置大数据在预处理和使用上的无标准化和不确定性缺陷，本发明通过介绍空间位置大数据的基本格式及其存在的稀疏采样问题，在此基础之上利用数据采样点时空分布这一直观而有效的固有特征，通过动态时间窗口对采样分布中的时段进行划分，提出了一种稀疏采样数据对刻画城市居民活动特征的偏差度量模型；利用计算机大量的随机模拟，得出空间位置大数据对刻画城市居民活动特征偏差的一般性结论，清晰地解释不同的采样率所对应下的偏差分布特征。最终根据所得偏差的分布来挑选满足实际需求的样本数据集，使得相关研究和分析的结论更为可靠，更科学地指导实际应用。Aiming at the defects of non-standardization and uncertainty in preprocessing and use of existing spatial location big data, the present invention introduces the basic format of spatial location big data and its existing sparse sampling problem, and uses data sampling points on this basis. Space-time distribution is an intuitive and effective inherent feature. The time period in the sampling distribution is divided by dynamic time window, and a deviation measurement model for sparse sampling data to describe the characteristics of urban residents' activities is proposed. This paper draws out the general conclusions of the spatial location big data to describe the deviation of urban residents' activity characteristics, and clearly explains the deviation distribution characteristics corresponding to different sampling rates. Finally, the sample data set that meets the actual needs is selected according to the distribution of the obtained deviations, which makes the conclusions of related research and analysis more reliable and guides practical applications more scientifically.

本发明提出一种定量有效的稀疏采样的位置大数据质量评估和抽样方法，以解决大规模稀疏采样的空间位置大数据在城市居民活动分析和应用时所带来的不确定性问题。The invention proposes a quantitative and effective sparse sampling location big data quality assessment and sampling method to solve the uncertainty problem brought by the large-scale sparse sampling spatial location big data in the analysis and application of urban residents' activities.

本发明的技术方案为一种稀疏采样的空间位置大数据的质量评估和抽样方法，如图1所示，包括以下步骤：The technical solution of the present invention is a quality assessment and sampling method for sparsely sampled spatial location big data, as shown in FIG. 1 , including the following steps:

步骤S101，建立城市空间数据库，导入稀疏采样的空间位置大数据至数据库；Step S101, establishing an urban spatial database, and importing sparsely sampled spatial location big data into the database;

在一个示例中，对每个居民个体进行唯一编号ID，并把个体ID字段设置为常用的索引字段；导入数据库的空间数据均转换为以2000国家大地坐标系为基准的坐标系统中；分别查询个体ID位置大数据，依次构建每个居民个体移动轨迹采样点的时序序列，{P₁,P₂,……,P_n}，其中P_i(x_i,y_i,t_i)为第i个位置采样点，(x_i,y_i)为大地坐标值，x_i为横坐标，y_i为纵坐标，t_i为第三维的竖向坐标，即是用时间表示；引入2000国家大地坐标系统作为二维基准面，二维基准面中各采样点的大地坐标值生成Voronoi多边形，以时间为第三维竖轴构建时空长方体，在该时空立方体内利用采样点的时序序列恢复出每个居民个体的时空轨迹；如图2所示，为稀疏采样的位置大数据所反映居民活动轨迹示意图。根据所述每个个体轨迹采样点的时序序列和恢复出的时空轨迹，将自然日进行时段上的线性划分。In one example, a unique ID is assigned to each resident individual, and the individual ID field is set as a commonly used index field; the spatial data imported into the database are converted into a coordinate system based on the 2000 National Geodetic Coordinate System; query separately Individual ID location big data, and sequentially construct the time series of sampling points of each resident individual movement trajectory, {P ₁ ,P ₂ ,...,P _n }, where P _i (x _i ,y _i ,t _i ) is the i-th (x _i , y _i ) is the geodetic coordinate value, _xi is the abscissa, y _i is the ordinate, and t _i is the vertical coordinate of the third dimension, which is represented by time; the introduction of 2000 national geodetic coordinates The system is used as a two-dimensional datum. The geodetic coordinate values of each sampling point in the two-dimensional datum generate a Voronoi polygon, and a space-time cuboid is constructed with time as the third-dimensional vertical axis. In the space-time cube, the time series of sampling points is used to restore each resident. The spatiotemporal trajectory of the individual; as shown in Figure 2, it is a schematic diagram of the resident activity trajectory reflected by the sparsely sampled location big data. According to the time series of each individual trajectory sampling point and the recovered space-time trajectory, the natural day is linearly divided into time periods.

需要说明的是，本发明中所提出的时段划分过程中所采用的时间窗口大小由根据实际需求或是数据记录采样点的密度分布来自适应，划分所得的时段数量为M，并且M大于等于2；并从数据集中匹配出全时段覆盖的个体样本，即是划分所得的每个时段内均有采样位置的个体作为子数据集D，所包含的个体总数为N；其他的个体作为数据集D’。It should be noted that the size of the time window used in the time period division process proposed in the present invention is adaptive according to actual requirements or the density distribution of data recording sampling points, and the number of time periods obtained by division is M, and M is greater than or equal to 2. ; and match the individual samples covered by the whole time period from the data set, that is, the individuals with sampling positions in each time period obtained from the division are used as sub-data set D, and the total number of individuals included is N; other individuals are used as data set D '.

可以理解的是，子数据集D即为覆盖全时段的空间位置数据，数据集D’即为覆盖部分时段的空间位置数据。其中，覆盖全时段的空间位置数据指的是这部分数据中每个时段均包含所有采用个体的位置数据；覆盖部分时段的空间位置数据指的是这部分数据中没有一个时段是包含所有采样个体的位置数据的，即这部分数据中并不是每个时段都包含有每个采样个体的空间位置数据。具体地，这里的采样个体即为每个居民或者每个用户。It can be understood that the sub-data set D is the spatial position data covering the whole time period, and the data set D' is the spatial position data covering a part of the time period. Among them, the spatial location data covering the whole time period means that each time period in this part of the data contains the location data of all the sampling individuals; the spatial location data covering part of the time period means that no time period in this part of the data contains all the sampling individuals of the location data, that is, not every time period in this part of the data contains the spatial location data of each sampled individual. Specifically, the sampled individual here is each resident or each user.

可选地，可引入刻画城市居民活动特征的典型指标，回旋半径R_g、移动距离S和熵E，并计算每个个体u的全时段典型指标值{R_gu,S_u,E_u}，其中u＝1,2,…….,N；Optionally, typical indicators that characterize the activities of urban residents, such as the radius of gyration R _g , the moving distance S and the entropy E, can be introduced, and the full-time typical indicator values {R _gu , S _u , E _u } of each individual u can be calculated, where u=1,2,.......,N;

典型指标：用回旋半径R_g来刻画居民的空间活动范围；用移动距离S来刻画居民在该空间范围内的活动路径长度；用熵E来刻画居民在该空间范围内不同空间位置上的访问差异性和不均衡性；Typical indicators: the radius of gyration R _g is used to describe the spatial activity range of residents; the moving distance S is used to describe the length of the residents’ activity paths in the space; the entropy E is used to describe the residents’ visits to different spatial locations in the space. Differences and Imbalances;

回旋半径：

Turning Radius:

移动距离：

Moving distance:

熵：

entropy:

以上计算公式中，n是每个个体访问的位置总数，(x_j,y_j)是第j个采样点的大地坐标值，以及(x_c,y_c)是每个个体所有采样点位置的重心，n’是相异的采样位置数量，p_i则是第i个相异采样点出现的概率；In the above calculation formula, n is the total number of locations visited by each individual, (x _j , y _j ) is the geodetic coordinate value of the j-th sampling point, and (x _c , y _c ) is the location of all sampling points for each individual The center of gravity, n' is the number of different sampling positions, and p _i is the probability of the i-th different sampling point appearing;

每个个体所有采样点位置的重心(x_c,y_c)的计算方式为：The centroid (x _c , y _c ) of all sampling point positions of each individual is calculated as:

步骤S102、从匹配所得的子数据集D中，依次挑选出涵盖m(2≤m≤M,m＝2,3,……,M)个时段内的位置记录，分别计算m个时段下的每个个体u的采样时段典型指标值{R_gmu’,S_mu’,E_mu’}，m＝2,3,……,M；Step S102: From the sub-data set D obtained by matching, sequentially select the location records covering m (2≤m≤M, m=2,3,...,M) time periods, and calculate the position records under m time periods respectively. Typical index values of the sampling period of each individual u {R _gmu' ,S _mu' ,E _mu' }, m=2,3,...,M;

在一个实施例中，本发明所提出的挑选m个时段应遵守的规则如下：In one embodiment, the rules to be followed for selecting m time periods proposed by the present invention are as follows:

对于指定数量的时段，随机挑选次数为大于等于C次(C根据实际需求或数据特性而定，推荐默认值为1000)；并且每次随机情况下的时段组合方式均只出现一次，且它们之间相互独立；如果由于指定的时段数量无法满足随机挑选次数的要求，则按照理论组合方式的上限进行全部挑选，并满足组合方式均只出现一次，且它们之间相互独立；由此可以得到随机的随机次数C。For a specified number of time periods, the number of random selections is greater than or equal to C times (C is determined according to actual needs or data characteristics, and the recommended default value is 1000); and the time period combination method in each random case occurs only once, and their are independent of each other; if the specified number of time periods cannot meet the requirements of the number of random selections, all selections are carried out according to the upper limit of the theoretical combination method, and the combination method only appears once, and they are independent of each other; thus, the random selection can be obtained. Random times C.

具体地，对于指定相同个数的时段，在C次随机过后还应保证每个时段至少要被选择k次及以上(k根据实际需求或数据特性而定，推荐默认值为10)。Specifically, for the specified time periods of the same number, after C random times, it should be ensured that each time period must be selected at least k times or more (k depends on actual needs or data characteristics, and the recommended default value is 10).

具体地，在随机过程中，每个时段均会被选出，并且以比较均匀地分布方式被选出，保证在随机过程中时段较为均衡的出现。例如，当划分的时段数量为M＝24个，挑选的时段数目为m＝3个，如果时段组合(#2,#5,#6)，(#2,#6,#7)，(#2,#7,#16)，(#2,#7,#19)，(#2,#5,#7)，(#2,#6,#11)，(#2,#6,#12)，(#2,#7,#17)，(#2,#7,#18)以及(#2,#7,#22)既已选出，由此#2时段被挑选了10次，但是这些组合中时段#5、#6、#7、#11、#12、#16、#17、#18、#19以及#22等时段均没有出现k次，且其他的13个时段还没有出现，所以在剩下的C-k次随机选择中，将优先考虑还没满足被选k次及以上的时段；Specifically, in the random process, each time period will be selected and selected in a relatively uniform distribution manner to ensure a relatively balanced appearance of the time periods in the random process. For example, when the number of divided time periods is M=24, and the number of selected time periods is m=3, if the time periods are combined (#2, #5, #6), (#2, #6, #7), (# 2,#7,#16),(#2,#7,#19),(#2,#5,#7),(#2,#6,#11),(#2,#6,# 12), (#2, #7, #17), (#2, #7, #18) and (#2, #7, #22) have been selected, so the #2 period has been selected 10 times , but the periods #5, #6, #7, #11, #12, #16, #17, #18, #19 and #22 did not appear k times in these combinations, and the other 13 periods also does not appear, so in the remaining C-k random selections, priority will be given to the period that has not been selected for k times or more;

可以理解的是，以上所提出的规则保证随机数量的广度和深度，随机过程的无偏性和均衡性等。It can be understood that the rules proposed above guarantee the breadth and depth of random numbers, the unbiasedness and equilibrium of random processes, etc.

最后，计算每次随机所挑选的时段数量m内的用户位置记录，重新组成时序化的位置序列。Finally, the user location records within the number m of time periods randomly selected each time are calculated, and a time-series location sequence is reconstituted.

步骤S103、分别计算m个时段下的每个个体u的采样时段典型指标值{R_g(m,c)u’,S_(m,c)u’,E_(m,c)u’}均值，m＝2,3,……,M；c＝1,2,……,C。时段数量m应从2个依次增长至M个；每类时段数下均随机C次；Step S103: Calculate the mean value of the typical index values {R _g(m,c)u' , S _(m,c)u' , E _(m,c)u' } of the sampling period of each individual u under m time periods respectively , m=2,3,...,M; c=1,2,...,C. The number of time periods m should increase from 2 to M in turn; each time period is randomly C times;

具体地，步骤S102、步骤S103的细化流程可参见发明内容部分的步骤2至步骤6。本发明在此不做赘述。Specifically, for the detailed flow of step S102 and step S103, reference may be made to step 2 to step 6 in the content of the invention. The present invention will not be described in detail here.

步骤S104、这些采样时段典型指标均值{R_g(m,c)u’,S_(m,c)u’,E_(m,c)u’}与全覆盖M个完整时段下的全时段典型指标值{R_gu,S_u,E_u}进行比对分析。其中，下标u表示个体(居民或用户)u的参数。Step S104, the typical index mean values of these sampling periods {R _g(m,c)u' , S _(m,c)u' , E _(m,c)u' } and the full-period typical values under the full coverage of M complete periods The index values {R _gu , S _u , E _u } are compared and analyzed. Among them, the subscript u represents the parameters of the individual (resident or user) u.

将N个居民个体的全时段典型指标均值与采样时段典型指标值一一对应组成坐标对的形式，分别为(R_gu,R_g(m,c)u’)，(S_u,S_(m,c)u’)和(E_u,E_(m,c)u’)；计算全时段典型指标值与采样时段典型指标值之间的偏差。The mean of the typical indicators of the whole period of N residents and the typical indicators of the sampling period are in the form of a one-to-one correspondence to form a coordinate pair, which are (R _gu ,R _g(m,c)u' ), (S _u ,S _(m ) _,c)u' ) and (E _u ,E _(m,c)u' ); calculate the deviation between the typical index value of the whole period and the typical index value of the sampling period.

本发明提出一种定量化的偏差度量模型，该模型的计算公式为：The present invention proposes a quantitative deviation measurement model, and the calculation formula of the model is:

F_m(X_u)＝A_mX_u-BF _m (X _u )=A _m X _u -B

其中，由全时段所有位置记录数计算的全时段典型指标值R_gu，S_u或E_u作为独立变量X_u，而由随机过程选择的部分时段中的位置记录计算得到的采样时段典型指标值R_g(m,c)u’,S_(m,c)u’,E_(m,c)u’均值，依次分别作为该回归模型中的因变量y；A_m为回归系数，B为常数。Among them, the typical index values R _gu , _{Su or E u} _in the whole period calculated from the number of all location records in the whole period are taken as independent variables _Xu , and the typical index values in the sampling period calculated from the location records in a part of the time period selected by the random process The mean values of R _g(m,c)u' , S _(m,c)u' , E _(m,c)u' are respectively used as the dependent variable y in the regression model; A _m is the regression coefficient, and B is a constant .

另外，当全时段典型指标值为0时，所对应的采样时段典型指标值理论上亦为0，由此把回归模型中的B强制性设置为0；In addition, when the typical index value of the whole period is 0, the typical index value of the corresponding sampling period is theoretically 0, so B in the regression model is forcibly set to 0;

进一步利用下列公式来求解质量损失系数(Quality Loss,QL)：Further use the following formula to solve the quality loss coefficient (Quality Loss, QL):

QL_m＝1-|A_m|QL _m =1-|A _m |

其中，|A_m|为系数A_m的绝对值。where |A _m | is the absolute value of the coefficient _Am .

质量损失系数QL_m分布在0～1这个区间里；按照此公式进一步可求得C组m个时段下的空间位置数据的回旋半径、移动距离和熵等典型指标所对应的QL_{m_Rg}，QL_{m_S}和QL_{m_E}；The quality loss coefficient QL _m is distributed in the range of 0 to 1; according to this formula, the QL _{m_Rg} , QL _{m_S} corresponding to typical indicators such as the radius of gyration, moving distance and entropy of the spatial position data of group C under m time periods can be obtained. and QL _{m_E} ;

最后，不同时段m下的每个典型指标分别可计算出一个对应质量损失系数值。Finally, a corresponding quality loss coefficient value can be calculated for each typical index in different time periods m.

进一步地，分别计算不同时段m下的每个典型指标(回旋半径、移动距离和熵)的质量损失系数值的最大值max、最小值min、四分位数值和标准差std；根据质量损失系数值的最大值、最小值，绘制质量损失系数的分布限制区并存储；依次将N个个体，所对应m个时段下的每个典型指标(回旋半径、移动距离和熵)的质量损失系数均值进行曲线或者直线拟合f_Rg，f_S和f_E，(可以是线性函数、指数函数、幂函数等定量化的数理关系)，如图3所示为本发明中质量损失系数拟合曲线图。Further, calculate the maximum value max, minimum value min, quartile value and standard deviation std of the quality loss coefficient value of each typical index (gyroscope, moving distance and entropy) under different time periods m respectively; according to the quality loss coefficient The maximum value and the minimum value of the value, the distribution restriction area of the quality loss coefficient is drawn and stored; the mean value of the quality loss coefficient of each typical index (radius of gyration, moving distance and entropy) under the corresponding m time periods for N individuals in turn Perform curve or straight line fitting f _Rg , f _S and f _E , (may be quantitative mathematical relationships such as linear function, exponential function, power function, etc.), as shown in FIG. 3 is the mass loss coefficient fitting curve diagram in the present invention .

由此分别得到典型指标在不同时段数量下的损失系数的数理演变规则；拟合曲线f_Rg，f_S和f_E的结论，可直接用于筛选出不同估计偏差下的数据集合，根据实际需求合理规定数据应当满足的质量下限，用于城市空间分析、城市居民移动动力学分析、时空模式挖掘等领域。From this, the mathematical evolution rules of the loss coefficients of the typical indicators in different time periods are obtained respectively; the conclusions of the fitting curves f _Rg , f _S and f _E can be directly used to screen out the data sets under different estimation deviations, and according to the actual needs Reasonably specify the lower quality limit that the data should meet, and use it in the fields of urban spatial analysis, urban residents' movement dynamics analysis, and spatiotemporal pattern mining.

步骤S105、最后可通过f_Rg，f_S和f_E来计算整个数据集合的加权质量损失系数，其计算方式如下：Step S105: Finally, the weighted quality loss coefficient of the entire data set can be calculated through f _Rg , f _S and f _E , and the calculation method is as follows:

分别统计数据集D’中的各个体采样记录的时段分布数量，依次带入到f_Rg，f_S和f_E中来计算出该个体典型指标的质量损失系数；Count the number of time-period distributions of the sampling records of each individual in the data set D' respectively, and bring them into f _Rg , f _S and f _E in turn to calculate the quality loss coefficient of the typical index of the individual;

设定数据集D中的每个个体的质量损失系数为0；Set the quality loss coefficient of each individual in dataset D to 0;

根据数据集D和D’中涵盖不同时段数量的用户数作为权重，计算数据集合下各典型指标的加权质量损失系数w_QL；According to the number of users covering different time periods in the data sets D and D' as weights, calculate the weighted quality loss coefficient w _QL of each typical index under the data set;

其中，users_m表示其位置记录覆盖m个时段的用户数量，users表示所有个体的数量；QL_m表示该时段数量下所对应的一种典型指标的质量损失系数；Among them, users _m represents the number of users whose location records cover m time periods, users represents the number of all individuals; QL _m represents the quality loss coefficient of a typical indicator corresponding to the number of time periods;

通过计算回旋半径、移动距离和熵等典型指标的加权质量损失系数w_{QL_Rg}，w_{QL_S}和w_{QL_E}，可计算整个数据的加权质量损失系数W_QL；By calculating the weighted quality loss coefficients w _{QL_Rg} , w _{QL_S} and w _{QL_E} of typical indicators such as radius of gyration, moving distance and entropy, the weighted quality loss coefficient W _QL of the entire data can be calculated;

整个稀疏采样数据的加权质量损失系数W_QL的计算公式为：The formula for calculating the weighted quality loss coefficient W _QL of the entire sparsely sampled data is:

具体地。加权质量损失系数W_QL用以评价整个数据集的质量；加权质量损失系数越小，表明数据的整体质量越高；加权质量损失系数越大，表明数据的整体质量越差；质量损失系数分布区中的任意一个值均代表一种抽样组合方式，有效地用于指导抽取指定质量的数据记录。specifically. The weighted quality loss coefficient W _QL is used to evaluate the quality of the entire data set; the smaller the weighted quality loss coefficient, the higher the overall quality of the data; the larger the weighted quality loss coefficient, the worse the overall quality of the data; the distribution area of the quality loss coefficient Any value in represents a sampling combination that effectively guides the extraction of data records of a specified quality.

图4为本发明提供一种空间位置数据的处理系统，如图4所示，包括：数据采样单元410、全时段数据处理单元420、部分时段数据处理单元430以及数据质量评估单元440。4 is a system for processing spatial position data provided by the present invention, as shown in FIG.

数据采样单元410，用于建立城市空间数据库，导入稀疏采样的空间位置数据至所述城市空间数据库；将所述稀疏采样的空间位置数据划分成覆盖全时段的空间位置数据和覆盖部分时段的空间位置数据；The data sampling unit 410 is used for establishing an urban spatial database, and importing sparsely sampled spatial location data into the urban spatial database; dividing the sparsely sampled spatial location data into spatial location data covering the whole time period and space covering part of the time period location data;

全时段数据处理单元420，用于将所述覆盖全时段的空间位置数据按照时段进行划分，划分成M个时段的时序空间位置数据；M为大于或等于2的正整数；从所述M个时段的时序空间位置数据中随机挑选C组m个时段的空间位置数据，M个时段的时序空间位置数据中每个时段的空间位置数据在所述C组m个时段的空间位置数据中均至少被挑选k次以上；m初始值为2，m为小于M的正整数；k为小于或等于C的正整数，C为正整数；计算每组m个时段的空间位置数据对应的每个用户的指标值，并将每个用户在m个时段下的C组指标值求平均作为每个用户在m个时段下的指标值；以及计算M个时段的时序空间位置数据对应的每个用户在全时段下的指标值；所述指标值包括：用户空间活动范围、用户在所述空间活动范围内的活动路径长度、以及用户在所述空间活动范围内不同空间位置上的差异性和不均衡性；根据每个用户在m个时段下的指标值和每个用户在全时段下的指标值确定m个时段对应的指标值偏差，并基于每个用户在m个时段对应的指标值偏差确定每个用户在m个时段下指标值的质量损失系数；若m＝M，则结束处理，若m小于M，则将m加1，作为新的m值，继续从所述M个时段的时序空间位置数据中随机挑选C组m个时段的空间位置数据，以根据所述覆盖全时段的空间位置数据确定不同时段数值m下每个用户指标值的质量损失系数；The full-time data processing unit 420 is configured to divide the spatial position data covering the full period according to time periods, and divide it into time-series spatial position data of M time periods; M is a positive integer greater than or equal to 2; The spatial position data of C groups of m time periods are randomly selected from the time-series spatial position data of the time period, and the spatial position data of each time period in the time-series space position data of the M time periods are at least in the spatial position data of the C groups of m time periods. Selected more than k times; m initial value is 2, m is a positive integer less than M; k is a positive integer less than or equal to C, C is a positive integer; calculate each user corresponding to the spatial location data of each group of m time periods , and average the C group index values of each user in m time periods as the indicator value of each user in m time periods; The index value in the whole time period; the index value includes: the user's space activity range, the user's activity path length within the space activity range, and the difference and imbalance of the user's different spatial positions within the space activity range The index value deviation corresponding to m time periods is determined according to the index value of each user in m time periods and the index value of each user in the whole time period, and is determined based on the index value deviation corresponding to each user in m time periods The quality loss coefficient of the index value of each user in m time periods; if m=M, end the process, if m is less than M, add 1 to m as a new m value, and continue from the sequence of the M time periods In the spatial location data, the spatial location data of C groups of m time periods is randomly selected to determine the quality loss coefficient of each user index value under the numerical value m of different time periods according to the spatial location data covering the whole time period;

部分时段数据处理单元430，用于确定所述覆盖部分时段的空间位置数据中覆盖m个时段空间位置数据的用户数量；m分别取从2到M之间的整数；A partial period data processing unit 430, configured to determine the number of users covering the spatial position data of m time periods in the spatial position data covering the partial period; m is an integer from 2 to M respectively;

数据质量评估单元440，用于根据所述覆盖部分时段的空间位置数据确定的覆盖m个时段的空间位置数据的用户数量、所述覆盖全时段的空间位置数据确定的每个用户在m个时段下指标值的质量损失系数以及所有用户数量确定所述稀疏采样的空间位置数据的加权质量损失系数，2≤m≤M。The data quality evaluation unit 440 is configured to determine the number of users of the spatial location data covering m time periods according to the spatial location data covering part of the time period, and each user determined by the spatial location data covering the whole time period is in m time periods The quality loss coefficient of the lower index value and the number of all users determine the weighted quality loss coefficient of the sparsely sampled spatial location data, 2≤m≤M.

在一个可选的实施例中，所述全时段数据处理单元420，根据偏差度量模型，求得每个用户在各个时段数量m下指标值的质量损失系数，所述偏差度量模型为：F_m(X_u)＝A_mX_u-B；其中，F_m(X_u)表示每个用户u在各个时段数量m下的指标值，X_u表示每个用户在全时段下的指标值，A_m为回归系数，B为常数；所述质量损失系数QL_m通过如下公式确定：QL_m＝1-|A_m|；其中，|A_m|为系数A_m的绝对值；将每个用户在各个时段数量m下每个指标值对应的回归系数带入上述公式，分别可求得每个用户在各个时段数量m下回旋半径指标值、移动距离指标值和熵指标值所对应的质量损失系数QL_{m_Rg}，QL_{m_S}和QL_{m_E}。In an optional embodiment, the full-time data processing unit 420 obtains, according to the deviation measurement model, the quality loss coefficient of the index value of each user under the number m of each time period, and the deviation measurement model is: F _m (X _u )=A _m X _u -B; wherein, F _m (X _u ) represents the index value of each user u under the number m of each time period, X _u represents the index value of each user in the whole time period, A _m is a regression coefficient, and B is a constant; the quality loss coefficient QL _m is determined by the following formula: QL _m =1-|A _m |; where |A _m | is the absolute value of the coefficient A _m ; The regression coefficient corresponding to each index value under the number m of each time period is brought into the above formula, and the quality loss coefficient corresponding to the index value of the gyration radius, the index value of moving distance and the index value of entropy can be obtained for each user under the number m of each time period. QL _{m_Rg} , QL _{m_S} and QL _{m_E} .

在一个可选的实施例中，所述数据质量评估单元440，通过如下公式确定所述稀疏采样的空间位置数据的各个指标值对应的质量损失系数w_QL：In an optional embodiment, the data quality evaluation unit 440 determines the quality loss coefficient w _QL corresponding to each index value of the sparsely sampled spatial location data by the following formula:

其中，users_m表示所述覆盖部分时段的空间位置数据中覆盖m个时段空间位置数据的用户数量，users表示所有用户的数量；QL_m分别表示每个用户在m个时段下指标值的质量损失系数，QL_m具体包括：QL_{m_Rg}，QL_{m_S}或QL_{m_E}；通过分别计算回旋半径、移动距离和熵的质量损失系数w_{QL_Rg}，w_{QL_S}和w_{QL_E}，计算所述稀疏采样的空间位置数据的加权质量损失系数W_QL：Among them, users _m represents the number of users who cover the spatial location data of m time periods in the spatial location data covering part of the time period, users represents the number of all users; QL _m respectively represents the quality loss of the index value of each user in m time periods coefficient, QL _m specifically includes: QL _{m_Rg} , QL _{m_S} or QL _{m_E} ; by calculating the mass loss coefficients w _{QL_Rg} , w _{QL_S} and w _{QL_E} of the radius of gyration, moving distance and entropy respectively, the weighting of the sparsely sampled spatial position data is calculated Mass loss factor W _QL :

具体地，各个单元的功能可参见前述方法实施例，在此不做赘述。Specifically, for the functions of each unit, reference may be made to the foregoing method embodiments, and details are not described herein.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。Those skilled in the art can easily understand that the above are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, etc., All should be included within the protection scope of the present invention.

Claims

1. A method for processing spatial location data, comprising the steps of:

step 1, establishing an urban spatial database, and importing sparse sampled spatial position data to the urban spatial database; dividing the sparsely sampled spatial position data into spatial position data covering a full period and spatial position data covering a partial period;

step 2, dividing the spatial position data covering the whole time interval into time sequence spatial position data of M time intervals according to the time intervals; m is a positive integer greater than or equal to 2;

step 3, randomly selecting C groups of spatial position data of M time periods from the time sequence spatial position data of the M time periods, wherein the spatial position data of each time period in the time sequence spatial position data of the M time periods is selected at least k times from the spatial position data of the C groups of M time periods; m is 2 as the initial value, and M is a positive integer smaller than M; k is a positive integer less than or equal to C, C is a positive integer;

step 4, calculating an index value of each user corresponding to the spatial position data of each group of m time periods, and averaging C groups of index values of each user in the m time periods to serve as the index values of each user in the m time periods; calculating the index value of each user corresponding to the time sequence space position data of the M time periods in the full time period; the index value includes: the spatial activity range of the user, the activity path length of the user in the spatial activity range, and the difference and the imbalance of the user at different spatial positions in the spatial activity range;

step 5, determining index value deviations corresponding to m time periods according to the index values of each user in m time periods and the index values of each user in the full time period, and determining the quality loss coefficient of the index value of each user in m time periods based on the index value deviations corresponding to m time periods of each user;

step 6, if M is equal to M, executing step 7, and if M is smaller than M, adding 1 to M to obtain a new M value, and executing step 3;

and 7, determining the number of users covering M time period spatial position data in the spatial position data covering part of the time period, determining the number of users covering M time periods spatial position data according to the number of users covering M time periods spatial position data determined by the spatial position data covering part of the time period, the quality loss coefficient of the index value of each user in M time periods determined by the spatial position data covering the whole time period, and the weighted quality loss coefficient of all users determined by the spatial position data covering the whole time period, wherein M is more than or equal to 2 and less than or equal to M.

2. The method for processing spatial location data according to claim 1, wherein the step 1 specifically includes the steps of:

and importing the sparsely sampled spatial position data into the urban spatial database, and converting each spatial position data into a preset coordinate system, wherein each spatial position data comprises a sampling coordinate and sampling time.

3. The method for processing spatial location data according to claim 1, wherein the step 2 specifically includes the steps of:

and dividing the spatial position data covering the whole time interval into spatial position data of M time intervals according to actual requirements or in a self-adaptive mode.

4. The method for processing spatial location data according to claim 1, wherein the index value specifically includes:

the spatial motion range index is the radius of gyration R_g：

The index of the length of the movement path in the spatial movement range is a movement distance S:

the index of access variability and disparity at different spatial positions within the spatial range of motion is entropy E:

where n is the total number of spatial position samples per user in the spatial position data of each combination of M periods or the spatial position data of M periods, (x)_j,y_j) Is the coordinate value of the jth sampling point of each user, (x)_c,y_c) Is the center of gravity of all the sampling point positions of each user, n' is the number of sampling points different for each user, p_iIs the probability of occurrence of the ith distinct sample point of each user;

center of gravity (x) of all sampling point positions of each user_c,y_c) The calculation formula of (2) is as follows:

5. the method for processing spatial location data according to claim 4, wherein the step 5 specifically includes the steps of:

according to a deviation measurement model, obtaining a quality loss coefficient of an index value of each user under each time period number m, wherein the deviation measurement model is as follows:

F_m(X_u)＝A_mX_u-B

wherein, F_m(X_u) An index value, X, representing each user u at each period number m_uAn index value, A, representing each user at the full time period_mIs a regression coefficient, B is a constant;

the mass loss coefficient QL is determined by the following formula:

QL_m＝1-|A_m|

wherein, | A_m| is coefficient A_mAbsolute value of (d); substituting the regression coefficient corresponding to each index value of each user in each period number m into the formula to obtain the quality loss coefficient QL corresponding to the gyration radius index value, the movement distance index value and the entropy index value of each user in each period number m_{m_Rg}，QL_{m_S}And QL_{m_E}。

6. The method for processing spatial location data according to claim 5, wherein the step 7 specifically includes the steps of:

determining the quality corresponding to each index value of the sparsely sampled spatial position data by the following formulaCoefficient of mass loss omega_QL：

Wherein, users_mRepresenting the number of users covering m time periods of the spatial position data covering the part of the time periods, wherein users represents the number of all users; QL_mQuality loss coefficient, QL, representing the index value for each user over m time periods_mThe method specifically comprises the following steps: QL_{m_Rg}，QL_{m_S}Or QL_{m_E}；

By calculating the mass loss coefficient w of the radius of gyration, the distance traveled and the entropy, respectively_{QL_Rg}，w_{QL_S}And w_{QL_E}Calculating a weighted mass loss factor W for the sparsely sampled spatial position data_QL：

7. A system for processing spatial location data, comprising:

the data sampling unit is used for establishing an urban spatial database and importing sparse sampled spatial position data into the urban spatial database; dividing the sparsely sampled spatial position data into spatial position data covering a full period and spatial position data covering a partial period;

the full-time-period data processing unit is used for dividing the spatial position data covering the full time period into time sequence spatial position data of M time periods according to the time periods; m is a positive integer greater than or equal to 2; randomly selecting C groups of spatial position data of M time periods from the time sequence spatial position data of the M time periods, wherein the spatial position data of each time period in the time sequence spatial position data of the M time periods is selected at least k times from the spatial position data of the C groups of M time periods; m is 2 as the initial value, and M is a positive integer smaller than M; k is a positive integer less than or equal to C, C is a positive integer; calculating an index value of each user corresponding to the spatial position data of each group of m time periods, and averaging C groups of index values of each user in the m time periods to serve as the index values of each user in the m time periods; calculating the index value of each user corresponding to the time sequence space position data of the M time periods in the full time period; the index value includes: the spatial activity range of the user, the activity path length of the user in the spatial activity range, and the difference and the imbalance of the user at different spatial positions in the spatial activity range; determining index value deviations corresponding to m time periods according to the index values of each user in m time periods and the index values of each user in the full time period, and determining the quality loss coefficient of the index value of each user in m time periods based on the index value deviations corresponding to m time periods of each user; if M is equal to M, ending the processing, if M is smaller than M, adding 1 to M to serve as a new M value, and continuing to randomly select C groups of spatial position data of M time periods from the time sequence spatial position data of the M time periods so as to determine the quality loss coefficient of each user index value under different time period values M according to the spatial position data covering the whole time period;

a partial time period data processing unit for determining the number of users covering the m time period spatial position data in the spatial position data covering the partial time period; m is an integer from 2 to M;

and the data quality evaluation unit is used for determining the weighted quality loss coefficient of the sparsely sampled spatial position data according to the number of users covering the spatial position data of M time periods determined by the spatial position data covering the whole time periods, the quality loss coefficient of the index value of each user in M time periods determined by the spatial position data covering the whole time periods and the number of all users, wherein M is more than or equal to 2 and less than or equal to M.

8. The system for processing spatial locality data according to claim 7, wherein said full-time-period data processing unit finds a quality loss coefficient of an index value for each user for each time period number m based on a deviation metric model, said deviation metric model being: f_m(X_u)＝A_mX_u-B; wherein, F_m(X_u) An index value, X, representing each user u at each period number m_uAn index value, A, representing each user at the full time period_mIs a regression coefficient, B is a constant; the mass loss coefficient QL_mDetermined by the following formula: QL_m＝1-|A_mL, |; wherein, | A_m| is coefficient A_mAbsolute value of (d); substituting the regression coefficient corresponding to each index value of each user in each period number m into the formula to obtain the quality loss coefficient QL corresponding to the gyration radius index value, the movement distance index value and the entropy index value of each user in each period number m_{m_Rg}，QL_{m_S}And QL_{m_E}。

9. The system for processing spatial locality data according to claim 7, wherein the data quality assessment unit determines a quality loss coefficient ω corresponding to each index value of the sparsely sampled spatial locality data by using the following formula_QL：

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out a method of processing spatial position data according to any one of claims 1 to 6.