CN117456003A

CN117456003A - Category-level object 6D pose estimation method and system based on dynamic key point detection

Info

Publication number: CN117456003A
Application number: CN202311546440.2A
Authority: CN
Inventors: 张天柱; 杨文飞; 王诗良; 张哲�; 潘晓扬; 吴枫
Original assignee: Deep Space Exploration Laboratory Tiandu Laboratory
Current assignee: Deep Space Exploration Laboratory Tiandu Laboratory
Priority date: 2023-11-20
Filing date: 2023-11-20
Publication date: 2024-01-26
Anticipated expiration: 2043-11-20
Also published as: CN117456003B

Abstract

The invention provides a category-level object 6D pose estimation method and system based on dynamic key point detection, and relates to the field of computer vision. The category-level object 6D pose estimation method based on dynamic key point detection comprises the steps of receiving image data of an object, wherein the image data comprise RDB images and point clouds, and the point clouds are formed by randomly sampling pixels in a depth image by combining internal parameters of a camera and then projecting the pixels into a scene; respectively extracting image features of the RDB image and point cloud features of the point cloud, and carrying out splicing and fusion on the image features and the point cloud features to obtain fused features; inputting the fused characteristics into a preset dynamic key point detection network to extract key points of an object; and inputting the key points into a preset multi-scale pose prediction network, and outputting the final pose of the object 6D. The method can adaptively extract the key points of the object from the observed scene, and can achieve better effect even if more noise points exist in the scene and the shielding condition is serious.

Description

Category-level object 6D pose estimation method and system based on dynamic key point detection

技术领域Technical field

本发明涉及计算机视觉技术领域，具体为基于动态关键点检测的类别级物体6D位姿估计方法及系统。The present invention relates to the field of computer vision technology, specifically a method and system for 6D pose estimation of category-level objects based on dynamic key point detection.

背景技术Background technique

物体6D位姿估计技术是一项重要的计算机视觉和机器人领域的技术，它用于精确地确定三维物体在六个自由度上的位姿，即三维平移和旋转。这一技术在许多应用中具有广泛的用途，例如自动化制造、机器人操作、增强现实、虚拟现实、无人驾驶汽车等领域。Object 6D pose estimation technology is an important technology in the fields of computer vision and robotics. It is used to accurately determine the pose of a three-dimensional object in six degrees of freedom, that is, three-dimensional translation and rotation. This technology has broad uses in many applications, such as automated manufacturing, robotic operations, augmented reality, virtual reality, driverless cars, and more.

尽管目前基于固定关键点检测的实例级6D物体位姿估计方法已经具有很好的性能与鲁棒性，但是其仅仅针对单一实例有效的特性使得该类方法距离现实落地仍有很大的差距。因此，基于归一化类别坐标空间的类别级6D物体位姿估计方法被提出，此类方法能够检测同一类别物体下不同实例的物体位姿，具有较高的泛用性，更接近实际生产的需要。Although the current instance-level 6D object pose estimation method based on fixed key point detection has good performance and robustness, its only effective feature for a single instance makes this type of method still far from being implemented in reality. Therefore, a category-level 6D object pose estimation method based on the normalized category coordinate space is proposed. This type of method can detect the object poses of different instances of the same category of objects, has high versatility, and is closer to actual production. need.

目前的类别级6D物体位姿估计方法大致可以被分为两类。一是基于提取的特征直接回归位姿参数方法。然而，由于三维空间旋转矩阵群的复杂与非凸性使得该类方法较难优化，往往达不到实际应用场景所需的精度与鲁棒性。二是基于密集物体坐标系坐标预测的方法，旨在检测场景中的每个点在归一化物体坐标系下的位置，并通过这些对应关系，使用PnP或位姿预测网络进行后处理输出物体姿态。这些方法将物体位姿的回归转化为场景中每个点在物体坐标系下的位置的预测，使得网络更容易拟合和优化。上述方法虽然避免了三维旋转群的难以优化的问题，但是三维点云往往存在许多噪声，直接预测场景中所有点在物体坐标系下的坐标会因为噪声点的存在而影响网络的性能。不仅如此，场景点云的规模往往是巨大的，采用对所有点进行预测的方式会使得计算量开销过大，对设备的存储要求较大，这些都不利于算法的现实落地。因此，提出了一种基于动态关键点检测的类别级物体6D位姿估计方法，用以解决上述所说现有方法存在的问题。Current category-level 6D object pose estimation methods can be roughly divided into two categories. The first is the direct regression pose parameter method based on the extracted features. However, due to the complexity and non-convexity of the three-dimensional space rotation matrix group, this type of method is difficult to optimize and often fails to achieve the accuracy and robustness required in actual application scenarios. The second is a method based on dense object coordinate system coordinate prediction, which aims to detect the position of each point in the scene under the normalized object coordinate system, and through these correspondences, use PnP or pose prediction network for post-processing and output object attitude. These methods convert the regression of object pose into a prediction of the position of each point in the scene in the object coordinate system, making the network easier to fit and optimize. Although the above method avoids the problem of difficult optimization of 3D rotation groups, 3D point clouds often contain a lot of noise. Directly predicting the coordinates of all points in the scene in the object coordinate system will affect the performance of the network due to the presence of noise points. Not only that, the scale of the scene point cloud is often huge, and the method of predicting all points will cause excessive computational overhead and high storage requirements for the device, which are not conducive to the practical implementation of the algorithm. Therefore, a category-level object 6D pose estimation method based on dynamic key point detection is proposed to solve the problems of the existing methods mentioned above.

发明内容Contents of the invention

(一)解决的技术问题(1) Technical problems solved

针对现有技术的不足，本发明提供了基于动态关键点检测的类别级物体6D位姿估计方法及系统，可以自适应的从观测场景中提取物体的关键点，即使在场景存在较多噪声点，遮挡情况严重时也能达到较好的效果。In view of the shortcomings of the existing technology, the present invention provides a category-level object 6D pose estimation method and system based on dynamic key point detection, which can adaptively extract key points of objects from the observation scene, even when there are many noise points in the scene. , it can achieve better results even when the occlusion situation is serious.

(二)技术方案(2) Technical solutions

为实现以上目的，本发明通过以下技术方案予以实现：In order to achieve the above objectives, the present invention is achieved through the following technical solutions:

第一方面，提供了一种基于动态关键点检测的类别级物体6D位姿估计方法，包括：The first aspect provides a category-level object 6D pose estimation method based on dynamic key point detection, including:

接收物体的图像数据，所述图像数据包括RDB图像和点云，所述点云通过结合相机内部参数将深度图中的像素经过随机采样后投影到场景中形成；Receive image data of the object. The image data includes RDB images and point clouds. The point clouds are formed by randomly sampling the pixels in the depth map and projecting them into the scene in combination with internal parameters of the camera;

分别提取RDB图像的图像特征和点云的点云特征，并将图像特征和点云特征进行拼接融合得到融合后的特征；Extract the image features of the RDB image and the point cloud features of the point cloud respectively, and splice and fuse the image features and point cloud features to obtain the fused features;

将融合后的特征输入到预设的动态关键点检测网络提取物体的关键点；Input the fused features into the preset dynamic key point detection network to extract the key points of the object;

将所述关键点输入到预设的多尺度位姿预测网络中，将局部结构信息聚合到关键点中得到具有多尺度信息的关键点，通过具有多尺度信息的关键点特征预测关键点在物体空间坐标系下的位置，并将输出的关键点在场景中的位置、关键点在场景中的特征以及关键点在物体空间坐标系下的位置和特征拼接后形成多组对应关系，通过多层感知机输出最终的物体6D位姿。Input the key points into the preset multi-scale pose prediction network, aggregate the local structure information into key points to obtain key points with multi-scale information, and predict the key points in the object through the key point features with multi-scale information. The position in the spatial coordinate system, and the output key point's position in the scene, the key point's characteristics in the scene, and the key point's position and characteristics in the object space coordinate system are spliced to form multiple sets of correspondences, through multiple layers The perceptron outputs the final 6D pose of the object.

优选的，所述RDB图像的特征提取器采用Resnet18卷积神经网络。Preferably, the feature extractor of the RDB image adopts Resnet18 convolutional neural network.

优选的，所述分别提取RDB图像的图像特征和点云的点云特征，具体包括：Preferably, separately extracting the image features of the RDB image and the point cloud features of the point cloud specifically includes:

将输入的RGB图片送入Resnet18卷积神经网络中，提取图像的特征图f_rgb∈R^h×w×c；Send the input RGB image to the Resnet18 convolutional neural network, and extract the feature map f _rgb ∈R ^h×w×c of the image;

将点云输入到Pointnet++点云特征提取网络中，提取点云的结构特征f_point∈R^N ^×C；其中，N为点云的数量。Input the point cloud into the Pointnet++ point cloud feature extraction network to extract the structural features f _point ∈R ^N ^×C of the point cloud; where N is the number of point clouds.

优选的，所述将图像特征和点云特征进行拼接融合得到融合后的特征，具体包括Preferably, the image features and point cloud features are spliced and fused to obtain the fused features, specifically including:

通过相机内部参数将点云的结构特征投射到图像的特征图上，通过双线性插值提取点云的结构特征在图像的特征图上的对应特征F_point→rgb∈R^N×C；Project the structural features of the point cloud onto the feature map of the image through the internal parameters of the camera, and extract the corresponding features F _point→rgb ∈R ^N×C of the structural features of the point cloud on the feature map of the image through bilinear interpolation;

通过将图像的特征图和点云的结构特征拼接后经过一个多层MLP输出得到融合后的特征f_fusion∈R^N×C。By splicing the feature map of the image and the structural features of the point cloud and outputting it through a multi-layer MLP, the fused feature f _fusion ∈R ^N×C is obtained.

优选的，所述将融合后的特征输入到预设的动态关键点检测网络提取物体的关键点，具体包括：Preferably, the input of the fused features into a preset dynamic key point detection network to extract the key points of the object specifically includes:

引入注意力机制和Transformer Layer进行动态检测物体关键点，表示N_s个随机初始化并会随着训练过程不断更新的KPT query，用于代表场景中的N_s个关键点；Introduce attention mechanism and Transformer Layer to dynamically detect key points of objects. Represents N _s KPT queries that are randomly initialized and continuously updated with the training process, and are used to represent N _s key points in the scene;

将代表不同关键点的query与从场景提取的融合后的特征f_fusion∈R^N×C通过crossattention层进行交互，并对KPT query进行场景自适应的更新：The query representing different key points interacts with the fused features f _fusion ∈R ^N×C extracted from the scene through the crossattention layer, and the KPT query is updated scene adaptively:

f_kpt＝MHCA(f_fusion；f_kpt),f _kpt =MHCA (f _fusion ; f _kpt ),

利用基于相似性的热度图生成策略，利用每个KPT query与场景点计算相似性后，通过热度图加权的方式生成关键点的3D位置和3D特征：Using a similarity-based heat map generation strategy, after calculating the similarity between each KPT query and scene points, the 3D positions and 3D features of key points are generated through heat map weighting:

heatmap＝Softmax(Similarity(f′_kpt,f_fusion))heatmap=Softmax(Similarity(f′ _kpt ,f _fusion ))

其中表征了每个关键点检测子在场景中的相似度计算的权重图，/>为最终检测的关键点坐标。in A weight map that represents the similarity calculation of each key point detector in the scene,/> are the key point coordinates of the final detection.

优选的，所述将所述关键点输入到预设的多尺度位姿预测网络中，将局部结构信息聚合到关键点中得到具有多尺度信息的关键点，具体包括：Preferably, the key points are input into a preset multi-scale pose prediction network, and local structural information is aggregated into key points to obtain key points with multi-scale information, specifically including:

对于每个检测到的3D关键点通过提取/>最近邻的场景点的融合特征，通过cross attention将局部结构信息聚合到关键点中：For each detected 3D keypoint By extracting/> The fusion features of the nearest neighbor scene points aggregate local structural information into key points through cross attention:

其中，knn表示欧氏空间中的k-近邻点，index表示索引操作。Among them, knn represents the k-nearest neighbor point in the Euclidean space, and index represents the index operation.

优选的，所述通过具有多尺度信息的关键点特征预测关键点在物体空间坐标系下的位置，并将输出的关键点在场景中的位置、关键点在场景中的特征以及关键点在物体空间坐标系下的位置和特征拼接后形成多组对应关系，通过多层感知机输出最终的物体6D位姿，具体包括：Preferably, the key point features with multi-scale information are used to predict the position of the key point in the object space coordinate system, and the output key point's position in the scene, the key point's characteristics in the scene and the key point's position in the object are The positions and features in the spatial coordinate system are spliced to form multiple sets of correspondences, and the final 6D pose of the object is output through the multi-layer perceptron, including:

通过关键点特征预测关键点在物体坐标空间下的位置：Predict the position of key points in object coordinate space through key point features:

并将输出的关键点在场景中的位置、关键点在场景中的特征以及关键点在物体空间坐标系下的位置和特征拼接后形成N_s组对应关系，通过多层感知机输出最终的物体6D位姿：And the positions of the output key points in the scene, the characteristics of the key points in the scene, and the positions and characteristics of the key points in the object space coordinate system are spliced together to form N _s sets of correspondences, and the final object is output through the multi-layer perceptron. 6D pose:

第二方面，提供了一种基于动态关键点检测的类别级物体6D位姿估计系统，其特征在于，所述系统包括：In the second aspect, a category-level object 6D pose estimation system based on dynamic key point detection is provided, which is characterized in that the system includes:

接收模块，用于接收物体的图像数据，所述图像数据包括RDB图像和点云，所述点云通过结合相机内部参数将深度图中的像素经过随机采样后投影到场景中形成；A receiving module, configured to receive image data of an object. The image data includes RDB images and point clouds. The point clouds are formed by randomly sampling the pixels in the depth map and projecting them into the scene in combination with internal parameters of the camera;

特征提取与融合模块，用于分别提取RDB图像的图像特征和点云的点云特征，并将图像特征和点云特征进行拼接融合得到融合后的特征；The feature extraction and fusion module is used to extract the image features of the RDB image and the point cloud features of the point cloud respectively, and splice and fuse the image features and point cloud features to obtain the fused features;

关键点提取模块，用于将融合后的特征输入到预设的动态关键点检测网络提取物体的关键点；The key point extraction module is used to input the fused features into the preset dynamic key point detection network to extract the key points of the object;

处理与输出模块，用于将所述关键点输入到预设的多尺度位姿预测网络中，将局部结构信息聚合到关键点中得到具有多尺度信息的关键点，通过具有多尺度信息的关键点特征预测关键点在物体空间坐标系下的位置，并将输出的关键点在场景中的位置、关键点在场景中的特征以及关键点在物体空间坐标系下的位置和特征拼接后形成多组对应关系，通过多层感知机输出最终的物体6D位姿。The processing and output module is used to input the key points into the preset multi-scale pose prediction network, aggregate the local structure information into the key points to obtain key points with multi-scale information, and use the key points with multi-scale information to The point feature predicts the position of the key point in the object space coordinate system, and splices the output key point's position in the scene, the key point's characteristics in the scene, and the key point's position and characteristics in the object space coordinate system to form a multi-point Group correspondence relationship, output the final object 6D pose through the multi-layer perceptron.

第三方面，提供了一种存储一个或多个程序的计算机可读存储介质，所述一个或多个程序包括指令，所述指令当由计算设备执行时，使得所述计算设备执行所述的方法中的任一方法。In a third aspect, a computer-readable storage medium is provided that stores one or more programs, the one or more programs including instructions that, when executed by a computing device, cause the computing device to perform the any of the methods.

第四方面，提供了一种计算设备，包括：A fourth aspect provides a computing device, including:

一个或多个处理器、存储器以及一个或多个程序，其中一个或多个程序存储在所述存储器中并被配置为由所述一个或多个处理器执行，所述一个或多个程序包括用于执行所述的方法中的任一方法的指令。one or more processors, memory, and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including Instructions for performing any of the methods described.

(三)有益效果(3) Beneficial effects

本发明基于动态关键点检测的类别级物体6D位姿估计方法及系统，可以自适应的从观测场景中提取物体的关键点，即使在场景存在较多噪声点，遮挡情况严重时也能达到较好的效果。其次，本专利设计了两个模块分别考虑关键点局部的特征以及基于对应关系的位姿预测网络。能够更好的提取关键点周围的局部空间几何特征并且利用对应关系回归物体位姿。并在数字孪生仿真系统中进行训练，最终在现有数据集上能大幅提升类别级物体6D位姿估计的精度。The present invention is a category-level object 6D pose estimation method and system based on dynamic key point detection. It can adaptively extract the key points of objects from the observation scene. Even when there are many noise points in the scene and the occlusion situation is serious, it can achieve a higher accuracy. Good results. Secondly, this patent designs two modules that consider local features of key points and pose prediction networks based on correspondence relationships. It can better extract the local spatial geometric features around key points and use the corresponding relationship to return the object pose. And it is trained in the digital twin simulation system, which can ultimately greatly improve the accuracy of 6D pose estimation of category-level objects on the existing data set.

附图说明Description of the drawings

图1为本发明基于动态关键点检测的类别级物体6D位姿估计方法流程图；Figure 1 is a flow chart of the 6D pose estimation method of category-level objects based on dynamic key point detection according to the present invention;

图2为本发明实施例中方法解析图。Figure 2 is an analysis diagram of the method in the embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

实施例Example

如图1-2所示，本发明实施例提供一种基于动态关键点检测的类别级物体6D位姿估计方法，包括：As shown in Figure 1-2, embodiments of the present invention provide a category-level object 6D pose estimation method based on dynamic key point detection, including:

进一步的，对于输入RGBD的图片，采用Resnet18作为RGB图像的特征提取器。对于深度图D，通过结合相机内参的方式将深度图中的像素经过随机采样后投影到场景中，形成点云。针对不同模态的输入数据，网络首先将输入的RGB图片送入Resnet18卷积神经网络中，提取图像的特征图f_rgb∈R^h×w×c。对于输入的点云特征，网络将其输入到Pointnet++点云特征提取网络中，提取点云的结构特征f_point∈R^N×C。其中N为点云的数量。接着通过相机内参将场景点云投射到图像特征图上，通过双线性插值提取场景点云在特征图上的对应特征f_point→rgb∈R^N×C。最终通过将两个模态的特征拼接后经过一个多层MLP输出得到融合后的特征f_fusion∈R^N×C。Furthermore, for input RGBD images, Resnet18 is used as the feature extractor for RGB images. For the depth map D, the pixels in the depth map are randomly sampled and projected into the scene by combining camera internal parameters to form a point cloud. For input data of different modalities, the network first sends the input RGB image into the Resnet18 convolutional neural network, and extracts the feature map f _rgb ∈R ^h×w×c of the image. For the input point cloud features, the network inputs them into the Pointnet++ point cloud feature extraction network to extract the structural features f _point ∈R ^N×C of the point cloud. Where N is the number of point clouds. Then the scene point cloud is projected onto the image feature map through the camera internal parameters, and the corresponding feature f _point→rgb ∈R ^N×C of the scene point cloud on the feature map is extracted through bilinear interpolation. Finally, the fused feature f _fusion ∈R ^N×C is obtained by splicing the features of the two modalities and outputting it through a multi-layer MLP.

进一步的对于每个场景输入，在提取了多模态融合特征后，我们设计了一个如图1所示的动态关键点检测网络，用于在场景中自适应的提取物体关键点。为了实现在不同场景中自适应的动态检测物体关键点，本项目拟引入注意力机制和Transformer Layer实现场景自适应。表示N_s个随机初始化并会随着训练过程不断更新的KPT query，用于代表场景中的N_s个关键点。将这些代表不同关键点的query与从场景提取的融合后的特征f_fusion∈R^N×C通过cross attention层进行交互，并对KPT query进行场景自适应的更新：Further, for each scene input, after extracting multi-modal fusion features, we designed a dynamic key point detection network as shown in Figure 1, which is used to adaptively extract object key points in the scene. In order to achieve adaptive dynamic detection of object key points in different scenes, this project plans to introduce the attention mechanism and Transformer Layer to achieve scene adaptation. Represents N _s KPT queries that are randomly initialized and continuously updated with the training process, and are used to represent N _s key points in the scene. These queries representing different key points are interacted with the fused features f _fusion ∈R ^N×C extracted from the scene through the cross attention layer, and the KPT query is updated scene adaptively:

f′_kpt＝MHCA(f_fusion；f_kpt),#f′ _kpt =MHCA(f _fusion ; f _kpt ),#

经过更新后的KPT query聚合了场景自适应的特征，用于接下来的关键点检测模块。接下来利用基于相似性的热度图生成策略，利用每个KPT query与场景点计算相似性后，通过热度图加权的方式生成关键点的3D位置，以及3D特征具体而言：The updated KPT query aggregates scene adaptive features and is used in the subsequent key point detection module. Next, a similarity-based heat map generation strategy is used. After calculating the similarity between each KPT query and scene points, the 3D positions of key points and 3D features are generated through heat map weighting. Specifically:

heatmap＝Softmax(Similarity(f′_kpt,f_fusion))#heatmap=Softmax(Similarity(f′ _kpt ,f _fusion ))#

其中表征了每个关键点检测子在场景中的相似度计算的权重图，/>为最终检测的关键点坐标。动态检测的关键点能够适应不同的场景和变化，无论目标物体的位置、角度、光照条件等如何改变，都能够准确地检测到关键点，并且这组关键点可以泛化到同一类别的不同实例物体中，使得模型的泛化性更强，更适应类别级物体位姿估计的任务。使得后续能够更好地进行精确和鲁棒的位姿估计。in A weight map that represents the similarity calculation of each key point detector in the scene,/> are the key point coordinates of the final detection. The key points of dynamic detection can adapt to different scenes and changes. No matter how the position, angle, lighting conditions, etc. of the target object change, the key points can be accurately detected, and this set of key points can be generalized to different instances of the same category. objects, making the model more generalizable and more suitable for the task of category-level object pose estimation. This enables better accurate and robust pose estimation in the future.

进一步的，本专利设计了多尺度位姿预测网络，该部分主要分为局部特征聚合模块和基于对应关系的位姿预测网络两个模块：Furthermore, this patent designs a multi-scale pose prediction network, which is mainly divided into two modules: a local feature aggregation module and a correspondence-based pose prediction network:

局部特征聚合模块。为了使每个关键点能够更好的提取场景中的局部信息，从而产生多尺度的特征。本项目拟提出关键点位置的局部特征聚合模块。具体而言，对于每个检测到的3D关键点通过提取其最近邻的场景点的融合特征，通过cross attention将局部结构信息聚合到关键点中：Local feature aggregation module. In order to enable each key point to better extract local information in the scene, thereby generating multi-scale features. This project plans to propose a local feature aggregation module for key point locations. Specifically, for each detected 3D keypoint By extracting the fusion features of its nearest neighbor scene points, local structural information is aggregated into key points through cross attention:

其中knn表示欧氏空间中的k-近邻点，index表示索引操作。通过local注意力计算聚合局部特征使得关键点具有多尺度的信息，能够更好的预测场景物体位姿。Among them, knn represents the k-nearest neighbor point in the Euclidean space, and index represents the index operation. By aggregating local features through local attention calculation, key points have multi-scale information, which can better predict the pose of scene objects.

基于对应关系的位姿预测网络。为了根据网络输出的对应关系回归物体位姿，利用先进的深度神经网络拟合传统的最小二乘算法，使得拟合对应关系输出的位姿更具鲁棒性，本方法首先通过关键点特征预测关键点在物体坐标空间下的位置：Correspondence-based pose prediction network. In order to regress the object pose according to the correspondence relationship output by the network, an advanced deep neural network is used to fit the traditional least squares algorithm, making the pose output from the fitted correspondence relationship more robust. This method first predicts the key point features The position of the key point in the object coordinate space:

并将输出的关键点在场景中的位置，关键点在场景中的特征以及关键点在物体空间坐标系下的位置和特征拼接后形成N_s组对应关系，通过多层感知机输出最终的物体位姿。具体而言：And the positions of the output key points in the scene, the characteristics of the key points in the scene, and the positions and characteristics of the key points in the object space coordinate system are spliced together to form N _s sets of correspondences, and the final object is output through the multi-layer perceptron. Posture. in particular:

基于对应性关系的预测更符合数学中最小二乘拟合物体坐标对的关系，能使网络更容易学到位姿的映射关系，而且在本方法中利用自适应关键点产生的对应关系能去除场景中噪声点的影响，仅仅选取最具代表性的部分关键点，使得整体计算精度更高，鲁棒性更强且计算更为高效。The prediction based on the correspondence relationship is more in line with the relationship between least squares fitting object coordinate pairs in mathematics, which can make it easier for the network to learn the mapping relationship of pose, and in this method, the correspondence relationship generated by adaptive key points can be used to remove the scene In order to eliminate the influence of noise points, only the most representative key points are selected, making the overall calculation more accurate, more robust and more efficient.

本专利还设计了利用数字孪生仿真技术辅助模型的训练与验证模型的性能的方法。为了采集物体的6D位姿数据，我们将虚拟RGBD传感器部署到数字孪生环境中，这些传感器模拟了现实世界中的相机、激光雷达等感知设备。这些虚拟传感器会实时记录物体的位置和姿态信息，生成大量的模拟数据。我们利用这些模拟数据来训练和验证深度学习模型。通过在数字孪生环境中进行模拟实验，从而验证模型的性能，包括其准确性、鲁棒性和泛化能力。This patent also designs a method to use digital twin simulation technology to assist model training and verify model performance. In order to collect the 6D pose data of objects, we deploy virtual RGBD sensors into the digital twin environment. These sensors simulate sensing devices such as cameras and lidar in the real world. These virtual sensors will record the position and attitude information of objects in real time and generate a large amount of simulation data. We use these simulated data to train and validate deep learning models. By conducting simulation experiments in a digital twin environment, the performance of the model, including its accuracy, robustness and generalization ability, is verified.

在本专利中，RGBD多模态特征提取主干网络中的Resnet卷积神经网络和pointnet++点云特征提取网络为背景技术中的已有技术，本专利基于此技术新增了自适应关键点检测网络，并且设计了新的基于局部特征聚合的位姿估计网络，并在数字孪生环境中进行模拟实验以验证技术的有效性In this patent, the Resnet convolutional neural network and pointnet++ point cloud feature extraction network in the RGBD multi-modal feature extraction backbone network are existing technologies in the background technology. This patent adds an adaptive key point detection network based on this technology. , and designed a new pose estimation network based on local feature aggregation, and conducted simulation experiments in a digital twin environment to verify the effectiveness of the technology.

本发明又一个实施例提供了一种基于动态关键点检测的类别级物体6D位姿估计系统，其特征在于，所述系统包括：Yet another embodiment of the present invention provides a category-level object 6D pose estimation system based on dynamic key point detection, which is characterized in that the system includes:

本申请的实施例可提供为方法或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。本申请实施例中的方案可以采用各种计算机语言实现，例如，面向对象的程序设计语言Java和直译式脚本语言JavaScript等。Embodiments of the present application may be provided as methods or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The solutions in the embodiments of this application can be implemented using various computer languages, such as the object-oriented programming language Java and the literal scripting language JavaScript.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations are mutually exclusive. any such actual relationship or sequence exists between them. Furthermore, the terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus that includes a list of elements includes not only those elements, but also those not expressly listed other elements, or elements inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or apparatus that includes the stated element.

Claims

1. A category-level object 6D pose estimation method based on dynamic key point detection, which is characterized by:

Receive image data of the object, where the image data includes RDB images and point clouds, wherein the point cloud is formed by randomly sampling pixels in the depth map and projecting them into the scene in combination with camera internal parameters;

Extract the image features of the RDB image and the point cloud features of the point cloud respectively, and splice and fuse the image features and point cloud features to obtain the fused features;

Input the fused features into the preset dynamic key point detection network to extract the key points of the object;

Input the key points into the preset multi-scale pose prediction network, aggregate the local structure information into the key points to obtain key points with multi-scale information, and predict the key points in the object through the key point features with multi-scale information. The position in the spatial coordinate system, and the position of the output key point in the scene, the characteristics of the key point in the scene, and the position and characteristics of the key point in the object space coordinate system are spliced to form multiple sets of correspondences. Through multi-layer The perceptron outputs the final 6D pose of the object.

2. A category-level object 6D pose estimation method based on dynamic key point detection according to claim 1, characterized in that: the feature extractor of the RDB image adopts Resnet18 convolutional neural network.

3. A method for 6D pose estimation of category-level objects based on dynamic key point detection according to claim 2, characterized in that: respectively extracting image features of RDB images and point cloud features of point clouds, specifically including:

Send the input RGB image to the Resnet18 convolutional neural network, and extract the feature map f _rgb ∈R ^h×w×c of the image, where h, w, and c are the feature map height, feature map width, and feature map channel number respectively. ;

Input the point cloud into the Pointnet++ point cloud feature extraction network, and extract the structural features f _point ∈R ^N×C of the point cloud; where N is the number of point clouds, and C is the dimension of each point cloud feature output by the Pointnet++ network.

4. A method for 6D pose estimation of category-level objects based on dynamic key point detection according to claim 3, characterized in that: the image features and point cloud features are spliced and fused to obtain the fused features, specifically including

Project the structural features of the point cloud onto the feature map of the image through the internal parameters of the camera, and extract the corresponding features f _point→rgb ∈R ^N×C of the structural features of the point cloud on the feature map of the image through bilinear interpolation;

By splicing the feature map of the image and the structural features of the point cloud and outputting it through a multi-layer MLP, the fused feature f _fusion ∈R ^N×C is obtained.

5. A method for 6D pose estimation of category-level objects based on dynamic key point detection according to claim 4, characterized in that: the fused features are input into a preset dynamic key point detection network to extract the object. Key points include:

Introduce attention mechanism and Transformer Layer to dynamically detect key points of objects. Represents N _s KPT queries that are randomly initialized and continuously updated with the training process, and are used to represent N _s key points in the scene;

The query representing different key points interacts with the fused features f _fusion ∈R ^N×C extracted from the scene through the crossattention layer, and the KPT query is updated scene adaptively:

f′ _kpt =MHCA(f _fusion ; f _kpt ),

Using a similarity-based heat map generation strategy, after calculating the similarity between each KPT query and scene points, the 3D positions and 3D features of key points are generated through heat map weighting:

heatmap=Softmax(Similarity(f′ _kpt , f _fusion ))

in A weight map representing the similarity calculation of each key point detector in the scene, are the key point coordinates of the final detection.

6. A method for 6D pose estimation of category-level objects based on dynamic key point detection according to claim 5, characterized in that: the key points are input into a preset multi-scale pose prediction network, Aggregate local structural information into key points to obtain key points with multi-scale information, including:

For each detected 3D keypoint By extracting/> The fusion features of the nearest neighbor scene points aggregate local structural information into key points through cross attention:

Among them, knn represents the k-nearest neighbor point in the Euclidean space, and index represents the index operation.

7. A category-level object 6D pose estimation method based on dynamic key point detection according to claim 6, characterized in that: the key points are predicted by key point features with multi-scale information in the object space coordinate system. The position of the output key point in the scene, the characteristics of the key point in the scene, and the position and characteristics of the key point in the object space coordinate system are spliced to form multiple sets of correspondences. The final output is through the multi-layer perceptron. 6D pose of the object, including:

Predict the position of key points in object coordinate space through key point features:

And the positions of the output key points in the scene, the characteristics of the key points in the scene, and the positions and characteristics of the key points in the object space coordinate system are spliced together to form N _s sets of correspondences, and the final object is output through the multi-layer perceptron. 6D pose:

8. A category-level object 6D pose estimation system based on dynamic key point detection, characterized in that the system includes:

A receiving module, configured to receive image data of an object. The image data includes RDB images and point clouds. The point clouds are formed by randomly sampling the pixels in the depth map and projecting them into the scene in combination with internal parameters of the camera;

The feature extraction and fusion module is used to extract the image features of the RDB image and the point cloud features of the point cloud respectively, and splice and fuse the image features and point cloud features to obtain the fused features;

The key point extraction module is used to input the fused features into the preset dynamic key point detection network to extract the key points of the object;

The processing and output module is used to input the key points into the preset multi-scale pose prediction network, aggregate the local structure information into the key points to obtain key points with multi-scale information, and use the key points with multi-scale information to The point feature predicts the position of the key point in the object space coordinate system, and splices the output key point's position in the scene, the key point's characteristics in the scene, and the key point's position and characteristics in the object space coordinate system to form a multi-point Group correspondence relationship, output the final object 6D pose through the multi-layer perceptron.

9. A computer-readable storage medium storing one or more programs, characterized in that the one or more programs include instructions that, when executed by a computing device, cause the computing device to perform according to claim 1 Any of the methods described in 1-7.

10. A computing device, characterized by:

one or more processors, memory, and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including Instructions for performing any of the methods according to claims 1-7.