CN111353506B

CN111353506B - Adaptive line of sight estimation method and device

Info

Publication number: CN111353506B
Application number: CN201811582119.9A
Authority: CN
Inventors: 郭天楚; 张辉; 刘夏冰; 韩在濬; 崔昌圭
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2018-12-24
Filing date: 2018-12-24
Publication date: 2024-10-01
Anticipated expiration: 2038-12-24
Also published as: KR20200079170A; KR102868991B1; CN111353506A

Abstract

An adaptive line-of-sight estimation method and apparatus are provided. The sight line estimation method comprises the following steps: acquiring characteristics of data for calibration and characteristics of data for estimation; line-of-sight estimation is performed by the acquired features. According to the sight line estimation method and apparatus of the present disclosure, the sight line estimation effect for a specific person can be improved.

Description

Adaptive line-of-sight estimation method and apparatus

Technical Field

The present disclosure relates generally to the field of gaze estimation, and more particularly, to an adaptive gaze estimation method and apparatus.

Background

The estimation method in the related art generally uses a basic model by which the line of sight of any person can be estimated. However, the basic model is generally established by fitting a large number of human eye images as training data to obtain a common parameter, and since the eye shapes of each user are different, the prediction effect of the basic model is good (for example, the accuracy of the prediction result is high) for a person whose eye shape is similar to the eye shape corresponding to the common parameter, and conversely, the prediction effect of the basic model is poor for a person having a large difference between the eye shape and the eye shape corresponding to the common parameter.

Disclosure of Invention

Exemplary embodiments of the present disclosure provide an adaptive gaze estimation method and apparatus to solve the problem of low accuracy of gaze estimation results for a specific person in the prior art.

According to an exemplary embodiment of the present disclosure, an adaptive line-of-sight estimation method is provided. The sight line estimation method comprises the following steps: acquiring characteristics of data for calibration and characteristics of data for estimation; line-of-sight estimation is performed by the acquired features.

Optionally, the line-of-sight estimation method further includes: acquiring a neural network model, wherein the step of acquiring the characteristics comprises: features of the data for calibration are extracted by the neural network model and/or features of the data for estimation are extracted by the neural network model.

Optionally, the step of obtaining the neural network model includes: the neural network model is trained with data for training.

Optionally, the data for training includes a first user image and a second user image, where the first user image and the second user image are an image of the same user when looking at the first object and an image of the same user when looking at the second object, respectively; wherein training the neural network model comprises: the neural network model is trained by taking as input a first user image and a second user image, and taking as output a relative position of a first object and a second object.

Optionally, the data for training includes image-related data for training and a line-of-sight tag for training, and the step of training the neural network model includes: converting the sight tag used for training into a two-class tag; determining a loss function corresponding to the classification label; training a first neural network model by image-related data for training, the classification labels, and the loss function; wherein the step of converting the line-of-sight tag for training into a two-class tag comprises: determining a coordinate Y _a of the sight line label for training on a specific coordinate axis, wherein Y _amin≤Y_a≤Y_amax,Y_amin and Y _amax are the minimum value and the maximum value of the coordinate Y _a respectively; setting a plurality of nodes at a predetermined pitch on the specific coordinate axis, wherein the size of the predetermined pitch is bin_size, generating a bi-classified label including a vector having a dimension which is the number of the plurality of nodes, wherein the value of each dimension of the vector is determined by the size of the predetermined pitch and the coordinate Y _a, wherein the loss function is calculated by the value of each dimension of the vector, and an activation probability calculated by data for training corresponding to each node.

Optionally, the data for training includes image-related data for training and a line-of-sight tag for training, and the step of training the neural network model includes: extracting two pairs of samples from the image-related data for training and the line-of-sight tag for training, wherein the two pairs of samples correspond to the same user, each pair of samples comprising one image-related data for training and one corresponding line-of-sight tag for training, the difference between the two line-of-sight tags of the two pairs of samples being greater than a first threshold and less than a second threshold; a second neural network model is trained through the two pairs of samples.

Optionally, the step of training the neural network model further includes: extracting two other pairs of samples by the step of extracting two pairs of samples, wherein a difference between two line-of-sight labels of the two other pairs of samples is greater than a third threshold and less than a fourth threshold, wherein the third threshold is greater than or equal to the first threshold and the fourth threshold is less than or equal to the second threshold; continuing to train the second neural network model through the other two pairs of samples, wherein the step of extracting two pairs of samples is performed at least twice such that a difference between two line-of-sight labels of two pairs of samples extracted each time is smaller than a difference between two line-of-sight labels of two pairs of samples extracted the previous time.

Optionally, before training the second neural network model, the line-of-sight estimation method further includes: setting parameters of a second neural network model based on the first neural network model, wherein the second neural network model and the first neural network model have the same network layer for feature extraction, the step of training the second neural network model by the two pairs of samples comprising: the classifier of the second neural network model is trained by the two image-related data for training of the two pairs of samples and the two classification labels corresponding to the two image-related data for training.

Optionally, the step of training the classifier of the second neural network model includes: respectively extracting the characteristics of the two image related data for training through a trained first neural network model; calculating a feature difference between features of the two image-related data for training; the classifier of the second neural network model is trained by taking the feature differences as input and the classification labels corresponding to the two image-related data for training as output.

Optionally, extracting features of the data for calibration by the neural network model and/or extracting features of the data for estimation by the neural network model includes: features of the data for calibration are extracted by the second neural network model and/or features of the data for estimation are extracted by the second neural network model.

Optionally, the step of performing line-of-sight estimation by the acquired features includes: and estimating the position of the gaze point in the gaze area through the acquired neural network model according to the acquired characteristics of the data for estimation and the acquired characteristics of the data for calibration.

Optionally, the step of performing line-of-sight estimation by the acquired features includes: calculating a feature difference between the extracted features of the data for estimation and the extracted features of the data for calibration; estimating a classifier output result corresponding to the calculated feature difference by using the acquired neural network model; calculating a probability that a gaze point corresponding to data for estimation belongs to each of a plurality of sub-regions divided from a gaze region according to an estimated classifier output result; the center of the sub-region with the highest probability is determined as the estimated gaze point.

Optionally, when the gazing area is an area on a two-dimensional plane, the gazing area is divided by: setting two straight lines perpendicularly intersecting each of the calibration points for each of the calibration points, and dividing the gazing area into a plurality of sub-areas by the set respective straight lines, or when the gazing area is an area in a three-dimensional space, the gazing area is divided by: three straight lines perpendicular to each other and intersecting each of the calibration points are provided for each of the calibration points, and the fixation area is divided into a plurality of sub-areas by the provided respective straight lines.

Optionally, the step of calculating the probability that the gaze point corresponding to the data for estimating belongs to each of the plurality of sub-regions divided from the gaze region includes: for the classifier output result corresponding to each calibration point, respectively determining the probability that the coordinate of each dimension of the gaze point is smaller than and larger than the coordinate of each calibration point about each dimension; and calculating the probability that the fixation point belongs to each sub-region according to the determined probability.

Optionally, the probability that the gaze point corresponding to the data for estimating belongs to each sub-region of the plurality of sub-regions is calculated by comparing the probability of relation with respect to the corresponding calibration point.

Optionally, before the line of sight estimation by the acquired features, the line of sight estimation method further includes: and acquiring data for calibration when the specific point is used as one of the calibration points according to the operation of the user on the specific point.

Optionally, the specific point includes at least one of: a specific point on the screen of the device, a specific button on the device, a specific point with a determined relative position to the device.

Optionally, the line-of-sight estimation method further includes: displaying the calibration points; acquiring a user image of a user when the user gazes at the calibration point as the data for calibration; and calibrating according to the data for calibrating.

Optionally, the step of acquiring the image of the user when the user gazes at the calibration point includes: and in response to receiving the gesture aiming at the calibration point, judging the distance between the operation point corresponding to the gesture and the calibration point, and acquiring a user image as the data for calibration when the distance is smaller than a distance threshold.

According to another exemplary embodiment of the present disclosure, there is provided a line of sight estimating apparatus including: a feature acquisition unit that acquires features of data for calibration and features of data for estimation; and an estimation unit that performs line-of-sight estimation from the acquired features.

Optionally, the line-of-sight estimating apparatus further includes: and the model training unit is used for acquiring a neural network model, wherein the characteristic acquisition unit is used for extracting the characteristics of the data for calibration through the neural network model and/or extracting the characteristics of the data for estimation through the neural network model.

Optionally, the model training unit trains the neural network model through data for training.

Optionally, the data for training includes a first user image and a second user image, where the first user image and the second user image are an image of the same user when looking at the first object and an image of the same user when looking at the second object, respectively; wherein the model training unit trains the neural network model by taking the first user image and the second user image as inputs and the relative positions of the first object and the second object as outputs.

Optionally, the data for training includes image related data for training and a sight line label for training, and the model training unit converts the sight line label for training into a two-class label; determining a loss function corresponding to the classification label; training a first neural network model by image-related data for training, the classification labels, and the loss function; the model training unit determines a coordinate Y _a of the sight line label for training on a specific coordinate axis, wherein Y _amin≤Y_a≤Y_amax,Y_amin and Y _amax are the minimum value and the maximum value of a coordinate Y _a respectively; setting a plurality of nodes at a predetermined pitch on the specific coordinate axis, wherein the size of the predetermined pitch is bin_size, generating a bi-classified label including a vector having a dimension which is the number of the plurality of nodes, wherein the value of each dimension of the vector is determined by the size of the predetermined pitch and the coordinate Y _a, wherein the loss function is calculated by the value of each dimension of the vector, and an activation probability calculated by data for training corresponding to each node.

Optionally, the data for training includes image-related data for training and a line-of-sight tag for training, the model training unit extracts two pairs of samples from the image-related data for training and the line-of-sight tag for training, wherein the two pairs of samples correspond to the same user, each pair of samples includes one image-related data for training and one corresponding line-of-sight tag for training, and a difference between the two line-of-sight tags of the two pairs of samples is greater than a first threshold and less than a second threshold; a second neural network model is trained through the two pairs of samples.

Optionally, the model training unit extracts two other pairs of samples through the step of extracting two pairs of samples, wherein a difference between two sight labels of the two other pairs of samples is greater than a third threshold and less than a fourth threshold, wherein the third threshold is greater than or equal to the first threshold and the fourth threshold is less than or equal to the second threshold; continuing to train the second neural network model through the other two pairs of samples, wherein the step of extracting two pairs of samples is performed at least twice such that a difference between two line-of-sight labels of two pairs of samples extracted each time is smaller than a difference between two line-of-sight labels of two pairs of samples extracted the previous time.

Optionally, before training the second neural network model, the model training unit sets parameters of the second neural network model based on the first neural network model, where the second neural network model and the first neural network model have the same network layer for feature extraction; and a model training unit for training the classifier of the second neural network model by using the two image related data for training of the two pairs of samples and the two classification labels corresponding to the two image related data for training.

Optionally, the model training unit extracts the features of the two image related data for training through the trained first neural network model respectively; calculating a feature difference between features of the two image-related data for training; the classifier of the second neural network model is trained by taking the feature differences as input and the classification labels corresponding to the two image-related data for training as output.

Optionally, the feature acquiring unit extracts features of the data for calibration through the second neural network model and/or extracts features of the data for estimation through the second neural network model.

Optionally, the estimating unit estimates the position of the gaze point in the gaze area according to the acquired characteristics of the data for estimation and the acquired characteristics of the data for calibration by the acquired neural network model.

Optionally, the estimating unit calculates a feature difference between the feature of the extracted data for estimation and the feature of the extracted data for calibration; estimating a classifier output result corresponding to the calculated feature difference by using the acquired neural network model; calculating a probability that a gaze point corresponding to data for estimation belongs to each of a plurality of sub-regions divided from a gaze region according to an estimated classifier output result; the center of the sub-region with the highest probability is determined as the estimated gaze point.

Optionally, when the gazing area is an area on a two-dimensional plane, the gazing area is divided by: setting two straight lines perpendicularly intersecting each of the calibration points for each of the calibration points, and dividing the fixation area into a plurality of sub-areas by the set straight lines, or

When the gazing region is a region in a three-dimensional space, the gazing region is divided by: three straight lines perpendicular to each other and intersecting each of the calibration points are provided for each of the calibration points, and the fixation area is divided into a plurality of sub-areas by the provided respective straight lines.

Optionally, the estimating unit determines, for the classifier output result corresponding to each calibration point, a probability that the coordinates of each dimension of the gaze point are smaller and larger than the coordinates of each calibration point with respect to the each dimension, respectively; and calculating the probability that the fixation point belongs to each sub-region according to the determined probability.

Optionally, the estimating unit calculates, by a comparison relation probability of each of the plurality of sub-regions with respect to the corresponding calibration point, a probability that the gaze point corresponding to the data for estimation belongs to the each sub-region.

Optionally, the line-of-sight estimating apparatus further includes: and the calibration unit is used for acquiring data for calibration when the specific point is used as one of the calibration points according to the operation of the user on the specific point before the sight line estimation is carried out through the acquired characteristics.

Optionally, the calibration unit displays the calibration point; acquiring a user image of a user when the user gazes at the calibration point as the data for calibration; and calibrating according to the data for calibrating.

Optionally, the calibration unit is configured to determine, in response to receiving a gesture for the calibration point, a distance between an operation point corresponding to the gesture and the calibration point, and acquire, when the distance is smaller than a distance threshold, a user image as the data for calibration.

According to another exemplary embodiment of the present disclosure, a computer readable storage medium storing a computer program is provided, wherein the computer program, when executed by a processor, implements the gaze estimation method as described above.

The method and the device improve the traditional sight line estimation method and the traditional sight line estimation device, can realize self-adaptive sight line estimation for specific people, can reduce or avoid the operation of calculating model parameters or retraining a model in the calibration process, and can reduce the consumption of hardware resources; the sight line estimation performance or effect (for example, the accuracy of the sight line estimation result) can be effectively improved through calibration, and particularly, the sight line estimation performance or effect for a specific person can be improved; the obtained neural network model has stronger characteristic characterization capability. The beneficial effects of the present disclosure may be embodied in at least one of the following:

In a first aspect, for conventional solutions that extract features based on deep learning (i.e., appearance-based solutions), the defined loss function may result in poor characterizations of the model. In view of this, the present disclosure employs a new loss function that is different from the conventional solution, which may result in an improved characterization capability of the obtained model.

In a second aspect, conventional solutions require training of a general basic model and a specific model for a specific person, and at least require training of the specific model at the mobile device of the user, resulting in complex operations and increased resource consumption of the mobile device. The network structure defined in the disclosure can calculate the feature differences in a specific order, remove the appearance differences among different people through the calculated feature differences, and reduce or avoid model training operation in the calibration process through calibration and test (also called estimation) of the classifier obtained through training.

In a third aspect, in the present disclosure, data and line of sight acquired under a specific environment may be used for calibration, so that calibration efficiency may be effectively improved.

Additional aspects and/or advantages of the present general inventive concept will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the general inventive concept.

Drawings

The foregoing and other objects and features of exemplary embodiments of the present disclosure will become more apparent from the following description taken in conjunction with the accompanying drawings which illustrate the embodiments by way of example, in which:

FIGS. 1-3 illustrate schematic diagrams of calibration points according to exemplary embodiments of the present disclosure;

FIG. 4 illustrates a flow chart of a gaze estimation method in accordance with an exemplary embodiment of the present disclosure;

FIG. 5 illustrates a schematic diagram of a process of training a first neural network model, according to an exemplary embodiment of the present disclosure;

FIG. 6 illustrates a schematic diagram of a process of training a second neural network model, according to an exemplary embodiment of the present disclosure;

FIG. 7 shows a schematic diagram of a process based on slicing an operational neural network model, in accordance with an exemplary embodiment of the present disclosure;

fig. 8-11 illustrate schematic diagrams of calibration points according to exemplary embodiments of the present disclosure;

FIG. 12 illustrates a schematic diagram of an operation of extracting features at calibration according to an exemplary embodiment of the present disclosure;

FIG. 13 illustrates a schematic diagram of operations for line-of-sight estimation by extracted features according to an exemplary embodiment of the present disclosure;

FIG. 14 illustrates a schematic diagram of dividing regions based on calibration points according to an exemplary embodiment of the present disclosure;

FIG. 15 illustrates a region probability distribution histogram according to an exemplary embodiment of the present disclosure;

fig. 16 shows a schematic diagram of repartitioning a gaze area according to an exemplary embodiment of the present disclosure;

Fig. 17 shows a block diagram of a gaze estimation device in accordance with an exemplary embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments will be described below in order to explain the present disclosure by referring to the figures.

In practice, the gaze estimation product to which the gaze estimation method is applied is designed for use by a specific person (e.g., the gaze estimation product is a mobile device), or for a period of time, the gaze estimation product is only used by a specific person (e.g., the gaze estimation product is augmented Reality (Augmented Reality), virtual Reality (Virtual Reality) glasses, abbreviated as AVR glasses). In this case, to enhance the user experience, modifications to the basic model are required, which is called calibration. The calibration is to solve the problem that the eye shape of a specific person is different from the eye shape corresponding to the basic model, and the calibration is usually divided into two steps. The first step is as follows: acquiring data of a specific person when the specific person performs calibration; the second step is as follows: the general parameters of the basic model are updated with the acquired data in order to adapt the basic model to the specific model of the specific person. Estimating the line of sight of the particular person using the particular model can significantly improve the predictive effect on the particular person, although it is also possible to reduce the predictive effect on other persons.

The basic model is typically built by the following solutions: geometric model-based solutions and appearance-based solutions.

The geometric model-based solution performs a distance line of sight estimation using general theory of pupil center and cornea reflection, specifically, determines an algorithm formula for calculating a line of sight direction from metric features (e.g., pupil center position, cornea reflection position, etc.), calculates a required metric feature using human eye data acquired through an infrared camera and an Red Green Blue (RGB) camera, and substitutes the calculated metric feature into the algorithm formula to calculate the line of sight direction. Geometric model-based solutions typically use parameters related to the user (e.g., the angle between the visual and optical axes of the human eye, the distance between the pupil center and the center of the corneal curve, etc.). In order to improve the accuracy of the gaze estimation result for a specific person, the above parameters need to be calculated by calibration, since these parameters typically cannot be measured directly, but need to be calculated by a special calibration device and calibration algorithm. Additionally, geometric model-based solutions typically require that the computed metrology features have high accuracy, which is dependent on the particular data acquisition device (e.g., infrared light source and infrared camera).

Appearance-based solutions typically acquire a user image through an RGB camera and extract features from the user image that correspond to the appearance of the user. The feature extraction method is divided into a method for manually screening features and a feature extraction method based on deep learning. A mapping between the input x and the position of the eye's line of sight (e.g., line of sight tag Y) is established, wherein the mapping may be represented by a classifier or a regressor. For example, the mapping relationship is expressed as the formula y=f (x; w). For the method of manually screening features, x is a feature extracted from a user image, and may be a feature extracted by using scale-invariant feature transform (SCALE INVARIANT Feature Transform, abbreviated as SIFT) or the like; for a feature extraction method based on deep learning, x is the input image. F is a mapping function, and w is a parameter of the mapping function. Feature extraction methods based on deep learning are generally divided into two phases, training and testing (also referred to as estimation). In the training phase, the parameters w of the mapping function F are learned using the training data pairs (x, Y) and the mapping function F. In the test phase, the line-of-sight estimation result Y 'is obtained using the test data x' and the parameter w obtained by learning. With the development of deep learning, feature extraction methods based on deep learning in appearance-based solutions are increasingly adopted.

Calibration is generally divided into two steps: the first step is to interactively capture the data of the user while looking at a fixed point (e.g., a calibration point); the second step is to adjust parameters (e.g., parameters w) of the base model to update the base model to a particular model based on the captured user data and the corresponding line-of-sight tag Y.

For a basic model built by a geometric model-based solution, calibration must be performed in order to determine parameters related to the user (e.g., the angle between the visual and optical axes of the human eye, the distance between the pupil center and the cornea curve center, etc.).

For a basic model built by appearance-based solutions, the mapping function is typically redetermined by calibration and the parameters w of the mapping function are trained. For example, support vector regression (Support Vector Regression, abbreviated as SVR), random Forest (Random Forest), or the like is used for deep learning.

However, for geometric model-based solutions, specific data acquisition devices (e.g., infrared light sources and infrared cameras) are typically required. In this case, the line-of-sight estimation cannot be realized with a general-purpose device (e.g., a mobile device with a front RGB camera).

For appearance-based solutions, the problems are presented in the following ways:

In the first aspect, it is necessary to redetermine the mapping function by calibration and learn the parameters of the mapping function, such an operation needs to be performed at the mobile device of the user, and this operation needs to be performed every time calibration is performed, resulting in an increase in hardware resource consumption. Here, it is emphasized that since the above operation requires collection of user data and the user data relates to personal privacy, the above operation cannot be performed by transmitting the user data to the server side and performing calculation by the server side in order to protect the personal privacy.

In the second aspect, the kinds of mapping functions redetermined by calibration are limited due to limitations of methods (e.g., SVR and random forest, etc.) employed when deep learning is performed, resulting in limitation of the calibration mode. In this case, part of the data acquired at the time of calibration cannot be used for training, resulting in poor prediction effect for a specific person, in other words, the prediction effect for a specific person cannot be effectively improved by calibration.

In the third aspect, the feature extracted by the existing method is poor in versatility. In other words, the basic model in the existing method has poor characteristic characterization capability, and a mapping function applicable to all persons cannot be established, i.e., the characteristics of all persons cannot be effectively characterized.

The reason for the above problems is as follows:

As described above, the differences in the eye shapes of different persons are not considered in the training of the basic model, and in order to reduce the influence of the differences on the sight line estimation result, calibration must be performed to redetermine the mapping function of the basic model; in order to preserve personal privacy, calibration and model training is required at the user's mobile device. In this case, the consumption of hardware resources of the mobile device increases.

The method is limited by the method (such as SVR and random forest) adopted in deep learning, and the data acquired in calibration must meet a specific distribution to enable the classifier to be effective, so that the method can be applied to the SVR and random forest. Fig. 1 to 3 show calibration points set when calibration is performed, wherein five calibration points, nine calibration points, and thirteen calibration points are respectively provided in fig. 1 to 3. In performing calibration, a preset calibration point must be used. However, there may be situations where user data and line-of-sight tags corresponding to the data acquired during use by the user (e.g., during a testing phase) cannot be used for calibration. For example, personal settings may be made during the first use of the personal mobile device by the user. In making a personal setting, typically a user may click a particular button. In this case, if calibration is performed using data of the user when clicking the specific button and a sight line tag corresponding to the specific button, accuracy of the sight line estimation result can be improved. However, due to limitations of SVR, random forest, and other methods, part of the data and line-of-sight labels obtained during the process of personal setting can be used for calibration, and the rest of the data and line-of-sight obtained during the process of personal setting cannot be used for adjusting parameters of the existing basic model, but also cannot be used for training the specific model.

During the training of the basic model, there is a large difference between the data used for training, at least in the eye shape, resulting in that it is not easy to obtain a global optimum of the parameters of the basic model.

An adaptive line-of-sight estimation method according to an exemplary embodiment of the present disclosure may include at least one of the following: the device comprises a model training part, a calibration part and an actual use part.

The model training portion may be used to build a mapping relationship of the model (e.g., mapping function y=f (x)), namely: the parameters w of the model representing the mapping relation are trained from the data for training. The parameter w obtained by training can be fixed and calibrated and estimated on the basis of this (also called test or prediction). The operation of the model training part may be an offline operation or an online operation, preferably an offline operation, and may be performed by an electronic device such as a server or a mobile phone. The calibration part and the usage part may use a trained model, and the parameters w of the used trained model are not changed in the operation of the calibration part and the actual usage part. Each of the operation of the calibration section and the operation of the actual use section may be either an online operation or an offline operation, and preferably each of the operation of the calibration section and the operation of the actual use section is an online operation, which can be performed by an electronic device such as a mobile phone.

The line of sight estimation method according to the exemplary embodiments of the present disclosure may be applied to various electronic devices, for example, a cellular phone, a tablet computer, a smart watch, etc. The cell phone is illustrated in some exemplary embodiments of the present disclosure, but this should not be construed as limiting the present disclosure.

Fig. 4 shows a flowchart of a line-of-sight estimation method according to an exemplary embodiment of the present disclosure. As shown in fig. 4, the line-of-sight estimating method according to an exemplary embodiment of the present disclosure includes the steps of:

In step S10, the characteristics of the data for calibration and the characteristics of the data for estimation are acquired. In step S20, line-of-sight estimation is performed by the acquired features. The operation of acquiring the features of the data for calibration may correspond to a calibration section, the operation of acquiring the features of the data for estimation and step S20 may correspond to an actual use section, which may also be referred to as a prediction section, a test section or an estimation section.

As an example, the line-of-sight estimation method further includes: acquiring a neural network model, wherein the step of acquiring the characteristics comprises: features of the data for calibration are extracted by the neural network model and/or features of the data for estimation are extracted by the neural network model.

As an example, the step of acquiring the neural network model includes: the neural network model is trained with data for training. The step of acquiring the neural network model may correspond to a model training portion.

In exemplary embodiments of the present disclosure, the data may include data related to a user looking at a specific point (also referred to as a point of gaze), e.g., a user image, depth data of the user (e.g., depth data of a point on the user's face), etc., in other words, the data may include image-related data (e.g., RGB image or depth image), e.g., appearance-related data, which may be referred to as apparent data; the data may also include line-of-sight tags; the line-of-sight tag may be a two-dimensional line-of-sight tag (e.g., two-dimensional coordinates) three-dimensional sight line label three-dimensional coordinates), and the like; the neural network model may include a network layer and a classifier for feature extraction; the network layer and classifier for feature extraction may be determined by the model training section, and the network layer for feature extraction may be used for feature acquisition in step S10; the classifier may be used for line of sight estimation at step S20.

In another exemplary embodiment of the present disclosure, the data for training includes a first user image and a second user image, wherein the first user image and the second user image are an image of the same user when looking at a first object and an image when looking at a second object, respectively; wherein training the neural network model comprises: the neural network model is trained by taking as input a first user image and a second user image, and taking as output a relative position of a first object and a second object.

The above settings are merely for convenience in description of the embodiments of the present disclosure and are not intended to limit the scope of the present disclosure.

In exemplary embodiments of the present disclosure, the neural network model may be trained in advance and/or features of the data for calibration may be acquired in an off-line manner. The trained neural network model and/or the features of the acquired data for calibration may be stored to a storage device, in particular, an electronic device (e.g., a cell phone) for implementing a gaze estimation method according to an exemplary embodiment of the present disclosure. The characteristics of the data for estimation may then be acquired in real time and the line of sight estimation performed in real time.

In training the neural network model, the data for training and/or the line-of-sight tags for training may be processed to train a more efficient neural network model, which may include at least one of: the line-of-sight labels used for training are converted into two-class labels, and image-related data (e.g., RGB images) used for training are subjected to a slicing operation.

As an example, the data for training may include a third user image, and the step of training the neural network model may include: extracting at least one slice from the third user image; the neural network model is trained using the at least one slice and the line-of-sight tag for training.

As an example, the data for calibration comprises a fourth user image, the data for estimation comprises a fifth user image, and the step of acquiring the feature comprises: features of the fourth user image are extracted by the trained neural network model and/or features of the fifth user image are extracted by the trained neural network model.

As an example, the data for training includes image-related data for training and line-of-sight labels for training, the step of training the neural network model comprising: converting the sight tag used for training into a two-class tag; determining a loss function corresponding to the classification label; the first neural network model is trained by the image-related data for training, the classification labels, and the loss function.

As an example, the step of converting the line-of-sight tag for training into a two-class tag includes: determining a coordinate Y _a of the sight line label for training on a specific coordinate axis, wherein Y _amin≤Y_a≤Y_amax,Y_amin and Y _amax are the minimum value and the maximum value of the coordinate Y _a respectively; setting a plurality of nodes at a predetermined pitch on the specific coordinate axis, wherein the size of the predetermined pitch is bin_size, generating a bi-classified label including a vector having a dimension which is the number of the plurality of nodes, wherein the value of each dimension of the vector is determined by the size of the predetermined pitch and the coordinate Y _a, wherein the loss function is calculated by the value of each dimension of the vector, and an activation probability calculated by data for training corresponding to each node.

As an example, the data for training includes a sixth user image and a line-of-sight tag for training, and the step of training the neural network model includes: extracting at least one slice from the sixth user image; converting the sight tag used for training into a two-class tag; determining a loss function corresponding to the classification label; training a neural network model by the at least one slice, the classification labels, and the loss function.

Fig. 5 shows a schematic diagram of a process of training a first neural network model according to an exemplary embodiment of the present disclosure.

As shown in fig. 5, the data for training includes images (also referred to as pictures). At least one slice may be extracted from the image for training, and the extracted at least one slice may be used for training of the first neural network model. The line-of-sight labels used for training can also be converted into two-class labels, and the converted two-class labels can also be used for training the first neural network model.

When the transformed bi-classification labels are used for training of the first neural network model, the trained first neural network model includes a bi-classifier and a network layer for feature extraction. The network layer for feature extraction has an input of an image for training and an output of the image for training. The input of the two classifiers is the characteristic of the image used for training, and the output is the two-class label. The process of training the first neural network may specifically include a training operation a as follows.

In training operation a, a line-of-sight tag for training may be converted into a bi-classified tag, a first loss function corresponding to the bi-classified tag is determined, and a first neural network model is trained by the image for training, the bi-classified tag, and the first loss function. Training the target may include minimizing a first loss function, and the parameter that minimizes the first loss function may be obtained through training. The parameters may include weights for layers of the first neural network model.

In particular, the line-of-sight tag Y for training may be converted into a series of bi-classified tags Y ', the first neural network model may be trained by the image for training and the bi-classified tags Y ', and the first neural network model may also be trained using at least one slice extracted (or segmented) from the image for training and the bi-classified tags Y ', wherein an operation for extracting (or segmenting) the at least one slice from the image for training will be described in detail below. The first loss function may be obtained using the image for training (or at least one slice obtained by extraction or segmentation) and the classification labels Y' in order to train the first neural network model. Neural network models trained using the two-class labels Y' converge more easily or converge more quickly than neural network models trained using the line-of-sight labels Y for training. Features of data (e.g., images) input to the first neural network model that are characteristically strong may be acquired through the trained first neural network model.

As an example, the step of converting the line-of-sight tag for training into a two-class tag may include: determining a coordinate Y _a of the sight line label for training on a specific coordinate axis, wherein Y _amin≤Y_a≤Y_amax,Y_amin and Y _amax are the minimum value and the maximum value of the coordinate Y _a respectively; setting a plurality of nodes on the specific coordinate axis at a predetermined interval, wherein the size of the predetermined interval is bin_size, and the number bin_num of the plurality of nodes is an integer part of a result obtained by (Y _amax-Y_amin)/bin_size+1; generating a binary class label comprising a vector of dimensions bin num, the values of the respective dimensions of the vector being calculated by a binary function:

Wherein, i is more than or equal to 1 and is more than or equal to bin_num.

For example, the specific coordinate axis is one axis in a two-dimensional, three-dimensional or higher latitude coordinate system, and Y _amin and Y _amax define a range of one coordinate on the one axis. In this case, the bin_num nodes (e.g., equally spaced bin_num nodes spaced 20 pixels apart) may be disposed on the one axis. A column vector or row vector of elements having a value Y' _ai and a dimension bin num may be set, and a bi-classified label corresponding to the one coordinate on the one axis may include the column vector or row vector. The two-class labels corresponding to the line-of-sight labels represented by two-dimensional coordinates include 2 vectors (i.e., vectors corresponding to the x-axis and the y-axis, respectively) and correspond to the bin_num_x+bin_num_y classifiers, where bin_num_x represents the dimension of the vector corresponding to the x-axis and bin_num_y represents the dimension of the vector corresponding to the y-axis. The two-class labels corresponding to the line-of-sight labels represented by three-dimensional coordinates include 3 vectors (i.e., vectors corresponding to the x-axis, the y-axis, and the z-axis, respectively) and correspond to the bin_num_x+bin_num_y+bin_num_z classifiers, where bin_num_x represents the dimension of the vector corresponding to the x-axis, bin_num_y represents the dimension of the vector corresponding to the y-axis, and bin_num_z represents the dimension of the vector corresponding to the z-axis.

As an example, the line-of-sight tag is represented by two-dimensional coordinates (Y _x,Y_y), and nodes are set based on coordinate axes on the screen of the cellular phone. For example, the upper left corner of the screen of the cell phone is defined as the origin of coordinates (0, 0), the direction from the upper left corner of the screen to the upper right corner of the screen is the positive x-axis (e.g., horizontal axis) direction, and the direction from the upper left corner of the screen to the lower left corner of the screen is the positive y-axis (e.g., vertical axis) direction. The maximum value of the horizontal axis is Y _xmax, and the minimum value is Y _xmin. The vertical axis has a maximum value of Y _ymax and a minimum value of Y _ymin.

In such a coordinate system, nodes are set on the x-axis with a pitch bin_size_x of 20 pixels. The number of nodes on the x-axis, bin_num_x, is the integer portion of the result of (Y _xmax-Y_xmin)/bin_size_x+1. On the y-axis, nodes are set at a pitch bin_size_y of 20 pixels. The number of nodes on the Y-axis, bin_num_y, is the integer portion of the result of (Y _ymax-Y_ymim)/bin_size_y+1. In this way, a vector of dimension bin_num_x and a vector of dimension bin_num_y can be generated from the line-of-sight tag of coordinates (Y _x,Y_y). The respective elements Y' _xi of the vector of dimension bin_num_x are:

the respective elements Y' _yi of the vector of dimension bin_num_y are:

In this case, the two-class label corresponding to the line-of-sight label having the coordinates (Y _x,Y_y) includes a vector having the dimension bin_num_x and a vector having the dimension bin_num_y. The number of the classifiers corresponding to the line-of-sight tag having the coordinates (Y _x,Y_y) is bin_num_x+bin_num_y.

As an example, the loss function is a cross entropy loss calculated based on a two-classifier. For example, the first loss function is calculated by the following formula:

Loss_i＝-Y'_ai×log(P_ai)-(1-Y'_ai)×log(1-P_ai)，

Wherein the Loss function is Loss, and the activation probability P _ai of the ith node is expressed as:

Where zi is the input of the ith node (e.g., a portion or all of the data corresponding to the ith node for training).

As an example, the following steps may also be performed: training a second neural network model by samples extracted from the data for training and the line-of-sight tag for training; the first neural network model is trained with the data for training and the line-of-sight tag for training, the second neural network model is set based on the first neural network model, and the second neural network model (in particular, a classifier of the second neural network model) is trained with samples extracted from the data for training and the line-of-sight tag for training. Here, the line-of-sight labels used for training may be converted into the class-of-class labels, and the converted class-of-class labels are used to train the corresponding neural network models.

Specifically, the second neural network model may be trained without training the first neural network model, in other words, the step of training the second neural network model may include: extracting two pairs of samples from the image-related data for training and the line-of-sight tag for training, wherein the two pairs of samples correspond to the same user, each pair of samples comprising one image-related data for training and one corresponding line-of-sight tag for training, the difference between the two line-of-sight tags of the two pairs of samples being greater than a first threshold and less than a second threshold; a second neural network model is trained through the two pairs of samples. The second neural network model may also be trained with the first neural network model, in other words, the step of training the neural network model may comprise: converting the sight tag used for training into a two-class tag; determining a loss function corresponding to the classification label; training a first neural network model with the image-related data for training, the classification labels, and the determined loss function; parameters of the second neural network model are set based on the trained first neural network model, wherein the trained second neural network model and the trained first neural network model have the same network layer for feature extraction, in which case a classifier of the second neural network model may be trained by two image-related data for training of the two pairs of samples and a classification tag corresponding to the two image-related data for training.

As an example, the step of training the neural network model further comprises: extracting two other pairs of samples by the step of extracting two pairs of samples, wherein a difference between two line-of-sight labels of the two other pairs of samples is greater than a third threshold and less than a fourth threshold, wherein the third threshold is greater than or equal to the first threshold and the fourth threshold is less than or equal to the second threshold; continuing to train the second neural network model through the other two pairs of samples, wherein the step of extracting two pairs of samples is performed at least twice such that a difference between two line-of-sight labels of two pairs of samples extracted each time is smaller than a difference between two line-of-sight labels of two pairs of samples extracted the previous time.

The model training section of the exemplary embodiment of the present disclosure will be described below taking, as an example, a case where a first neural network model is trained, and a second neural network model is set based on the trained first neural network model, and the set second neural network model is trained. The model training part comprises a training operation B and a training operation C.

In training operation B, two pairs of samples may be extracted from the image-related data for training and the line-of-sight tag for training, wherein the two pairs of samples correspond to the same user, each pair of samples including one image-related data for training and one corresponding line-of-sight tag for training, a difference between the two line-of-sight tags of the two pairs of samples being greater than a first threshold and less than a second threshold; taking a function for representing the position relation of two sight labels for training of the two pairs of samples as a second loss function; a second neural network model is trained by the two pairs of samples and the second loss function. Training the target may include minimizing a second loss function, and the parameter that minimizes the second loss function may be obtained through training. The parameters may include weights for layers of the second neural network model. The parameters of the second neural network model in training operation B may be set based on the first neural network model trained in training operation a such that the second neural network model has the same network layer for feature extraction as the first neural network model. In this case, the second neural network model may be trained by training a classifier of the second neural network model. Preferably, the second loss function is identical to the first loss function.

As an example, the classifier of the second neural network model may be trained by: respectively extracting the characteristics of the two image related data for training through a trained first neural network model; calculating a feature difference between features of the two image-related data for training; the classifier of the second neural network model is trained by taking the feature differences as input and the classification labels corresponding to the two image-related data for training as output. The feature of the image-related data may be a vector, in which case the feature difference is the difference of the vector.

In training operation C, extracting two other pairs of samples by the step of extracting two pairs of samples, wherein a difference between two line-of-sight labels of the two other pairs of samples is greater than a third threshold and less than a fourth threshold, wherein the third threshold is greater than or equal to the first threshold and the fourth threshold is less than or equal to the second threshold; the training of the second neural network model is continued through the other two pairs of samples, thereby progressively reducing the difference between the two line-of-sight labels of the two pairs of samples extracted.

Fig. 6 shows a schematic diagram of a process of training a second neural network model according to an exemplary embodiment of the present disclosure.

The network layer for feature extraction shown in fig. 6 can be obtained with reference to the description for fig. 5. Sample pairs 1 and 2 obtained by extraction are from the image-related data for training and the line-of-sight tag for training and correspond to the same user, each sample pair comprising one image-related data for training and one corresponding line-of-sight tag for training, the difference between the two line-of-sight tags of the two pairs of samples being greater than a first threshold and less than a second threshold. The image-related data of the present embodiment is an image, which may be an image of a user who looks at the gaze point. As described above, in the case where the second neural network model is set based on the first neural network model, the network layer for feature extraction of the second neural network model has been determined. In this case, the classifier of the second neural network model may be trained by sample pair 1 and sample pair 2. The second loss function corresponding to the second neural network model may represent a positional relationship between the line-of-sight tag of the sample pair 1 and the line-of-sight tag of the sample pair 2.

The line-of-sight labels for training for sample pair 1 and sample pair 2 may be converted into two-class labels, and the classifier of the second neural network model may be trained by the converted two-class labels, the image in sample pair 1, the image in sample pair 2. In this case, the classifier of the second neural network model is a classifier (classifier in fig. 6). The input to the classifier may be a feature corresponding to the image. For example, feature 1 of the image in sample pair 1 is extracted by the network layer for feature extraction, feature 2 of the image in sample pair 2 is extracted by the network layer for feature extraction, and the feature difference between feature 1 and feature 2 is taken as an input to the classifier. The output of the classifier may represent a positional relationship between a predetermined object in the first image and a corresponding object in the second image among the image of the sample pair 1 and the image of the sample pair 2 (hereinafter also referred to as a first image and a second image, respectively) from the same user point of view, or a positional relationship between a screen gaze point corresponding to the first image and a screen corresponding to the second image from the same user point of view. For example, the first image is an image of the same user when viewing a point on the left side of the mobile phone screen, the second image is an image of the same user when viewing a point on the right side of the mobile phone screen, and the two-class label may indicate which side of the gaze point corresponding to the first image the gaze point corresponding to the second image is on from the same user's perspective (e.g., a two-class label with a value of 1 indicates the right side and a two-class label with a value of 0 indicates the left side). The images can be two images obtained by a front camera of the mobile phone when the same user looks at a preset object.

In this embodiment, the two images are from the same user, and the feature differences from the two images are used to train the two classifiers. In this case, deviation due to the difference in appearance of the person can be removed. In addition, the order of inputting the two images is not limited, and the first image may be input first to determine the feature 1 corresponding to the first image, and then the second image may be input to determine the feature 2 corresponding to the second image, or the second image may be input first to determine the feature 2 corresponding to the second image, and then the first image may be input to determine the feature 1 corresponding to the first image. Subsequently, feature differences corresponding to the two images may be calculated.

As an example, the feature obtained by the network layer for feature extraction may be a vector, and the feature difference is a difference of the vectors, that is, the feature difference is a vector having a difference of corresponding elements of the two vectors as elements.

As an example, the training data may be sampled with difficult samples such that the difference between the line of sight tag of the newly sampled sample pair and the line of sight tag of the sample pair from the previous sample is gradually reduced. The second neural network model continues to be trained with the newly sampled pairs of samples. In addition, because resampling can be continuously performed, namely, the difference value between sight line labels of sample pairs obtained by sampling twice in sequence can be continuously reduced according to the network convergence condition, a new training set is obtained by continuously sampling, and therefore, training samples input to the neural network are easy to converge from simple to difficult. It is generally considered that the larger the difference between two samples inputted, the easier it is to judge the relationship between the two samples, and two or more samples having a large difference may be referred to as simple samples; the smaller the difference between the two samples input, the less easily the relationship between the two samples is judged, and the two or more samples having small differences may be referred to as difficult samples. The simple to difficult input training samples means that the differences between the input samples are large to small.

As an example, the image-related data used in exemplary embodiments of the present disclosure may be an image, preferably a slice extracted from the image, which slice may include at least one of the following: a face image, a left eye image, and a right eye image.

For example, an image captured by a camera (e.g., a user image) may be used as image-related data for training. Since the line of sight is related to the eyeball position, the head pose, etc., at least one slice may be extracted from the acquired image. The at least one slice shown may be part of an image, each slice corresponding to one of the sub-networks of the first neural network model.

Fig. 7 shows a schematic diagram of a process of operating a neural network model based on slicing, according to an example embodiment of the present disclosure.

As shown in fig. 7, three slices, respectively, a face image, a left-eye image, and a right-eye image, are provided; the three slices correspond to three sub-networks, respectively, namely: face network, left eye network, and right eye network. In the exemplary embodiments of the present disclosure, in the model training section, the calibration section, and the actual use section, the features corresponding to the image may be extracted by a neural network model (preferably, a network layer of the neural network model for feature extraction). In the present exemplary embodiment, three features respectively corresponding to a face image, a left-eye image, and a right-eye image may be extracted. The three features may be stitched together and the stitched features may be output to corresponding classifiers.

Taking the line of sight estimation by a cell phone as an example, the slice may include a face image, a left eye image, and a right eye image. However, the present disclosure is not limited thereto, and for example, line-of-sight estimation may also be performed by AVR glasses. The present disclosure is also not limited to using only a single RGB camera to acquire image-related data, but may use multiple cameras, may use infrared cameras, near infrared cameras, etc., and the acquired image-related data may include depth data as well as fusion of data of one or more modalities. In other words, the image-related data or slices may include other data that may be used for line-of-sight estimation, such as face-to-camera depth data, etc.

The neural network model (e.g., the first neural network model and/or the second neural network model) used in the exemplary embodiments of the present disclosure has a deeper number of layers than existing neural network models, has a greater number of convolution kernels, and the convolution kernel stacking manner facilitates more efficient extraction of features of the image. For example, the neural network model of the exemplary embodiments of the present disclosure may include three inputs, a face image input, a left eye image input, and a right eye image input, respectively. The left-eye network corresponding to the left-eye image and the right-eye network corresponding to the right-eye image may share a partial network structure (e.g., network layer). The neural network model may be set by the parameters shown in tables 1,2 and 3 below, where tables 1 and 2 correspond to training the neural network model by a slice (e.g., a slice extracted from an image), respectively, and table 3 corresponds to a process of fusing features of the slice. Of course, tables 1 and 2 correspond only to the preferred neural network model of the exemplary embodiments of the present disclosure, and are not intended to limit the present disclosure. The structure, parameters, input data, and structure of the sub-network, etc. of the neural network model implemented on the basis of the disclosed concept are not limited thereto.

TABLE 1

TABLE 2

TABLE 3 Table 3

Layer(s)	Input junction count	Number of output nodes
			Fc1	128*3＝384	256
Classifier	256	Bin_num

In table 3 above, fc1 corresponds to the network layer used for feature extraction, classifier corresponds to the Classifier, and bin_num corresponds to the number of nodes described above.

In exemplary embodiments according to the present disclosure, various methods may be employed to obtain characteristics of data for calibration and/or characteristics of data for estimation.

As an example, extracting features of the data for calibration by the neural network model and/or extracting features of the data for estimation by the neural network model includes: features of the data for calibration are extracted by the second neural network model and/or features of the data for estimation are extracted by the second neural network model.

The calibration part of the line of sight estimation method according to an exemplary embodiment of the present disclosure may include: displaying the calibration points; acquiring a user image of a user when the user gazes at the calibration point as the data for calibration; and calibrating according to the data for calibrating.

As an example, the step of acquiring the image of the user while looking at the calibration point comprises: and in response to receiving the gesture aiming at the calibration point, judging the distance between the operation point corresponding to the gesture and the calibration point, and acquiring a user image as the data for calibration when the distance is smaller than a distance threshold.

Specifically, the calibration portion of the line-of-sight estimation method according to an exemplary embodiment of the present disclosure may include a calibration operation a and a calibration operation B.

In the calibration operation a, a calibration point is set, and data x_cali_1 to data x_cali_n of the user when looking at the calibration point are acquired and saved.

Fig. 8-11 illustrate schematic diagrams of calibration points according to exemplary embodiments of the present disclosure.

As shown in fig. 8-11, five calibration points are included in each figure. This is by way of example only and is not intended to limit the scope of the present disclosure, that is, the number and location of the calibration points is not limited.

As an example, the calibration points may be obtained by calibration in advance and the gaze area may be divided according to the calibration points calibrated in advance.

Taking a mobile phone as an example, the screen can be divided into 6 parts in the horizontal direction and the vertical direction according to the calibration points. It should be noted that the division into 6 parts is for descriptive purposes only and is not intended to limit the present disclosure, in other words, more or less parts may be divided. In fig. 8 to 11, the screen is divided into 36 sub-areas by respective straight lines intersecting perpendicularly at the calibration points. The exemplary embodiments of the present disclosure are not limited to the respective sub-regions as shown in fig. 8 to 11, in other words, the exemplary embodiments of the present disclosure do not limit the number of sub-regions divided from the screen and the manner in which the sub-regions are divided.

In the process of calibrating by using the mobile phone, in the calibration operation A, the user can be prompted to watch the calibration point displayed on the screen, so that the interaction between the mobile phone and the user is realized. The interaction may include: prompting a user to look at a pointing point displayed on a screen and click the pointing point, and receiving the clicked position on the screen; when a click on the screen is received, the distance between the clicked position and the standard point can be judged; when the distance is less than a distance threshold (e.g., 100 pixels), determining the click as a click against a target point may also determine that the user is looking at the target point, so that an image of the user may be acquired by a camera or the like. In this case, another set point may be displayed and the user may be prompted to look at the other set point on the screen and click on the other set point. When the distance is greater than or equal to a distance threshold, the calibration may be determined to be invalid and the user may be prompted to gaze again and click on the calibration point.

It should be noted that the exemplary embodiments of the present disclosure are described with respect to a cellular phone as an example only for convenience in describing the concepts of the present disclosure, and the present disclosure may be implemented on other apparatuses or devices. The above operations of displaying the calibration points, interacting the user with the mobile phone, judging whether the calibration is valid, etc. may be considered as a preferred embodiment of the present disclosure, but are not limited to the present disclosure, and other operations are also possible.

At calibration operation B, features feat _1 to feat _n of data x_cali_1 to data x_cali_n may be extracted and saved using the neural network model described in exemplary embodiments of the present disclosure. The neural network model to which the extracted features are applied may be a first neural network model or a second neural network model, preferably the second neural network model.

In addition, features of the data may be extracted offline rather than in real-time. In other words, for example, the features of the data may be extracted and saved in advance without performing the step of training the neural network model. The feature is extracted only once, so that the execution time of the actual use portion is not affected by the calibration portion, for example, the number of calibration points, the time consumed in extracting the feature in the calibration portion, and the like.

FIG. 12 illustrates a schematic diagram of operations for extracting features at calibration time according to an exemplary embodiment of the present disclosure.

As shown in fig. 12, the data for calibration for which the features are extracted may be calibration pictures, and the features of the calibration pictures may be extracted using a network (e.g., a second neural network model), and the extracted features may be referred to as calibration features.

In an exemplary embodiment according to the present disclosure, step S20 may include: and estimating the position of the gaze point in the gaze area through the acquired neural network model according to the acquired characteristics of the data for estimation and the acquired characteristics of the data for calibration.

As an example, step S20 may include: calculating a feature difference between the extracted features of the data for estimation and the extracted features of the data for calibration; estimating a classifier output result corresponding to the calculated feature difference by using the acquired neural network model; calculating a probability that a gaze point corresponding to data for estimation belongs to each of a plurality of sub-regions divided from a gaze region according to an estimated classifier output result; the center of the sub-region with the highest probability is determined as the estimated gaze point.

As an example, when the gazing area is an area on a two-dimensional plane, the gazing area is divided by: setting two straight lines perpendicularly intersecting each of the calibration points for each of the calibration points, and dividing the gazing area into a plurality of sub-areas by the set respective straight lines, or when the gazing area is an area in a three-dimensional space, the gazing area is divided by: three straight lines perpendicular to each other and intersecting each of the calibration points are provided for each of the calibration points, and the fixation area is divided into a plurality of sub-areas by the provided respective straight lines.

As an example, the step of calculating the probability that the gaze point corresponding to the data for estimation belongs to each of the plurality of sub-regions divided from the gaze region includes: for the classifier output result corresponding to each calibration point, respectively determining the probability that the coordinate of each dimension of the gaze point is smaller than and larger than the coordinate of each calibration point about each dimension; and calculating the probability that the fixation point belongs to each sub-region according to the determined probability.

As an example, a probability that a gaze point corresponding to the data for estimation belongs to each of the plurality of sub-regions is calculated by comparing the probability of relation with respect to the corresponding calibration point for each of the sub-regions.

As an example, before the line of sight estimation by the acquired features, the line of sight estimation method further comprises: and acquiring data for calibration when the specific point is used as one of the calibration points according to the operation of the user on the specific point.

As an example, the specific point includes at least one of: a specific point on the screen of the device, a specific button on the device, a specific point with a determined relative position to the device.

As an example, the probability P _area that the gaze point corresponding to the data for estimation belongs to the each sub-region is calculated by the following formula:

wherein, For the comparison relation probability of each sub-region relative to the ith calibration point, cali_num is the number of calibration points.

As an example, when new data is acquired, the step of estimating the line of sight by the extracted features further includes: acquiring the characteristics of the new data; combining the characteristics of the new data with the characteristics of the data previously acquired for calibration; recalculating feature differences between features of the data for estimation and the combined features; estimating a new classifier output result corresponding to the recalculated feature difference; recalculating the probability that the gaze point corresponding to the data for estimation belongs to each sub-region according to the new classifier output result; the center of the sub-region with the highest probability of being recalculated is determined as the estimated gaze point.

As an example, when the new data includes new data for calibration, a plurality of sub-regions are re-divided on the gazing area based on the existing calibration points and the calibration points corresponding to the new data for calibration, wherein the step of re-calculating the probability includes: and recalculating the probability that the gaze point corresponding to the data for estimation belongs to each of the re-divided sub-regions according to the new classifier output result.

As an example, the new data and the calibration point corresponding to the new data are obtained in any one of the following cases: in the case of recalibration, in the case of operation for a specific button, and in the case of operation for a specific position.

Calculating a feature difference between the feature of the extracted data for estimation and the feature of the extracted data for calibration using the trained neural network model; estimating classifier output results corresponding to the calculated feature differences by using a trained neural network model; calculating a probability that a gaze point corresponding to data for estimation belongs to each of a plurality of sub-regions divided from a gaze region according to an estimated classifier output result; the center of the sub-region with the highest probability is determined as the estimated gaze point.

As an example, when new data is acquired, the step of estimating the line of sight by the acquired features further includes: extracting features of the new data using a trained neural network model; combining the features of the new data with features of previously extracted data for calibration; recalculating feature differences between the features of the extracted data for estimation and the combined features; estimating a new classifier output result corresponding to the recalculated feature difference using a trained neural network model; recalculating the probability that the gaze point corresponding to the data for estimation belongs to each sub-region according to the new classifier output result; the center of the sub-region with the highest probability of being recalculated is determined as the estimated gaze point.

In the case of using the trained second neural network model, the step of performing line-of-sight estimation by the acquired features includes: calculating a feature difference between the feature of the data for estimation extracted by the second neural network model and the feature of the data for calibration extracted by the second neural network model; estimating classifier output results corresponding to the calculated feature differences using the trained second neural network model; calculating a probability that a gaze point corresponding to data for estimation belongs to each of a plurality of sub-regions divided from a gaze region according to an estimated classifier output result; the center of the sub-region with the highest probability is determined as the estimated gaze point.

As an example, when new data is acquired, the step of estimating the line of sight by the acquired features further includes: extracting features of the new data using the trained second neural network model; combining the features of the new data with features of previously extracted data for calibration; recalculating feature differences between the features of the extracted data for estimation and the combined features; estimating a new classifier output result corresponding to the recalculated feature difference using the trained second neural network model; recalculating the probability that the gaze point corresponding to the data for estimation belongs to each sub-region according to the new classifier output result; the center of the sub-region with the highest probability of being recalculated is determined as the estimated gaze point.

Operations of line-of-sight estimation by extracted features according to an exemplary embodiment of the present disclosure, which may include test operation a, test operation B, test operation C, and test operation D, are described below with reference to fig. 13.

In test operation a, when the user gazes at a location of interest (which may be referred to as a gaze point), an image is acquired by the camera, and features feat _x corresponding to the acquired image are extracted using the second neural network model. The extraction operation is feature extraction in the test stage, and can collect images in real time and extract features feat _X corresponding to the collected images. Taking a mobile phone as an example, acquiring a user image in real time through a camera of the mobile phone, and extracting characteristics according to the acquired user image in real time.

In test operation B, the feature differences diff_1 to diff_n between the feature feat _x and the features feat _1 to feat _n obtained at the time of calibration, respectively, are calculated. Features may be represented by vectors and feature differences may be differences between the vectors. The classification result of the feature differences diff_1 to diff_n may be calculated using a classifier (classifier) of the second neural network model, so that a comparison relation probability between an image for estimation (for example, an image acquired at the test operation a) and images x_cali_1 to x_cali_n for calibration (images acquired at the time of calibration) may be obtained. The comparison of the relationship probabilities can be understood as: the feature of the image for estimation obtained by the two classifiers is larger and smaller than the feature probability of the image for calibration. The feature differences may be calculated in accordance with a predetermined order rule, for example, the feature difference diff_i obtained by subtracting the features feat _1 to feat _n of the images obtained at the time of calibration from the features feat _x corresponding to the acquired images, respectively, that is: diff_i= feat _x-feat _i, 1.ltoreq.i.ltoreq.n, N being a natural number.

In the test operation C, in the case where the gazing area is divided into a plurality of sub-areas by the calibration points, the probability that the gazing point corresponding to the image for estimation (i.e., the point at which the current user gazes) falls on each sub-area may be calculated from the comparison relation probability. The center point of the region where the probability or the probability accumulation value or the probability expectation value is largest may be determined as the gaze point of the user.

Referring to fig. 13, features (e.g., test features) are extracted from an image (e.g., test picture) for estimation through a network (e.g., second neural network model), and calibration features acquired from the image for calibration are input to a corresponding comparison classifier together with the test features. Feature differences between the calibration features and the test features may be obtained and probabilities corresponding to the feature differences (e.g., comparison relationship probabilities) are output by the comparison classifier.

Fig. 14 shows a schematic diagram of dividing regions based on calibration points according to an exemplary embodiment of the present disclosure. Fig. 15 illustrates a region probability distribution histogram according to an exemplary embodiment of the present disclosure.

As shown in fig. 14, each calibration point may divide the axis into two parts, and the probability (comparison relation probability) p_i output by the classifier may represent the probability that the gaze point coordinate is less than or equal to the ith calibration point coordinate, p_i_l, and the probability that the gaze point coordinate is greater than the ith calibration point coordinate, p_i_g, where p_i_l+p_i_g=1. For example, p_1 may include a probability p_1_l that the gaze point coordinate is less than or equal to the coordinate of the gaze point 1 and a probability p_1_g that the gaze point coordinate is greater than the coordinate of the gaze point 1, and p_2 may represent a probability p_2_l that the gaze point coordinate is less than or equal to the coordinate of the gaze point 2 and a probability p_2_G that the gaze point coordinate is greater than the coordinate of the gaze point 2.

The probability P _area that the gaze point corresponding to the image used for estimation belongs to each of the plurality of sub-regions (e.g., region a between the origin of coordinates and the calibration point 1) is calculated by the following formula:

wherein, For the probability of the comparison of each sub-region with respect to the index point i, according to FIG. 14, if the sub-region is to the left of the index point iIf the sub-region is to the right of the mark point, thenCali_num is the number of index points.

As shown in fig. 15, the probability corresponding to the region a is p_2_l+p_1_l, the probability corresponding to the region between the calibration point 1 and the calibration point 2 is p_2_l+p_1_g, and the probability corresponding to the region between the calibration point 2 and the maximum value is p_2_g+p_1_g.

Taking line of sight estimation by a cell phone as an example, it is necessary to estimate the gaze point coordinates on the screen of the cell phone at which the user gazes as (y_x, y_y). On the x-axis and the Y-axis, the regions have been divided by the calibration points, the probability maximum region in which y_x falls on the x-axis and the probability maximum region in which y_y falls on the Y-axis are calculated by the operations described above, respectively, and the center of the region defined by these two probability maximum regions (as shown by x in fig. 13) is determined as the gaze point.

In the test operation D, when a new image for calibration and a corresponding line-of-sight tag are acquired during use by the user (for example, calibration operation or setting operation of the electronic apparatus), features corresponding to the new image for calibration and the corresponding line-of-sight tag are re-extracted, and the re-extracted features are combined with the existing features feat _i, for example, in the form of vector addition. The combined features may be obtained when a new image for calibration and corresponding line-of-sight tag are obtained to expand the set of features feat _i, thereby improving the characterizability of the features. The combined features may be used for testing (e.g., performing test operation a, test operation B, and test operation C), so that the accuracy of the test results may be improved.

In performing a test (e.g., performing test operation a, test operation B, and test operation C) using the combined features, the screen may be repartitioned with the existing index points and the newly added index points.

Fig. 16 shows a schematic diagram of repartitioning a gaze area according to an exemplary embodiment of the present disclosure. As shown in fig. 16, there are two straight lines (shown as a solid line in fig. 16) perpendicular to and intersecting each of the existing calibration points, and two straight lines (shown as a broken line in fig. 16, an intersecting portion of the two broken lines is a solid line) perpendicular to and intersecting the newly added calibration point, and the gazing area (i.e., screen) can be re-divided using the above straight lines to obtain a new sub-area. The probability maximum sub-region among the respective new sub-regions into which the gaze point falls may be determined by the operations described above.

The newly added calibration point and the new data corresponding to the newly added calibration point may be obtained when any one of the following operations is performed: receiving a click of a key by a user when setting mobile phone parameters, and aiming at the operation of a specific point, wherein the specific point comprises at least one of the following items: a specific point on the screen of the device, a specific button on the device, a specific point with a determined relative position to the device.

In an exemplary embodiment of the present disclosure, in the operation of extracting the features, the appearance difference may be removed by the feature difference; the neural network model does not need to be trained during calibration, so that the calibration process is simplified, and the calculated amount of electronic equipment such as mobile phones is reduced; when a new calibration point is obtained, data corresponding to the new calibration point may affect the prediction result, and as data corresponding to the new calibration point increases, the gaze area may be divided in more detail, so that the accuracy of the gaze estimation result may be improved.

Fig. 17 shows a line-of-sight estimating apparatus according to an exemplary embodiment of the present disclosure. As shown in fig. 17, a line-of-sight estimating apparatus according to an exemplary embodiment of the present disclosure includes: a feature acquisition unit 110 that acquires features of data for calibration and features of data for estimation; the estimation unit 120 performs line-of-sight estimation from the acquired features.

As an example, the line-of-sight estimating apparatus further includes: a model training unit for acquiring a neural network model, wherein the feature acquiring unit 110 extracts features of data for calibration through the neural network model and/or extracts features of data for estimation through the neural network model.

As an example, the model training unit trains the neural network model with data for training.

As an example, the data for training includes a first user image and a second user image, wherein the first user image and the second user image are an image of the same user when looking at a first object and an image when looking at a second object, respectively; wherein the model training unit trains the neural network model by taking the first user image and the second user image as inputs and the relative positions of the first object and the second object as outputs.

As an example, the data for training includes image-related data for training and a line-of-sight tag for training, and the model training unit converts the line-of-sight tag for training into a two-class tag; determining a loss function corresponding to the classification label; training a first neural network model by image-related data for training, the classification labels, and the loss function; the model training unit determines a coordinate Y _a of the sight line label for training on a specific coordinate axis, wherein Y _amin≤Y_a≤Y_amax,Y_amin and Y _amax are the minimum value and the maximum value of a coordinate Y _a respectively; setting a plurality of nodes at a predetermined pitch on the specific coordinate axis, wherein the size of the predetermined pitch is bin_size, generating a bi-classified label including a vector having a dimension which is the number of the plurality of nodes, wherein the value of each dimension of the vector is determined by the size of the predetermined pitch and the coordinate Y _a, wherein the loss function is calculated by the value of each dimension of the vector, and an activation probability calculated by data for training corresponding to each node.

As an example, the data for training comprises image-related data for training and line-of-sight labels for training, the model training unit extracting two pairs of samples from the image-related data for training and the line-of-sight labels for training, wherein the two pairs of samples correspond to the same user, each pair of samples comprising one image-related data for training and one corresponding line-of-sight label for training, the difference between the two line-of-sight labels of the two pairs of samples being greater than a first threshold and less than a second threshold; a second neural network model is trained through the two pairs of samples.

As an example, the model training unit extracts two further pairs of samples by the step of extracting two pairs of samples, wherein a difference between two line of sight labels of the two further pairs of samples is greater than a third threshold value and less than a fourth threshold value, wherein the third threshold value is greater than or equal to the first threshold value and the fourth threshold value is less than or equal to the second threshold value; continuing to train the second neural network model through the other two pairs of samples, wherein the step of extracting two pairs of samples is performed at least twice such that a difference between two line-of-sight labels of two pairs of samples extracted each time is smaller than a difference between two line-of-sight labels of two pairs of samples extracted the previous time.

As an example, before training the second neural network model, a model training unit sets parameters of the second neural network model based on the first neural network model, wherein the second neural network model and the first neural network model have the same network layer for feature extraction; and a model training unit for training the classifier of the second neural network model by using the two image related data for training of the two pairs of samples and the two classification labels corresponding to the two image related data for training.

As an example, the model training unit extracts features of the two image-related data for training through the trained first neural network model, respectively; calculating a feature difference between features of the two image-related data for training; the classifier of the second neural network model is trained by taking the feature differences as input and the classification labels corresponding to the two image-related data for training as output.

As an example, the feature acquisition unit 110 extracts features of data for calibration by the second neural network model and/or extracts features of data for estimation by the second neural network model.

As an example, the estimation unit 120 estimates the position of the gaze point in the gaze area by the acquired neural network model from the characteristics of the acquired data for estimation and the characteristics of the acquired data for calibration.

As an example, the estimation unit 120 calculates a feature difference between the feature of the extracted data for estimation and the feature of the extracted data for calibration; estimating a classifier output result corresponding to the calculated feature difference by using the acquired neural network model; calculating a probability that a gaze point corresponding to data for estimation belongs to each of a plurality of sub-regions divided from a gaze region according to an estimated classifier output result; the center of the sub-region with the highest probability is determined as the estimated gaze point.

As an example, when the gazing area is an area on a two-dimensional plane, the gazing area is divided by: setting two straight lines perpendicularly intersecting each of the calibration points for each of the calibration points, and dividing the fixation area into a plurality of sub-areas by the set straight lines, or

As an example, the estimation unit 120 determines, for the classifier output result corresponding to each calibration point, probabilities that the coordinates of each dimension of the gaze point are smaller and larger than the coordinates of each calibration point with respect to the each dimension, respectively; and calculating the probability that the fixation point belongs to each sub-region according to the determined probability.

As an example, the estimation unit 120 calculates a probability that the gaze point corresponding to the data for estimation belongs to each of the plurality of sub-areas by a comparison relation probability with respect to the corresponding calibration point for each of the sub-areas.

As an example, the line-of-sight estimating apparatus further includes: and the calibration unit is used for acquiring data for calibration when the specific point is used as one of the calibration points according to the operation of the user on the specific point before the sight line estimation is carried out through the acquired characteristics.

As an example, a calibration unit displays calibration points; acquiring a user image of a user when the user gazes at the calibration point as the data for calibration; and calibrating according to the data for calibrating.

As an example, the calibration unit, in response to receiving a gesture for the calibration point, determines a distance between an operation point corresponding to the gesture and the calibration point, and when the distance is smaller than a distance threshold, acquires a user image as the data for calibration.

It should be appreciated that the specific implementation of the line of sight estimation apparatus according to the exemplary embodiments of the present disclosure may be implemented with reference to the related specific implementations described in connection with fig. 1 to 16, and will not be described here again.

According to another exemplary embodiment of the present disclosure, there is provided an electronic device, wherein the electronic device includes: a processor; and a memory storing a computer program which, when executed by the processor, implements the gaze estimation method as described above.

The computer readable storage medium is any data storage device that can store data which can be read by a computer system. Examples of the computer-readable recording medium include: read-only memory, random access memory, compact disc read-only, magnetic tape, floppy disk, optical data storage device, and carrier waves (such as data transmission through the internet via wired or wireless transmission paths).

Further, it should be understood that the various units of the gaze estimation device according to exemplary embodiments of the present disclosure may be implemented as hardware components and/or as software components. The individual units may be implemented, for example, using a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC), depending on the processing performed by the individual units as defined.

Furthermore, the line-of-sight estimation method according to the exemplary embodiments of the present disclosure may be implemented as computer code in a computer-readable storage medium. The computer code may be implemented by those skilled in the art in light of the description of the above methods. The above-described methods of the present disclosure are implemented when the computer code is executed in a computer.

Although a few exemplary embodiments of the present disclosure have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit of the disclosure, the scope of which is defined in the claims and their equivalents.

Claims

1. An adaptive line of sight estimation method, wherein the method comprises:

Obtaining features of data used for calibration and features of data used for estimation;

The line of sight is estimated by acquiring features.

The steps of performing sight line estimation by using the acquired features include:

Based on a feature difference between features of the data used for estimation and features of the data used for calibration, a probability that the fixation point is included in each of a plurality of sub-regions divided from the fixation region is calculated using a neural network model,

The center position of the sub-region having the maximum probability among the probabilities calculated for the sub-regions is determined as the position of the gaze point.

2. The line of sight estimation method according to claim 1, wherein the line of sight estimation method further comprises: obtaining the neural network model.

3. The line of sight estimation method according to claim 2, wherein the step of acquiring the neural network model comprises: training the neural network model using data for training.

4. The line of sight estimation method according to claim 1, wherein the step of acquiring features comprises: extracting features of data used for calibration through the neural network model and/or extracting features of data used for estimation through the neural network model.

5. The line of sight estimation method according to claim 3, wherein the data used for training comprises a first user image and a second user image,

The first user image and the second user image are respectively an image of the same user when gazing at the first object and an image of the same user when gazing at the second object;

The step of training the neural network model includes:

The neural network model is trained by taking the first user image and the second user image as input and taking the relative position of the first object and the second object as output.

6. The line of sight estimation method according to claim 3 or 5, wherein the data used for training includes image-related data used for training and line of sight labels used for training, and the step of training the neural network model includes:

Convert the sight labels used for training into binary classification labels;

Determine a loss function corresponding to the binary classification label;

The first neural network model is trained using the image-related data for training, the binary classification labels and the loss function.

7. The sight line estimation method according to claim 6, wherein the step of converting the sight line labels used for training into binary classification labels comprises:

Determine the coordinates of the sight labels used for training on specific coordinate axes;

A plurality of nodes are arranged at predetermined intervals on the specific coordinate axis,

generating a binary classification label including a vector having a dimension equal to the number of the plurality of nodes,

The values of the dimensions of the vector are determined by the size of the predetermined spacing and the coordinates.

The loss function is calculated by the values of each dimension of the vector and the activation probability, and the activation probability is calculated by the data for training corresponding to each node.

8. The line of sight estimation method according to any one of claims 3 to 5 and 7, wherein the data used for training comprises image-related data used for training and line of sight labels used for training, and the step of training the neural network model comprises:

Extracting two pairs of samples from the image-related data for training and the sight-line labels for training, wherein the two pairs of samples correspond to the same user, each pair of samples includes one image-related data for training and one corresponding sight-line label for training, and the difference between the two sight-line labels of the two pairs of samples is greater than a first threshold and less than a second threshold;

The second neural network model is trained by using the two pairs of samples.

9. The line of sight estimation method according to claim 8, wherein the step of training the neural network model further comprises:

Extracting another two pairs of samples by the step of extracting two pairs of samples, wherein the difference between the two sight labels of the another two pairs of samples is greater than a third threshold and less than a fourth threshold, wherein the third threshold is greater than or equal to the first threshold, and the fourth threshold is less than or equal to the second threshold;

The second neural network model is continued to be trained by using the other two pairs of samples.

10. The line of sight estimation method according to claim 9, wherein the step of extracting two pairs of samples is performed at least twice so that the difference between the two line of sight labels of the two pairs of samples extracted each time is smaller than the difference between the two line of sight labels of the two pairs of samples extracted previously.

11. The sight line estimation method according to claim 8, wherein before training the second neural network model, it further comprises:

Setting parameters of the second neural network model based on the first neural network model,

The second neural network model and the first neural network model have the same network layer for feature extraction, and the step of training the second neural network model using the two pairs of samples includes: training the classifier of the second neural network model using two image-related data for training of the two pairs of samples and binary classification labels corresponding to the two image-related data for training.

12. The line of sight estimation method according to claim 11, wherein the step of training the classifier of the second neural network model comprises:

Extracting features of the two image-related data for training respectively through the trained first neural network model;

Calculating a feature difference between features of the two image-related data for training;

The classifier of the second neural network model is trained by taking the feature difference as input and taking the binary classification labels corresponding to the two image-related data for training as output.

13. The line of sight estimation method according to claim 8, wherein extracting features of data used for calibration through the neural network model and/or extracting features of data used for estimation through the neural network model comprises: extracting features of data used for calibration through the second neural network model and/or extracting features of data used for estimation through the second neural network model.

14. The method according to claim 1, wherein the method further comprises:

Acquire an image including the user's eye area,

The step of acquiring features of the data for calibration and features of the data for estimation comprises: extracting features of the data for estimation from the acquired image,

The data used for calibration is data used to calibrate the neural network model.

15. The line of sight estimation method according to claim 1, wherein the step of calculating the probability that the gaze point is included in each of the plurality of sub-regions divided from the gaze area comprises:

For the evaluation output results of the neural network model corresponding to each of the calibration points, respectively determining the probability that the coordinate of each dimension of the gaze point is less than and greater than the coordinate of each calibration point with respect to each dimension;

The probability that the fixation point is included in each sub-region is calculated according to the determined probability.

16. The line of sight estimation method according to claim 1, wherein the step of performing line of sight estimation by using the acquired features further comprises:

Calculate the feature difference between the features extracted for estimation and the features extracted for calibration,

The step of using the neural network model to calculate the probability that the gaze point is included in each of the plurality of sub-regions divided from the gaze area comprises:

using the obtained neural network model to estimate the classifier output corresponding to the calculated feature difference, and

A probability that a fixation point corresponding to the data for estimation is included in each of a plurality of sub-regions divided from the fixation region is calculated based on the estimated classifier output result.

17. The line of sight estimation method according to claim 1 or 15, wherein, when the gaze area is on a two-dimensional plane, the gaze area is divided in the following manner: two vertically intersecting straight lines are set at each calibration point, and the gaze area is divided into multiple sub-areas by the set straight lines.

18. The line of sight estimation method according to claim 1 or 15, wherein, when the gaze area is in a three-dimensional space, the gaze area is divided in the following manner: three mutually perpendicular and intersecting straight lines are set at each calibration point, and the gaze area is divided into a plurality of sub-areas by the set straight lines.

19. The line of sight estimation method according to claim 15 or 16, wherein the step of calculating the probability that the gaze point corresponding to the data for estimation is included in each of the plurality of sub-regions divided from the gaze area comprises:

For each calibration point, the classifier output result corresponding to each calibration point is used to determine the probability that the coordinate of each dimension of the gaze point is smaller than or greater than the coordinate of each calibration point in each dimension.

20. The line of sight estimation method according to claim 15 or 16, wherein the step of calculating the probability that the gaze point is included in each of the sub-regions comprises: calculating the probability that the gaze point is included in each of the multiple sub-regions by comparing the relationship probability of each sub-region relative to the corresponding calibration point.

21. The line of sight estimation method according to any one of claims 14 to 16, wherein before performing line of sight estimation by using the acquired features, the line of sight estimation method further comprises:

When a point is used as one of the calibration points, data to be used for calibration of the neural network is acquired based on a user action performed with respect to the point.

22. The line of sight estimation method according to claim 21, wherein the point comprises at least one of the following: a point on a device screen, a point on a button on the device, and a specific point having a determined relative position with respect to the device.

23. The line of sight estimation method according to any one of claims 1 to 5 and 14 to 16, wherein the line of sight estimation method further comprises:

Display calibration points;

Acquire a calibration image obtained by capturing an image of a user when the user is gazing at the calibration point;

The neural network model is calibrated according to the obtained calibration image.

24. The line of sight estimation method according to claim 23, wherein the step of acquiring a calibration image comprises:

When a gesture for executing on the calibration point is received, a distance between an operation point corresponding to the gesture and the calibration point is determined, and when the distance is less than a distance threshold, a user image is determined as a calibration image for executing calibration.

25. The line of sight estimation method according to claim 24, wherein the step of performing calibration on the neural network model comprises: performing calibration on the neural network model using a calibration image.

26 . A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the line of sight estimation method according to any one of claims 1 to 25 is implemented.

27. An electronic device comprising:

one or more processors; and

A memory storing a computer program which, when executed by the processor, implements the line of sight estimation method according to any one of claims 1 to 25.