CN118378158B

CN118378158B - Video editing processing method and system based on artificial intelligence

Info

Publication number: CN118378158B
Application number: CN202410182301.4A
Authority: CN
Inventors: 郭勇; 苑朋飞; 靳世凯; 王彭; 赵存喜; 庄麒达
Original assignee: Zhongying Nian Nian Beijing Technology Co ltd
Current assignee: Zhongying Nian Nian Beijing Technology Co ltd
Priority date: 2024-02-19
Filing date: 2024-02-19
Publication date: 2024-10-08
Anticipated expiration: 2044-02-19
Also published as: CN118378158A

Abstract

The invention discloses a video editing processing method and a system based on artificial intelligence, relating to the field of video editing processing; firstly, carrying out content recognition on each image frame in a video to be clipped to obtain content description of each image frame, then, obtaining a first video clipping scheme from the video to be clipped and carrying out semantic association coding on the content description of each image frame to obtain a first video clipping scheme semantic coding feature vector, carrying out semantic coding on video clipping requirement text description to obtain a video clipping requirement text semantic understanding feature vector, and determining whether the adaptation degree between video clipping requirements and the first video clipping scheme exceeds a preset threshold or not based on video clipping requirement-video clipping scheme semantic interaction feature obtained by carrying out semantic feature interaction association analysis on the first video clipping scheme semantic coding feature vector and the video clipping requirement text semantic understanding feature vector. In this way, a better viewing experience may be provided to the user.

Description

Video editing processing method and system based on artificial intelligence

Technical Field

The application relates to the field of video clip processing, and more particularly, to an artificial intelligence-based video clip processing method and system.

Background

With the rapid growth of the internet and digital media, the number and diversity of video content is increasing. However, it is becoming increasingly challenging to view large amounts of video content and obtain useful information therefrom. Video editing is a technique that edits and processes raw video material to generate a video summary that is more attractive and information dense.

However, conventional video clips typically require specialized editors to manually select and process video material, which requires the editors to have extensive editing experience and skill, limits the popularity and efficiency of the video clip, and increases cost and time consumption. Meanwhile, the traditional scheme cannot automatically select the best editing scheme according to the requirements and the favorites of the user, and subjective judgment of editors is needed. That is, the conventional video clip scheme requires editors to understand the needs of users, is easily affected by subjective factors, and results in that the generated clip video may not meet the needs and preferences of users. Furthermore, conventional video editing schemes require editors to manually screen, crop and combine video material, which limits the efficiency and flexibility of video editing, which is not feasible for large-scale video content and real-time editing requirements, especially where rapid generation of video summaries is required.

Accordingly, an artificial intelligence based video clip processing scheme is desired.

Disclosure of Invention

The present application has been made to solve the above-mentioned technical problems. The embodiment of the application provides an artificial intelligence-based video clip processing method and system. The video editing method and the video editing device can improve video editing efficiency, reduce manual intervention, eliminate subjectivity and personal preference, and simultaneously can adapt to large-scale video content and real-time editing requirements, thereby providing better watching experience for users.

According to one aspect of the present application, there is provided an artificial intelligence based video clip processing method, comprising:

acquiring a video to be edited;

Performing content recognition on each image frame in the video to be clipped to obtain content description of each image frame, wherein the content description comprises characters, objects, scenes and actions;

acquiring a first video editing scheme from the video to be edited;

carrying out semantic association coding on the content description of each image frame of the first video clipping scheme to obtain semantic coding feature vectors of the first video clipping scheme;

Acquiring a video clip requirement text description;

carrying out semantic coding on the video clip requirement text description to obtain a video clip requirement text semantic understanding feature vector;

carrying out semantic feature interaction correlation analysis on the semantic coding feature vector of the first video editing scheme and the semantic understanding feature vector of the video editing requirement text so as to obtain semantic interaction features of the video editing requirement-video editing scheme;

and determining whether a degree of adaptation between the video clip requirements and the first video clip scheme exceeds a predetermined threshold based on the video clip requirements-video clip scheme semantic interaction characteristics.

According to another aspect of the present application, there is provided an artificial intelligence based video clip processing system comprising:

the video acquisition module is used for acquiring videos to be clipped;

The content identification module is used for carrying out content identification on each image frame in the video to be clipped to obtain content description of each image frame, wherein the content description comprises characters, objects, scenes and actions;

the editing scheme acquisition module is used for acquiring a first video editing scheme from the video to be edited;

The content description semantic association coding module is used for carrying out semantic association coding on the content description of each image frame of the first video clip scheme to obtain a first video clip scheme semantic coding feature vector;

The demand text description acquisition module is used for acquiring the demand text description of the video clip;

The text description semantic coding module is used for carrying out semantic coding on the video clip demand text description to obtain a video clip demand text semantic understanding feature vector;

The semantic feature interaction correlation analysis module is used for carrying out semantic feature interaction correlation analysis on the first video clip scheme semantic coding feature vector and the video clip requirement text semantic understanding feature vector to obtain video clip requirement-video clip scheme semantic interaction features;

and the adaptation degree judging module is used for determining whether the adaptation degree between the video clip requirement and the first video clip scheme exceeds a preset threshold or not based on the video clip requirement-video clip scheme semantic interaction characteristic.

Compared with the prior art, the artificial intelligence video clipping processing method and system provided by the application have the advantages that firstly, each image frame in video to be clipped is subjected to content recognition to obtain the content description of each image frame, then, a first video clipping scheme is obtained from the video to be clipped, the content description of each image frame is subjected to semantic association coding to obtain a first video clipping scheme semantic coding feature vector, then, the video clipping requirement text description is subjected to semantic coding to obtain a video clipping requirement text semantic understanding feature vector, and finally, whether the adaptation degree between the video clipping requirement and the first video clipping scheme exceeds a preset threshold is determined based on the video clipping requirement-video clipping scheme semantic interaction feature obtained by carrying out semantic feature interaction analysis on the first video clipping scheme semantic coding feature vector and the video clipping requirement text semantic understanding feature vector. In this way, a better viewing experience may be provided to the user.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly introduced below, the following drawings not being drawn to scale with respect to actual dimensions, emphasis instead being placed upon illustrating the gist of the present application.

FIG. 1 is a flow chart of an artificial intelligence based video clip processing method according to an embodiment of the present application.

Fig. 2 is a flowchart of sub-step S180 of an artificial intelligence based video clip processing method according to an embodiment of the present application.

Fig. 3 is a flowchart of sub-step S181 of the artificial intelligence based video clip processing method according to an embodiment of the present application.

Fig. 4 is a flowchart of sub-step S184 of an artificial intelligence based video clip processing method according to an embodiment of the present application.

FIG. 5 is a block diagram of an artificial intelligence based video clip processing system according to an embodiment of the present application.

Fig. 6 is an application scenario diagram of an artificial intelligence based video clip processing method according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are also within the scope of the application.

As used in the specification and in the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

Although the present application makes various references to certain modules in a system according to embodiments of the present application, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative, and different aspects of the systems and methods may use different modules.

A flowchart is used in the present application to describe the operations performed by a system according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.

Hereinafter, exemplary embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

In view of the above technical problems, the technical idea of the present application is to acquire a video to be clipped, and then identify the content of each image frame in the video to be clipped to obtain a content description, wherein the content description includes characters, objects, scenes and actions. And by introducing a semantic analysis and understanding algorithm at the rear end, carrying out interactive association on semantic understanding information between each video editing scheme and text description of user requirements in the video to be edited, so as to carry out adaptation degree detection and evaluation between the video editing requirements and the first video editing scheme, in this way, the efficiency of video editing can be improved, manual intervention can be reduced, and subjectivity and personal preference can be eliminated. Meanwhile, the method can also meet the requirements of large-scale video content and real-time clipping, and provides better viewing experience for users.

FIG. 1 is a flow chart of an artificial intelligence based video clip processing method according to an embodiment of the present application. As shown in fig. 1, the video clip processing method based on artificial intelligence according to the embodiment of the application comprises the following steps: s110, acquiring a video to be clipped; s120, carrying out content recognition on each image frame in the video to be clipped to obtain content description of each image frame, wherein the content description comprises characters, objects, scenes and actions; s130, acquiring a first video editing scheme from the video to be edited; s140, carrying out semantic association coding on the content description of each image frame of the first video clip scheme to obtain semantic coding feature vectors of the first video clip scheme; s150, acquiring a text description of video clip requirements; s160, carrying out semantic coding on the video clip requirement text description to obtain a video clip requirement text semantic understanding feature vector; s170, carrying out semantic feature interaction association analysis on the semantic coding feature vector of the first video clip scheme and the semantic understanding feature vector of the video clip requirement text so as to obtain semantic interaction features of the video clip requirement-video clip scheme; and S180, determining whether the adaptation degree between the video clip requirement and the first video clip scheme exceeds a preset threshold or not based on the video clip requirement-video clip scheme semantic interaction characteristics.

Specifically, in the technical scheme of the application, firstly, a video to be clipped is obtained, and content identification is carried out on each image frame in the video to be clipped so as to obtain the content description of each image frame, so that the content and the semantics of the video material can be better understood. Specifically, through content recognition, element information such as characters, objects, scenes, actions and the like in each image frame can be extracted, so that a basis is provided for subsequent editing and processing.

When actually selecting a video clip scheme based on user requirements, it is critical to perform semantic feature matching on each video clip scheme and user requirements, thereby evaluating whether there is suitability between the two. Based on this, in the technical solution of the present application, a first video clip solution needs to be obtained from the video to be clipped. And then, carrying out semantic coding on the content descriptions of the image frames of the first video clip scheme by a semantic coder comprising an embedded layer, so as to extract global context semantic association characteristic information between the content descriptions of the image frames of the first video clip scheme, thereby obtaining semantic coding characteristic vectors of the first video clip scheme.

For the user's needs, first, a video clip need text description is obtained. And then, carrying out semantic coding on the video clip demand text description, for example, processing the video clip demand text description through a context semantic coder comprising an embedded layer after word segmentation so as to capture context semantic association characteristic information based on the global in the video clip demand text description, thereby obtaining a video clip demand text semantic understanding characteristic vector.

Further, the attention mechanism-based feature interaction is performed on the video clip demand text semantic understanding feature vector and the first video clip scheme semantic coding feature vector by using an inter-feature attention layer to obtain a video clip demand-video clip scheme semantic interaction feature vector, so that the association and interaction between the video clip demand text semantic understanding feature and the content description semantic association feature of each image frame of the first video clip scheme are captured. It should be appreciated that since the goal of the traditional attention mechanism is to learn an attention weight matrix, a greater weight is given to important features and a lesser weight is given to secondary features, thereby selecting more critical information to the current task goal. This approach is more focused on weighting the importance of individual features, while ignoring the dependency between features. The attention layer between the features can capture the correlation and the mutual influence between the semantic understanding features of the text of the video clip requirement and the semantic association features of the content description of each image frame of the first video clip scheme through the feature interaction based on the attention mechanism, learn the dependency relationship between different semantic features of the text of the video clip requirement and the semantic association features of the content description of each image frame of the first video clip scheme, and interact and integrate the features according to the dependency relationship, so that the semantic interaction feature vector of the video clip requirement-video clip scheme is obtained.

Accordingly, in step S140, performing semantic association encoding on the content descriptions of the image frames of the first video clip scheme to obtain a first video clip scheme semantic encoding feature vector, including: the content description of each image frame of the first video clip scheme is passed through a semantic encoder comprising an embedded layer to obtain the first video clip scheme semantically encoded feature vector. It should be noted that, in the video clip scheme, the content descriptions of the image frames are semantically related encoded to obtain a feature vector representing semantic information of the entire video clip scheme. This can be achieved by using a semantic encoder comprising an embedded layer. The embedded layer is a hierarchical structure in the deep learning model for converting input data into a more characterizable representation. In a semantic encoder, the embedding layer functions to convert the content description of an image frame into a low-dimensional semantically encoded feature vector. This feature vector contains semantic information of the input description and typically has a low dimension to facilitate subsequent processing and analysis. The primary purpose of the embedding layer is to learn the representation of the data so that similar inputs are closer together in the embedding space, while dissimilar inputs are farther apart. This helps extract key features of the data and can reduce the dimensionality of the data, thereby reducing computational complexity and improving generalization ability of the model. By inputting the content description of the image frames into a semantic encoder comprising an embedded layer, a feature vector representing the semantic information of the video clip scheme can be obtained. The feature vector can be used for subsequent tasks such as similarity calculation, cluster analysis, semantic retrieval and the like, so that the processing efficiency and accuracy of a video editing scheme are improved.

Accordingly, in step S170, performing semantic feature interaction correlation analysis on the first video clip scheme semantic coding feature vector and the video clip requirement text semantic understanding feature vector to obtain a video clip requirement-video clip scheme semantic interaction feature, including: and performing feature interaction based on an attention mechanism on the video clip requirement text semantic understanding feature vector and the first video clip scheme semantic coding feature vector by using an inter-feature attention layer to obtain a video clip requirement-video clip scheme semantic interaction feature vector as the video clip requirement-video clip scheme semantic interaction feature. It is worth mentioning that the inter-feature attention layer is a hierarchical structure in a deep learning model for modeling and weighting the associations between different features. In the video editing task, the inter-feature attention layer can be used for carrying out interactive association analysis on the text semantic understanding feature vector of the video editing requirement and the semantic coding feature vector of the first video editing scheme so as to obtain semantic interactive features of the video editing requirement-video editing scheme. The inter-feature attention layer is able to automatically learn the relevance and importance between features by introducing an attention mechanism and weight the features according to these associated information. Specifically, for the text semantic understanding feature vector of the video clip requirement and the semantic encoding feature vector of the first video clip scheme, the attention layer between the features calculates the similarity or the correlation between the feature vectors, and the two feature vectors are weighted and summed according to the weight of the similarity to obtain the semantic interaction feature vector of the video clip requirement-video clip scheme. The feature interaction based on the attention mechanism can capture semantic association information between video clip requirements and video clip schemes, so that the accuracy and effect of video clips are improved. By learning the attention weights between features, the model can automatically focus on important feature parts and ignore irrelevant or noisy features, thereby improving the expressive and generalization capabilities of the model. In other words, the role of the inter-feature attention layer in the video editing task is to perform cross-correlation analysis on the text semantic understanding feature vector of the video editing requirement and the semantic encoding feature vector of the first video editing scheme, and generate the semantic interactive feature vector of the video editing requirement-video editing scheme, so that the quality and effect of the video editing are improved.

And then, the video clip requirement-video clip scheme semantic interaction feature vector is passed through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the adaptation degree between the video clip requirement and the first video clip scheme exceeds a preset threshold. That is, the matching degree evaluation of the first video clip scheme and the user demand is performed by using the interactive correlation feature between the text semantic understanding feature of the video clip demand and the content description semantic correlation feature of each image frame of the first video clip scheme, so as to judge whether the matching degree of the first video clip scheme and the user demand exceeds a predetermined threshold value, so as to select the video clip scheme. Meanwhile, the method can also meet the requirements of large-scale video content and real-time clipping, and provides better viewing experience for users.

Accordingly, as shown in fig. 2, based on the video clip requirement-video clip scheme semantic interaction feature, determining whether the degree of adaptation between the video clip requirement and the first video clip scheme exceeds a predetermined threshold comprises: s181, passing the video Clip requirement text semantic understanding feature vector and the first video Clip scheme semantic coding feature vector through a Clip-like model to obtain an incidence matrix; s182, fusing the video clip requirement-video clip scheme semantic interaction feature vector with the incidence matrix to obtain an optimized video clip requirement-video clip scheme semantic interaction feature vector; s183, correcting each characteristic value of the optimized video clip requirement-video clip scheme semantic interaction characteristic vector to obtain a corrected optimized video clip requirement-video clip scheme semantic interaction characteristic vector; And S184, passing the corrected optimized video clip requirement-video clip scheme semantic interaction feature vector through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the adaptation degree between the video clip requirement and the first video clip scheme exceeds a preset threshold. It should be appreciated that in step S181, a Clip-like model is used to calculate an incidence matrix between the video Clip requirement text semantic understanding feature vector and the first video Clip scheme semantic coding feature vector. The Clip-like model is a deep learning model that can understand and encode semantic associations between images and text. By inputting the video Clip requirement text and the semantic feature vector of the first video Clip scheme into the Clip-like model, an association matrix can be obtained that represents the degree of semantic association between them. In step S182, the semantic interaction feature vector of the video clip requirement-video clip scheme is fused with the incidence matrix to obtain the semantic interaction feature vector of the optimized video clip requirement-video clip scheme, and this fusion process can be implemented by simple vector multiplication or other fusion methods, so as to combine the semantic interaction feature with the matching degree of the incidence matrix, and further improve the adaptation degree between the video clip requirement and the first video clip scheme. In step S183, the respective feature values of the optimized video clip need-video clip scheme semantic interaction feature vector are corrected to obtain a corrected optimized video clip need-video clip scheme semantic interaction feature vector. In step S184, the corrected optimized video clip requirement-video clip scheme semantic interaction feature vector is input into a classifier to obtain a classification result. The classifier may be a classifier, configured to determine whether the matching degree between the video clip requirement and the first video clip scheme exceeds a predetermined threshold, and the classification result may be a probability value or a binary label, which indicates the level of the matching degree, and by using the classification result, the matching degree between the video clip requirement and the first video clip scheme may be determined, so as to determine whether the predetermined matching degree threshold is met. That is, S181 to S184 are a series of steps for determining whether the degree of adaptation between the video clip requirement and the first video clip scheme exceeds a predetermined threshold. the steps include calculating an incidence matrix, fusing semantic interaction features and the incidence matrix, correcting each feature value, and obtaining a classification result of the adaptation degree through a classifier. Through a combination of these steps, semantic associations between video clip requirements and video clip schemes can be analyzed and evaluated to determine their fitness.

More specifically, in step S181, as shown in fig. 3, passing the video Clip requirement text semantic understanding feature vector and the first video Clip scheme semantic coding feature vector through a Clip-like model to obtain an incidence matrix includes: s1811, processing the video Clip demand text semantic understanding feature vector by using a sequence encoder of the Clip-like model to obtain a demand text feature vector; s1812, processing the first video Clip scheme semantic coding feature vector by using a sequence encoder of the Clip-like model to obtain a scheme feature vector; and S1813, fusing the required text feature vector and the scheme feature vector by using a joint encoder of the Clip-like model to obtain the incidence matrix.

In particular, in the technical solution of the present application, considering that the video Clip demand text semantic understanding feature vector and the first video Clip solution semantic coding feature vector express the text semantic features described by the video Clip demand text and the image semantic features of the image content of the first video Clip solution respectively, when the inter-feature attention layer is used to perform attention mechanism-based feature interaction on the video Clip demand text semantic understanding feature vector and the first video Clip solution semantic coding feature vector to extract dependency relationship features therebetween, it is desirable that the video Clip demand text semantic understanding feature vector and the first video Clip solution semantic coding feature vector can have semantic fusion representations aligned as much as possible, therefore, it is preferable that an association matrix between the video Clip demand text semantic understanding feature vector and the first video Clip solution semantic coding feature vector is calculated first in a Clip-like model manner, and then the video Clip demand-video Clip solution semantic interaction feature vector is fused with the association matrix to optimize the video demand-video Clip solution semantic interaction feature vector.

However, further considering that when the association matrix performs vector cross-modal semantic alignment association expression, if the text semantic features and the image semantic features expressed by the text semantic understanding feature vector of the video clip requirement and the semantic encoding feature vector of the first video clip scheme are used as foreground object features, background distribution noise is also introduced when the association matrix performs vector cross-modal semantic alignment association expression, and when high-rank distribution expression between vectors-matrices is performed, the probability density mapping error of the association matrix relative to the text semantic understanding feature vector of the video clip requirement and the semantic encoding feature vector of the first video clip scheme is caused due to the spatial heterogeneous distribution of heterogeneous high-dimensional features of the text semantic features and the image semantic features, so that the feature distribution information significance expression of the semantic interaction feature vector of the optimized video clip requirement-video clip scheme is affected, and when the semantic interaction feature vector of the optimized video clip requirement-video clip scheme performs quasi-probability regression mapping through a classifier, the feature of the optimized video clip requirement-video clip scheme is difficult to be stably distributed in the local training process of the video clip requirement-feature, and the feature of the optimized video clip requirement is difficult to be affected.

Based on the above, the applicant of the present application corrects each feature value of the optimized video clip requirement-video clip scheme semantic interaction feature vector at each classifier iteration, expressed as: correcting each characteristic value of the optimized video clipping requirement-video clipping scheme semantic interaction characteristic vector by using the following correction formula to obtain the corrected optimized video clipping requirement-video clipping scheme semantic interaction characteristic vector; wherein, the correction formula is:

wherein, Is the optimized video clip requirement-video clip scheme semantic interaction feature vector,Is the feature value of the optimized video clip requirement-video clip scheme semantic interaction feature vector,AndThe optimized video editing requirement-video editing scheme semantic interaction feature vectorSquare of 1-norm and 2-norm of (c),Is the optimized video clipping requirement-video clipping scheme semantic interaction feature vectorAnd (2) length ofIs the weight of the parameter to be exceeded,The logarithmic function value is represented with a base of 2,Is the feature value of the semantic interaction feature vector of the video clip requirement-video clip scheme after correction.

Specifically, by semantic interaction feature vectors based on the optimized video clip requirements-video clip schemeGeometric registration of its high-dimensional feature manifold shape is performed with respect to the scale and structural parameters of the optimized video clip requirements-video clip scheme semantic interaction feature vectorsFeatures with rich feature semantic information in feature sets formed by feature values, namely distinguishable stable interest features representing dissimilarity based on local context information during classifier iterative training, so as to realize the optimized video clip requirement-video clip scheme semantic interaction feature vectorAnd (3) marking the significance of the feature information in the classifier training process, and improving the expression effect of the optimized video editing requirement-video editing scheme semantic interaction feature vector and the training speed of the model and the accuracy of a classification result obtained by the classifier. In this way, the adaptation degree evaluation of the video editing scheme and the user requirement can be performed based on the semantic interaction between the two schemes so as to select the video editing scheme, thereby improving the efficiency of video editing, reducing manual intervention, and eliminating subjectivity and personal preference. Meanwhile, the method can also meet the requirements of large-scale video content and real-time editing so as to automatically complete video editing and provide better watching experience for users.

Further, in step S184, as shown in fig. 4, the corrected optimized video clip requirement-video clip scheme semantic interaction feature vector is passed through a classifier to obtain a classification result, where the classification result is used to indicate whether the adaptation degree between the video clip requirement and the first video clip scheme exceeds a predetermined threshold, and includes: s1841, performing full-connection coding on the corrected optimized video editing requirement-video editing scheme semantic interaction feature vector by using a full-connection layer of the classifier to obtain a coding classification feature vector; and S1842, inputting the coding classification feature vector into a Softmax classification function of the classifier to obtain the classification result.

That is, in the technical solution of the present disclosure, the labels of the classifier include that the adaptation degree between the video clip requirement and the first video clip scheme exceeds a predetermined threshold (first label), and that the adaptation degree between the video clip requirement and the first video clip scheme does not exceed a predetermined threshold (second label), wherein the classifier determines to which classification label the optimized video clip requirement-video clip scheme semantic interaction feature vector belongs through a soft maximum function. It should be noted that the first tag p1 and the second tag p2 do not include the concept of artificial setting, and in fact, during the training process, the computer model does not have the concept of "whether the adaptation degree between the video clip requirement and the first video clip scheme exceeds the predetermined threshold", which is simply that there are two kinds of classification tags and the probability that the output feature is at the two classification tags sign, i.e., the sum of p1 and p2 is one. Thus, the classification result of whether the adaptation degree between the video clip requirement and the first video clip scheme exceeds the predetermined threshold is actually a classification probability distribution converted by classifying the tags into two classifications conforming to the natural law, and the physical meaning of the natural probability distribution of the tags is essentially used instead of the language text meaning of "whether the adaptation degree between the video clip requirement and the first video clip scheme exceeds the predetermined threshold".

It should be appreciated that the role of the classifier is to learn the classification rules and classifier using a given class, known training data, and then classify (or predict) the unknown data. Logistic regression (logistics), SVM, etc. are commonly used to solve the classification problem, and for multi-classification problems (multi-class classification), logistic regression or SVM can be used as well, but multiple bi-classifications are required to compose multiple classifications, but this is error-prone and inefficient, and the commonly used multi-classification method is the Softmax classification function.

In summary, an artificial intelligence based video clip processing method according to an embodiment of the present application is illustrated, which may provide a better viewing experience for a user.

Fig. 5 is a block diagram of an artificial intelligence based video clip processing system 100 according to an embodiment of the present application. As shown in fig. 5, an artificial intelligence based video clip processing system 100 according to an embodiment of the present application includes: a video acquisition module 110, configured to acquire a video to be clipped; a content recognition module 120, configured to perform content recognition on each image frame in the video to be clipped to obtain a content description of each image frame, where the content description includes a person, an object, a scene, and an action; a clipping scheme acquisition module 130, configured to acquire a first video clipping scheme from the video to be clipped; a content description semantic association encoding module 140, configured to perform semantic association encoding on content descriptions of each image frame of the first video clip scheme to obtain a first video clip scheme semantic encoding feature vector; a requirement text description acquisition module 150, configured to acquire a requirement text description of a video clip; a text description semantic coding module 160, configured to perform semantic coding on the video clip requirement text description to obtain a video clip requirement text semantic understanding feature vector; the semantic feature interaction correlation analysis module 170 is configured to perform semantic feature interaction correlation analysis on the first video clip scheme semantic coding feature vector and the video clip requirement text semantic understanding feature vector to obtain video clip requirement-video clip scheme semantic interaction features; and an adaptation degree judging module 180, configured to determine whether the adaptation degree between the video clip requirement and the first video clip scheme exceeds a predetermined threshold based on the video clip requirement-video clip scheme semantic interaction feature.

In one example, in the artificial intelligence based video clip processing system 100 described above, the content description semantic association encoding module 140 is configured to: the content description of each image frame of the first video clip scheme is passed through a semantic encoder comprising an embedded layer to obtain the first video clip scheme semantically encoded feature vector.

In one example, in the artificial intelligence based video clip processing system 100 described above, the semantic feature cross-correlation analysis module 170 is configured to: and performing feature interaction based on an attention mechanism on the video clip requirement text semantic understanding feature vector and the first video clip scheme semantic coding feature vector by using an inter-feature attention layer to obtain a video clip requirement-video clip scheme semantic interaction feature vector as the video clip requirement-video clip scheme semantic interaction feature.

Here, it will be understood by those skilled in the art that the specific functions and operations of the respective modules in the above-described artificial intelligence-based video clip processing system 100 have been described in detail in the above description of the artificial intelligence-based video clip processing method with reference to fig. 1 to 4, and thus, repetitive descriptions thereof will be omitted.

As described above, the artificial intelligence based video clip processing system 100 according to the embodiment of the present application may be implemented in various wireless terminals, for example, a server or the like having an artificial intelligence based video clip processing algorithm. In one example, the artificial intelligence based video clip processing system 100 according to embodiments of the present application may be integrated into a wireless terminal as a software module and/or hardware module. For example, the artificial intelligence based video clip processing system 100 may be a software module in the operating system of the wireless terminal or may be an application developed for the wireless terminal; of course, the artificial intelligence based video clip processing system 100 could equally be one of many hardware modules of the wireless terminal.

Alternatively, in another example, the artificial intelligence based video clip processing system 100 and the wireless terminal may be separate devices, and the artificial intelligence based video clip processing system 100 may be connected to the wireless terminal via a wired and/or wireless network and communicate the interaction information in accordance with a agreed data format.

Fig. 6 is an application scenario diagram of an artificial intelligence based video clip processing method according to an embodiment of the present application. As shown in fig. 6, in this application scenario, first, a video to be clipped (e.g., D1 illustrated in fig. 6) is acquired, and a video clip requirement text description (e.g., D2 illustrated in fig. 6) is then input into a server in which an artificial intelligence-based video clip processing algorithm is deployed (e.g., S illustrated in fig. 6), wherein the server is capable of processing the video to be clipped and the video clip requirement text description using the artificial intelligence-based video clip processing algorithm to obtain a classification result indicating whether the degree of adaptation between video clip requirements and a first video clip scheme exceeds a predetermined threshold.

Furthermore, those skilled in the art will appreciate that the various aspects of the application are illustrated and described in the context of a number of patentable categories or circumstances, including any novel and useful procedures, machines, products, or materials, or any novel and useful modifications thereof. Accordingly, aspects of the application may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.) or by a combination of hardware and software. The above hardware or software may be referred to as a "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the application may take the form of a computer product, comprising computer-readable program code, embodied in one or more computer-readable media.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. Although a few exemplary embodiments of this invention have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this invention. Accordingly, all such modifications are intended to be included within the scope of this invention as defined in the following claims. It is to be understood that the foregoing is illustrative of the present invention and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The invention is defined by the claims and their equivalents.

Claims

1. A method for processing video clips based on artificial intelligence, comprising:

acquiring a video to be edited;

acquiring a first video editing scheme from the video to be edited;

Acquiring a video clip requirement text description;

2. The artificial intelligence based video clip processing method of claim 1, wherein semantically correlating encoding the content descriptions of the respective image frames of the first video clip scheme to obtain first video clip scheme semantically encoded feature vectors, comprises:

the content description of each image frame of the first video clip scheme is passed through a semantic encoder comprising an embedded layer to obtain the first video clip scheme semantically encoded feature vector.

3. The artificial intelligence based video clip processing method of claim 2, wherein performing semantic feature cross-correlation analysis on the first video clip schema semantic coding feature vector and the video clip requirement text semantic understanding feature vector to obtain video clip requirement-video clip schema semantic interactive features, comprises:

And performing feature interaction based on an attention mechanism on the video clip requirement text semantic understanding feature vector and the first video clip scheme semantic coding feature vector by using an inter-feature attention layer to obtain a video clip requirement-video clip scheme semantic interaction feature vector as the video clip requirement-video clip scheme semantic interaction feature.

4. The artificial intelligence based video clip processing method of claim 3, wherein determining whether a degree of adaptation between a video clip requirement and a first video clip scheme exceeds a predetermined threshold based on the video clip requirement-video clip scheme semantic interaction feature comprises:

The semantic understanding feature vector of the video Clip requirement text and the semantic encoding feature vector of the first video Clip scheme pass through a Clip-like model to obtain an incidence matrix;

Fusing the video clip requirement-video clip scheme semantic interaction feature vector with the incidence matrix to obtain an optimized video clip requirement-video clip scheme semantic interaction feature vector;

Correcting each characteristic value of the optimized video clipping requirement-video clipping scheme semantic interaction characteristic vector to obtain a corrected optimized video clipping requirement-video clipping scheme semantic interaction characteristic vector;

And passing the corrected optimized video clip requirement-video clip scheme semantic interaction feature vector through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the adaptation degree between the video clip requirement and the first video clip scheme exceeds a preset threshold.

5. The artificial intelligence based video Clip processing method of claim 4, wherein passing the video Clip requirement text semantic understanding feature vector and the first video Clip scheme semantic coding feature vector through a Clip-like model to obtain an incidence matrix comprises:

Processing the video Clip demand text semantic understanding feature vector by using a sequence encoder of the Clip-like model to obtain a demand text feature vector;

Processing the first video Clip scheme semantic coding feature vector by using a sequence encoder of the Clip-like model to obtain a scheme feature vector;

and fusing the required text feature vector and the scheme feature vector by using a joint encoder of the Clip-like model to obtain the correlation matrix.

6. The artificial intelligence based video clip processing method of claim 5, wherein correcting the respective feature values of the optimized video clip need-video clip solution semantic interaction feature vector to obtain a corrected optimized video clip need-video clip solution semantic interaction feature vector, comprises: correcting each characteristic value of the optimized video clipping requirement-video clipping scheme semantic interaction characteristic vector by using the following correction formula to obtain the corrected optimized video clipping requirement-video clipping scheme semantic interaction characteristic vector;

wherein, the correction formula is:

；

7. The artificial intelligence based video clip processing method of claim 6, wherein passing the corrected optimized video clip requirement-video clip scheme semantic interaction feature vector through a classifier to obtain a classification result, the classification result being used to represent whether a degree of adaptation between a video clip requirement and a first video clip scheme exceeds a predetermined threshold, comprising: performing full-connection coding on the corrected optimized video clipping requirement-video clipping scheme semantic interaction feature vector by using a full-connection layer of the classifier to obtain a coding classification feature vector;

and inputting the coding classification feature vector into a Softmax classification function of the classifier to obtain the classification result.

8. An artificial intelligence based video clip processing system, comprising:

the video acquisition module is used for acquiring videos to be clipped;

9. The artificial intelligence based video clip processing system of claim 8, wherein the content description semantic association encoding module is configured to:

10. The artificial intelligence based video clip processing system of claim 9, wherein the semantic feature cross-correlation analysis module is configured to: