CN108806670B

CN108806670B - Audio recognition method, device and storage medium

Info

Publication number: CN108806670B
Application number: CN201810758435.0A
Authority: CN
Inventors: 李国华; 戴帅湘
Original assignee: Beijing Moran Cognitive Technology Co Ltd
Current assignee: Beijing Shitian Cultural Development Co ltd
Priority date: 2018-07-11
Filing date: 2018-07-11
Publication date: 2019-06-25
Anticipated expiration: 2038-07-11
Also published as: CN108806670A

Abstract

The invention discloses a speech recognition method, a device and a storage medium, which belong to the technical field of speech processing. The method includes: collecting target speech to be recognized, and acquiring acoustic features of the target speech. The target recognition model is invoked, the acoustic features are input into the target recognition model, and the behavioral intent label corresponding to the target speech is output, and the target recognition model is used to identify the behavioral intent corresponding to the speech according to the acoustic characteristics of any speech. In the embodiment of the present invention, whether it is standard speech or non-standard speech, the corresponding behavioral intent can be identified through the target recognition model based on the acoustic features, which enhances the applicability of speech recognition.

Description

Speech recognition method, device and storage medium

技术领域technical field

本发明实施例涉及语音处理技术领域，特别涉及一种语音识别方法、装置及存储介质。Embodiments of the present invention relate to the technical field of speech processing, and in particular, to a speech recognition method, device, and storage medium.

背景技术Background technique

目前，语音识别技术得到了广泛的应用。譬如，用户在使用终端的过程中，可以利用语音识别技术来控制终端，如，控制终端开启摄像头等。At present, speech recognition technology has been widely used. For example, in the process of using the terminal, the user may use the voice recognition technology to control the terminal, for example, control the terminal to turn on the camera, and the like.

在相关技术中，终端采集到用户输入的语音后，将该语音发送给语音转化服务器，该语音转化服务器可以将该语音转化为文本的形式，之后，将转化后的文本发送给该终端。该终端接收到该文本后，可以再将该文本发送给语义识别服务器，由该语义识别服务器对该文本进行语义识别，并将该识别结果反馈给该终端。如此，终端即可基于该识别结果，执行对应的操作。In the related art, after the terminal collects the voice input by the user, it sends the voice to a voice conversion server, and the voice conversion server can convert the voice into text, and then sends the converted text to the terminal. After receiving the text, the terminal can send the text to a semantic recognition server, and the semantic recognition server performs semantic recognition on the text and feeds back the recognition result to the terminal. In this way, the terminal can perform a corresponding operation based on the identification result.

然而，在上述实现过程中，只能对标准的语音进行识别，也就是说，只能对普通话进行语音识别，语音识别的适用性较差。However, in the above implementation process, only standard speech can be recognized, that is, only Mandarin can be recognized, and the applicability of speech recognition is poor.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供了一种语音识别方法、装置及存储介质，可以解决相关技术中语音识别的适用性较差的问题。所述技术方案如下：Embodiments of the present invention provide a speech recognition method, device and storage medium, which can solve the problem of poor applicability of speech recognition in the related art. The technical solution is as follows:

第一方面，提供了一种语音识别方法，所述方法包括：In a first aspect, a speech recognition method is provided, the method comprising:

采集待识别的目标语音；Collect the target speech to be recognized;

获取所述目标语音的声学特征；obtaining the acoustic features of the target speech;

调用目标识别模型，将所述声学特征输入至所述目标识别模型中，输出所述目标语音对应的行为意图标签，所述目标识别模型用于根据任一语音的声学特征识别所述语音对应的行为意图。Invoke the target recognition model, input the acoustic features into the target recognition model, and output the behavioral intent label corresponding to the target voice, and the target recognition model is used to identify the corresponding voice according to the acoustic features of any voice. behavioral intent.

可选地，所述调用目标识别模型之前，还包括：Optionally, before invoking the target recognition model, the method further includes:

获取至少一个语音训练样本的声学特征和每个语音训练样本对应的行为意图标签；Acquiring the acoustic features of at least one voice training sample and the behavioral intent label corresponding to each voice training sample;

基于所述至少一个语音训练样本的声学特征和所述每个语音训练样本对应的行为意图标签，对待训练的识别模型进行训练，得到所述目标识别模型。Based on the acoustic feature of the at least one speech training sample and the behavioral intent label corresponding to each speech training sample, the recognition model to be trained is trained to obtain the target recognition model.

可选地，获取每个语音训练样本对应的行为意图标签，包括：Optionally, obtain the behavioral intent label corresponding to each speech training sample, including:

获取至少一个语音；Get at least one voice;

确定所述至少一个语音中每个语音对应的行为操作；determining a behavior operation corresponding to each voice in the at least one voice;

生成每个行为操作对应的行为意图标签；Generate a behavioral intent label corresponding to each behavioral operation;

将所述至少一个语音确定为所述至少一个语音训练样本，以及将生成的每个行为意图标签确定为对应的语音训练样本的行为意图标签。The at least one speech is determined as the at least one speech training sample, and each generated behavioral intent label is determined as a behavioral intent label of the corresponding speech training sample.

可选地，所述获取至少一个语音之前，还包括：Optionally, before the acquiring at least one voice, the method further includes:

根据所述每个语音的声纹特征，查询所述至少一个语音是否均来自目标用户，所述目标用户是指与所述第一终端具有关联关系的用户；According to the voiceprint feature of each voice, query whether the at least one voice comes from a target user, and the target user refers to a user associated with the first terminal;

当所述至少一个语音均来自所述目标用户时，执行所述获取至少一个语音的操作。When the at least one voice comes from the target user, the operation of acquiring the at least one voice is performed.

可选地，所述根据所述每个语音的声纹特征，查询所述至少一个语音是否均来自目标用户，包括：Optionally, according to the voiceprint feature of each voice, query whether the at least one voice is from the target user, including:

确定所述每个语音的声纹特征与预设声纹特征之间的差异值；determining the difference value between the voiceprint feature of each voice and the preset voiceprint feature;

当所述每个语音的声纹特征与所述预设声纹特征之间的差异值均小于预设阈值时，确定所述至少一个语音均来自所述目标用户。When the difference between the voiceprint feature of each voice and the preset voiceprint feature is less than a preset threshold, it is determined that the at least one voice comes from the target user.

可选地，所述基于所述至少一个语音训练样本的声学特征和所述每个语音训练样本对应的行为意图标签，对待训练的识别模型进行训练，得到所述目标识别模型之后，还包括：Optionally, after the recognition model to be trained is trained based on the acoustic feature of the at least one speech training sample and the behavioral intent label corresponding to each speech training sample, and the target recognition model is obtained, the method further includes:

向第二终端分享所述目标识别模型，所述第二终端是指与所述第一终端具有关联关系的终端。The target recognition model is shared with a second terminal, where the second terminal refers to a terminal associated with the first terminal.

第二方面，提供了一种语音识别装置，所述装置包括：In a second aspect, a speech recognition device is provided, the device comprising:

采集模块，用于采集待识别的目标语音；The acquisition module is used to collect the target voice to be recognized;

第一获取模块，用于获取所述目标语音的声学特征；a first acquisition module, used for acquiring the acoustic features of the target speech;

调用模块，用于调用目标识别模型，将所述声学特征输入至所述目标识别模型中，输出所述目标语音对应的行为意图标签，所述目标识别模型用于根据任一语音的声学特征识别所述语音对应的行为意图。The calling module is used to call the target recognition model, input the acoustic features into the target recognition model, and output the behavioral intent label corresponding to the target voice, and the target recognition model is used to identify according to the acoustic features of any voice The behavioral intent corresponding to the voice.

可选地，所述装置还包括：Optionally, the device further includes:

第二获取模块，用于获取至少一个语音训练样本的声学特征和每个语音训练样本对应的行为意图标签；The second acquisition module is used to acquire the acoustic feature of at least one voice training sample and the behavioral intent label corresponding to each voice training sample;

训练模块，用于基于所述至少一个语音训练样本的声学特征和所述每个语音训练样本对应的行为意图标签，对待训练的识别模型进行训练，得到所述目标识别模型。A training module, configured to train the recognition model to be trained based on the acoustic feature of the at least one speech training sample and the behavioral intent label corresponding to each speech training sample to obtain the target recognition model.

可选地，所述第二获取模块用于：Optionally, the second obtaining module is used for:

获取至少一个语音；Get at least one voice;

可选地，所述第二获取模块还用于：Optionally, the second obtaining module is also used for:

可选地，所述装置还包括：Optionally, the device further includes:

分享模块，用于向第二终端分享所述目标识别模型，所述第二终端是指与所述第一终端具有关联关系的终端。A sharing module, configured to share the target recognition model with a second terminal, where the second terminal refers to a terminal that is associated with the first terminal.

第三方面，提供一种计算机可读存储介质，所述计算机可读存储介质上存储有指令，所述指令被处理器执行时实现上述第一方面所述的语音识别方法。In a third aspect, a computer-readable storage medium is provided, where instructions are stored on the computer-readable storage medium, and when the instructions are executed by a processor, the speech recognition method described in the first aspect above is implemented.

第四方面，提供了一种包含指令的计算机程序产品，当其在计算机上运行时，使得计算机执行上述第一方面所述的语音识别方法。In a fourth aspect, there is provided a computer program product containing instructions that, when run on a computer, cause the computer to execute the speech recognition method described in the first aspect above.

第五方面，提供了一种计算设备，包括：In a fifth aspect, a computing device is provided, comprising:

一个或多个处理器；one or more processors;

存储器，用于存储一个或多个程序，memory for storing one or more programs,

当所述一个或多个程序被所述一个或多个处理器执行时，使得所述一个或多个处理器执行上述第一方面所述的语音识别方法。When the one or more programs are executed by the one or more processors, the one or more processors are caused to execute the speech recognition method described in the first aspect above.

本发明实施例提供的技术方案带来的有益效果是：The beneficial effects brought by the technical solutions provided in the embodiments of the present invention are:

采集待识别的目标语音，获取该目标语音的声学特征。调用目标识别模型，由于该目标识别模型可以根据任一语音的声学特征识别该语音对应的行为意图，因此，将获取的该声学特征输入至该目标识别模型后，可以输出该目标语音对应的行为意图标签。在本发明实施例中，无论是标准语音还是非标准语音，均可以基于声学特征通过目标识别模型识别出对应的行为意图，增强了语音识别的适用性。The target speech to be recognized is collected, and the acoustic features of the target speech are acquired. Call the target recognition model. Since the target recognition model can identify the behavioral intent corresponding to the speech according to the acoustic features of any speech, after inputting the acquired acoustic features into the target recognition model, the behavior corresponding to the target speech can be output Intent label. In the embodiment of the present invention, whether it is standard speech or non-standard speech, the corresponding behavioral intent can be identified through the target recognition model based on the acoustic features, which enhances the applicability of speech recognition.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.

图1是根据一示例性实施例示出的一种语音识别方法的流程图；1 is a flowchart of a speech recognition method according to an exemplary embodiment;

图2是根据另一示例性实施例示出的一种语音识别方法的流程图；2 is a flowchart of a speech recognition method according to another exemplary embodiment;

图3是根据一示例性实施例示出的一种语音识别装置的结构示意图；3 is a schematic structural diagram of a speech recognition apparatus according to an exemplary embodiment;

图4是根据另一示例性实施例示出的一种语音识别装置的结构示意图；4 is a schematic structural diagram of a speech recognition apparatus according to another exemplary embodiment;

图5是根据另一示例性实施例示出的一种语音识别装置的结构示意图；5 is a schematic structural diagram of a speech recognition apparatus according to another exemplary embodiment;

图6是根据另一示例性实施例示出的一种终端600的结构示意图。FIG. 6 is a schematic structural diagram of a terminal 600 according to another exemplary embodiment.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明实施方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present invention clearer, the embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.

在对本发明实施例提供的语音识别方法进行详细描述之前，先对本发明实施例所涉及的应用场景和实施环境进行简单介绍。Before describing in detail the speech recognition method provided by the embodiments of the present invention, the application scenarios and implementation environments involved in the embodiments of the present invention are briefly introduced.

首先，对本发明实施例涉及的应用场景进行简单介绍。First, the application scenarios involved in the embodiments of the present invention are briefly introduced.

在一些应用场景中，为了提高操作的便捷性，有通过语音来控制终端的需求，譬如，该应用场景包括但不限于家居、车载等环境。在通过语音控制终端的过程中，为了能够获知语音对应的行为意图，需要进行语音识别。目前，在语音识别过程中，需要将待识别的语音转化为文本，再对转化后的文本进行语义识别。然而，在上述实现过程中，只能对标准语音进行文本转化和语义识别，无法对非标准语音(如方言)进行识别，从而导致语音识别的适用性较差。In some application scenarios, in order to improve the convenience of operation, there is a need to control the terminal through voice, for example, the application scenarios include but are not limited to environments such as home and vehicle. In the process of controlling the terminal by voice, in order to know the behavior intention corresponding to the voice, voice recognition needs to be performed. At present, in the process of speech recognition, it is necessary to convert the speech to be recognized into text, and then perform semantic recognition on the converted text. However, in the above implementation process, only text conversion and semantic recognition can be performed on standard speech, and non-standard speech (such as dialect) cannot be recognized, resulting in poor applicability of speech recognition.

为此，本发明实施例提供了一种语音识别方法，该语音识别方法可以基于语音的声学特征通过目标识别模型来识别对应的行为意图，由于该方法无需识别语义，因此，无论是标准语音还是非标准语音，均可以实现语音识别，增加了语音识别的适用性，其具体实现过程请参见如下图1或图2所示实施例。To this end, the embodiment of the present invention provides a speech recognition method, which can recognize the corresponding behavioral intent through the target recognition model based on the acoustic features of the speech. Non-standard speech can realize speech recognition, which increases the applicability of speech recognition. For the specific implementation process, please refer to the embodiment shown in FIG. 1 or FIG. 2 below.

其次，对本发明实施例涉及的实施环境进行简单介绍。Next, the implementation environment involved in the embodiments of the present invention is briefly introduced.

本发明实施例提供的语音识别方法可以由第一终端来执行，该第一终端中可以配置有语音采集装置，譬如，该语音采集装置可以为麦克风阵列等，用于语音采集。在一些实施例中，该第一终端可以是任何一种可与用户通过键盘、触摸板、触摸屏、遥控器、语音交互或手写设备等一种或多种方式进行人机交互的电子产品，例如PC、手机、智能手机、PDA、可穿戴设备、掌上电脑PPC、可穿戴设备、平板电脑、智能车机、智能电视、智能音箱等。在实际应用中，当第一终端为可以与用户进行语音交互的电子产品时，其上可搭载/安装能够识别、解析、理解、处理并响应用户的自然语言命令并将响应结果进行输出的客户端(可以是APP形式)，也可以是该客户端仅能对用户输入的自然语言命令进行语音识别但需对应的服务器来对该自然语言命令进行解析、理解、处理并响应用户的自然语言命令并将响应结果返回客户端进行输出。The speech recognition method provided by the embodiment of the present invention may be performed by a first terminal, and the first terminal may be configured with a speech collection device, for example, the speech collection device may be a microphone array, etc., for speech collection. In some embodiments, the first terminal may be any electronic product that can interact with the user in one or more ways, such as a keyboard, a touch panel, a touch screen, a remote control, a voice interaction or a handwriting device, for example PC, mobile phone, smart phone, PDA, wearable device, handheld computer PPC, wearable device, tablet computer, smart car machine, smart TV, smart speaker, etc. In practical applications, when the first terminal is an electronic product that can interact with the user by voice, a client capable of recognizing, parsing, understanding, processing and responding to the user's natural language command and outputting the response result can be mounted/installed on the first terminal. The terminal (which can be in the form of an APP), or the client can only perform speech recognition on the natural language command input by the user, but requires a corresponding server to parse, understand, process and respond to the user's natural language command. And return the response result to the client for output.

进一步地，该第一终端可以与至少一个第二终端连接，在一种可能的实现方式中，该至少一个第二终端与该第一终端可以均属于同一个用户。Further, the first terminal may be connected to at least one second terminal, and in a possible implementation manner, the at least one second terminal and the first terminal may both belong to the same user.

在此，所述第二终端可以是任何一种可与用户通过键盘、触摸板、触摸屏、遥控器、语音交互或手写设备等一种或多种方式进行人机交互的电子产品，例如PC、手机、智能手机、PDA、可穿戴设备、掌上电脑PPC、可穿戴设备、平板电脑、智能车机、智能电视、智能音箱等。在实际应用中，当第二终端为可以与用户进行语音交互的电子产品时，其上可搭载/安装能够识别、解析、理解、处理并响应用户的自然语言命令并将响应结果进行输出的客户端(可以是APP形式)，也可以是该客户端仅能对用户输入的自然语言命令进行语音识别但需对应的服务器来对该自然语言命令进行解析、理解、处理并响应用户的自然语言命令并将响应结果返回客户端进行输出。Here, the second terminal may be any electronic product that can perform human-computer interaction with the user in one or more ways such as a keyboard, a touch panel, a touch screen, a remote control, a voice interaction or a handwriting device, such as a PC, Mobile phones, smart phones, PDAs, wearable devices, PDAs, PPCs, wearable devices, tablet PCs, smart car machines, smart TVs, smart speakers, etc. In practical applications, when the second terminal is an electronic product that can interact with the user by voice, a client capable of recognizing, parsing, understanding, processing and responding to the user's natural language command and outputting the response result can be mounted/installed on the second terminal. The terminal (which can be in the form of an APP), or the client can only perform speech recognition on the natural language command input by the user, but requires a corresponding server to parse, understand, process and respond to the user's natural language command. And return the response result to the client for output.

在此，第一终端和第二终端均包括一种能够按照事先设定或存储的指令，自动进行数值计算和信息处理的电子设备，其硬件包括但不限于微处理器、专用集成电路(ASIC)、可编程门阵列(FPGA)、数字处理器(DSP)、嵌入式设备等。Here, both the first terminal and the second terminal include an electronic device that can automatically perform numerical calculation and information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, application-specific integrated circuits (ASICs) ), programmable gate array (FPGA), digital processor (DSP), embedded devices, etc.

在介绍完本发明实施例涉及的应用场景和实施环境后，接下来将结合附图对本发明实施例提供的语音识别方法进行详细介绍。After introducing the application scenarios and implementation environments involved in the embodiments of the present invention, the speech recognition method provided by the embodiments of the present invention will be described in detail next with reference to the accompanying drawings.

请参考图1，图1是根据一示例性实施例示出的一种语音识别方法的流程图，该语音识别方法可以由上述第一终端来执行，该语音识别方法可以包括如下几个实现步骤：Please refer to FIG. 1. FIG. 1 is a flowchart of a speech recognition method according to an exemplary embodiment. The speech recognition method can be executed by the above-mentioned first terminal, and the speech recognition method can include the following implementation steps:

步骤101：采集待识别的目标语音。Step 101: Collect the target speech to be recognized.

步骤102：获取该目标语音的声学特征。Step 102: Acquire the acoustic features of the target speech.

步骤103：调用目标识别模型，将该声学特征输入至该目标识别模型中，输出该目标语音对应的行为意图标签，该目标识别模型用于根据任一语音的声学特征识别该语音对应的行为意图。Step 103: call the target recognition model, input the acoustic feature into the target recognition model, output the behavioral intent label corresponding to the target voice, and the target recognition model is used to identify the behavioral intent corresponding to the voice according to the acoustic feature of any voice .

进一步地，调用目标识别模型之前，获取至少一个语音训练样本的声学特征和每个语音训练样本对应的行为意图标签，基于该至少一个语音训练样本的声学特征和该每个语音训练样本对应的行为意图标签，对待训练的识别模型进行训练，得到该目标识别模型。Further, before calling the target recognition model, acquire the acoustic feature of at least one voice training sample and the behavioral intent label corresponding to each voice training sample, and based on the acoustic feature of the at least one voice training sample and the behavior corresponding to each voice training sample Intent label, train the recognition model to be trained, and obtain the target recognition model.

其中，上述获取每个语音训练样本对应的行为意图标签的实现过程可以包括：获取至少一个语音；确定该至少一个语音中每个语音对应的行为操作；生成每个行为操作对应的行为意图标签；将该至少一个语音确定为该至少一个语音训练样本，以及将生成的每个行为意图标签确定为对应的语音训练样本的行为意图标签。Wherein, the above-mentioned implementation process of acquiring the behavioral intent label corresponding to each speech training sample may include: acquiring at least one speech; determining the behavioral operation corresponding to each speech in the at least one speech; generating the behavioral intent label corresponding to each behavioral operation; The at least one speech is determined as the at least one speech training sample, and each generated behavioral intent label is determined as a behavioral intent label of the corresponding speech training sample.

进一步地，获取至少一个语音之前，根据该每个语音的声纹特征，查询该至少一个语音是否均来自目标用户，该目标用户是指与该第一终端具有关联关系的用户；当该至少一个语音均来自该目标用户时，执行该获取至少一个语音的操作。Further, before acquiring at least one voice, according to the voiceprint feature of each voice, query whether the at least one voice comes from a target user, and the target user refers to a user associated with the first terminal; when the at least one voice When the voices all come from the target user, the operation of acquiring at least one voice is performed.

在一种可能的实现方式中，根据该每个语音的声纹特征，查询该至少一个语音是否均来自目标用户的实现可以包括：确定该每个语音的声纹特征与预设声纹特征之间的差异值；当该每个语音的声纹特征与该预设声纹特征之间的差异值均小于预设阈值时，确定该至少一个语音均来自该目标用户。In a possible implementation manner, according to the voiceprint feature of each voice, the implementation of querying whether the at least one voice comes from the target user may include: determining the difference between the voiceprint feature of each voice and a preset voiceprint feature When the difference between the voiceprint feature of each voice and the preset voiceprint feature is less than a preset threshold, it is determined that the at least one voice comes from the target user.

在一些实施例中，基于所述至少一个语音训练样本的声学特征和所述每个语音训练样本对应的行为意图标签，对待训练的识别模型进行训练，得到所述目标识别模型之后，向第二终端分享该目标识别模型，该第二终端是指与该第一终端具有关联关系的终端。In some embodiments, the recognition model to be trained is trained based on the acoustic feature of the at least one speech training sample and the behavioral intent label corresponding to each speech training sample, and after the target recognition model is obtained, the second The terminal shares the target recognition model, and the second terminal refers to a terminal associated with the first terminal.

在本发明实施例中，采集待识别的目标语音，获取该目标语音的声学特征。调用目标识别模型，由于该目标识别模型可以根据任一语音的声学特征识别该语音对应的行为意图，因此，将获取的该声学特征输入至该目标识别模型后，可以输出该目标语音对应的行为意图标签。在本发明实施例中，无论是标准语音还是非标准语音，均可以基于声学特征通过目标识别模型识别出对应的行为意图，增强了语音识别的适用性。In the embodiment of the present invention, the target speech to be recognized is collected, and the acoustic features of the target speech are acquired. Call the target recognition model. Since the target recognition model can identify the behavioral intent corresponding to the speech according to the acoustic features of any speech, after inputting the acquired acoustic features into the target recognition model, the behavior corresponding to the target speech can be output Intent label. In the embodiment of the present invention, whether it is standard speech or non-standard speech, the corresponding behavioral intent can be identified through the target recognition model based on the acoustic features, which enhances the applicability of speech recognition.

图2是根据另一示例性实施例示出的一种语音识别方法的流程图，本实施例以该语音识别方法应用于上述第一终端中进行举例说明，该语音识别方法可以包括如下几个实现步骤：FIG. 2 is a flowchart of a speech recognition method according to another exemplary embodiment. This embodiment is illustrated by applying the speech recognition method to the above-mentioned first terminal. The speech recognition method may include the following implementations step:

步骤201：采集待识别的目标语音。Step 201: Collect the target speech to be recognized.

当用户想要通过语音控制第一终端时，可以直接对着该第一终端中诸如麦克风阵列之类的语音采集装置说话。相应地，该第一终端可以通过该语音采集装置采集用户所说的话，即采集待识别的目标语音。When the user wants to control the first terminal by voice, he can speak directly to a voice collecting device such as a microphone array in the first terminal. Correspondingly, the first terminal can collect the words spoken by the user through the voice collecting device, that is, collect the target voice to be recognized.

进一步地，该第一终端可以在接收到语音识别指令时，采集待识别的目标语音，该语音识别指令可以由用户触发，该用户可以通过指定操作触发，该指定操作可以包括点击操作、滑动操作等等，本发明实施例对此不做限定。Further, the first terminal can collect the target voice to be recognized when receiving a voice recognition instruction, and the voice recognition instruction can be triggered by a user, and the user can be triggered by a specified operation, and the specified operation can include a click operation, a sliding operation etc., which is not limited in this embodiment of the present invention.

譬如，该第一终端中可以提供有语音识别选项，当用户想要通过语音控制该第一终端时，可以点击该语音识别选项，以触发该语音识别指令。该第一终端接收到该语音识别指令后，采集待识别的目标语音。For example, a voice recognition option may be provided in the first terminal, and when the user wants to control the first terminal by voice, the user may click on the voice recognition option to trigger the voice recognition instruction. After receiving the voice recognition instruction, the first terminal collects the target voice to be recognized.

步骤202：获取该目标语音的声学特征。Step 202: Acquire the acoustic features of the target speech.

为了便于后续对该目标语音进行语音识别，该第一终端获取该目标语音的声学特征。其中，该声学特征可以用于描述该目标语音的响度、音调、频率和音色中的至少一个。In order to facilitate subsequent speech recognition of the target speech, the first terminal acquires the acoustic feature of the target speech. Wherein, the acoustic feature can be used to describe at least one of loudness, pitch, frequency and timbre of the target speech.

步骤203：调用目标识别模型，将该声学特征输入至该目标识别模型中，输出该目标语音对应的行为意图标签，该目标识别模型用于根据任一语音的声学特征识别该语音对应的行为意图。Step 203: call the target recognition model, input the acoustic feature into the target recognition model, output the behavioral intent label corresponding to the target voice, and the target recognition model is used to identify the behavioral intent corresponding to the voice according to the acoustic feature of any voice .

由于该目标识别模型可以用于根据任一语音的声学特征识别该语音对应的行为意图，因此，该第一终端将该目标语音的声学特征输入至该目标识别模型后，可以输出该目标语音对应的行为意图标签。Since the target recognition model can be used to identify the behavioral intention corresponding to the speech according to the acoustic characteristics of any speech, the first terminal can output the corresponding target speech after inputting the acoustic characteristics of the target speech into the target recognition model. behavioral intent label.

在一些实施例中，该行为意图标签可为一行为意图序列，换句话说，该行为意图标签可以用于描述一系列的行为动作，譬如，该行为意图标签可以用于描述“打开相机、再登录微信，之后再播放音乐”这一系列行为动作等。In some embodiments, the behavioral intent tag can be a behavioral intent sequence, in other words, the behavioral intent tag can be used to describe a series of behavioral actions, for example, the behavioral intent tag can be used to describe “turn on the camera, then Log in to WeChat, and play music later.” This series of behaviors and actions, etc.

进一步地，该第一终端确定该目标语音对应的行为意图标签后，可以执行该行为意图标签对应的行为操作。譬如，若该行为意图标签为启动相机标签，则该第一终端可以启动所安装的相机应用。如此，即实现了利用目标语音控制该第一终端的目的。Further, after determining the behavior intent label corresponding to the target speech, the first terminal may execute the behavior operation corresponding to the behavior intent label. For example, if the behavior intent tag is the activation camera tag, the first terminal may activate the installed camera application. In this way, the purpose of using the target voice to control the first terminal is achieved.

进一步地，调用目标识别模型之前，需要进行模型训练以得到上述目标识别模型，在一种可能的实现方式中，训练过程可以包括：获取至少一个语音训练样本的声学特征和每个语音训练样本对应的行为意图标签，基于该至少一个语音训练样本的声学特征和该每个语音训练样本对应的行为意图标签，对待训练的识别模型进行训练，得到该目标识别模型。Further, before calling the target recognition model, model training needs to be performed to obtain the above-mentioned target recognition model. In a possible implementation, the training process may include: acquiring the acoustic features of at least one voice training sample and corresponding to each voice training sample. Based on the acoustic feature of the at least one speech training sample and the behavioral intent label corresponding to each speech training sample, the recognition model to be trained is trained to obtain the target recognition model.

也即是，该第一终端获取至少一个语音训练样本的声学特征和每个语音训练样本对应的行为意图标签，并将获取的该至少一个语音训练样本的声学特征和每个语音训练样本对应的行为意图标签输入至待训练的识别模型中，对该识别模型进行深度学习和训练，从而得到能够基于任一语音的声学特征识别该语音的行为意图的目标识别模型。That is, the first terminal acquires the acoustic feature of the at least one voice training sample and the behavioral intent label corresponding to each voice training sample, and compares the acquired acoustic feature of the at least one voice training sample with the corresponding value of each voice training sample. The behavioral intent label is input into the recognition model to be trained, and the recognition model is subjected to deep learning and training, thereby obtaining a target recognition model capable of recognizing the behavioral intent of any speech based on the acoustic features of the speech.

在一些实施例中，该待训练的识别模型可以为CNN(Convolutional NeuralNetworks，卷积神经网络)模型，本发明实施例对此不做限定。In some embodiments, the recognition model to be trained may be a CNN (Convolutional Neural Networks, convolutional neural network) model, which is not limited in this embodiment of the present invention.

其中，上述获取每个语音训练样本对应的行为意图标签的实现过程可以包括：获取至少一个语音，确定该至少一个语音中每个语音对应的行为操作。生成每个行为操作对应的行为意图标签，将该至少一个语音确定为该至少一个语音训练样本，以及将生成的每个行为意图标签确定为对应的语音训练样本的行为意图标签。Wherein, the above-mentioned implementation process of acquiring a behavior intent label corresponding to each speech training sample may include: acquiring at least one speech, and determining a behavior operation corresponding to each speech in the at least one speech. A behavioral intent label corresponding to each behavioral operation is generated, the at least one speech is determined as the at least one speech training sample, and each generated behavioral intent label is determined as a behavioral intent label of the corresponding speech training sample.

在一种可能的实现方式中，该每个语音对应的行为操作是在接收到针对该每个语音触发的操作指令时所执行的。在一些实施例中，用户可能向该第一终端输入至少一个语音，并且，在输入该至少一个语音中的每个语音后，可能通过手动触发的方式，针对该每个语音触发行为操作。譬如，用户可以用方言对该第一终端说“启动百度地图”，之后，该用户可以手动启动百度地图。此时，第一终端可以采集至少一个语音中的每个语音，并且确定对应的行为操作，譬如，确定启动百度地图这一行为操作。之后，该第一终端生成所确定的行为操作的行为意图标签，譬如，该行为意图标签可以为“启动百度地图标签”。该第一终端将所获取的每个语音确定为至少一个语音训练样本中的每个语音训练样本，以及将所生成的每个行为标签确定为对应的语音训练样本的行为意图标签。如此，该第一终端即可基于所确定的该至少一个语音训练样本和每个语音训练样本对应的行为意图标签，对待训练的识别模型进行训练。In a possible implementation manner, the behavior operation corresponding to each voice is performed when an operation instruction triggered for each voice is received. In some embodiments, the user may input at least one voice to the first terminal, and after inputting each voice in the at least one voice, a behavior operation may be triggered for each voice by manual triggering. For example, the user can say "start Baidu map" to the first terminal in dialect, and then the user can manually start Baidu map. At this time, the first terminal may collect each voice in the at least one voice, and determine the corresponding behavior operation, for example, determine the behavior operation of starting the Baidu map. After that, the first terminal generates a behavior intent label of the determined behavior operation, for example, the behavior intent label may be "start Baidu map label". The first terminal determines each acquired speech as each speech training sample in the at least one speech training sample, and determines each generated behavior label as a behavior intention label of the corresponding speech training sample. In this way, the first terminal can train the recognition model to be trained based on the determined at least one speech training sample and the behavior intention label corresponding to each speech training sample.

当然，上述仅是以该每个语音对应的行为操作是在接收到针对该每个语音触发的操作指令时所执行的为例进行说明。在另一实施例中，该每个语音对应的行为操作还可以是在采集到针对该每个语音录入的标准控制语音时所执行的。其中，该标准控制语音通常是指普通话。在一些实施例中，由于第一终端可以识别普通话，因此，用户向该第一终端输入至少一个语音，并且，在输入该至少一个语音中的每个语音后，可以通过标准控制语音针对该每个语音触发行为操作。譬如，用户可以用方言对该第一终端说“启动百度地图”，之后，该用户可以用普通话控制第一终端启动百度地图。此时，第一终端可以采集至少一个语音中的每个语音，并且确定对应的行为操作，譬如，确定启动百度地图这一行为操作。之后，该第一终端生成所确定的行为操作的行为意图标签，譬如，该行为意图标签可以为“启动百度地图标签”。该第一终端将所获取的每个语音确定为至少一个语音训练样本中的每个语音训练样本，以及将所生成的每个行为标签确定为对应的语音训练样本的行为意图标签。Of course, the above is only described by taking an example that the behavior operation corresponding to each voice is performed when an operation instruction triggered for each voice is received. In another embodiment, the behavior operation corresponding to each voice may also be performed when the standard control voice entered for each voice is collected. Wherein, the standard control voice usually refers to Mandarin. In some embodiments, since the first terminal can recognize Mandarin, the user inputs at least one voice to the first terminal, and, after inputting each voice in the at least one voice, can control the voice for each voice through a standard control voice. A voice-triggered behavioral action. For example, the user can say "start Baidu map" to the first terminal in dialect, and then the user can control the first terminal to start Baidu map in Mandarin. At this time, the first terminal may collect each voice in the at least one voice, and determine the corresponding behavior operation, for example, determine the behavior operation of starting the Baidu map. After that, the first terminal generates a behavior intent label of the determined behavior operation, for example, the behavior intent label may be "start Baidu map label". The first terminal determines each acquired speech as each speech training sample in the at least one speech training sample, and determines each generated behavior label as a behavior intention label of the corresponding speech training sample.

进一步地，获取至少一个语音之前，还可以对该至少一个语音的来源进行限定，相应的处理过程包括：根据该每个语音的声纹特征，查询该至少一个语音是否均来自目标用户，该目标用户是指与该第一终端具有关联关系的用户，当该至少一个语音均来自该目标用户时，执行该获取至少一个语音的操作。Further, before acquiring the at least one voice, the source of the at least one voice can also be limited, and the corresponding processing process includes: according to the voiceprint feature of each voice, inquiring whether the at least one voice comes from the target user, the target user The user refers to a user with an associated relationship with the first terminal, and when the at least one voice comes from the target user, the operation of acquiring the at least one voice is performed.

在一些实施例中，如在多人场景中，该第一终端可以只基于与自身具有关联关系的目标用户的语音进行训练，譬如，该目标用户可以为该第一终端的拥有者。在该种情况下，该第一终端获取该至少一个语音之前，需要判断该至少一个语音是否是来自于该目标用户。该第一终端可以根据每个语音的声纹特征来查询该至少一个语音是否均来自于该目标用户。In some embodiments, such as in a multi-person scenario, the first terminal may only perform training based on the voice of a target user associated with itself. For example, the target user may be the owner of the first terminal. In this case, before acquiring the at least one voice, the first terminal needs to determine whether the at least one voice is from the target user. The first terminal may query whether the at least one voice comes from the target user according to the voiceprint feature of each voice.

在一种可能的实现方式中，上述根据该每个语音的声纹特征，查询该至少一个语音是否均来自目标用户的实现可以包括：确定该每个语音的声纹特征与预设声纹特征之间的差异值；当该每个语音的声纹特征与该预设声纹特征之间的差异值均小于预设阈值时，确定该至少一个语音均来自该目标用户。In a possible implementation manner, the above-mentioned implementation of querying whether the at least one voice comes from the target user according to the voiceprint feature of each voice may include: determining the voiceprint feature and the preset voiceprint feature of each voice When the difference between the voiceprint feature of each voice and the preset voiceprint feature is less than a preset threshold, it is determined that the at least one voice comes from the target user.

该第一终端中可以预选存储有该预设声纹特征，该预设声纹特征可以为该目标用户的声纹特征。如此，该第一终端获取每个语音的声纹特征后，可以将获取的每个语音的声纹特征与预先存储的预设声纹特征进行比较，确定每个语音的声纹特征与该预设声纹特征之间的差异值。The preset voiceprint feature may be preselected and stored in the first terminal, and the preset voiceprint feature may be the voiceprint feature of the target user. In this way, after acquiring the voiceprint feature of each voice, the first terminal can compare the acquired voiceprint feature of each voice with the pre-stored preset voiceprint feature, and determine the voiceprint feature of each voice and the preset voiceprint feature. Set the difference value between voiceprint features.

在一些实施例中，可以将获取的每个语音的声纹特征与预先存储的预设声纹特征进行模式匹配，以确定每个语音的声纹特征与该预设声纹特征之间的差异值。其中，该模式匹配的方法可以包括：概率统计法、人工神经网络法等，本申请实施例对此不做限定。In some embodiments, the acquired voiceprint feature of each voice may be pattern-matched with a pre-stored preset voiceprint feature to determine the difference between the voiceprint feature of each voice and the preset voiceprint feature value. The method for pattern matching may include: probability statistics method, artificial neural network method, etc., which are not limited in this embodiment of the present application.

当该差异值小于预设阈值时，说明所比较的某个语音的声纹特征与该预设声纹特征之间的差异不大，此时，可以确定所比较的该语音是来自于该目标用户。反之，如果该差异值大于该预设阈值，说明所比较的某个语音的声纹特征与该预设声纹特征之间的差异较大，因此，可以确定所比较的该语音不是来自于该目标用户。如此，通过将每个语音声纹特征与预设阈值之间进行比较，可以确定该至少一个语音是否均来自于该目标用户。When the difference value is less than the preset threshold, it means that the difference between the voiceprint feature of a certain voice being compared and the preset voiceprint feature is not big, and at this time, it can be determined that the voice being compared is from the target user. Conversely, if the difference value is greater than the preset threshold, it means that the difference between the voiceprint feature of a certain voice being compared and the preset voiceprint feature is relatively large. Therefore, it can be determined that the compared voice does not come from the voiceprint feature. Target users. In this way, by comparing each voiceprint feature with a preset threshold, it can be determined whether the at least one voice comes from the target user.

其中，预设阈值可以由用户根据实际需求自定义设置，也可以由该第一终端默认设置，本发明实施例对此不作限定。The preset threshold may be set by the user according to actual needs, or may be set by default by the first terminal, which is not limited in this embodiment of the present invention.

进一步地，上述仅是以根据每个语音的声纹特征查询该至少一个语音是否均来自于目标用户为例，在另一实施例中，该可以根据其它信息来查询该至少一个语音是否均来自该目标用户，譬如，还可以结合该每个语音的声源位置来判断该至少一个语音是否来自于目标用户。Further, the above is only an example of querying whether the at least one voice comes from the target user according to the voiceprint feature of each voice. In another embodiment, the at least one voice can be queried according to other information. The target user, for example, can also determine whether the at least one voice comes from the target user in combination with the sound source position of each voice.

在一种可能的实现方式中，该第一终端可以根据所采集的该至少一个语音中任一语音的声纹特征，判断该任一语音是否来自于目标用户。当确定该任一语音是来自于该目标用户时，再判断该至少一个语音中其它语音的声源位置与该任一语音的声源位置是否均相同，即判断该至少一个语音是否均来自于同一个方向。当该至少一个语音的声源位置均相同时，确定该至少一个语音来自于同一用户，即均来自于该目标用户。否则，可以确定该至少一个语音不是均来自于同一目标用户。其中，确定每个语音的声源位置可以基于采集的语音的接收强度等参数来确定，本发明实施例对此不做限定。In a possible implementation manner, the first terminal may determine, according to a voiceprint feature of any voice in the collected at least one voice, whether the any voice comes from the target user. When it is determined that the any voice is from the target user, then determine whether the sound source position of the other voices in the at least one voice is the same as the sound source position of the any voice, that is, determine whether the at least one voice comes from the same direction. When the sound source positions of the at least one voice are all the same, it is determined that the at least one voice comes from the same user, that is, they all come from the target user. Otherwise, it may be determined that the at least one voice does not all come from the same target user. The determination of the sound source position of each speech may be determined based on parameters such as the received intensity of the collected speech, which is not limited in this embodiment of the present invention.

需要说明的是，当上述至少一个语音不是均来自于该目标用户时，可以将该至少一个语音中不是来自于该目标用户的语音删除，之后，获取删除后的至少一个语音。It should be noted that, when the above at least one voice does not all come from the target user, the voice that does not come from the target user in the at least one voice may be deleted, and then the deleted at least one voice is obtained.

至此，已经实现了本发明实施例所涉及的语音识别方法。进一步地，由于用户所使用的终端可能包括多个，因此，为了使得其它终端也可以与该第一终端一样进行语音识别，该第一终端对待训练的识别模型进行训练得到目标识别模型后，还可以对该目标识别模型进行分享，具体实现请参见如下步骤204。So far, the speech recognition method involved in the embodiment of the present invention has been implemented. Further, since the terminals used by the user may include multiple terminals, in order to enable other terminals to perform speech recognition in the same way as the first terminal, after the first terminal trains the recognition model to be trained to obtain the target recognition model, The target recognition model can be shared. For specific implementation, please refer to the following step 204.

步骤204：向第二终端分享该目标识别模型，该第二终端是指与该第一终端具有关联关系的终端。Step 204: Share the target recognition model with a second terminal, where the second terminal refers to a terminal that is associated with the first terminal.

在一种可能的实现方式中，该关联关系是指与该第一终端属于同一用户，或者，该关联关系是指与该第一终端具有连接关系，再或者，该关联关系是指与所述第一终端处于同一环境下等。In a possible implementation manner, the association relationship refers to belonging to the same user as the first terminal, or the association relationship refers to having a connection relationship with the first terminal, and further alternatively, the association relationship refers to having a connection relationship with the first terminal The first terminal is in the same environment and so on.

接下来，以该第二终端与该第一终端属于同一用户为例进行说明，譬如，该第二终端与该第一终端均属于上述目标用户。也就是说，为了使得该目标用户在使用第二终端时也可以利用语音对该第二终端进行控制，该第一终端经过训练得到该目标识别模型后，可以将该目标识别模型发送给与自身属于同一用户的第二终端。相应地，该第二终端接收到该第一终端分享的目标识别模型后，在本地保存该目标识别模型。Next, description will be given by taking an example that the second terminal and the first terminal belong to the same user, for example, both the second terminal and the first terminal belong to the above-mentioned target user. That is to say, in order to enable the target user to use voice to control the second terminal when using the second terminal, the first terminal can send the target recognition model to itself after training to obtain the target recognition model. A second terminal belonging to the same user. Correspondingly, after receiving the target recognition model shared by the first terminal, the second terminal saves the target recognition model locally.

当然，该第一终端还可以接收第二终端发送的经过训练后的目标识别模型，也就是说，第二终端也可以按照上述实现方式进行模型训练，并将训练后的目标识别模型分享给该第一终端。该第一终端接收来自第二终端分享的训练后的目标识别模型，进一步地，当该第一终端中存储有目标识别模型时，若接收到第二终端分享的最新的目标识别模型，则该第一终端可以将原来存储的目标识别模型删除，并保存该第二终端分享的最新的目标识别模型。Of course, the first terminal can also receive the trained target recognition model sent by the second terminal, that is to say, the second terminal can also perform model training according to the above implementation manner, and share the trained target recognition model with the target recognition model. first terminal. The first terminal receives the trained target recognition model shared by the second terminal, and further, when the target recognition model is stored in the first terminal, if the latest target recognition model shared by the second terminal is received, the The first terminal may delete the originally stored target recognition model, and save the latest target recognition model shared by the second terminal.

值得一提的是，上述第一终端将训练后得到的目标识别模型分享给第二终端，使得第二终端保存有与第一终端相同的目标识别模型，如此，第二终端可以直接用该目标识别模型进行语音识别，避免需要自身再进行训练，减少了第二终端的训练次数。It is worth mentioning that the above-mentioned first terminal shares the target recognition model obtained after training with the second terminal, so that the second terminal saves the same target recognition model as the first terminal. In this way, the second terminal can directly use the target recognition model. The recognition model performs speech recognition, avoiding the need for self-training, and reducing the number of training times for the second terminal.

图3是根据一示例性实施例示出的一种语音识别装置的结构示意图，该语音识别装置可以由软件、硬件或者两者的结合实现。该语音识别装置可以包括：FIG. 3 is a schematic structural diagram of a speech recognition apparatus according to an exemplary embodiment, and the speech recognition apparatus may be implemented by software, hardware, or a combination of the two. The speech recognition device may include:

采集模块301，用于采集待识别的目标语音；The collection module 301 is used to collect the target speech to be recognized;

第一获取模块302，用于获取所述目标语音的声学特征；a first acquisition module 302, configured to acquire the acoustic features of the target speech;

调用模块303，用于调用目标识别模型，将所述声学特征输入至所述目标识别模型中，输出所述目标语音对应的行为意图标签，所述目标识别模型用于根据任一语音的声学特征识别所述语音对应的行为意图。The calling module 303 is used to call the target recognition model, input the acoustic features into the target recognition model, and output the behavioral intent label corresponding to the target voice, and the target recognition model is used for the acoustic features of any voice according to the Identify the behavioral intent corresponding to the speech.

可选地，请参考图4，所述装置还包括：Optionally, please refer to FIG. 4, the device further includes:

第二获取模块304，用于获取至少一个语音训练样本的声学特征和每个语音训练样本对应的行为意图标签；The second acquisition module 304 is used to acquire the acoustic feature of at least one speech training sample and the behavioral intent label corresponding to each speech training sample;

训练模块305，用于基于所述至少一个语音训练样本的声学特征和所述每个语音训练样本对应的行为意图标签，对待训练的识别模型进行训练，得到所述目标识别模型。The training module 305 is configured to train the recognition model to be trained based on the acoustic feature of the at least one speech training sample and the behavioral intent label corresponding to each speech training sample to obtain the target recognition model.

可选地，所述第二获取模块304用于：Optionally, the second obtaining module 304 is used for:

获取至少一个语音；Get at least one voice;

可选地，所述第二获取模块304还用于：Optionally, the second obtaining module 304 is further configured to:

可选地，请参考图5，所述装置还包括：Optionally, please refer to FIG. 5, the apparatus further includes:

分享模块306，用于向第二终端分享所述目标识别模型，所述第二终端是指与所述第一终端具有关联关系的终端。The sharing module 306 is configured to share the target recognition model with a second terminal, where the second terminal refers to a terminal that is associated with the first terminal.

需要说明的是：上述实施例提供的语音识别装置在实现语音识别方法时，仅以上述各功能模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即将设备的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。另外，上述实施例提供的语音识别装置与语音识别方法实施例属于同一构思，其具体实现过程详见方法实施例，这里不再赘述。It should be noted that: when the speech recognition device provided in the above embodiment implements the speech recognition method, only the division of the above functional modules is used for illustration. , that is, dividing the internal structure of the device into different functional modules to complete all or part of the functions described above. In addition, the speech recognition device and the speech recognition method embodiments provided by the above embodiments belong to the same concept, and the specific implementation process thereof is detailed in the method embodiments, which will not be repeated here.

图6示出了本发明一个示例性实施例提供的终端600的结构框图。该终端600可以是：智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III，动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio LayerIV，动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端600还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。FIG. 6 shows a structural block diagram of a terminal 600 provided by an exemplary embodiment of the present invention. The terminal 600 can be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, the standard audio layer of the moving picture experts compression), MP4 (Moving Picture Experts Group Audio Layer IV, the standard audio layer of the moving picture experts compression) 4) Player, laptop or desktop computer. Terminal 600 may also be called user equipment, portable terminal, laptop terminal, desktop terminal, and the like by other names.

通常，终端600包括有：处理器601和存储器602。Generally, the terminal 600 includes: a processor 601 and a memory 602 .

处理器601可以包括一个或多个处理核心，比如4核心处理器、8核心处理器等。处理器601可以采用DSP(Digital Signal Processing，数字信号处理)、FPGA(Field－Programmable Gate Array，现场可编程门阵列)、PLA(Programmable Logic Array，可编程逻辑阵列)中的至少一种硬件形式来实现。处理器601也可以包括主处理器和协处理器，主处理器是用于对在唤醒状态下的数据进行处理的处理器，也称CPU(Central ProcessingUnit，中央处理器)；协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中，处理器601可以在集成有GPU(Graphics Processing Unit，图像处理器)，GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中，处理器601还可以包括AI(Artificial Intelligence，人工智能)处理器，该AI处理器用于处理有关机器学习的计算操作。The processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 601 may adopt at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish. The processor 601 may also include a main processor and a coprocessor. The main processor is a processor used to process data in a wake-up state, also called a CPU (Central Processing Unit, central processing unit); A low-power processor for processing data in a standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen. In some embodiments, the processor 601 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.

存储器602可以包括一个或多个计算机可读存储介质，该计算机可读存储介质可以是非暂态的。存储器602还可包括高速随机存取存储器，以及非易失性存储器，比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中，存储器602中的非暂态的计算机可读存储介质用于存储至少一个指令，该至少一个指令用于被处理器601所执行以实现本申请中方法实施例提供的语音识别方法。Memory 602 may include one or more computer-readable storage media, which may be non-transitory. Memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 602 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 601 to implement the speech recognition provided by the method embodiments in this application. method.

在一些实施例中，终端600还可选包括有：外围设备接口603和至少一个外围设备。处理器601、存储器602和外围设备接口603之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口603相连。具体地，外围设备包括：射频电路604、触摸显示屏605、摄像头606、音频电路607、定位组件608和电源609中的至少一种。In some embodiments, the terminal 600 may optionally further include: a peripheral device interface 603 and at least one peripheral device. The processor 601, the memory 602 and the peripheral device interface 603 may be connected through a bus or a signal line. Each peripheral device can be connected to the peripheral device interface 603 through a bus, a signal line or a circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 604 , a touch display screen 605 , a camera 606 , an audio circuit 607 , a positioning component 608 and a power supply 609 .

外围设备接口603可被用于将I/O(Input/Output，输入/输出)相关的至少一个外围设备连接到处理器601和存储器602。在一些实施例中，处理器601、存储器602和外围设备接口603被集成在同一芯片或电路板上；在一些其他实施例中，处理器601、存储器602和外围设备接口603中的任意一个或两个可以在单独的芯片或电路板上实现，本实施例对此不加以限定。The peripheral device interface 603 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 601 and the memory 602 . In some embodiments, processor 601, memory 602, and peripherals interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one of processor 601, memory 602, and peripherals interface 603 or The two can be implemented on a separate chip or circuit board, which is not limited in this embodiment.

射频电路604用于接收和发射RF(Radio Frequency，射频)信号，也称电磁信号。射频电路604通过电磁信号与通信网络以及其他通信设备进行通信。射频电路604将电信号转换为电磁信号进行发送，或者，将接收到的电磁信号转换为电信号。可选地，射频电路604包括：天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。射频电路604可以通过至少一种无线通信协议来与其它终端进行通信。该无线通信协议包括但不限于：万维网、城域网、内联网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity，无线保真)网络。在一些实施例中，射频电路604还可以包括NFC(Near Field Communication，近距离无线通信)有关的电路，本申请对此不加以限定。The radio frequency circuit 604 is used for receiving and transmitting RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals. The radio frequency circuit 604 communicates with the communication network and other communication devices via electromagnetic signals. The radio frequency circuit 604 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals. Optionally, the radio frequency circuit 604 includes an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and the like. The radio frequency circuit 604 may communicate with other terminals through at least one wireless communication protocol. The wireless communication protocol includes but is not limited to: World Wide Web, Metropolitan Area Network, Intranet, various generations of mobile communication networks (2G, 3G, 4G and 5G), wireless local area network and/or WiFi (Wireless Fidelity, Wireless Fidelity) network. In some embodiments, the radio frequency circuit 604 may further include a circuit related to NFC (Near Field Communication, short-range wireless communication), which is not limited in this application.

显示屏605用于显示UI(User Interface，用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏605是触摸显示屏时，显示屏605还具有采集在显示屏605的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器601进行处理。此时，显示屏605还可以用于提供虚拟按钮和/或虚拟键盘，也称软按钮和/或软键盘。在一些实施例中，显示屏605可以为一个，设置终端600的前面板；在另一些实施例中，显示屏605可以为至少两个，分别设置在终端600的不同表面或呈折叠设计；在再一些实施例中，显示屏605可以是柔性显示屏，设置在终端600的弯曲表面上或折叠面上。甚至，显示屏605还可以设置成非矩形的不规则图形，也即异形屏。显示屏605可以采用LCD(LiquidCrystal Display，液晶显示屏)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。The display screen 605 is used to display a UI (User Interface). The UI can include graphics, text, icons, video, and any combination thereof. When the display screen 605 is a touch display screen, the display screen 605 also has the ability to acquire touch signals on or above the surface of the display screen 605 . The touch signal may be input to the processor 601 as a control signal for processing. At this time, the display screen 605 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, there may be one display screen 605, which is provided on the front panel of the terminal 600; in other embodiments, there may be at least two display screens 605, which are respectively arranged on different surfaces of the terminal 600 or in a folded design; In still other embodiments, the display screen 605 may be a flexible display screen, which is disposed on a curved surface or a folding surface of the terminal 600 . Even, the display screen 605 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen. The display screen 605 can be made of materials such as LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, organic light emitting diode).

摄像头组件606用于采集图像或视频。可选地，摄像头组件606包括前置摄像头和后置摄像头。通常，前置摄像头设置在终端的前面板，后置摄像头设置在终端的背面。在一些实施例中，后置摄像头为至少两个，分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种，以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality，虚拟现实)拍摄功能或者其它融合拍摄功能。在一些实施例中，摄像头组件606还可以包括闪光灯。闪光灯可以是单色温闪光灯，也可以是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合，可以用于不同色温下的光线补偿。The camera assembly 606 is used to capture images or video. Optionally, the camera assembly 606 includes a front camera and a rear camera. Usually, the front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal. In some embodiments, there are at least two rear cameras, which are any one of a main camera, a depth-of-field camera, a wide-angle camera, and a telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function, the main camera It is integrated with the wide-angle camera to achieve panoramic shooting and VR (Virtual Reality, virtual reality) shooting functions or other integrated shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash can be a single color temperature flash or a dual color temperature flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.

音频电路607可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波，并将声波转换为电信号输入至处理器601进行处理，或者输入至射频电路604以实现语音通信。出于立体声采集或降噪的目的，麦克风可以为多个，分别设置在终端600的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器601或射频电路604的电信号转换为声波。扬声器可以是传统的薄膜扬声器，也可以是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时，不仅可以将电信号转换为人类可听见的声波，也可以将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中，音频电路607还可以包括耳机插孔。Audio circuitry 607 may include a microphone and speakers. The microphone is used to collect the sound waves of the user and the environment, convert the sound waves into electrical signals and input them to the processor 601 for processing, or to the radio frequency circuit 604 to realize voice communication. For the purpose of stereo collection or noise reduction, there may be multiple microphones, which are respectively disposed in different parts of the terminal 600 . The microphone may also be an array microphone or an omnidirectional collection microphone. The speaker is used to convert the electrical signal from the processor 601 or the radio frequency circuit 604 into sound waves. The loudspeaker can be a traditional thin-film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, it can not only convert electrical signals into sound waves audible to humans, but also convert electrical signals into sound waves inaudible to humans for distance measurement and other purposes. In some embodiments, the audio circuit 607 may also include a headphone jack.

定位组件608用于定位终端600的当前地理位置，以实现导航或LBS(LocationBased Service，基于位置的服务)。定位组件608可以是基于美国的GPS(GlobalPositioning System，全球定位系统)、中国的北斗系统或俄罗斯的伽利略系统的定位组件。The positioning component 608 is used to locate the current geographic location of the terminal 600 to implement navigation or LBS (Location Based Service, location-based service). The positioning component 608 may be a positioning component based on the GPS (Global Positioning System, global positioning system) of the United States, the Beidou system of China or the Galileo system of Russia.

电源609用于为终端600中的各个组件进行供电。电源609可以是交流电、直流电、一次性电池或可充电电池。当电源609包括可充电电池时，该可充电电池可以是有线充电电池或无线充电电池。有线充电电池是通过有线线路充电的电池，无线充电电池是通过无线线圈充电的电池。该可充电电池还可以用于支持快充技术。The power supply 609 is used to power various components in the terminal 600 . The power source 609 may be alternating current, direct current, disposable batteries or rechargeable batteries. When the power source 609 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. Wired rechargeable batteries are batteries that are charged through wired lines, and wireless rechargeable batteries are batteries that are charged through wireless coils. The rechargeable battery can also be used to support fast charging technology.

在一些实施例中，终端600还包括有一个或多个传感器610。该一个或多个传感器610包括但不限于：加速度传感器611、陀螺仪传感器612、压力传感器613、指纹传感器614、光学传感器615以及接近传感器616。In some embodiments, terminal 600 also includes one or more sensors 610 . The one or more sensors 610 include, but are not limited to, an acceleration sensor 611 , a gyro sensor 612 , a pressure sensor 613 , a fingerprint sensor 614 , an optical sensor 615 and a proximity sensor 616 .

加速度传感器611可以检测以终端600建立的坐标系的三个坐标轴上的加速度大小。比如，加速度传感器611可以用于检测重力加速度在三个坐标轴上的分量。处理器601可以根据加速度传感器611采集的重力加速度信号，控制触摸显示屏605以横向视图或纵向视图进行用户界面的显示。加速度传感器611还可以用于游戏或者用户的运动数据的采集。The acceleration sensor 611 can detect the magnitude of acceleration on the three coordinate axes of the coordinate system established by the terminal 600 . For example, the acceleration sensor 611 can be used to detect the components of the gravitational acceleration on the three coordinate axes. The processor 601 can control the touch display screen 605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 611 . The acceleration sensor 611 can also be used for game or user movement data collection.

陀螺仪传感器612可以检测终端600的机体方向及转动角度，陀螺仪传感器612可以与加速度传感器611协同采集用户对终端600的3D动作。处理器601根据陀螺仪传感器612采集的数据，可以实现如下功能：动作感应(比如根据用户的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。The gyroscope sensor 612 can detect the body direction and rotation angle of the terminal 600 , and the gyroscope sensor 612 can cooperate with the acceleration sensor 611 to collect 3D actions of the user on the terminal 600 . The processor 601 can implement the following functions according to the data collected by the gyro sensor 612 : motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.

压力传感器613可以设置在终端600的侧边框和/或触摸显示屏605的下层。当压力传感器613设置在终端600的侧边框时，可以检测用户对终端600的握持信号，由处理器601根据压力传感器613采集的握持信号进行左右手识别或快捷操作。当压力传感器613设置在触摸显示屏605的下层时，由处理器601根据用户对触摸显示屏605的压力操作，实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。The pressure sensor 613 may be disposed on the side frame of the terminal 600 and/or the lower layer of the touch display screen 605 . When the pressure sensor 613 is disposed on the side frame of the terminal 600, the user's holding signal of the terminal 600 can be detected, and the processor 601 can perform left and right hand identification or quick operation according to the holding signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed on the lower layer of the touch display screen 605 , the processor 601 controls the operability controls on the UI interface according to the user's pressure operation on the touch display screen 605 . The operability controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.

指纹传感器614用于采集用户的指纹，由处理器601根据指纹传感器614采集到的指纹识别用户的身份，或者，由指纹传感器614根据采集到的指纹识别用户的身份。在识别出用户的身份为可信身份时，由处理器601授权该用户执行相关的敏感操作，该敏感操作包括解锁屏幕、查看加密信息、下载软件、支付及更改设置等。指纹传感器614可以被设置终端600的正面、背面或侧面。当终端600上设置有物理按键或厂商Logo时，指纹传感器614可以与物理按键或厂商Logo集成在一起。The fingerprint sensor 614 is used to collect the user's fingerprint, and the processor 601 identifies the user's identity according to the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the user's identity according to the collected fingerprint. When the user's identity is identified as a trusted identity, the processor 601 authorizes the user to perform relevant sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings. The fingerprint sensor 614 may be provided on the front, back or side of the terminal 600 . When the terminal 600 is provided with physical buttons or a manufacturer's logo, the fingerprint sensor 614 may be integrated with the physical buttons or the manufacturer's logo.

光学传感器615用于采集环境光强度。在一个实施例中，处理器601可以根据光学传感器615采集的环境光强度，控制触摸显示屏605的显示亮度。具体地，当环境光强度较高时，调高触摸显示屏605的显示亮度；当环境光强度较低时，调低触摸显示屏605的显示亮度。在另一个实施例中，处理器601还可以根据光学传感器615采集的环境光强度，动态调整摄像头组件606的拍摄参数。Optical sensor 615 is used to collect ambient light intensity. In one embodiment, the processor 601 may control the display brightness of the touch display screen 605 according to the ambient light intensity collected by the optical sensor 615 . Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 605 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 605 is decreased. In another embodiment, the processor 601 may also dynamically adjust the shooting parameters of the camera assembly 606 according to the ambient light intensity collected by the optical sensor 615 .

接近传感器616，也称距离传感器，通常设置在终端600的前面板。接近传感器616用于采集用户与终端600的正面之间的距离。在一个实施例中，当接近传感器616检测到用户与终端600的正面之间的距离逐渐变小时，由处理器601控制触摸显示屏605从亮屏状态切换为息屏状态；当接近传感器616检测到用户与终端600的正面之间的距离逐渐变大时，由处理器601控制触摸显示屏605从息屏状态切换为亮屏状态。A proximity sensor 616 , also called a distance sensor, is usually provided on the front panel of the terminal 600 . The proximity sensor 616 is used to collect the distance between the user and the front of the terminal 600 . In one embodiment, when the proximity sensor 616 detects that the distance between the user and the front of the terminal 600 gradually decreases, the processor 601 controls the touch display screen 605 to switch from the bright screen state to the off screen state; when the proximity sensor 616 detects When the distance between the user and the front of the terminal 600 gradually increases, the processor 601 controls the touch display screen 605 to switch from the screen-off state to the screen-on state.

本领域技术人员可以理解，图6中示出的结构并不构成对终端600的限定，可以包括比图示更多或更少的组件，或者组合某些组件，或者采用不同的组件布置。Those skilled in the art can understand that the structure shown in FIG. 6 does not constitute a limitation on the terminal 600, and may include more or less components than the one shown, or combine some components, or adopt different component arrangements.

本申请实施例还提供了一种非临时性计算机可读存储介质，当所述存储介质中的指令由移动终端的处理器执行时，使得移动终端能够执行上述图1或图2所示实施例提供的语音识别方法。The embodiment of the present application further provides a non-transitory computer-readable storage medium, when the instructions in the storage medium are executed by the processor of the mobile terminal, the mobile terminal can execute the above embodiment shown in FIG. 1 or FIG. 2 Provided speech recognition methods.

本申请实施例还提供了一种包含指令的计算机程序产品，当其在计算机上运行时，使得计算机执行上述图1或图2所示实施例提供的语音识别方法。The embodiment of the present application also provides a computer program product including instructions, which, when running on a computer, enables the computer to execute the speech recognition method provided by the embodiment shown in FIG. 1 or FIG. 2 .

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成，也可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，上述提到的存储介质可以是只读存储器，磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps of implementing the above embodiments can be completed by hardware, or can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium. The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, etc.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims

1. a kind of audio recognition method is applied in first terminal, which is characterized in that the described method includes:

Acquire target voice to be identified；

Obtain the acoustic feature of the target voice；

The acoustic feature is input in the Model of Target Recognition by invocation target identification model, exports the target voice Corresponding behavior intention labels, the Model of Target Recognition are used to identify that the voice is corresponding according to the acoustic feature of any voice Behavior be intended to；

Wherein, before the invocation target identification model, this method further include:

Obtain the acoustic feature and the corresponding behavior intention labels of each voice training sample of at least one voice training sample；

The corresponding behavior meaning of acoustic feature and each voice training sample based at least one voice training sample Icon label are treated trained identification model and are trained, obtain the Model of Target Recognition；

Wherein, the corresponding behavior intention labels of each voice training sample are obtained, comprising:

Obtain at least one voice；

Determine the corresponding behavior operation of each voice at least one described voice, wherein the corresponding behavior behaviour of each voice Make include it is following any one of at least: the corresponding behavior operation of-each voice is being received for each speech trigger It is performed when operational order；The corresponding behavior operation of each voice, which can also be, to be collected for each Speech Record It is performed when the standard control voice entered；

It generates each behavior and operates corresponding behavior intention labels；

At least one described voice is determined as at least one described voice training sample, and each behavior of generation is intended to Label is determined as the behavior intention labels of corresponding voice training sample；

Before described at least one voice of acquisition, further includes:

According to the vocal print feature of each voice, inquire whether at least one described voice is all from target user；

When at least one described voice is all from the target user, the operation for obtaining at least one voice is executed；

The acoustic feature and the corresponding row of each voice training sample based at least one voice training sample It for intention labels, treats trained identification model and is trained, after obtaining the Model of Target Recognition, further includes:

Share the Model of Target Recognition to second terminal, the second terminal, which refers to, has incidence relation with the first terminal Terminal.

2. the method as described in claim 1, which is characterized in that the vocal print feature according to each voice inquires institute State whether at least one voice is all from target user, comprising:

Determine the difference value between the vocal print feature and default vocal print feature of each voice；

When the difference value between the vocal print feature of each voice and the default vocal print feature is respectively less than preset threshold, really At least one fixed described voice is all from the target user.

3. a kind of speech recognition equipment, it is applied in first terminal, which is characterized in that described device includes:

Acquisition module, for acquiring target voice to be identified；

First obtains module, for obtaining the acoustic feature of the target voice；

Calling module is used for invocation target identification model, the acoustic feature is input in the Model of Target Recognition, exports The corresponding behavior intention labels of the target voice, the Model of Target Recognition are used to be identified according to the acoustic feature of any voice The corresponding behavior of the voice is intended to；

Described device further include:

Second obtains module, and the acoustic feature and each voice training sample for obtaining at least one voice training sample are corresponding Behavior intention labels；

Training module, for acoustic feature and each voice training sample based at least one voice training sample Corresponding behavior intention labels are treated trained identification model and are trained, obtain the Model of Target Recognition；

The second acquisition module is used for:

Obtain at least one voice；

Determine the corresponding behavior operation of each voice at least one described voice；

The second acquisition module is also used to:

According to the vocal print feature of each voice, inquire whether at least one described voice is all from target user, the mesh Mark user refers to the user for having incidence relation with the first terminal；

Described device further include:

Sharing module, for sharing the Model of Target Recognition to second terminal, the second terminal refers to described first eventually Hold the terminal with incidence relation.

4. device as claimed in claim 3, which is characterized in that the second acquisition module is also used to:

5. a kind of computer readable storage medium, instruction is stored on the computer readable storage medium, which is characterized in that institute It states and realizes method of any of claims 1-2 when instruction is executed by processor.

6. a kind of calculating equipment, comprising:

One or more processors；

Memory, for storing one or more programs,

When one or more of programs are executed by one or more of processors, so that one or more of processors Execute such as method of any of claims 1-2.