CN108806670A - Audio recognition method, device and storage medium - Google Patents

Audio recognition method, device and storage medium Download PDF

Info

Publication number
CN108806670A
CN108806670A CN201810758435.0A CN201810758435A CN108806670A CN 108806670 A CN108806670 A CN 108806670A CN 201810758435 A CN201810758435 A CN 201810758435A CN 108806670 A CN108806670 A CN 108806670A
Authority
CN
China
Prior art keywords
voice
target
terminal
training sample
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810758435.0A
Other languages
Chinese (zh)
Other versions
CN108806670B (en
Inventor
李国华
戴帅湘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shitian Cultural Development Co ltd
Original Assignee
Beijing Moran Cognitive Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Moran Cognitive Technology Co Ltd filed Critical Beijing Moran Cognitive Technology Co Ltd
Priority to CN201810758435.0A priority Critical patent/CN108806670B/en
Publication of CN108806670A publication Critical patent/CN108806670A/en
Application granted granted Critical
Publication of CN108806670B publication Critical patent/CN108806670B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本发明公开了一种语音识别方法、装置及存储介质,属于语音处理技术领域。所述方法包括:采集待识别的目标语音,获取该目标语音的声学特征。调用目标识别模型,将该声学特征输入至该目标识别模型中,输出该目标语音对应的行为意图标签,该目标识别模型用于根据任一语音的声学特征识别该语音对应的行为意图。在本发明实施例中,无论是标准语音还是非标准语音,均可以基于声学特征通过目标识别模型识别出对应的行为意图,增强了语音识别的适用性。

The invention discloses a voice recognition method, device and storage medium, belonging to the technical field of voice processing. The method includes: collecting target speech to be recognized, and acquiring the acoustic features of the target speech. Invoke the target recognition model, input the acoustic features into the target recognition model, and output the behavioral intention label corresponding to the target voice, and the target recognition model is used to identify the behavioral intention corresponding to the voice according to the acoustic features of any voice. In the embodiment of the present invention, whether it is a standard voice or a non-standard voice, the corresponding behavioral intention can be recognized through the target recognition model based on the acoustic features, which enhances the applicability of the voice recognition.

Description

语音识别方法、装置及存储介质Speech recognition method, device and storage medium

技术领域technical field

本发明实施例涉及语音处理技术领域,特别涉及一种语音识别方法、装置及存储介质。Embodiments of the present invention relate to the technical field of voice processing, and in particular, to a voice recognition method, device, and storage medium.

背景技术Background technique

目前,语音识别技术得到了广泛的应用。譬如,用户在使用终端的过程中,可以利用语音识别技术来控制终端,如,控制终端开启摄像头等。At present, speech recognition technology has been widely used. For example, in the process of using the terminal, the user can use the voice recognition technology to control the terminal, for example, control the terminal to turn on the camera and so on.

在相关技术中,终端采集到用户输入的语音后,将该语音发送给语音转化服务器,该语音转化服务器可以将该语音转化为文本的形式,之后,将转化后的文本发送给该终端。该终端接收到该文本后,可以再将该文本发送给语义识别服务器,由该语义识别服务器对该文本进行语义识别,并将该识别结果反馈给该终端。如此,终端即可基于该识别结果,执行对应的操作。In related technologies, after the terminal collects the voice input by the user, it sends the voice to a voice conversion server, and the voice conversion server can convert the voice into text, and then sends the converted text to the terminal. After receiving the text, the terminal can send the text to the semantic recognition server, and the semantic recognition server performs semantic recognition on the text and feeds back the recognition result to the terminal. In this way, the terminal can perform corresponding operations based on the identification result.

然而,在上述实现过程中,只能对标准的语音进行识别,也就是说,只能对普通话进行语音识别,语音识别的适用性较差。However, in the above implementation process, only standard speech can be recognized, that is to say, only Mandarin Chinese can be recognized, and the applicability of speech recognition is poor.

发明内容Contents of the invention

本发明实施例提供了一种语音识别方法、装置及存储介质,可以解决相关技术中语音识别的适用性较差的问题。所述技术方案如下:Embodiments of the present invention provide a voice recognition method, device and storage medium, which can solve the problem of poor applicability of voice recognition in the related art. Described technical scheme is as follows:

第一方面,提供了一种语音识别方法,所述方法包括:In a first aspect, a speech recognition method is provided, the method comprising:

采集待识别的目标语音;Collect the target voice to be recognized;

获取所述目标语音的声学特征;Acquiring the acoustic features of the target speech;

调用目标识别模型,将所述声学特征输入至所述目标识别模型中,输出所述目标语音对应的行为意图标签,所述目标识别模型用于根据任一语音的声学特征识别所述语音对应的行为意图。Invoking the target recognition model, inputting the acoustic features into the target recognition model, and outputting the behavioral intention label corresponding to the target voice, the target recognition model is used to identify the corresponding behavior of the voice according to the acoustic features of any voice behavioral intent.

可选地,所述调用目标识别模型之前,还包括:Optionally, before calling the target recognition model, it also includes:

获取至少一个语音训练样本的声学特征和每个语音训练样本对应的行为意图标签;Acquiring the acoustic features of at least one speech training sample and the behavioral intent label corresponding to each speech training sample;

基于所述至少一个语音训练样本的声学特征和所述每个语音训练样本对应的行为意图标签,对待训练的识别模型进行训练,得到所述目标识别模型。Based on the acoustic feature of the at least one speech training sample and the behavioral intention label corresponding to each speech training sample, the recognition model to be trained is trained to obtain the target recognition model.

可选地,获取每个语音训练样本对应的行为意图标签,包括:Optionally, obtain the behavioral intent label corresponding to each voice training sample, including:

获取至少一个语音;Get at least one voice;

确定所述至少一个语音中每个语音对应的行为操作;determining the behavior operation corresponding to each voice in the at least one voice;

生成每个行为操作对应的行为意图标签;Generate behavioral intent tags corresponding to each behavioral operation;

将所述至少一个语音确定为所述至少一个语音训练样本,以及将生成的每个行为意图标签确定为对应的语音训练样本的行为意图标签。Determining the at least one speech as the at least one speech training sample, and determining each generated behavioral intention label as the behavioral intention label of the corresponding speech training sample.

可选地,所述获取至少一个语音之前,还包括:Optionally, before acquiring at least one voice, it also includes:

根据所述每个语音的声纹特征,查询所述至少一个语音是否均来自目标用户,所述目标用户是指与所述第一终端具有关联关系的用户;According to the voiceprint feature of each voice, inquire whether the at least one voice is from a target user, where the target user refers to a user associated with the first terminal;

当所述至少一个语音均来自所述目标用户时,执行所述获取至少一个语音的操作。When the at least one voice is from the target user, the operation of acquiring at least one voice is performed.

可选地,所述根据所述每个语音的声纹特征,查询所述至少一个语音是否均来自目标用户,包括:Optionally, the querying whether the at least one voice is from the target user according to the voiceprint feature of each voice includes:

确定所述每个语音的声纹特征与预设声纹特征之间的差异值;Determine the difference between the voiceprint feature of each voice and the preset voiceprint feature;

当所述每个语音的声纹特征与所述预设声纹特征之间的差异值均小于预设阈值时,确定所述至少一个语音均来自所述目标用户。When the difference values between the voiceprint feature of each voice and the preset voiceprint feature are smaller than a preset threshold, it is determined that the at least one voice is from the target user.

可选地,所述基于所述至少一个语音训练样本的声学特征和所述每个语音训练样本对应的行为意图标签,对待训练的识别模型进行训练,得到所述目标识别模型之后,还包括:Optionally, the training of the recognition model to be trained based on the acoustic features of the at least one speech training sample and the behavioral intention label corresponding to each speech training sample, after obtaining the target recognition model, further includes:

向第二终端分享所述目标识别模型,所述第二终端是指与所述第一终端具有关联关系的终端。The target recognition model is shared with a second terminal, where the second terminal refers to a terminal that has an association relationship with the first terminal.

第二方面,提供了一种语音识别装置,所述装置包括:In a second aspect, a speech recognition device is provided, the device comprising:

采集模块,用于采集待识别的目标语音;The collection module is used to collect the target voice to be recognized;

第一获取模块,用于获取所述目标语音的声学特征;A first acquisition module, configured to acquire the acoustic features of the target speech;

调用模块,用于调用目标识别模型,将所述声学特征输入至所述目标识别模型中,输出所述目标语音对应的行为意图标签,所述目标识别模型用于根据任一语音的声学特征识别所述语音对应的行为意图。The calling module is used to call the target recognition model, input the acoustic features into the target recognition model, and output the behavioral intention label corresponding to the target voice, and the target recognition model is used to recognize any voice according to the acoustic features The behavioral intention corresponding to the voice.

可选地,所述装置还包括:Optionally, the device also includes:

第二获取模块,用于获取至少一个语音训练样本的声学特征和每个语音训练样本对应的行为意图标签;The second obtaining module is used to obtain the acoustic features of at least one speech training sample and the behavioral intention label corresponding to each speech training sample;

训练模块,用于基于所述至少一个语音训练样本的声学特征和所述每个语音训练样本对应的行为意图标签,对待训练的识别模型进行训练,得到所述目标识别模型。A training module, configured to train a recognition model to be trained based on the acoustic features of the at least one speech training sample and the behavioral intention label corresponding to each speech training sample, to obtain the target recognition model.

可选地,所述第二获取模块用于:Optionally, the second acquisition module is used for:

获取至少一个语音;Get at least one voice;

确定所述至少一个语音中每个语音对应的行为操作;determining the behavior operation corresponding to each voice in the at least one voice;

生成每个行为操作对应的行为意图标签;Generate behavioral intent tags corresponding to each behavioral operation;

将所述至少一个语音确定为所述至少一个语音训练样本,以及将生成的每个行为意图标签确定为对应的语音训练样本的行为意图标签。Determining the at least one speech as the at least one speech training sample, and determining each generated behavioral intention label as the behavioral intention label of the corresponding speech training sample.

可选地,所述第二获取模块还用于:Optionally, the second acquisition module is also used for:

根据所述每个语音的声纹特征,查询所述至少一个语音是否均来自目标用户,所述目标用户是指与所述第一终端具有关联关系的用户;According to the voiceprint feature of each voice, inquire whether the at least one voice is from a target user, where the target user refers to a user associated with the first terminal;

当所述至少一个语音均来自所述目标用户时,执行所述获取至少一个语音的操作。When the at least one voice is from the target user, the operation of acquiring at least one voice is performed.

可选地,所述第二获取模块还用于:Optionally, the second acquisition module is also used for:

确定所述每个语音的声纹特征与预设声纹特征之间的差异值;Determine the difference between the voiceprint feature of each voice and the preset voiceprint feature;

当所述每个语音的声纹特征与所述预设声纹特征之间的差异值均小于预设阈值时,确定所述至少一个语音均来自所述目标用户。When the difference values between the voiceprint feature of each voice and the preset voiceprint feature are smaller than a preset threshold, it is determined that the at least one voice is from the target user.

可选地,所述装置还包括:Optionally, the device also includes:

分享模块,用于向第二终端分享所述目标识别模型,所述第二终端是指与所述第一终端具有关联关系的终端。A sharing module, configured to share the target recognition model with a second terminal, where the second terminal refers to a terminal that has an association relationship with the first terminal.

第三方面,提供一种计算机可读存储介质,所述计算机可读存储介质上存储有指令,所述指令被处理器执行时实现上述第一方面所述的语音识别方法。In a third aspect, a computer-readable storage medium is provided, where instructions are stored on the computer-readable storage medium, and when the instructions are executed by a processor, the speech recognition method described in the above-mentioned first aspect is implemented.

第四方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的语音识别方法。In a fourth aspect, there is provided a computer program product containing instructions, which, when run on a computer, causes the computer to execute the speech recognition method described in the first aspect above.

第五方面,提供了一种计算设备,包括:In a fifth aspect, a computing device is provided, including:

一个或多个处理器;one or more processors;

存储器,用于存储一个或多个程序,memory for storing one or more programs,

当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器执行上述第一方面所述的语音识别方法。When the one or more programs are executed by the one or more processors, the one or more processors are made to execute the voice recognition method described in the first aspect above.

本发明实施例提供的技术方案带来的有益效果是:The beneficial effects brought by the technical solution provided by the embodiments of the present invention are:

采集待识别的目标语音,获取该目标语音的声学特征。调用目标识别模型,由于该目标识别模型可以根据任一语音的声学特征识别该语音对应的行为意图,因此,将获取的该声学特征输入至该目标识别模型后,可以输出该目标语音对应的行为意图标签。在本发明实施例中,无论是标准语音还是非标准语音,均可以基于声学特征通过目标识别模型识别出对应的行为意图,增强了语音识别的适用性。Collect the target speech to be recognized, and obtain the acoustic features of the target speech. Call the target recognition model, because the target recognition model can recognize the behavioral intention corresponding to the voice according to the acoustic features of any voice, so after inputting the acquired acoustic features into the target recognition model, the behavior corresponding to the target voice can be output intent label. In the embodiment of the present invention, whether it is a standard voice or a non-standard voice, the corresponding behavioral intention can be recognized through the target recognition model based on the acoustic features, which enhances the applicability of the voice recognition.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained based on these drawings without creative effort.

图1是根据一示例性实施例示出的一种语音识别方法的流程图;Fig. 1 is a flow chart of a speech recognition method shown according to an exemplary embodiment;

图2是根据另一示例性实施例示出的一种语音识别方法的流程图;Fig. 2 is a flowchart of a speech recognition method according to another exemplary embodiment;

图3是根据一示例性实施例示出的一种语音识别装置的结构示意图;Fig. 3 is a schematic structural diagram of a speech recognition device according to an exemplary embodiment;

图4是根据另一示例性实施例示出的一种语音识别装置的结构示意图;Fig. 4 is a schematic structural diagram of a speech recognition device according to another exemplary embodiment;

图5是根据另一示例性实施例示出的一种语音识别装置的结构示意图;Fig. 5 is a schematic structural diagram of a speech recognition device according to another exemplary embodiment;

图6是根据另一示例性实施例示出的一种终端600的结构示意图。Fig. 6 is a schematic structural diagram of a terminal 600 according to another exemplary embodiment.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。In order to make the object, technical solution and advantages of the present invention clearer, the implementation manner of the present invention will be further described in detail below in conjunction with the accompanying drawings.

在对本发明实施例提供的语音识别方法进行详细描述之前,先对本发明实施例所涉及的应用场景和实施环境进行简单介绍。Before describing the speech recognition method provided by the embodiment of the present invention in detail, the application scenario and implementation environment involved in the embodiment of the present invention will be briefly introduced.

首先,对本发明实施例涉及的应用场景进行简单介绍。First, the application scenarios involved in the embodiments of the present invention are briefly introduced.

在一些应用场景中,为了提高操作的便捷性,有通过语音来控制终端的需求,譬如,该应用场景包括但不限于家居、车载等环境。在通过语音控制终端的过程中,为了能够获知语音对应的行为意图,需要进行语音识别。目前,在语音识别过程中,需要将待识别的语音转化为文本,再对转化后的文本进行语义识别。然而,在上述实现过程中,只能对标准语音进行文本转化和语义识别,无法对非标准语音(如方言)进行识别,从而导致语音识别的适用性较差。In some application scenarios, in order to improve the convenience of operation, there is a need to control the terminal through voice. For example, the application scenarios include but are not limited to environments such as home and vehicle. In the process of controlling the terminal by voice, in order to know the behavioral intention corresponding to the voice, voice recognition is required. At present, in the process of speech recognition, it is necessary to convert the speech to be recognized into text, and then carry out semantic recognition on the converted text. However, in the above implementation process, text conversion and semantic recognition can only be performed on standard speech, and non-standard speech (such as dialect) cannot be recognized, resulting in poor applicability of speech recognition.

为此,本发明实施例提供了一种语音识别方法,该语音识别方法可以基于语音的声学特征通过目标识别模型来识别对应的行为意图,由于该方法无需识别语义,因此,无论是标准语音还是非标准语音,均可以实现语音识别,增加了语音识别的适用性,其具体实现过程请参见如下图1或图2所示实施例。For this reason, the embodiment of the present invention provides a kind of speech recognition method, and this speech recognition method can recognize the corresponding behavioral intention through target recognition model based on the acoustic characteristic of speech, because this method does not need to recognize semantics, therefore, whether it is standard speech or All non-standard voices can realize voice recognition, which increases the applicability of voice recognition. For the specific implementation process, please refer to the embodiment shown in FIG. 1 or FIG. 2 below.

其次,对本发明实施例涉及的实施环境进行简单介绍。Next, briefly introduce the implementation environment involved in the embodiment of the present invention.

本发明实施例提供的语音识别方法可以由第一终端来执行,该第一终端中可以配置有语音采集装置,譬如,该语音采集装置可以为麦克风阵列等,用于语音采集。在一些实施例中,该第一终端可以是任何一种可与用户通过键盘、触摸板、触摸屏、遥控器、语音交互或手写设备等一种或多种方式进行人机交互的电子产品,例如PC、手机、智能手机、PDA、可穿戴设备、掌上电脑PPC、可穿戴设备、平板电脑、智能车机、智能电视、智能音箱等。在实际应用中,当第一终端为可以与用户进行语音交互的电子产品时,其上可搭载/安装能够识别、解析、理解、处理并响应用户的自然语言命令并将响应结果进行输出的客户端(可以是APP形式),也可以是该客户端仅能对用户输入的自然语言命令进行语音识别但需对应的服务器来对该自然语言命令进行解析、理解、处理并响应用户的自然语言命令并将响应结果返回客户端进行输出。The speech recognition method provided by the embodiment of the present invention can be executed by the first terminal, and the first terminal can be configured with a speech collection device, for example, the speech collection device can be a microphone array, etc., for speech collection. In some embodiments, the first terminal can be any electronic product that can interact with the user in one or more ways such as keyboard, touch pad, touch screen, remote control, voice interaction or handwriting equipment, for example PC, mobile phone, smart phone, PDA, wearable device, handheld computer PPC, wearable device, tablet computer, smart car machine, smart TV, smart speaker, etc. In practical applications, when the first terminal is an electronic product capable of voice interaction with the user, it can be loaded/installed with a client terminal capable of recognizing, parsing, understanding, processing and responding to the user's natural language commands and outputting the response results. end (can be in the form of an APP), or the client can only perform speech recognition on the natural language commands input by the user but needs a corresponding server to analyze, understand, process and respond to the natural language commands of the user And return the response result to the client for output.

进一步地,该第一终端可以与至少一个第二终端连接,在一种可能的实现方式中,该至少一个第二终端与该第一终端可以均属于同一个用户。Further, the first terminal may be connected to at least one second terminal, and in a possible implementation manner, the at least one second terminal and the first terminal may both belong to the same user.

在此,所述第二终端可以是任何一种可与用户通过键盘、触摸板、触摸屏、遥控器、语音交互或手写设备等一种或多种方式进行人机交互的电子产品,例如PC、手机、智能手机、PDA、可穿戴设备、掌上电脑PPC、可穿戴设备、平板电脑、智能车机、智能电视、智能音箱等。在实际应用中,当第二终端为可以与用户进行语音交互的电子产品时,其上可搭载/安装能够识别、解析、理解、处理并响应用户的自然语言命令并将响应结果进行输出的客户端(可以是APP形式),也可以是该客户端仅能对用户输入的自然语言命令进行语音识别但需对应的服务器来对该自然语言命令进行解析、理解、处理并响应用户的自然语言命令并将响应结果返回客户端进行输出。Here, the second terminal may be any electronic product capable of human-computer interaction with the user through one or more methods such as keyboard, touch pad, touch screen, remote control, voice interaction or handwriting equipment, such as PC, Mobile phones, smart phones, PDAs, wearable devices, PPCs, PPCs, wearable devices, tablets, smart cars, smart TVs, smart speakers, etc. In practical applications, when the second terminal is an electronic product capable of voice interaction with the user, it can be loaded/installed with a client terminal that can recognize, analyze, understand, process and respond to the user's natural language commands and output the response results. end (can be in the form of an APP), or the client can only perform speech recognition on the natural language commands input by the user but needs a corresponding server to analyze, understand, process and respond to the natural language commands of the user And return the response result to the client for output.

在此,第一终端和第二终端均包括一种能够按照事先设定或存储的指令,自动进行数值计算和信息处理的电子设备,其硬件包括但不限于微处理器、专用集成电路(ASIC)、可编程门阵列(FPGA)、数字处理器(DSP)、嵌入式设备等。Here, both the first terminal and the second terminal include an electronic device that can automatically perform numerical calculation and information processing according to preset or stored instructions, and its hardware includes but is not limited to microprocessors, application-specific integrated circuits (ASICs) ), programmable gate array (FPGA), digital processor (DSP), embedded devices, etc.

在介绍完本发明实施例涉及的应用场景和实施环境后,接下来将结合附图对本发明实施例提供的语音识别方法进行详细介绍。After introducing the application scenarios and implementation environments involved in the embodiments of the present invention, the voice recognition method provided by the embodiments of the present invention will be described in detail with reference to the accompanying drawings.

请参考图1,图1是根据一示例性实施例示出的一种语音识别方法的流程图,该语音识别方法可以由上述第一终端来执行,该语音识别方法可以包括如下几个实现步骤:Please refer to FIG. 1. FIG. 1 is a flow chart of a voice recognition method according to an exemplary embodiment. The voice recognition method may be executed by the above-mentioned first terminal, and the voice recognition method may include the following implementation steps:

步骤101:采集待识别的目标语音。Step 101: Collect target speech to be recognized.

步骤102:获取该目标语音的声学特征。Step 102: Obtain the acoustic features of the target speech.

步骤103:调用目标识别模型,将该声学特征输入至该目标识别模型中,输出该目标语音对应的行为意图标签,该目标识别模型用于根据任一语音的声学特征识别该语音对应的行为意图。Step 103: call the target recognition model, input the acoustic features into the target recognition model, and output the behavioral intention label corresponding to the target voice, and the target recognition model is used to identify the behavioral intention corresponding to the voice according to the acoustic features of any voice .

进一步地,调用目标识别模型之前,获取至少一个语音训练样本的声学特征和每个语音训练样本对应的行为意图标签,基于该至少一个语音训练样本的声学特征和该每个语音训练样本对应的行为意图标签,对待训练的识别模型进行训练,得到该目标识别模型。Further, before invoking the target recognition model, the acoustic features of at least one voice training sample and the behavioral intent label corresponding to each voice training sample are acquired, based on the acoustic features of the at least one voice training sample and the behavior corresponding to each voice training sample The intent label is used to train the recognition model to be trained to obtain the target recognition model.

其中,上述获取每个语音训练样本对应的行为意图标签的实现过程可以包括:获取至少一个语音;确定该至少一个语音中每个语音对应的行为操作;生成每个行为操作对应的行为意图标签;将该至少一个语音确定为该至少一个语音训练样本,以及将生成的每个行为意图标签确定为对应的语音训练样本的行为意图标签。Wherein, the implementation process of obtaining the behavioral intention label corresponding to each speech training sample may include: obtaining at least one speech; determining the behavior operation corresponding to each speech in the at least one speech; generating the behavioral intention label corresponding to each behavior operation; The at least one speech is determined as the at least one speech training sample, and each generated behavioral intention label is determined as the behavioral intention label of the corresponding speech training sample.

进一步地,获取至少一个语音之前,根据该每个语音的声纹特征,查询该至少一个语音是否均来自目标用户,该目标用户是指与该第一终端具有关联关系的用户;当该至少一个语音均来自该目标用户时,执行该获取至少一个语音的操作。Further, before acquiring at least one voice, according to the voiceprint feature of each voice, query whether the at least one voice comes from a target user, and the target user refers to a user who has an association relationship with the first terminal; when the at least one voice When the voices all come from the target user, the operation of acquiring at least one voice is performed.

在一种可能的实现方式中,根据该每个语音的声纹特征,查询该至少一个语音是否均来自目标用户的实现可以包括:确定该每个语音的声纹特征与预设声纹特征之间的差异值;当该每个语音的声纹特征与该预设声纹特征之间的差异值均小于预设阈值时,确定该至少一个语音均来自该目标用户。In a possible implementation manner, according to the voiceprint feature of each voice, the implementation of querying whether the at least one voice comes from the target user may include: determining the difference between the voiceprint feature of each voice and the preset voiceprint feature When the difference value between the voiceprint feature of each voice and the preset voiceprint feature is less than a preset threshold, it is determined that the at least one voice is from the target user.

在一些实施例中,基于所述至少一个语音训练样本的声学特征和所述每个语音训练样本对应的行为意图标签,对待训练的识别模型进行训练,得到所述目标识别模型之后,向第二终端分享该目标识别模型,该第二终端是指与该第一终端具有关联关系的终端。In some embodiments, based on the acoustic features of the at least one voice training sample and the behavioral intention label corresponding to each voice training sample, the recognition model to be trained is trained, and after the target recognition model is obtained, the second The terminals share the target recognition model, and the second terminal refers to a terminal that has an association relationship with the first terminal.

在本发明实施例中,采集待识别的目标语音,获取该目标语音的声学特征。调用目标识别模型,由于该目标识别模型可以根据任一语音的声学特征识别该语音对应的行为意图,因此,将获取的该声学特征输入至该目标识别模型后,可以输出该目标语音对应的行为意图标签。在本发明实施例中,无论是标准语音还是非标准语音,均可以基于声学特征通过目标识别模型识别出对应的行为意图,增强了语音识别的适用性。In the embodiment of the present invention, the target speech to be recognized is collected, and the acoustic features of the target speech are obtained. Call the target recognition model, because the target recognition model can recognize the behavioral intention corresponding to the voice according to the acoustic features of any voice, so after inputting the acquired acoustic features into the target recognition model, the behavior corresponding to the target voice can be output intent label. In the embodiment of the present invention, whether it is a standard voice or a non-standard voice, the corresponding behavioral intention can be recognized through the target recognition model based on the acoustic features, which enhances the applicability of the voice recognition.

图2是根据另一示例性实施例示出的一种语音识别方法的流程图,本实施例以该语音识别方法应用于上述第一终端中进行举例说明,该语音识别方法可以包括如下几个实现步骤:Fig. 2 is a flow chart of a speech recognition method according to another exemplary embodiment. This embodiment uses the speech recognition method applied to the above-mentioned first terminal as an example for illustration. The speech recognition method may include the following implementations step:

步骤201:采集待识别的目标语音。Step 201: Collect target speech to be recognized.

当用户想要通过语音控制第一终端时,可以直接对着该第一终端中诸如麦克风阵列之类的语音采集装置说话。相应地,该第一终端可以通过该语音采集装置采集用户所说的话,即采集待识别的目标语音。When the user wants to control the first terminal by voice, he can speak directly to the voice collection device in the first terminal, such as a microphone array. Correspondingly, the first terminal can collect what the user said through the voice collection device, that is, collect the target voice to be recognized.

进一步地,该第一终端可以在接收到语音识别指令时,采集待识别的目标语音,该语音识别指令可以由用户触发,该用户可以通过指定操作触发,该指定操作可以包括点击操作、滑动操作等等,本发明实施例对此不做限定。Further, the first terminal can collect the target voice to be recognized when receiving the voice recognition instruction, the voice recognition instruction can be triggered by the user, and the user can trigger it through a specified operation, and the specified operation can include a click operation, a slide operation etc. This is not limited in this embodiment of the present invention.

譬如,该第一终端中可以提供有语音识别选项,当用户想要通过语音控制该第一终端时,可以点击该语音识别选项,以触发该语音识别指令。该第一终端接收到该语音识别指令后,采集待识别的目标语音。For example, the first terminal may provide a voice recognition option, and when the user wants to control the first terminal by voice, the user may click on the voice recognition option to trigger the voice recognition instruction. After receiving the speech recognition instruction, the first terminal collects the target speech to be recognized.

步骤202:获取该目标语音的声学特征。Step 202: Obtain the acoustic features of the target speech.

为了便于后续对该目标语音进行语音识别,该第一终端获取该目标语音的声学特征。其中,该声学特征可以用于描述该目标语音的响度、音调、频率和音色中的至少一个。In order to facilitate the subsequent voice recognition of the target voice, the first terminal acquires the acoustic features of the target voice. Wherein, the acoustic feature may be used to describe at least one of loudness, pitch, frequency and timbre of the target speech.

步骤203:调用目标识别模型,将该声学特征输入至该目标识别模型中,输出该目标语音对应的行为意图标签,该目标识别模型用于根据任一语音的声学特征识别该语音对应的行为意图。Step 203: call the target recognition model, input the acoustic features into the target recognition model, and output the behavioral intention label corresponding to the target voice, and the target recognition model is used to identify the behavioral intention corresponding to the voice according to the acoustic features of any voice .

由于该目标识别模型可以用于根据任一语音的声学特征识别该语音对应的行为意图,因此,该第一终端将该目标语音的声学特征输入至该目标识别模型后,可以输出该目标语音对应的行为意图标签。Since the target recognition model can be used to identify the behavioral intention corresponding to the voice according to the acoustic features of any voice, the first terminal can output the target voice corresponding to the target voice after inputting the acoustic features of the target voice into the target recognition model. behavioral intent label for .

在一些实施例中,该行为意图标签可为一行为意图序列,换句话说,该行为意图标签可以用于描述一系列的行为动作,譬如,该行为意图标签可以用于描述“打开相机、再登录微信,之后再播放音乐”这一系列行为动作等。In some embodiments, the behavioral intent tag can be a behavioral intent sequence. In other words, the behavioral intent tag can be used to describe a series of behavioral actions. For example, the behavioral intent tag can be used to describe "turn on the camera, Log in to WeChat, and then play music” series of actions, etc.

进一步地,该第一终端确定该目标语音对应的行为意图标签后,可以执行该行为意图标签对应的行为操作。譬如,若该行为意图标签为启动相机标签,则该第一终端可以启动所安装的相机应用。如此,即实现了利用目标语音控制该第一终端的目的。Further, after the first terminal determines the behavioral intention tag corresponding to the target voice, it may execute the behavioral operation corresponding to the behavioral intention tag. For example, if the behavioral intention tag is a camera start tag, the first terminal may start the installed camera application. In this way, the purpose of using the target voice to control the first terminal is achieved.

进一步地,调用目标识别模型之前,需要进行模型训练以得到上述目标识别模型,在一种可能的实现方式中,训练过程可以包括:获取至少一个语音训练样本的声学特征和每个语音训练样本对应的行为意图标签,基于该至少一个语音训练样本的声学特征和该每个语音训练样本对应的行为意图标签,对待训练的识别模型进行训练,得到该目标识别模型。Further, before invoking the target recognition model, model training needs to be performed to obtain the above target recognition model. In a possible implementation, the training process may include: obtaining the acoustic features of at least one speech training sample and the corresponding Based on the acoustic features of the at least one voice training sample and the behavioral intention label corresponding to each voice training sample, the recognition model to be trained is trained to obtain the target recognition model.

也即是,该第一终端获取至少一个语音训练样本的声学特征和每个语音训练样本对应的行为意图标签,并将获取的该至少一个语音训练样本的声学特征和每个语音训练样本对应的行为意图标签输入至待训练的识别模型中,对该识别模型进行深度学习和训练,从而得到能够基于任一语音的声学特征识别该语音的行为意图的目标识别模型。That is, the first terminal obtains the acoustic features of at least one speech training sample and the behavioral intention label corresponding to each speech training sample, and combines the acquired acoustic features of the at least one speech training sample with the corresponding The behavioral intention label is input into the recognition model to be trained, and the recognition model is subjected to deep learning and training, so as to obtain a target recognition model that can recognize the behavioral intention of any speech based on the acoustic features of the speech.

在一些实施例中,该待训练的识别模型可以为CNN(Convolutional NeuralNetworks,卷积神经网络)模型,本发明实施例对此不做限定。In some embodiments, the recognition model to be trained may be a CNN (Convolutional Neural Networks, convolutional neural network) model, which is not limited in this embodiment of the present invention.

其中,上述获取每个语音训练样本对应的行为意图标签的实现过程可以包括:获取至少一个语音,确定该至少一个语音中每个语音对应的行为操作。生成每个行为操作对应的行为意图标签,将该至少一个语音确定为该至少一个语音训练样本,以及将生成的每个行为意图标签确定为对应的语音训练样本的行为意图标签。Wherein, the implementation process of obtaining the behavioral intention label corresponding to each speech training sample may include: obtaining at least one speech, and determining the behavior operation corresponding to each speech in the at least one speech. Generate a behavioral intention label corresponding to each behavioral operation, determine the at least one voice as the at least one voice training sample, and determine each generated behavioral intention label as the behavioral intention label of the corresponding voice training sample.

在一种可能的实现方式中,该每个语音对应的行为操作是在接收到针对该每个语音触发的操作指令时所执行的。在一些实施例中,用户可能向该第一终端输入至少一个语音,并且,在输入该至少一个语音中的每个语音后,可能通过手动触发的方式,针对该每个语音触发行为操作。譬如,用户可以用方言对该第一终端说“启动百度地图”,之后,该用户可以手动启动百度地图。此时,第一终端可以采集至少一个语音中的每个语音,并且确定对应的行为操作,譬如,确定启动百度地图这一行为操作。之后,该第一终端生成所确定的行为操作的行为意图标签,譬如,该行为意图标签可以为“启动百度地图标签”。该第一终端将所获取的每个语音确定为至少一个语音训练样本中的每个语音训练样本,以及将所生成的每个行为标签确定为对应的语音训练样本的行为意图标签。如此,该第一终端即可基于所确定的该至少一个语音训练样本和每个语音训练样本对应的行为意图标签,对待训练的识别模型进行训练。In a possible implementation manner, the behavior operation corresponding to each voice is executed when the operation instruction triggered for each voice is received. In some embodiments, the user may input at least one voice into the first terminal, and after inputting each voice in the at least one voice, may manually trigger a behavior operation for each voice. For example, the user can say "start Baidu map" to the first terminal in dialect, and then the user can manually start Baidu map. At this time, the first terminal may collect each voice in the at least one voice, and determine the corresponding behavior operation, for example, determine the behavior operation of starting Baidu map. Afterwards, the first terminal generates a behavioral intention label of the determined behavioral operation, for example, the behavioral intention label may be "start Baidu map label". The first terminal determines each acquired voice as each voice training sample in at least one voice training sample, and determines each generated behavior label as a behavior intention label of the corresponding voice training sample. In this way, the first terminal can train the recognition model to be trained based on the determined at least one speech training sample and the behavioral intention label corresponding to each speech training sample.

当然,上述仅是以该每个语音对应的行为操作是在接收到针对该每个语音触发的操作指令时所执行的为例进行说明。在另一实施例中,该每个语音对应的行为操作还可以是在采集到针对该每个语音录入的标准控制语音时所执行的。其中,该标准控制语音通常是指普通话。在一些实施例中,由于第一终端可以识别普通话,因此,用户向该第一终端输入至少一个语音,并且,在输入该至少一个语音中的每个语音后,可以通过标准控制语音针对该每个语音触发行为操作。譬如,用户可以用方言对该第一终端说“启动百度地图”,之后,该用户可以用普通话控制第一终端启动百度地图。此时,第一终端可以采集至少一个语音中的每个语音,并且确定对应的行为操作,譬如,确定启动百度地图这一行为操作。之后,该第一终端生成所确定的行为操作的行为意图标签,譬如,该行为意图标签可以为“启动百度地图标签”。该第一终端将所获取的每个语音确定为至少一个语音训练样本中的每个语音训练样本,以及将所生成的每个行为标签确定为对应的语音训练样本的行为意图标签。Of course, the above is only described by taking the behavior operation corresponding to each voice as an example to be executed when the operation instruction triggered by each voice is received. In another embodiment, the behavior operation corresponding to each voice may also be executed when the standard control voice entered for each voice is collected. Wherein, the standard control voice usually refers to Mandarin Chinese. In some embodiments, since the first terminal can recognize Mandarin Chinese, the user inputs at least one voice to the first terminal, and after inputting each voice in the at least one voice, the standard control voice can be used for each voice. A voice-triggered behavioral action. For example, the user can say "start Baidu map" to the first terminal in dialect, and then the user can control the first terminal to start Baidu map in Mandarin. At this time, the first terminal may collect each voice in the at least one voice, and determine the corresponding behavior operation, for example, determine the behavior operation of starting Baidu map. Afterwards, the first terminal generates a behavioral intention label of the determined behavioral operation, for example, the behavioral intention label may be "start Baidu map label". The first terminal determines each acquired voice as each voice training sample in at least one voice training sample, and determines each generated behavior label as a behavior intention label of the corresponding voice training sample.

进一步地,获取至少一个语音之前,还可以对该至少一个语音的来源进行限定,相应的处理过程包括:根据该每个语音的声纹特征,查询该至少一个语音是否均来自目标用户,该目标用户是指与该第一终端具有关联关系的用户,当该至少一个语音均来自该目标用户时,执行该获取至少一个语音的操作。Furthermore, before acquiring at least one voice, the source of the at least one voice can also be limited, and the corresponding processing includes: according to the voiceprint characteristics of each voice, inquiring whether the at least one voice comes from the target user, the target The user refers to a user associated with the first terminal, and when the at least one voice comes from the target user, the operation of acquiring at least one voice is performed.

在一些实施例中,如在多人场景中,该第一终端可以只基于与自身具有关联关系的目标用户的语音进行训练,譬如,该目标用户可以为该第一终端的拥有者。在该种情况下,该第一终端获取该至少一个语音之前,需要判断该至少一个语音是否是来自于该目标用户。该第一终端可以根据每个语音的声纹特征来查询该至少一个语音是否均来自于该目标用户。In some embodiments, such as in a multi-person scenario, the first terminal may only perform training based on the voice of a target user associated with itself, for example, the target user may be the owner of the first terminal. In this case, before acquiring the at least one voice, the first terminal needs to determine whether the at least one voice comes from the target user. The first terminal may inquire whether the at least one voice is from the target user according to the voiceprint feature of each voice.

在一种可能的实现方式中,上述根据该每个语音的声纹特征,查询该至少一个语音是否均来自目标用户的实现可以包括:确定该每个语音的声纹特征与预设声纹特征之间的差异值;当该每个语音的声纹特征与该预设声纹特征之间的差异值均小于预设阈值时,确定该至少一个语音均来自该目标用户。In a possible implementation manner, the above-mentioned implementation of querying whether the at least one voice is from the target user according to the voiceprint feature of each voice may include: determining the voiceprint feature of each voice and the preset voiceprint feature The difference value between; when the difference value between the voiceprint feature of each voice and the preset voiceprint feature is less than a preset threshold, it is determined that the at least one voice is from the target user.

该第一终端中可以预选存储有该预设声纹特征,该预设声纹特征可以为该目标用户的声纹特征。如此,该第一终端获取每个语音的声纹特征后,可以将获取的每个语音的声纹特征与预先存储的预设声纹特征进行比较,确定每个语音的声纹特征与该预设声纹特征之间的差异值。The preset voiceprint feature may be preselected and stored in the first terminal, and the preset voiceprint feature may be the voiceprint feature of the target user. In this way, after the first terminal obtains the voiceprint feature of each voice, it can compare the acquired voiceprint feature of each voice with the pre-stored preset voiceprint feature, and determine that the voiceprint feature of each voice is consistent with the preset voiceprint feature. Set the difference value between voiceprint features.

在一些实施例中,可以将获取的每个语音的声纹特征与预先存储的预设声纹特征进行模式匹配,以确定每个语音的声纹特征与该预设声纹特征之间的差异值。其中,该模式匹配的方法可以包括:概率统计法、人工神经网络法等,本申请实施例对此不做限定。In some embodiments, the acquired voiceprint features of each voice can be pattern-matched with pre-stored preset voiceprint features to determine the difference between the voiceprint features of each voice and the preset voiceprint features value. Wherein, the pattern matching method may include: a probability statistics method, an artificial neural network method, etc., which are not limited in this embodiment of the present application.

当该差异值小于预设阈值时,说明所比较的某个语音的声纹特征与该预设声纹特征之间的差异不大,此时,可以确定所比较的该语音是来自于该目标用户。反之,如果该差异值大于该预设阈值,说明所比较的某个语音的声纹特征与该预设声纹特征之间的差异较大,因此,可以确定所比较的该语音不是来自于该目标用户。如此,通过将每个语音声纹特征与预设阈值之间进行比较,可以确定该至少一个语音是否均来自于该目标用户。When the difference value is less than the preset threshold value, it means that the difference between the voiceprint feature of a certain voice being compared and the preset voiceprint feature is not large. At this time, it can be determined that the voice being compared comes from the target user. Conversely, if the difference value is greater than the preset threshold, it indicates that the voiceprint feature of a certain voice being compared differs greatly from the preset voiceprint feature. Therefore, it can be determined that the voice being compared does not come from the voiceprint feature. Target users. In this way, by comparing the voiceprint feature of each voice with a preset threshold, it can be determined whether the at least one voice comes from the target user.

其中,预设阈值可以由用户根据实际需求自定义设置,也可以由该第一终端默认设置,本发明实施例对此不作限定。Wherein, the preset threshold can be customized and set by the user according to actual needs, or can be set by default by the first terminal, which is not limited in this embodiment of the present invention.

进一步地,上述仅是以根据每个语音的声纹特征查询该至少一个语音是否均来自于目标用户为例,在另一实施例中,该可以根据其它信息来查询该至少一个语音是否均来自该目标用户,譬如,还可以结合该每个语音的声源位置来判断该至少一个语音是否来自于目标用户。Further, the above is only an example of querying whether the at least one voice comes from the target user according to the voiceprint characteristics of each voice. In another embodiment, it can be used to query whether the at least one voice comes from the target user based on other information The target user, for example, may also determine whether the at least one voice comes from the target user in combination with the sound source location of each voice.

在一种可能的实现方式中,该第一终端可以根据所采集的该至少一个语音中任一语音的声纹特征,判断该任一语音是否来自于目标用户。当确定该任一语音是来自于该目标用户时,再判断该至少一个语音中其它语音的声源位置与该任一语音的声源位置是否均相同,即判断该至少一个语音是否均来自于同一个方向。当该至少一个语音的声源位置均相同时,确定该至少一个语音来自于同一用户,即均来自于该目标用户。否则,可以确定该至少一个语音不是均来自于同一目标用户。其中,确定每个语音的声源位置可以基于采集的语音的接收强度等参数来确定,本发明实施例对此不做限定。In a possible implementation manner, the first terminal may determine whether any voice in the at least one voice comes from the target user according to the collected voiceprint features of any voice in the at least one voice. When it is determined that any voice is from the target user, then judge whether the sound source position of other voices in the at least one voice is the same as the sound source position of the any voice, that is, judge whether the at least one voice comes from same direction. When the sound source positions of the at least one voice are the same, it is determined that the at least one voice comes from the same user, that is, they all come from the target user. Otherwise, it may be determined that the at least one voice is not all from the same target user. Wherein, determining the sound source position of each speech may be determined based on parameters such as the received strength of the collected speech, which is not limited in this embodiment of the present invention.

需要说明的是,当上述至少一个语音不是均来自于该目标用户时,可以将该至少一个语音中不是来自于该目标用户的语音删除,之后,获取删除后的至少一个语音。It should be noted that, when the above at least one voice is not all from the target user, voices not from the target user in the at least one voice may be deleted, and then the deleted at least one voice is obtained.

至此,已经实现了本发明实施例所涉及的语音识别方法。进一步地,由于用户所使用的终端可能包括多个,因此,为了使得其它终端也可以与该第一终端一样进行语音识别,该第一终端对待训练的识别模型进行训练得到目标识别模型后,还可以对该目标识别模型进行分享,具体实现请参见如下步骤204。So far, the speech recognition method involved in the embodiment of the present invention has been implemented. Further, since the user may use multiple terminals, in order to enable other terminals to perform speech recognition like the first terminal, after the first terminal trains the recognition model to be trained to obtain the target recognition model, it also The target recognition model can be shared, and for specific implementation, please refer to step 204 below.

步骤204:向第二终端分享该目标识别模型,该第二终端是指与该第一终端具有关联关系的终端。Step 204: Share the target recognition model with a second terminal, where the second terminal refers to a terminal that has an association relationship with the first terminal.

在一种可能的实现方式中,该关联关系是指与该第一终端属于同一用户,或者,该关联关系是指与该第一终端具有连接关系,再或者,该关联关系是指与所述第一终端处于同一环境下等。In a possible implementation manner, the association relationship refers to the user belonging to the same user as the first terminal, or the association relationship refers to a connection relationship with the first terminal, or the association relationship refers to the The first terminal is under the same environment and so on.

接下来,以该第二终端与该第一终端属于同一用户为例进行说明,譬如,该第二终端与该第一终端均属于上述目标用户。也就是说,为了使得该目标用户在使用第二终端时也可以利用语音对该第二终端进行控制,该第一终端经过训练得到该目标识别模型后,可以将该目标识别模型发送给与自身属于同一用户的第二终端。相应地,该第二终端接收到该第一终端分享的目标识别模型后,在本地保存该目标识别模型。Next, description will be made by taking the second terminal and the first terminal belonging to the same user as an example. For example, both the second terminal and the first terminal belong to the above-mentioned target user. That is to say, in order to enable the target user to control the second terminal by voice when using the second terminal, after the first terminal obtains the target recognition model through training, it can send the target recognition model to itself A second terminal belonging to the same user. Correspondingly, after receiving the object recognition model shared by the first terminal, the second terminal saves the object recognition model locally.

当然,该第一终端还可以接收第二终端发送的经过训练后的目标识别模型,也就是说,第二终端也可以按照上述实现方式进行模型训练,并将训练后的目标识别模型分享给该第一终端。该第一终端接收来自第二终端分享的训练后的目标识别模型,进一步地,当该第一终端中存储有目标识别模型时,若接收到第二终端分享的最新的目标识别模型,则该第一终端可以将原来存储的目标识别模型删除,并保存该第二终端分享的最新的目标识别模型。Of course, the first terminal can also receive the trained target recognition model sent by the second terminal, that is to say, the second terminal can also perform model training according to the above implementation method, and share the trained target recognition model with the first terminal. The first terminal receives the trained target recognition model shared by the second terminal. Further, when the target recognition model is stored in the first terminal, if the latest target recognition model shared by the second terminal is received, the The first terminal may delete the originally stored target recognition model, and save the latest target recognition model shared by the second terminal.

值得一提的是,上述第一终端将训练后得到的目标识别模型分享给第二终端,使得第二终端保存有与第一终端相同的目标识别模型,如此,第二终端可以直接用该目标识别模型进行语音识别,避免需要自身再进行训练,减少了第二终端的训练次数。It is worth mentioning that the above-mentioned first terminal shares the target recognition model obtained after training with the second terminal, so that the second terminal saves the same target recognition model as the first terminal, so that the second terminal can directly use the target recognition model The recognition model performs speech recognition, avoiding the need for retraining itself, and reducing the number of times of training for the second terminal.

在本发明实施例中,采集待识别的目标语音,获取该目标语音的声学特征。调用目标识别模型,由于该目标识别模型可以根据任一语音的声学特征识别该语音对应的行为意图,因此,将获取的该声学特征输入至该目标识别模型后,可以输出该目标语音对应的行为意图标签。在本发明实施例中,无论是标准语音还是非标准语音,均可以基于声学特征通过目标识别模型识别出对应的行为意图,增强了语音识别的适用性。In the embodiment of the present invention, the target speech to be recognized is collected, and the acoustic features of the target speech are acquired. Call the target recognition model, because the target recognition model can recognize the behavioral intention corresponding to the voice according to the acoustic features of any voice, so after inputting the acquired acoustic features into the target recognition model, the behavior corresponding to the target voice can be output intent label. In the embodiment of the present invention, whether it is a standard voice or a non-standard voice, the corresponding behavioral intention can be recognized through the target recognition model based on the acoustic features, which enhances the applicability of the voice recognition.

图3是根据一示例性实施例示出的一种语音识别装置的结构示意图,该语音识别装置可以由软件、硬件或者两者的结合实现。该语音识别装置可以包括:Fig. 3 is a schematic structural diagram of a speech recognition device according to an exemplary embodiment. The speech recognition device may be implemented by software, hardware or a combination of both. The speech recognition device may include:

采集模块301,用于采集待识别的目标语音;Acquisition module 301, is used for collecting target voice to be recognized;

第一获取模块302,用于获取所述目标语音的声学特征;The first acquiring module 302 is used to acquire the acoustic features of the target speech;

调用模块303,用于调用目标识别模型,将所述声学特征输入至所述目标识别模型中,输出所述目标语音对应的行为意图标签,所述目标识别模型用于根据任一语音的声学特征识别所述语音对应的行为意图。The calling module 303 is used to call the target recognition model, input the acoustic features into the target recognition model, and output the behavioral intention label corresponding to the target voice, and the target recognition model is used to Recognize the behavioral intention corresponding to the voice.

可选地,请参考图4,所述装置还包括:Optionally, please refer to Figure 4, the device also includes:

第二获取模块304,用于获取至少一个语音训练样本的声学特征和每个语音训练样本对应的行为意图标签;The second obtaining module 304 is used to obtain the acoustic features of at least one speech training sample and the behavioral intention label corresponding to each speech training sample;

训练模块305,用于基于所述至少一个语音训练样本的声学特征和所述每个语音训练样本对应的行为意图标签,对待训练的识别模型进行训练,得到所述目标识别模型。The training module 305 is configured to train a recognition model to be trained based on the acoustic features of the at least one speech training sample and the behavioral intention label corresponding to each speech training sample, so as to obtain the target recognition model.

可选地,所述第二获取模块304用于:Optionally, the second acquiring module 304 is configured to:

获取至少一个语音;Get at least one voice;

确定所述至少一个语音中每个语音对应的行为操作;determining the behavior operation corresponding to each voice in the at least one voice;

生成每个行为操作对应的行为意图标签;Generate behavioral intent tags corresponding to each behavioral operation;

将所述至少一个语音确定为所述至少一个语音训练样本,以及将生成的每个行为意图标签确定为对应的语音训练样本的行为意图标签。Determining the at least one speech as the at least one speech training sample, and determining each generated behavioral intention label as the behavioral intention label of the corresponding speech training sample.

可选地,所述第二获取模块304还用于:Optionally, the second acquiring module 304 is also configured to:

根据所述每个语音的声纹特征,查询所述至少一个语音是否均来自目标用户,所述目标用户是指与所述第一终端具有关联关系的用户;According to the voiceprint feature of each voice, inquire whether the at least one voice is from a target user, where the target user refers to a user associated with the first terminal;

当所述至少一个语音均来自所述目标用户时,执行所述获取至少一个语音的操作。When the at least one voice is from the target user, the operation of acquiring at least one voice is performed.

可选地,所述第二获取模块304还用于:Optionally, the second acquiring module 304 is also configured to:

确定所述每个语音的声纹特征与预设声纹特征之间的差异值;Determine the difference between the voiceprint feature of each voice and the preset voiceprint feature;

当所述每个语音的声纹特征与所述预设声纹特征之间的差异值均小于预设阈值时,确定所述至少一个语音均来自所述目标用户。When the difference values between the voiceprint feature of each voice and the preset voiceprint feature are smaller than a preset threshold, it is determined that the at least one voice is from the target user.

可选地,请参考图5,所述装置还包括:Optionally, referring to Figure 5, the device further includes:

分享模块306,用于向第二终端分享所述目标识别模型,所述第二终端是指与所述第一终端具有关联关系的终端。A sharing module 306, configured to share the target recognition model with a second terminal, where the second terminal refers to a terminal that has an association relationship with the first terminal.

在本发明实施例中,采集待识别的目标语音,获取该目标语音的声学特征。调用目标识别模型,由于该目标识别模型可以根据任一语音的声学特征识别该语音对应的行为意图,因此,将获取的该声学特征输入至该目标识别模型后,可以输出该目标语音对应的行为意图标签。在本发明实施例中,无论是标准语音还是非标准语音,均可以基于声学特征通过目标识别模型识别出对应的行为意图,增强了语音识别的适用性。In the embodiment of the present invention, the target speech to be recognized is collected, and the acoustic features of the target speech are obtained. Call the target recognition model, because the target recognition model can recognize the behavioral intention corresponding to the voice according to the acoustic features of any voice, so after inputting the acquired acoustic features into the target recognition model, the behavior corresponding to the target voice can be output intent label. In the embodiment of the present invention, whether it is a standard voice or a non-standard voice, the corresponding behavioral intention can be recognized through the target recognition model based on the acoustic features, which enhances the applicability of the voice recognition.

需要说明的是:上述实施例提供的语音识别装置在实现语音识别方法时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的语音识别装置与语音识别方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that: when the speech recognition device provided by the above-mentioned embodiments implements the speech recognition method, the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned function distribution can be completed by different functional modules according to needs. , which divides the internal structure of the device into different functional modules to complete all or part of the functions described above. In addition, the speech recognition device provided by the above embodiment and the speech recognition method embodiment belong to the same idea, and its specific implementation process is detailed in the method embodiment, and will not be repeated here.

图6示出了本发明一个示例性实施例提供的终端600的结构框图。该终端600可以是:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio LayerIV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端600还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。Fig. 6 shows a structural block diagram of a terminal 600 provided by an exemplary embodiment of the present invention. The terminal 600 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, moving picture experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, moving picture experts compress standard audio layer 4) Player, laptop or desktop computer. The terminal 600 may also be called user equipment, portable terminal, laptop terminal, desktop terminal and other names.

通常,终端600包括有:处理器601和存储器602。Generally, the terminal 600 includes: a processor 601 and a memory 602 .

处理器601可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器601可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器601也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central ProcessingUnit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器601可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器601还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。The processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 601 can adopt at least one hardware form in DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish. The processor 601 may also include a main processor and a coprocessor, the main processor is a processor for processing data in a wake-up state, and is also called a CPU (Central Processing Unit, central processing unit); Low-power processor for processing data in standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing content that needs to be displayed on the display screen. In some embodiments, the processor 601 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is configured to process computing operations related to machine learning.

存储器602可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器602还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器602中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器601所执行以实现本申请中方法实施例提供的语音识别方法。Memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 602 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 601 to realize the speech recognition provided by the method embodiment in this application method.

在一些实施例中,终端600还可选包括有:外围设备接口603和至少一个外围设备。处理器601、存储器602和外围设备接口603之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口603相连。具体地,外围设备包括:射频电路604、触摸显示屏605、摄像头606、音频电路607、定位组件608和电源609中的至少一种。In some embodiments, the terminal 600 may optionally further include: a peripheral device interface 603 and at least one peripheral device. The processor 601, the memory 602, and the peripheral device interface 603 may be connected through buses or signal lines. Each peripheral device can be connected to the peripheral device interface 603 through a bus, a signal line or a circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 604 , a touch screen 605 , a camera 606 , an audio circuit 607 , a positioning component 608 and a power supply 609 .

外围设备接口603可被用于将I/O(Input/Output,输入/输出)相关的至少一个外围设备连接到处理器601和存储器602。在一些实施例中,处理器601、存储器602和外围设备接口603被集成在同一芯片或电路板上;在一些其他实施例中,处理器601、存储器602和外围设备接口603中的任意一个或两个可以在单独的芯片或电路板上实现,本实施例对此不加以限定。The peripheral device interface 603 may be used to connect at least one peripheral device related to I/O (Input/Output, input/output) to the processor 601 and the memory 602 . In some embodiments, the processor 601, memory 602 and peripheral device interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 601, memory 602 and peripheral device interface 603 or The two can be implemented on a separate chip or circuit board, which is not limited in this embodiment.

射频电路604用于接收和发射RF(Radio Frequency,射频)信号,也称电磁信号。射频电路604通过电磁信号与通信网络以及其他通信设备进行通信。射频电路604将电信号转换为电磁信号进行发送,或者,将接收到的电磁信号转换为电信号。可选地,射频电路604包括:天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。射频电路604可以通过至少一种无线通信协议来与其它终端进行通信。该无线通信协议包括但不限于:万维网、城域网、内联网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity,无线保真)网络。在一些实施例中,射频电路604还可以包括NFC(Near Field Communication,近距离无线通信)有关的电路,本申请对此不加以限定。The radio frequency circuit 604 is configured to receive and transmit RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals. The radio frequency circuit 604 communicates with the communication network and other communication devices through electromagnetic signals. The radio frequency circuit 604 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals. Optionally, the radio frequency circuit 604 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and the like. The radio frequency circuit 604 can communicate with other terminals through at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: World Wide Web, Metropolitan Area Network, Intranet, various generations of mobile communication networks (2G, 3G, 4G and 5G), wireless local area network and/or WiFi (Wireless Fidelity, Wireless Fidelity) network. In some embodiments, the radio frequency circuit 604 may also include circuits related to NFC (Near Field Communication, short-range wireless communication), which is not limited in this application.

显示屏605用于显示UI(User Interface,用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏605是触摸显示屏时,显示屏605还具有采集在显示屏605的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器601进行处理。此时,显示屏605还可以用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在一些实施例中,显示屏605可以为一个,设置终端600的前面板;在另一些实施例中,显示屏605可以为至少两个,分别设置在终端600的不同表面或呈折叠设计;在再一些实施例中,显示屏605可以是柔性显示屏,设置在终端600的弯曲表面上或折叠面上。甚至,显示屏605还可以设置成非矩形的不规则图形,也即异形屏。显示屏605可以采用LCD(LiquidCrystal Display,液晶显示屏)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。The display screen 605 is used to display a UI (User Interface, user interface). The UI can include graphics, text, icons, video, and any combination thereof. When the display screen 605 is a touch display screen, the display screen 605 also has the ability to collect touch signals on or above the surface of the display screen 605 . The touch signal can be input to the processor 601 as a control signal for processing. At this time, the display screen 605 can also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards. In some embodiments, there may be one display screen 605, which is set on the front panel of the terminal 600; in other embodiments, there may be at least two display screens 605, which are respectively arranged on different surfaces of the terminal 600 or in a folding design; In some other embodiments, the display screen 605 may be a flexible display screen, which is arranged on a curved surface or a folded surface of the terminal 600 . Even, the display screen 605 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen. The display screen 605 may be made of LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, organic light-emitting diode) and other materials.

摄像头组件606用于采集图像或视频。可选地,摄像头组件606包括前置摄像头和后置摄像头。通常,前置摄像头设置在终端的前面板,后置摄像头设置在终端的背面。在一些实施例中,后置摄像头为至少两个,分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种,以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality,虚拟现实)拍摄功能或者其它融合拍摄功能。在一些实施例中,摄像头组件606还可以包括闪光灯。闪光灯可以是单色温闪光灯,也可以是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合,可以用于不同色温下的光线补偿。The camera assembly 606 is used to capture images or videos. Optionally, the camera component 606 includes a front camera and a rear camera. Usually, the front camera is set on the front panel of the terminal, and the rear camera is set on the back of the terminal. In some embodiments, there are at least two rear cameras, which are any one of the main camera, depth-of-field camera, wide-angle camera, and telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function. Combined with the wide-angle camera to realize panoramic shooting and VR (Virtual Reality, virtual reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash can be a single-color temperature flash or a dual-color temperature flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.

音频电路607可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波,并将声波转换为电信号输入至处理器601进行处理,或者输入至射频电路604以实现语音通信。出于立体声采集或降噪的目的,麦克风可以为多个,分别设置在终端600的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器601或射频电路604的电信号转换为声波。扬声器可以是传统的薄膜扬声器,也可以是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时,不仅可以将电信号转换为人类可听见的声波,也可以将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中,音频电路607还可以包括耳机插孔。Audio circuitry 607 may include a microphone and speakers. The microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 601 for processing, or input them to the radio frequency circuit 604 to realize voice communication. For the purpose of stereo acquisition or noise reduction, there may be multiple microphones, which are respectively arranged at different parts of the terminal 600 . The microphone can also be an array microphone or an omnidirectional collection microphone. The speaker is used to convert the electrical signal from the processor 601 or the radio frequency circuit 604 into sound waves. The loudspeaker can be a conventional membrane loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, it is possible not only to convert electrical signals into sound waves audible to humans, but also to convert electrical signals into sound waves inaudible to humans for purposes such as distance measurement. In some embodiments, the audio circuit 607 may also include a headphone jack.

定位组件608用于定位终端600的当前地理位置,以实现导航或LBS(LocationBased Service,基于位置的服务)。定位组件608可以是基于美国的GPS(GlobalPositioning System,全球定位系统)、中国的北斗系统或俄罗斯的伽利略系统的定位组件。The positioning component 608 is used to locate the current geographic location of the terminal 600 to implement navigation or LBS (Location Based Service, location-based service). The positioning component 608 may be a positioning component based on the GPS (Global Positioning System, Global Positioning System) of the United States, the Beidou system of China, or the Galileo system of Russia.

电源609用于为终端600中的各个组件进行供电。电源609可以是交流电、直流电、一次性电池或可充电电池。当电源609包括可充电电池时,该可充电电池可以是有线充电电池或无线充电电池。有线充电电池是通过有线线路充电的电池,无线充电电池是通过无线线圈充电的电池。该可充电电池还可以用于支持快充技术。The power supply 609 is used to supply power to various components in the terminal 600 . Power source 609 may be AC, DC, disposable or rechargeable batteries. When the power source 609 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. A wired rechargeable battery is a battery charged through a wired line, and a wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery can also be used to support fast charging technology.

在一些实施例中,终端600还包括有一个或多个传感器610。该一个或多个传感器610包括但不限于:加速度传感器611、陀螺仪传感器612、压力传感器613、指纹传感器614、光学传感器615以及接近传感器616。In some embodiments, the terminal 600 further includes one or more sensors 610 . The one or more sensors 610 include, but are not limited to: an acceleration sensor 611 , a gyro sensor 612 , a pressure sensor 613 , a fingerprint sensor 614 , an optical sensor 615 and a proximity sensor 616 .

加速度传感器611可以检测以终端600建立的坐标系的三个坐标轴上的加速度大小。比如,加速度传感器611可以用于检测重力加速度在三个坐标轴上的分量。处理器601可以根据加速度传感器611采集的重力加速度信号,控制触摸显示屏605以横向视图或纵向视图进行用户界面的显示。加速度传感器611还可以用于游戏或者用户的运动数据的采集。The acceleration sensor 611 can detect the acceleration on the three coordinate axes of the coordinate system established by the terminal 600 . For example, the acceleration sensor 611 can be used to detect the components of the acceleration of gravity on the three coordinate axes. The processor 601 may control the touch screen 605 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 611 . The acceleration sensor 611 can also be used for collecting game or user's motion data.

陀螺仪传感器612可以检测终端600的机体方向及转动角度,陀螺仪传感器612可以与加速度传感器611协同采集用户对终端600的3D动作。处理器601根据陀螺仪传感器612采集的数据,可以实现如下功能:动作感应(比如根据用户的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。The gyro sensor 612 can detect the body direction and rotation angle of the terminal 600 , and the gyro sensor 612 can cooperate with the acceleration sensor 611 to collect 3D actions of the user on the terminal 600 . According to the data collected by the gyroscope sensor 612, the processor 601 can realize the following functions: motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control and inertial navigation.

压力传感器613可以设置在终端600的侧边框和/或触摸显示屏605的下层。当压力传感器613设置在终端600的侧边框时,可以检测用户对终端600的握持信号,由处理器601根据压力传感器613采集的握持信号进行左右手识别或快捷操作。当压力传感器613设置在触摸显示屏605的下层时,由处理器601根据用户对触摸显示屏605的压力操作,实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。The pressure sensor 613 may be disposed on a side frame of the terminal 600 and/or a lower layer of the touch display screen 605 . When the pressure sensor 613 is installed on the side frame of the terminal 600 , it can detect the user's grip signal on the terminal 600 , and the processor 601 performs left and right hand recognition or shortcut operation according to the grip signal collected by the pressure sensor 613 . When the pressure sensor 613 is arranged on the lower layer of the touch screen 605, the processor 601 controls the operable controls on the UI interface according to the user's pressure operation on the touch screen 605. The operable controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.

指纹传感器614用于采集用户的指纹,由处理器601根据指纹传感器614采集到的指纹识别用户的身份,或者,由指纹传感器614根据采集到的指纹识别用户的身份。在识别出用户的身份为可信身份时,由处理器601授权该用户执行相关的敏感操作,该敏感操作包括解锁屏幕、查看加密信息、下载软件、支付及更改设置等。指纹传感器614可以被设置终端600的正面、背面或侧面。当终端600上设置有物理按键或厂商Logo时,指纹传感器614可以与物理按键或厂商Logo集成在一起。The fingerprint sensor 614 is used to collect the user's fingerprint, and the processor 601 recognizes the identity of the user according to the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 recognizes the user's identity according to the collected fingerprint. When the identity of the user is identified as a trusted identity, the processor 601 authorizes the user to perform related sensitive operations, such sensitive operations include unlocking the screen, viewing encrypted information, downloading software, making payment, and changing settings. The fingerprint sensor 614 may be provided on the front, back or side of the terminal 600 . When the terminal 600 is provided with a physical button or a manufacturer's Logo, the fingerprint sensor 614 may be integrated with the physical button or the manufacturer's Logo.

光学传感器615用于采集环境光强度。在一个实施例中,处理器601可以根据光学传感器615采集的环境光强度,控制触摸显示屏605的显示亮度。具体地,当环境光强度较高时,调高触摸显示屏605的显示亮度;当环境光强度较低时,调低触摸显示屏605的显示亮度。在另一个实施例中,处理器601还可以根据光学传感器615采集的环境光强度,动态调整摄像头组件606的拍摄参数。The optical sensor 615 is used to collect ambient light intensity. In one embodiment, the processor 601 can control the display brightness of the touch screen 605 according to the ambient light intensity collected by the optical sensor 615 . Specifically, when the ambient light intensity is high, the display brightness of the touch screen 605 is increased; when the ambient light intensity is low, the display brightness of the touch screen 605 is decreased. In another embodiment, the processor 601 may also dynamically adjust shooting parameters of the camera assembly 606 according to the ambient light intensity collected by the optical sensor 615 .

接近传感器616,也称距离传感器,通常设置在终端600的前面板。接近传感器616用于采集用户与终端600的正面之间的距离。在一个实施例中,当接近传感器616检测到用户与终端600的正面之间的距离逐渐变小时,由处理器601控制触摸显示屏605从亮屏状态切换为息屏状态;当接近传感器616检测到用户与终端600的正面之间的距离逐渐变大时,由处理器601控制触摸显示屏605从息屏状态切换为亮屏状态。The proximity sensor 616 , also called a distance sensor, is usually arranged on the front panel of the terminal 600 . The proximity sensor 616 is used to collect the distance between the user and the front of the terminal 600 . In one embodiment, when the proximity sensor 616 detects that the distance between the user and the front of the terminal 600 gradually decreases, the processor 601 controls the touch display screen 605 to switch from the bright screen state to the off screen state; when the proximity sensor 616 detects When the distance between the user and the front of the terminal 600 gradually increases, the processor 601 controls the touch display screen 605 to switch from the off-screen state to the on-screen state.

本领域技术人员可以理解,图6中示出的结构并不构成对终端600的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。Those skilled in the art can understand that the structure shown in FIG. 6 does not constitute a limitation on the terminal 600, and may include more or less components than shown in the figure, or combine certain components, or adopt different component arrangements.

本申请实施例还提供了一种非临时性计算机可读存储介质,当所述存储介质中的指令由移动终端的处理器执行时,使得移动终端能够执行上述图1或图2所示实施例提供的语音识别方法。The embodiment of the present application also provides a non-transitory computer-readable storage medium. When the instructions in the storage medium are executed by the processor of the mobile terminal, the mobile terminal can execute the above-mentioned embodiment shown in FIG. 1 or FIG. 2 Provided speech recognition methods.

本申请实施例还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述图1或图2所示实施例提供的语音识别方法。The embodiment of the present application also provides a computer program product containing instructions, which, when run on a computer, causes the computer to execute the voice recognition method provided in the above embodiment shown in FIG. 1 or FIG. 2 .

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above embodiments can be completed by hardware, and can also be completed by instructing related hardware through a program. The program can be stored in a computer-readable storage medium. The above-mentioned The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.

以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within range.

Claims (14)

1. a kind of audio recognition method is applied in first terminal, which is characterized in that the method includes:
Acquire target voice to be identified;
Obtain the acoustic feature of the target voice;
The acoustic feature is input in the Model of Target Recognition by invocation target identification model, exports the target voice Corresponding behavior intention labels, the Model of Target Recognition are used to identify that the voice corresponds to according to the acoustic feature of any voice Behavior be intended to.
2. the method as described in claim 1, which is characterized in that before the invocation target identification model, further include:
Obtain the acoustic feature and the corresponding behavior intention labels of each voice training sample of at least one voice training sample;
The corresponding behavior meaning of acoustic feature and each voice training sample based at least one voice training sample Icon label are treated trained identification model and are trained, obtain the Model of Target Recognition.
3. method as claimed in claim 2, which is characterized in that obtain the corresponding behavior of each voice training sample and be intended to mark Label, including:
Obtain at least one voice;
Determine the corresponding behavior operation of each voice at least one voice;
It generates each behavior and operates corresponding behavior intention labels;
At least one voice is determined as at least one voice training sample, and each behavior of generation is intended to Label is determined as the behavior intention labels of corresponding voice training sample.
4. method as claimed in claim 3, which is characterized in that before at least one voice of acquisition, further include:
According to the vocal print feature of each voice, inquire whether at least one voice is all from target user;
When at least one voice is all from the target user, the operation for obtaining at least one voice is executed.
5. method as claimed in claim 4, which is characterized in that the vocal print feature according to each voice inquires institute State whether at least one voice is all from target user, including:
Determine the difference value between the vocal print feature and default vocal print feature of each voice;
When the difference value between the vocal print feature of each voice and the default vocal print feature is respectively less than predetermined threshold value, really Fixed at least one voice is all from the target user.
6. method as claimed in claim 2, which is characterized in that the acoustics based at least one voice training sample Feature and the corresponding behavior intention labels of each voice training sample, treat trained identification model and are trained, obtain After the Model of Target Recognition, further include:
Share the Model of Target Recognition to second terminal, the second terminal refers to having incidence relation with the first terminal Terminal.
7. a kind of speech recognition equipment, it is applied in first terminal, which is characterized in that described device includes:
Acquisition module, for acquiring target voice to be identified;
First acquisition module, the acoustic feature for obtaining the target voice;
Calling module is used for invocation target identification model, the acoustic feature is input in the Model of Target Recognition, exports The corresponding behavior intention labels of the target voice, the Model of Target Recognition are used to be identified according to the acoustic feature of any voice The corresponding behavior of the voice is intended to.
8. device as claimed in claim 7, which is characterized in that described device further includes:
Second acquisition module, acoustic feature and each voice training sample for obtaining at least one voice training sample correspond to Behavior intention labels;
Training module is used for the acoustic feature based at least one voice training sample and each voice training sample Corresponding behavior intention labels are treated trained identification model and are trained, obtain the Model of Target Recognition.
9. device as claimed in claim 8, which is characterized in that second acquisition module is used for:
Obtain at least one voice;
Determine the corresponding behavior operation of each voice at least one voice;
It generates each behavior and operates corresponding behavior intention labels;
At least one voice is determined as at least one voice training sample, and each behavior of generation is intended to Label is determined as the behavior intention labels of corresponding voice training sample.
10. device as claimed in claim 9, which is characterized in that second acquisition module is additionally operable to:
According to the vocal print feature of each voice, inquire whether at least one voice is all from target user, the mesh Mark user refers to the user for having incidence relation with the first terminal;
When at least one voice is all from the target user, the operation for obtaining at least one voice is executed.
11. device as claimed in claim 10, which is characterized in that second acquisition module is additionally operable to:
Determine the difference value between the vocal print feature and default vocal print feature of each voice;
When the difference value between the vocal print feature of each voice and the default vocal print feature is respectively less than predetermined threshold value, really Fixed at least one voice is all from the target user.
12. device as claimed in claim 8, which is characterized in that described device further includes:
Sharing module, for sharing the Model of Target Recognition to second terminal, the second terminal refers to and first end Hold the terminal with incidence relation.
13. a kind of computer readable storage medium, instruction is stored on the computer readable storage medium, which is characterized in that The method described in any one of claim 1-6 is realized when described instruction is executed by processor.
14. a kind of computing device, including:
One or more processors;
Memory, for storing one or more programs,
When one or more of programs are executed by one or more of processors so that one or more of processors Execute the method as described in any one of claim 1-6.
CN201810758435.0A 2018-07-11 2018-07-11 Audio recognition method, device and storage medium Active CN108806670B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810758435.0A CN108806670B (en) 2018-07-11 2018-07-11 Audio recognition method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810758435.0A CN108806670B (en) 2018-07-11 2018-07-11 Audio recognition method, device and storage medium

Publications (2)

Publication Number Publication Date
CN108806670A true CN108806670A (en) 2018-11-13
CN108806670B CN108806670B (en) 2019-06-25

Family

ID=64076065

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810758435.0A Active CN108806670B (en) 2018-07-11 2018-07-11 Audio recognition method, device and storage medium

Country Status (1)

Country Link
CN (1) CN108806670B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110246499A (en) * 2019-08-06 2019-09-17 苏州思必驰信息科技有限公司 The sound control method and device of home equipment
CN110364146A (en) * 2019-08-23 2019-10-22 腾讯科技(深圳)有限公司 Audio recognition method, device, speech recognition apparatus and storage medium
CN110930989A (en) * 2019-11-27 2020-03-27 深圳追一科技有限公司 Speech intention recognition method and device, computer equipment and storage medium
CN111939559A (en) * 2019-05-16 2020-11-17 北京车和家信息技术有限公司 Control method and device for vehicle-mounted voice game

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101320439A (en) * 2007-06-08 2008-12-10 鹏智科技(深圳)有限公司 Biology-like device with automatic learning function
CN103778915A (en) * 2012-10-17 2014-05-07 三星电子(中国)研发中心 Speech recognition method and mobile terminal
CN105700897A (en) * 2014-11-24 2016-06-22 宇龙计算机通信科技(深圳)有限公司 Method and device for launching application program, and terminal device
CN107667399A (en) * 2015-06-25 2018-02-06 英特尔公司 Speech-recognition services

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101320439A (en) * 2007-06-08 2008-12-10 鹏智科技(深圳)有限公司 Biology-like device with automatic learning function
CN103778915A (en) * 2012-10-17 2014-05-07 三星电子(中国)研发中心 Speech recognition method and mobile terminal
CN105700897A (en) * 2014-11-24 2016-06-22 宇龙计算机通信科技(深圳)有限公司 Method and device for launching application program, and terminal device
CN107667399A (en) * 2015-06-25 2018-02-06 英特尔公司 Speech-recognition services

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111939559A (en) * 2019-05-16 2020-11-17 北京车和家信息技术有限公司 Control method and device for vehicle-mounted voice game
CN110246499A (en) * 2019-08-06 2019-09-17 苏州思必驰信息科技有限公司 The sound control method and device of home equipment
CN110246499B (en) * 2019-08-06 2021-05-25 思必驰科技股份有限公司 Voice control method and device for household equipment
CN110364146A (en) * 2019-08-23 2019-10-22 腾讯科技(深圳)有限公司 Audio recognition method, device, speech recognition apparatus and storage medium
CN110364146B (en) * 2019-08-23 2021-07-27 腾讯科技(深圳)有限公司 Speech recognition method, speech recognition device, speech recognition apparatus, and storage medium
CN110930989A (en) * 2019-11-27 2020-03-27 深圳追一科技有限公司 Speech intention recognition method and device, computer equipment and storage medium
CN110930989B (en) * 2019-11-27 2021-04-06 深圳追一科技有限公司 Speech intention recognition method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN108806670B (en) 2019-06-25

Similar Documents

Publication Publication Date Title
CN110826358B (en) Animal emotion recognition method and device and storage medium
CN109298904B (en) Information processing method, device, storage medium and calculating equipment
CN113744736B (en) Command word recognition method and device, electronic equipment and storage medium
CN114360494B (en) Rhythm annotation method, device, computer equipment and storage medium
CN111143002A (en) Application sharing method, electronic equipment and computer readable storage medium
CN108806670B (en) Audio recognition method, device and storage medium
CN114594923A (en) Control method, device, device and storage medium of vehicle terminal
CN111681655A (en) Voice control method and device, electronic equipment and storage medium
CN108717365A (en) The method and apparatus for executing function in the application
CN111341317B (en) Evaluation method, device, electronic device and medium for wake-up audio data
CN111862972B (en) Voice interaction service method, device, equipment and storage medium
CN112860046B (en) Method, device, electronic equipment and medium for selecting operation mode
CN111613213B (en) Audio classification method, device, equipment and storage medium
CN108279842A (en) A kind of function controlling method, function controlling device and terminal device
CN108521501A (en) Voice input method and mobile terminal
CN113282355B (en) Instruction execution method, device, terminal and storage medium based on state machine
CN112992127A (en) Voice recognition method and device
CN111028846B (en) Method and device for registration of wake-up-free words
CN111681654A (en) Voice control method and device, electronic equipment and storage medium
CN113380240A (en) Voice interaction method and electronic equipment
CN110992954A (en) Method, device, equipment and storage medium for voice recognition
CN114299945B (en) Voice signal recognition method, device, electronic device, storage medium and product
CN118250371A (en) Terminal control method, device, equipment and storage medium
CN116860913A (en) Voice interaction method, device, equipment and storage medium
CN114386006A (en) Audio recognition method, device, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220818

Address after: Room 35201, 5th Floor, Zone 2, Building 3, No. 2, Zhuantang Science and Technology Economic Zone, Xihu District, Hangzhou City, Zhejiang Province, 310024

Patentee after: Hangzhou suddenly Cognitive Technology Co.,Ltd.

Address before: 4-0001, East Zone, No. 1, Building 4, Building 1, No. 1, Xueyuan Road, Haidian District, Beijing 100083

Patentee before: BEIJING XIAOMO ROBOT TECHNOLOGY CO.,LTD.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20241224

Address after: 1102, 11th Floor, Momo Building, 199 Chaoyang North Road, Chaoyang District, Beijing 100020

Patentee after: Beijing Manxiang Time Culture Media Co.,Ltd.

Country or region after: China

Address before: Room 35201, 5th Floor, Zone 2, Building 3, No. 2, Zhuantang Science and Technology Economic Zone, Xihu District, Hangzhou City, Zhejiang Province, 310024

Patentee before: Hangzhou suddenly Cognitive Technology Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20250828

Address after: 101121 Beijing Tongzhou District Yangzhuang Road No. 1 Building 3 No. 1793

Patentee after: Beijing Shitian Cultural Development Co.,Ltd.

Country or region after: China

Address before: 1102, 11th Floor, Momo Building, 199 Chaoyang North Road, Chaoyang District, Beijing 100020

Patentee before: Beijing Manxiang Time Culture Media Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right