TWI809335B

TWI809335B - Personalized speech recognition method and speech recognition system

Info

Publication number: TWI809335B
Application number: TW109143838A
Authority: TW
Inventors: 莊郁強
Original assignee: 中華電信股份有限公司
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2023-07-21
Also published as: TW202223875A

Abstract

The disclosure provides a personalized speech recognition method and a speech recognition system. The method includes: in response to obtaining a voice input message, forwarding the voice input message to a cloud voice recognition system; receiving specific acoustic recognition information and a specific voice recognition result corresponding to the voice input message from the cloud voice recognition system; using a personalized voice recognition model dedicated to a voice recognition module to correct the specific voice recognition result as a first voice recognition results based on the specific acoustic recognition information.

Description

Personalized speech recognition method and speech recognition system

本發明是有關於一種語音辨識技術，且特別是有關於一種個人化語音辨識方法及語音辨識系統。The present invention relates to a voice recognition technology, and in particular to a personalized voice recognition method and a voice recognition system.

現今語音辨識大多透過雲端語音辨識系統來進行語音辨識，而所述雲端語音辨識系統主要係透過預先設定使用語言來進行語音辨識。然而，當使用者的語音訊息包含中英混雜、同音異字與特殊領域名詞等情形時，語音辨識結果常不如預期。Nowadays, speech recognition is mostly performed through a cloud speech recognition system, and the cloud speech recognition system mainly performs speech recognition by presetting the language used. However, when the user's voice message contains mixed Chinese and English, homophones and nouns in special fields, etc., the voice recognition results are often not as expected.

在此情況下，為修正語音辨識結果，使用者需將相關的個人資料/資訊上傳至雲端語音辨識系統來建立個人化語音辨識模型。然而，現今個人資料保護意識抬頭，收集個人資料相關資訊用以進行語音辨識結果修正將有違反個人資料隱私與資料利用合理性的疑慮。In this case, in order to modify the speech recognition result, the user needs to upload relevant personal data/information to the cloud speech recognition system to create a personalized speech recognition model. However, nowadays, the awareness of personal data protection is on the rise. Collecting personal data related information for the correction of voice recognition results will violate the privacy of personal data and the rationality of data utilization.

有鑑於此，本發明提供一種個人化語音辨識方法及語音辨識系統，其可用於解決上述技術問題。In view of this, the present invention provides a personalized speech recognition method and a speech recognition system, which can be used to solve the above technical problems.

本發明提供一種個人化語音辨識方法，適於一第一語音辨識模組，包括：反應於取得一語音輸入訊息，將語音輸入訊息轉發至一雲端語音辨識系統，其中雲端語音辨識系統因應於語音輸入訊息而產生對應於語音輸入訊息的一特定聲學辨識資訊及一特定語音辨識結果；從雲端語音辨識系統接收對應於語音輸入訊息的特定聲學辨識資訊及特定語音辨識結果；採用專屬於第一語音辨識模組的一第一個人化語音辨識模型以基於特定聲學辨識資訊將特定語音辨識結果修正為一第一語音辨識結果。The present invention provides a personalized voice recognition method suitable for a first voice recognition module, including: responding to obtaining a voice input message, forwarding the voice input message to a cloud voice recognition system, wherein the cloud voice recognition system responds to the voice Input a message to generate a specific acoustic recognition information and a specific voice recognition result corresponding to the voice input message; receive the specific acoustic recognition information and specific voice recognition result corresponding to the voice input message from the cloud voice recognition system; use the first voice exclusively A first humanized speech recognition model of the recognition module is used to modify the specific speech recognition result to a first speech recognition result based on the specific acoustic recognition information.

本發明提供一種語音辨識系統，其包括第一語音辨識模組。第一語音辨識模組經配置以：反應於取得一語音輸入訊息，將語音輸入訊息轉發至一雲端語音辨識系統，其中雲端語音辨識系統因應於語音輸入訊息而產生對應於語音輸入訊息的一特定聲學辨識資訊及一特定語音辨識結果；從雲端語音辨識系統接收對應於語音輸入訊息的特定聲學辨識資訊及特定語音辨識結果；採用專屬於第一語音辨識模組的一第一個人化語音辨識模型以基於特定聲學辨識資訊將特定語音辨識結果修正為一第一語音辨識結果。The invention provides a speech recognition system, which includes a first speech recognition module. The first voice recognition module is configured to: in response to obtaining a voice input message, forward the voice input message to a cloud voice recognition system, wherein the cloud voice recognition system generates a specific signal corresponding to the voice input message in response to the voice input message Acoustic recognition information and a specific speech recognition result; receiving specific acoustic recognition information and specific speech recognition results corresponding to the voice input message from the cloud speech recognition system; using a first personalized speech recognition model dedicated to the first speech recognition module to The specific speech recognition result is modified to a first speech recognition result based on the specific acoustic recognition information.

請參照圖1，其是依據本發明第一實施例繪示的語音辨識系統及雲端語音辨識系統示意圖。在本發明的實施例中，語音辨識系統110可以是各式智慧型裝置、語音助理或是任何具有語音辨識功能的電子裝置，但可不限於此。Please refer to FIG. 1 , which is a schematic diagram of a speech recognition system and a cloud speech recognition system according to a first embodiment of the present invention. In the embodiment of the present invention, the speech recognition system 110 may be various smart devices, voice assistants or any electronic device with a speech recognition function, but is not limited thereto.

在圖1中，語音辨識系統110可包括語音辨識模組111，而其可記錄有專屬於第一使用者的一第一個人化語音辨識模型。在本發明的實施例中，第一個人化語音辨識模型係基於專屬於第一使用者的多個第一個人資訊而訓練，其中上述第一個人資訊可包括第一使用者的通訊錄、網頁瀏覽資訊、社群軟體文章、通訊軟體對話的至少其中之一，但可不限於此。In FIG. 1 , the speech recognition system 110 may include a speech recognition module 111 , which may record a first personalized speech recognition model dedicated to a first user. In an embodiment of the present invention, the first personalized speech recognition model is trained based on a plurality of first personal information specific to the first user, wherein the first personal information may include the first user's address book, web browsing information, At least one of social software articles and communication software conversations, but not limited thereto.

在一實施例中，在語音辨識模組111取得第一使用者的上述第一個人資訊之後，可將各第一個人資訊轉換為對應的多個特徵向量，以作為第一個人化語音辨識模型的多個訓練資料。之後，語音辨識模組111可再基於上述訓練資料訓練第一個人化語音辨識模型。藉此，第一個人化語音辨識模型即可習得第一使用者的用字習慣、慣用發音習慣等特徵，藉以作為後續修正語音辨識結果的依據，但可不限於此。In one embodiment, after the speech recognition module 111 obtains the above-mentioned first personal information of the first user, each first personal information can be converted into a plurality of corresponding feature vectors, as the first personalized speech recognition model. training data. Afterwards, the speech recognition module 111 can train the first personalized speech recognition model based on the above training data. In this way, the first personalized speech recognition model can learn the characteristics of the first user, such as word usage habits and habitual pronunciation habits, and use them as a basis for subsequent revision of speech recognition results, but it is not limited thereto.

此外，如圖1所示，語音辨識系統110可連接於雲端語音辨識系統120。在本發明的實施例中，雲端語音辨識系統120可儲存有一通用語音辨識模型。然而，如先前所言，當使用者的語音訊息包含中英混雜、同音異字與特殊領域名詞等情形時，此種通用語音辨識模型所提供的語音辨識結果常不如預期，而本發明的語音辨識系統110即可藉由執行圖2的個人化語音辨識方法來改善此情形。In addition, as shown in FIG. 1 , the voice recognition system 110 can be connected to the cloud voice recognition system 120 . In an embodiment of the present invention, the cloud speech recognition system 120 may store a general speech recognition model. However, as previously mentioned, when the user's voice information contains mixed Chinese and English, homonyms and nouns in special fields, etc., the voice recognition results provided by this general voice recognition model are often not as expected, and the voice recognition of the present invention The system 110 can improve this situation by implementing the personalized speech recognition method in FIG. 2 .

請參照圖2，其是依據本發明之一實施例繪示的個人化語音辨識方法流程圖。本實施例的方法可由圖1的語音辨識系統110執行，以下即搭配圖1所示的元件說明圖2各步驟的細節。Please refer to FIG. 2 , which is a flowchart of a personalized voice recognition method according to an embodiment of the present invention. The method of this embodiment can be executed by the voice recognition system 110 in FIG. 1 , and the details of each step in FIG. 2 will be described below with the components shown in FIG. 1 .

在本發明的實施例中，假設第一使用者欲藉由向語音辨識系統110發出語音輸入訊息VS的方式來控制語音辨識系統110執行特定操作。In the embodiment of the present invention, it is assumed that the first user intends to control the voice recognition system 110 to perform a specific operation by sending a voice input message VS to the voice recognition system 110 .

在此情況下，在步驟S210中，反應於取得語音輸入訊息VS，語音辨識模組111可將語音輸入訊息VS轉發至雲端語音辨識系統120。相應地，雲端語音辨識系統120（的通用語音辨識模型）例如可因應於語音輸入訊息VS而產生對應於語音輸入訊息的特定聲學辨識資訊AR及特定語音辨識結果VR，並可將特定聲學辨識資訊AR及特定語音辨識結果VR回傳至語音辨識系統110。In this case, in step S210 , in response to obtaining the voice input message VS, the voice recognition module 111 may forward the voice input message VS to the cloud voice recognition system 120 . Correspondingly, the cloud speech recognition system 120 (the general speech recognition model) can, for example, generate specific acoustic recognition information AR and specific speech recognition result VR corresponding to the speech input message VS in response to the speech input message VS, and can convert the specific acoustic recognition information to The AR and the specific voice recognition result VR are sent back to the voice recognition system 110 .

相應地，在步驟S220中，語音辨識模組111可從雲端語音辨識系統120接收對應於語音輸入訊息VS的特定聲學辨識資訊AR及特定語音辨識結果VR。Correspondingly, in step S220 , the voice recognition module 111 may receive specific acoustic recognition information AR and specific voice recognition result VR corresponding to the voice input message VS from the cloud voice recognition system 120 .

之後，在步驟S230中，語音辨識模組111可採用專屬於語音辨識模組的第一個人化語音辨識模型以基於特定聲學辨識資訊AR將特定語音辨識結果VR修正為語音辨識結果VO1。After that, in step S230 , the speech recognition module 111 can use the first personalized speech recognition model dedicated to the speech recognition module to modify the specific speech recognition result VR into the speech recognition result VO1 based on the specific acoustic recognition information AR.

並且，在取得語音辨識結果VO1之後，語音辨識系統110還可執行對應於語音辨識結果VO1的特定操作。Moreover, after the voice recognition result VO1 is obtained, the voice recognition system 110 may also perform a specific operation corresponding to the voice recognition result VO1.

為使上述概念更易於理解，以下將另輔以數個應用情境作進一步說明。In order to make the above concepts easier to understand, several application scenarios will be added below for further explanation.

在第一應用情境中，假設第一使用者發出的語音輸入訊息VS為「打給王曉銘」（王曉銘例如是第一使用者的通訊錄中的聯絡人）。相應地，雲端語音辨識系統120可因應於語音輸入訊息VS所提供的特定聲學辨識資訊AR及特定語音辨識結果VR分別可以是「d a3 d i4 e4 nn4 h u4 a4 g e3 i3 u2 a2 ng2 x i3 a3 u3 m i2 ng2」及「打給王小明」。In the first application scenario, assume that the voice input message VS sent by the first user is "Call Xiaoming Wang" (Xiaoming Wang is, for example, a contact in the address book of the first user). Correspondingly, the cloud speech recognition system 120 can respond to the specific acoustic recognition information AR and the specific speech recognition result VR provided by the voice input message VS, which can be respectively "d a3 d i4 e4 nn4 h u4 a4 g e3 i3 u2 a2 ng2 x i3 a3 u3 m i2 ng2" and "Call Wang Xiaoming".

在此情況下，若語音辨識模組111直接基於特定語音辨識結果VR（即，「打給王小明」）而執行後續操作的話，將會產生無法進行正確撥號的結果。In this case, if the voice recognition module 111 performs subsequent operations directly based on the specific voice recognition result VR (ie, "call Wang Xiaoming"), it will result in the result that the correct dialing cannot be performed.

因此，在本實施例中，在語音辨識模組111接收上述特定聲學辨識資訊AR及特定語音辨識結果VR之後，例如可採用第一個人化語音辨識模型以基於特定聲學辨識資訊AR將特定語音辨識結果VR（即，「打給王小明」）修正為「打給王曉銘」的語音辨識結果VO1。Therefore, in this embodiment, after the voice recognition module 111 receives the above-mentioned specific acoustic recognition information AR and the specific voice recognition result VR, for example, the first personalized voice recognition model can be used to convert the specific voice recognition result based on the specific acoustic recognition information AR VR (that is, "call Wang Xiaoming") is corrected to the speech recognition result VO1 of "call Wang Xiaoming".

具體而言，由於第一個人化語音辨識模型已基於第一使用者的各種第一個人資訊（包括通訊錄）進行訓練，故第一個人化語音辨識模型可基於特定聲學辨識資訊AR而得知「王小明」一詞實質上應為第一使用者通訊錄中的聯絡人「王曉銘」，並可據以產生上述語音辨識結果VO1。Specifically, since the first personalized speech recognition model has been trained based on various first personal information (including contacts) of the first user, the first personalized speech recognition model can know "Wang Xiaoming" based on the specific acoustic recognition information AR The word should be the contact person "Wang Xiaoming" in the address book of the first user in essence, and the above-mentioned speech recognition result VO1 can be generated according to it.

藉此，語音辨識系統110即可基於語音辨識結果VO1而正確地執行第一使用者所需的特定操作（例如撥號給王曉銘）。In this way, the voice recognition system 110 can correctly perform a specific operation required by the first user (eg, dial Wang Xiaoming) based on the voice recognition result VO1 .

在第二應用情境中，假設第一使用者發出的語音輸入訊息VS為「電鍋啟動」，但其中的「電鍋」二字係因第一使用者的腔調/發音習慣（例如，慣用台語）而導致其音調較接近於「點歌」，但「啟動」二字則為正確發音。相應地，雲端語音辨識系統120可因應於語音輸入訊息VS所提供的特定聲學辨識資訊AR及特定語音辨識結果VR分別可以是「d i3 e3 nn3 g er1 q i3 d o4 ng4」及「點歌啟動」。In the second application scenario, it is assumed that the voice input message VS sent by the first user is "starting the electric cooker", but the word "electric cooker" in it is due to the accent/pronunciation habits of the first user (for example, the habitual station language), which makes its tone closer to "order song", but the word "activation" is the correct pronunciation. Correspondingly, the cloud speech recognition system 120 can respond to the specific acoustic recognition information AR and the specific speech recognition result VR provided by the voice input message VS, which can be "d i3 e3 nn3 g er1 q i3 d o4 ng4" and "order start ".

在此情況下，若語音辨識模組111直接基於特定語音辨識結果VR（即，「點歌啟動」）而執行後續操作的話，將可能無法正確地執行啟動電鍋的操作。In this case, if the voice recognition module 111 directly performs subsequent operations based on the specific voice recognition result VR (that is, "starting the song"), it may not be able to correctly perform the operation of starting the electric cooker.

因此，在本實施例中，在語音辨識模組111接收上述特定聲學辨識資訊AR及特定語音辨識結果VR之後，例如可採用第一個人化語音辨識模型以基於特定聲學辨識資訊AR將特定語音辨識結果VR（即，「點歌啟動」）修正為「電鍋啟動」的語音辨識結果VO1。Therefore, in this embodiment, after the voice recognition module 111 receives the above-mentioned specific acoustic recognition information AR and the specific voice recognition result VR, for example, the first personalized voice recognition model can be used to convert the specific voice recognition result based on the specific acoustic recognition information AR VR (that is, "starting song order") is corrected to the speech recognition result VO1 of "starting electric cooker".

具體而言，由於第一使用者平時可能慣用台語，故其相關的網頁瀏覽資訊可能包括大量的台語網頁（例如台語影音網頁）。在此情況下，第一個人化語音辨識模型已基於第一使用者的各種第一個人資訊（包括上述網頁瀏覽資訊）進行訓練，故第一個人化語音辨識模型可基於特定聲學辨識資訊AR而得知「點歌」一詞實質上應為「電鍋」，並可據以產生上述語音辨識結果VO1。Specifically, since the first user may usually use Taiwanese, his related webpage browsing information may include a large number of webpages in Taiwanese (such as video and audio webpages in Taiwanese). In this case, the first personalized speech recognition model has been trained based on the first user's various first personal information (including the above-mentioned web browsing information), so the first personalized speech recognition model can learn based on the specific acoustic recognition information AR that " The word "order a song" should be "electric pot" in essence, and the above speech recognition result VO1 can be generated accordingly.

藉此，語音辨識系統110即可基於語音辨識結果VO1而正確地執行第一使用者所需的特定操作（例如啟動電鍋）。In this way, the voice recognition system 110 can correctly perform a specific operation required by the first user (such as starting the electric cooker) based on the voice recognition result VO1 .

由上可知，本發明的方法可讓語音辨識模組基於專屬於第一使用者的第一個人化語音辨識模型而將特定語音辨識結果修正為第一語音辨識結果。由於第一語音辨識結果較貼近於第一使用者的發音/用字習慣，故可讓語音辨識模組相應地執行較為正確的操作。另外，由於上述技術手段未涉及將第一使用者的任何第一個人資訊上傳至雲端語音辨識系統的手段，故還可達到保護個人資料的效果。As can be seen from the above, the method of the present invention allows the speech recognition module to modify the specific speech recognition result to the first speech recognition result based on the first personalized speech recognition model specific to the first user. Since the first speech recognition result is closer to the pronunciation/word habits of the first user, the speech recognition module can be made to perform more correct operations accordingly. In addition, because the above-mentioned technical means do not involve the means of uploading any first personal information of the first user to the cloud speech recognition system, it can also achieve the effect of protecting personal data.

在其他實施例中，上述第一語音辨識結果還可進一步與其他語音辨識模組所提供的語音辨識結果進行整合，以產生對應於某特定群體的整合語音辨識結果。In other embodiments, the above-mentioned first speech recognition result can be further integrated with speech recognition results provided by other speech recognition modules to generate an integrated speech recognition result corresponding to a specific group.

請參照圖3，其是依據本發明第二實施例繪示的語音辨識系統及雲端語音辨識系統示意圖。在圖3中，語音辨識系統110可包括多個語音辨識模組111~11N，其中語音辨識模組111~11N個別可專屬於對應的使用者。為便於理解，以下以語音辨識系統110中僅包括語音辨識模組111及112的實施例進行說明，但本發明可不限於此。Please refer to FIG. 3 , which is a schematic diagram of a speech recognition system and a cloud speech recognition system according to a second embodiment of the present invention. In FIG. 3 , the speech recognition system 110 may include a plurality of speech recognition modules 111 - 11N, wherein each of the speech recognition modules 111 - 11N may be dedicated to a corresponding user. For ease of understanding, the speech recognition system 110 only includes the speech recognition modules 111 and 112 for illustration below, but the present invention is not limited thereto.

在第二實施例中，語音辨識模組111的相關細節可參照第一實施例中的說明，於此不另贅述。另外，語音辨識模組112可記錄有專屬於第二使用者的一第二個人化語音辨識模型，其中第二使用者與第一使用者可屬於同一個特定群體（例如同一間辦公室的成員、同一間實驗室的成員、同一個團隊的成員等）。In the second embodiment, the relevant details of the speech recognition module 111 can refer to the description in the first embodiment, and will not be repeated here. In addition, the speech recognition module 112 may record a second personalized speech recognition model dedicated to the second user, wherein the second user and the first user may belong to the same specific group (such as members of the same office, members of the same laboratory, members of the same team, etc.).

在本發明的實施例中，第二個人化語音辨識模型係基於專屬於第二使用者的多個第二個人資訊而訓練，其中上述第二個人資訊可包括第二使用者的通訊錄、網頁瀏覽資訊、社群軟體文章、通訊軟體對話的至少其中之一，但可不限於此。In an embodiment of the present invention, the second personalized speech recognition model is trained based on a plurality of second personal information specific to the second user, wherein the second personal information may include the second user's address book, web page At least one of browsing information, social software articles, and communication software conversations, but not limited thereto.

在第二實施例中，在語音辨識模組112取得第二使用者的上述第二個人資訊之後，可將各第二個人資訊轉換為對應的多個特徵向量，以作為第二個人化語音辨識模型的多個訓練資料。之後，語音辨識模組112可再基於上述訓練資料訓練第二個人化語音辨識模型。藉此，第二個人化語音辨識模型即可習得第二使用者的用字習慣、慣用發音習慣等特徵，藉以作為後續修正語音辨識結果的依據，但可不限於此。In the second embodiment, after the speech recognition module 112 obtains the above-mentioned second personal information of the second user, each second personal information can be converted into a plurality of corresponding feature vectors as the second personalized speech recognition Multiple training data for the model. Afterwards, the speech recognition module 112 can train the second personalized speech recognition model based on the above training data. In this way, the second personalized speech recognition model can learn the second user's character habits, idiomatic pronunciation habits and other characteristics, so as to serve as the basis for subsequent correction of the speech recognition results, but it is not limited thereto.

在第二實施例中，假設屬於上述特定群體的第一/第二使用者欲藉由向語音辨識系統110發出語音輸入訊息VS的方式來控制語音辨識系統110執行特定操作。In the second embodiment, it is assumed that the first/second user belonging to the above-mentioned specific group wants to control the voice recognition system 110 to perform a specific operation by sending a voice input message VS to the voice recognition system 110 .

在此情況下，語音辨識系統110可依據先前的教示將語音輸入訊息VS轉發至雲端語音辨識系統120，而雲端語音辨識系統120可相應地回傳對應於語音輸入訊息VS的特定聲學辨識資訊AR及特定語音辨識結果VR。In this case, the voice recognition system 110 can forward the voice input message VS to the cloud voice recognition system 120 according to the previous teaching, and the cloud voice recognition system 120 can correspondingly return the specific acoustic recognition information AR corresponding to the voice input message VS and a specific voice recognition result VR.

之後，語音辨識模組111可基於先前實施例的教示而採用第一個人化語音辨識模型將特定語音辨識結果VR修正為語音辨識結果VO1。Afterwards, the voice recognition module 111 can modify the specific voice recognition result VR into the voice recognition result VO1 by using the first humanized voice recognition model based on the teaching of the previous embodiment.

此外，在語音辨識模組112取得對應於語音輸入訊息VS的特定聲學辨識資訊AR及特定語音辨識結果VR之後，可採用專屬於語音辨識模組112的第二個人化語音辨識模型以基於特定聲學辨識資訊AR將特定語音辨識結果VR修正為語音辨識結果VO2。亦即，第二個人化語音辨識模型可基於第二使用者的用字/發音習慣將特定語音辨識結果VR修正為語音辨識結果VO2，但可不限於此。In addition, after the voice recognition module 112 obtains the specific acoustic recognition information AR and the specific voice recognition result VR corresponding to the voice input message VS, the second personalized voice recognition model dedicated to the voice recognition module 112 can be used to The recognition information AR modifies the specific voice recognition result VR into the voice recognition result VO2. That is, the second personalized speech recognition model can modify the specific speech recognition result VR to the speech recognition result VO2 based on the second user's word usage/pronunciation habit, but it is not limited thereto.

之後，語音辨識系統110可基於語音辨識結果VO1及VO2產生對應於上述特定群體的整合語音辨識結果VO。Afterwards, the speech recognition system 110 can generate an integrated speech recognition result VO corresponding to the above-mentioned specific group based on the speech recognition results VO1 and VO2.

由此可知，基於上述教示所產生的整合語音辨識結果VO即可較符合上述特定群體中成員的用字/發音習慣，進而得到較佳的語音辨識效果。並且，由於上述技術手段未涉及將特定群體中任何成員的個人資訊上傳至雲端語音辨識系統的手段，故還可達到保護各成員的個人資料的效果。It can be seen that the integrated speech recognition result VO generated based on the above teaching can be more in line with the word usage/pronunciation habits of the members of the above-mentioned specific group, and thus better speech recognition effect can be obtained. Moreover, since the above technical means do not involve the means of uploading the personal information of any member of a specific group to the cloud speech recognition system, it can also achieve the effect of protecting the personal information of each member.

在一實施例中，語音辨識系統110可基於對應於第一使用者及第二使用者的多個權重將語音辨識結果VO1及VO2整合為對應於上述特定群體的整合語音辨識結果VO。In one embodiment, the speech recognition system 110 may integrate the speech recognition results VO1 and VO2 into an integrated speech recognition result VO corresponding to the above-mentioned specific group based on multiple weights corresponding to the first user and the second user.

舉例而言，假設第一使用者為上述特定群體的領導者/管理者，而第二使用者為上述特定群體的一般成員。在此情況下，第一使用者對應的權重可經設定為高於第二使用者的權重。因此，在語音辨識系統110將語音辨識結果VO1及VO2進行整合時，所產生的整合語音辨識結果VO例如可較貼近於第一使用者的用字/發音習慣，但本發明可不限於此。For example, assume that the first user is the leader/manager of the above-mentioned specific group, and the second user is a general member of the above-mentioned specific group. In this case, the weight corresponding to the first user may be set higher than that of the second user. Therefore, when the speech recognition system 110 integrates the speech recognition results VO1 and VO2, the generated integrated speech recognition result VO may be closer to the word usage/pronunciation habits of the first user, but the present invention is not limited thereto.

此外，在其他實施例中，若上述特定群體共包括N個成員，則語音辨識系統110可調整為如圖3所示的態樣。亦即，語音辨識系統110可包括分別對應於所述N個成員的語音辨識模組111~11N。在此情況下，當語音辨識系統110接收特定聲學辨識資訊AR及特定語音辨識結果VR之後，語音辨識模組111~11N可分別採用對應的個人化語音辨識模型將特定語音辨識結果VR修正為語音辨識結果VO1~VON。之後，語音辨識結果VO1~VON可再經上述教示整合而產生整合語音辨識結果VO。藉此，可讓整合語音辨識結果VO更符合上述特定群體的用字/發音習慣，從而能夠讓語音辨識系統110更為正確地執行上述特定群體所需的特定操作，但可不限於此。In addition, in other embodiments, if the above-mentioned specific group includes N members in total, the speech recognition system 110 can be adjusted as shown in FIG. 3 . That is, the voice recognition system 110 may include voice recognition modules 111 - 11N respectively corresponding to the N members. In this case, after the voice recognition system 110 receives the specific acoustic recognition information AR and the specific voice recognition result VR, the voice recognition modules 111~11N can use the corresponding personalized voice recognition models to correct the specific voice recognition result VR into voice Identification results VO1~VON. Afterwards, the voice recognition results VO1˜VON can be integrated through the above teaching to generate an integrated voice recognition result VO. In this way, the integrated speech recognition result VO can be more in line with the word usage/pronunciation habit of the above-mentioned specific group, so that the speech recognition system 110 can more correctly perform the specific operation required by the above-mentioned specific group, but it is not limited thereto.

請參照圖4，其是依據本發明之一實施例繪示的第一個人化語音辨識模型的示意圖。在本實施例中，專屬於上述第一使用者的第一個人化語音辨識模型400例如是一傳統語言模型(Back-off N-gram Language Models)或遞迴式類神經網路語言模型(Recurrent Neural Network Language Modeling，RNNLM)，但可不限於此。Please refer to FIG. 4 , which is a schematic diagram of a first humanized speech recognition model according to an embodiment of the present invention. In this embodiment, the first personalized speech recognition model 400 dedicated to the first user is, for example, a traditional language model (Back-off N-gram Language Models) or a recurrent neural network-like language model (Recurrent Neural Network Language Modeling, RNNLM), but not limited to this.

如圖4所示，假設第一個人化語音辨識模型400包括輸入層401、隱藏層402及輸出層403。在本實施例中，在語音辨識模型111經第一使用者的授權而取得第一使用者的某個個人資訊（例如通訊軟體對話、網頁瀏覽資訊、通訊錄、常用語言與使用地點等）之後，語音辨識模型111可將此個人資訊轉換為特徵向量，並輸入至輸入層401。另，在本實施例中，隱藏層402例如包括50層，每層例如包括512個節點。此外，輸出層403可輸出一機率，而語音辨識模型111可將此機率以反向傳遞演算法求解出第一個人化語音辨識模型400的相關參數。As shown in FIG. 4 , it is assumed that the first humanized speech recognition model 400 includes an input layer 401 , a hidden layer 402 and an output layer 403 . In this embodiment, after the speech recognition model 111 obtains certain personal information of the first user (such as communication software conversations, web browsing information, address book, common language and place of use, etc.) with the authorization of the first user , the speech recognition model 111 can convert the personal information into a feature vector and input it to the input layer 401 . In addition, in this embodiment, the hidden layer 402 includes, for example, 50 layers, and each layer includes, for example, 512 nodes. In addition, the output layer 403 can output a probability, and the speech recognition model 111 can use the probability to solve the relevant parameters of the first humanized speech recognition model 400 through the backward transfer algorithm.

之後，透過經訓練而得的第一個人化語音辨識模型400（即下述公式中的P(W)）對特定語音辨識結果VR進行修正或重新評分，選擇第一個人化語音辨識模型400中的較高機率組合，作為最終的語音辨識結果VO1，但可不限於此。 Afterwards, the specific speech recognition result VR is corrected or re-scored through the trained first humanized speech recognition model 400 (that is, P(W) in the following formula), and the better one in the first personalized speech recognition model 400 is selected. The high probability combination is used as the final speech recognition result VO1, but it is not limited thereto.

在上式中，O係表徵語音輸入訊息，其可包括語音特徵向量序列[o1 , o2 ,..., oT]。另外，經雲端語音辨識系統120辨識而得的特定語音辨識結果VR可以對應的文字序列表示，其例如包括一連串詞（例如[w1, w2, ... , wm]。在本發明的實施例中，第一個人化語音辨識模型400所進行的操作可理解為找出具有最大事後(Maximum A Posteriori, MAP)機率的詞序列，也就是O最有可能的對應輸出文字序列。 In the above formula, O represents the voice input message, which may include a voice feature vector sequence [o1 , o2 , . . . , oT]. In addition, the specific speech recognition result VR recognized by the cloud speech recognition system 120 may correspond to a text sequence Indicates that it includes, for example, a series of words (such as [w1, w2, ... , wm]. In an embodiment of the present invention, the operation performed by the first humanized speech recognition model 400 can be understood as finding out the The word sequence of A Posteriori, MAP) probability, which is the most likely corresponding output text sequence of O .

在本實施例中，p(W|O)例如是在給定語音輸入訊息O時，文字序列W的發生事後機率，而其可進一步表示為P(W)、p(O|W)與p(O)，其中p(O|W)代表聲學模型(Acoustic Model)產生O（即，語音輸入訊息）的機率密度函數，用以估計語音輸入訊息O發生在文字序列W對應的聲學模型相似度。另，P(W)代表第一個人化語音辨識模型400產生文字序列W的機率，用於評估文字序列W於訓練語料的合理性並修正聲學模型中文字混淆的結果，使最終的第一語音辨識結果符合第一使用者的預期。In this embodiment, p(W|O) is, for example, the post-event probability of occurrence of the word sequence W when a voice input message O is given, and it can be further expressed as P(W), p(O|W) and p (O), where p(O|W) represents the probability density function of the acoustic model (Acoustic Model) to generate O (that is, the voice input message), which is used to estimate the similarity of the acoustic model corresponding to the voice input message O occurring in the text sequence W . In addition, P(W) represents the probability of the first humanized speech recognition model 400 generating the text sequence W, which is used to evaluate the rationality of the text sequence W in the training corpus and correct the result of text confusion in the acoustic model, so that the final first speech The identification result meets the expectation of the first user.

綜上所述，本發明至少具備以下特點：（1）可在保護個人隱私的條件下，使用授權後的個人資訊，於本地端的語音辨識系統訓練個人化語音辨識模型，並透過此個人化語音辨識模型進行語音辨識結果修正；（2）可應用個人化語音辨識模型於其他語音辨識應用情境；（3）可根據應用情境，進行個人、特定領域或特定群體的語音辨識模型整合。To sum up, the present invention has at least the following features: (1) Under the condition of protecting personal privacy, authorized personal information can be used to train a personalized speech recognition model in a local speech recognition system, and through this personalized speech The recognition model is used to correct the results of speech recognition; (2) Personalized speech recognition models can be applied to other speech recognition application scenarios; (3) Speech recognition models for individuals, specific fields, or specific groups can be integrated according to the application scenarios.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明的精神和範圍內，當可作些許的更動與潤飾，故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed above with the embodiments, it is not intended to limit the present invention. Anyone with ordinary knowledge in the technical field may make some changes and modifications without departing from the spirit and scope of the present invention. The scope of protection of the present invention should be defined by the scope of the appended patent application.

110:語音辨識系統 111~11N:語音辨識模組 120:雲端語音辨識系統 400:第一個人化語音辨識模型 401:輸入層 402:隱藏層 403:輸出層 VO1~VON:語音辨識結果 VO:整合語音辨識結果 AR:特定聲學辨識資訊 VR:特定語音辨識結果 VS:語音輸入訊息 S210~S230:步驟 110:Speech recognition system 111~11N: Speech recognition module 120: Cloud Speech Recognition System 400: The first humanized speech recognition model 401: Input layer 402: hidden layer 403: output layer VO1~VON: Voice recognition result VO: Integrate speech recognition results AR: Specific Acoustic Identification Information VR: Specific Speech Recognition Results VS: Voice input message S210~S230: steps

圖1是依據本發明第一實施例繪示的語音辨識系統及雲端語音辨識系統示意圖。圖2是依據本發明之一實施例繪示的個人化語音辨識方法流程圖。圖3是依據本發明第二實施例繪示的語音辨識系統及雲端語音辨識系統示意圖。圖4是依據本發明之一實施例繪示的第一個人化語音辨識模型的示意圖。 FIG. 1 is a schematic diagram of a speech recognition system and a cloud speech recognition system according to a first embodiment of the present invention. FIG. 2 is a flow chart of a personalized speech recognition method according to an embodiment of the present invention. FIG. 3 is a schematic diagram of a speech recognition system and a cloud speech recognition system according to a second embodiment of the present invention. FIG. 4 is a schematic diagram of a first humanized speech recognition model according to an embodiment of the present invention.

S210~S230:步驟 S210~S230: steps

Claims

A personalized voice recognition method suitable for a first voice recognition module, comprising: in response to obtaining a voice input message, forwarding the voice input message to a cloud voice recognition system, wherein the cloud voice recognition system responds to the voice Inputting information to generate a specific acoustic recognition information and a specific voice recognition result corresponding to the voice input message; receiving the specific acoustic recognition information and the specific voice recognition result corresponding to the voice input message from the cloud voice recognition system; using A first humanized speech recognition model dedicated to the first speech recognition module to modify the specific speech recognition result to a first speech recognition result based on the specific acoustic recognition information, wherein the first humanized speech recognition model includes a corresponding The acoustic model of the speech input message, and the first humanized speech recognition model scores the specific speech recognition result and finds the word sequence with the largest posterior probability as the first speech recognition result.

The method as claimed in claim 1, wherein the first personalized speech recognition model is trained based on a plurality of first personal information specific to a first user.

The method according to claim 2, wherein the first personal information includes at least one of the first user's address book, web browsing information, social software articles, and communication software conversations.

The method as described in claim item 2, further comprising: Each of the first personal information is converted into a plurality of corresponding feature vectors as a plurality of training data of the first personal speech recognition model; the first personal speech recognition model is trained based on the training data.

The method as claimed in claim 1, further comprising performing a specific operation based on the first speech recognition result.

A voice recognition system, comprising: a first voice recognition module configured to: respond to obtaining a voice input message, forward the voice input message to a cloud voice recognition system, wherein the cloud voice recognition system responds to the voice input message to generate a specific acoustic recognition information and a specific voice recognition result corresponding to the voice input message; receiving the specific acoustic recognition information and the specific voice recognition result corresponding to the voice input message from the cloud voice recognition system; Using a first humanized speech recognition model specific to the first speech recognition module to modify the specific speech recognition result to a first speech recognition result based on the specific acoustic recognition information, wherein the first humanized speech recognition model includes a corresponding An acoustic model of a message is input into the speech, and the first humanized speech recognition model scores the specific speech recognition result and finds a word sequence with the largest posterior probability as the first speech recognition result.

The speech recognition system as claimed in claim 6, wherein the first personalized speech recognition model is trained based on a plurality of first personal information specific to a first user, and the first user belongs to a specific group.

The speech recognition system as described in claim 7, further comprising a second speech recognition module configured to: receive the specific acoustic recognition information corresponding to the speech input message and the specific speech recognition from the cloud speech recognition system Result: using a second personalized speech recognition model specific to the second speech recognition module to correct the specific speech recognition result into a second speech recognition result based on the specific acoustic recognition information.

The speech recognition system as claimed in claim 8, wherein the second personalized speech recognition model is trained based on a plurality of second personal information specific to a second user.

The speech recognition system as claimed in claim 9, wherein the speech recognition system generates an integrated speech recognition result corresponding to the specific group based on the first speech recognition result and the second speech recognition result.

The speech recognition system as claimed in claim 10, wherein the speech recognition system integrates the first speech recognition result and the second speech recognition result based on a plurality of weights corresponding to the first user and the second user into The integrated speech recognition result corresponding to the specific group.