TWI798867B

TWI798867B - Video processing method and associated system on chip

Info

Publication number: TWI798867B
Application number: TW110138102A
Authority: TW
Inventors: 陳慶隆; 鄭家鈞
Original assignee: 瑞昱半導體股份有限公司
Priority date: 2021-06-27
Filing date: 2021-10-14
Publication date: 2023-04-11
Also published as: CN115529432A; TW202301870A

Abstract

The present invention provides a SoC, wherein the SoC includes a human recognition circuit, a sound detection circuit and a processing circuit. The human recognition circuit is used to obtain an image data from an image capturing device in real time, and perform human recognition on the image data to generate a recognition result. The sound detection circuit is used to obtain a plurality of sound signals from a plurality of microphones, to determine a voice characteristic value of a main sound. The processing circuit is coupled to the human recognition circuit and the sound detection circuit, and is configured to determine a specific region of the image data according to the recognition result and the voice characteristic value of the main sound, and the processing circuit processes the image data to highlight the specific area.

Description

Video processing method and related system chip

本發明係有關於即時串流的視訊處理方法。 The invention relates to a video processing method for real-time streaming.

即時串流(live streaming)目前被廣泛應用在社會的許多層面，例如可以被應用在遠端視訊會議中。然而，當遠端視訊會議中有其中一方在影像畫面中包含多個參加者時，另一方的參加者有時候可能難以分辨影像畫面中是誰正在說話。具體來說，假設目前有第一方與第二方正在進行遠端視訊會議，其中第一方有多個參加者在實體會議室，並透過麥克風與相機來擷取實體會議室的影音訊息後透過網路傳遞至遠端第二方的參加者，則由於第一方的多個參加者的姿勢與位置問題，可能會讓第二方的參加者無法看到是哪一位正在發言，造成第二方的參加者的困擾並影響到會議的效率。 Live streaming (live streaming) is currently widely used in many aspects of society, for example, it can be applied in remote video conferencing. However, when one of the parties in the remote video conference includes multiple participants in the image frame, sometimes it may be difficult for the other party's participants to distinguish who is speaking in the image frame. Specifically, assume that the first party and the second party are conducting a remote video conference, and the first party has multiple participants in the physical conference room, and captures the video and audio information of the physical conference room through a microphone and a camera If it is transmitted to the remote second-party participants through the network, due to the posture and position of the first-party participants, the second-party participants may not be able to see which one is speaking, resulting in Participants of the second party are disturbed and affect the efficiency of the meeting.

因此，本發明的目的之一在於提出一種應用於遠端視訊的人物追蹤技術，其可以在影像畫面中強調目前正在發言的人物，以解決先前技術中所述的問題。 Therefore, one of the objectives of the present invention is to propose a person tracking technology applied to remote video communication, which can emphasize the person who is currently speaking in the video frame, so as to solve the problems described in the prior art.

在本發明的一實施例中，揭露了一系統晶片，其包含有一人物辨識電路、一聲音偵測電路以及一處理電路。該人物辨識電路用以自一影像擷取裝置即時地取得一影像資料，並對該影像資料進行人物辨識以產生一辨識結果；該聲音偵測電路用以自多個麥克風取得多個聲音訊號，以判斷出一主要聲音的一聲音特徵值；以及該處理電路耦接於該人物辨識電路與該聲音偵測電路，且用以根據該辨識結果以及該出該主要聲音的該聲音特徵值以判斷出該影像資料中的一特定區域，並對該影像資料進行處理以強調該特定區域。 In one embodiment of the present invention, a system chip is disclosed, which includes a person recognition circuit, a sound detection circuit and a processing circuit. The person recognition circuit is used to obtain an image data from an image capture device in real time, and perform person recognition on the image data to generate a recognition result; the sound detection circuit is used to obtain a plurality of sound signals from a plurality of microphones, to determine a sound characteristic value of a main sound; and the processing circuit is coupled to the person identification circuit and the sound detection circuit, and is used for judging according to the recognition result and the sound characteristic value of the main sound A specific area in the image data is extracted, and the image data is processed to emphasize the specific area.

在本發明的一實施例中，揭露了一種視訊處理方法，其包含有以下步驟：自一影像擷取裝置即時地取得一影像資料，並對該影像資料進行人物辨識以產生一辨識結果；自多個麥克風取得多個聲音訊號，以判斷出一主要聲音的一聲音特徵值；根據該辨識結果以及該出該主要聲音的該聲音特徵值以判斷出該影像資料中的一特定區域；以及對該影像資料進行處理以強調該特定區域。 In one embodiment of the present invention, a video processing method is disclosed, which includes the following steps: obtaining an image data from an image capture device in real time, and performing character recognition on the image data to generate a recognition result; A plurality of microphones obtain a plurality of sound signals to determine a sound feature value of a main sound; according to the recognition result and the sound feature value of the main sound, a specific area in the image data is judged; and for The image data is processed to emphasize the specific area.

110:電子裝置 110: Electronic device

120:電子裝置 120: electronic device

200:系統晶片 200: system chip

202:影像擷取裝置 202: image capture device

204_1~204_N:麥克風 204_1~204_N: Microphone

210:人物辨識電路 210: Character identification circuit

220:語音活性偵測電路 220: Voice activity detection circuit

230:聲音方向偵測電路 230: Sound direction detection circuit

240:處理電路 240: Processing circuit

300~314:步驟 300~314: steps

410~450:區域 410~450: area

第1圖為遠端視訊會議的示意圖。 FIG. 1 is a schematic diagram of a remote video conference.

第2圖為根據本發明一實施例之電子裝置的示意圖。 FIG. 2 is a schematic diagram of an electronic device according to an embodiment of the present invention.

第3圖所示之根據本發明一實施例之視訊處理方法的流程圖。 FIG. 3 is a flowchart of a video processing method according to an embodiment of the present invention.

第4圖為人物辨識電路所辨識出之影像畫面中多個人物的示意圖。 FIG. 4 is a schematic diagram of a plurality of people in an image frame recognized by the person recognition circuit.

第5圖為在影像畫面中強調正在發言之人物的示意圖。 FIG. 5 is a schematic diagram of emphasizing a person who is speaking in a video frame.

第1圖為遠端視訊會議的示意圖。如第1圖所示，在第一會議室中具有電子裝置110，以供即時地拍攝第一會議室的影像，並即時地錄下第一會議室內的聲音後，透過網路傳送至第二會議室，以供第二會議室中的電子裝置120播放出第一會議室的影像與聲音；同時地，第二會議室的電子裝置120也即時地拍攝第二會議室的影像與錄下第二會議室內的聲音，並透過網路傳送至第一會議室，以供第一會議室中的電子裝置110播放出第二會議室的影像與聲音。在本實施例中，電子裝置110與電子裝置120可以是任何具有影像與聲音收發功能以及網路通訊功能的電子裝置，例如電視、筆記型電腦、平板電腦、手機...等等。 FIG. 1 is a schematic diagram of a remote video conference. As shown in Figure 1, there is a There is an electronic device 110 for taking pictures of the first meeting room in real time, and recording the sound in the first meeting room in real time, and then sending it to the second meeting room through the network for the electronic device in the second meeting room 120 plays the image and sound of the first meeting room; at the same time, the electronic device 120 in the second meeting room also captures the image of the second meeting room and records the sound of the second meeting room in real time, and transmits them to the second meeting room through the network. A meeting room for the electronic device 110 in the first meeting room to play the image and sound of the second meeting room. In this embodiment, the electronic device 110 and the electronic device 120 may be any electronic devices with video and audio sending and receiving functions and network communication functions, such as TV, notebook computer, tablet computer, mobile phone, etc.

如先前技術中所述，當遠端視訊會議中有其中一方在影像畫面中包含多個參加者時，另一方的參加者有時候可能難以分辨影像畫面中是誰正在說話。舉例來說，若是第二會議室的參加者並不熟悉第一會議室中參加者的聲音、或是第一會議室正在發言的參加者並未正面對著攝影機、或是其他的影像傳輸因素，則第二會議室中的參加者有時候可能難以透過電子裝置120所播放的聲音與影像，因而造成困擾。因此，本實施例在電子裝置110中的系統晶片設計了一種可以在影像中強調正在發言的參加者的方法，以使得第二會議室中的參加者可以清楚地知道第一會議室是哪一位參加者正在發言，以解決上述問題。 As described in the prior art, when one party in the remote video conference includes multiple participants in the video frame, sometimes it may be difficult for the other party's participants to distinguish who is speaking in the video frame. For example, if the participants in the second meeting room are not familiar with the voices of the participants in the first meeting room, or the participants in the first meeting room are not facing the camera, or other image transmission factors , the participants in the second conference room may sometimes have difficulty passing through the audio and video played by the electronic device 120 , thus causing confusion. Therefore, the system chip in the electronic device 110 in this embodiment designs a method that can emphasize the participant who is speaking in the image, so that the participant in the second conference room can clearly know which conference room is the first conference room. participants are speaking to address the above questions.

第2圖為根據本發明一實施例之電子裝置110的示意圖。如第2圖所示，電子裝置110包含了一系統晶片200、一影像擷取裝置202以及多個麥克風204_1~204_N，其中N為大於一的任意適合的正整數。此外，系統晶片200包含了一人物辨識電路210、一語音活性偵測(voice activity detection)電路220、一聲音偵測電路(在本實施例中係以一聲音方向偵測電路230為例)以及一處理電路240。在本實施例中，影像擷取裝置202可以是一照相機或是攝影機，以即時地持續擷取第一會議室中的影像以產生影像資料至系統晶片200，其中系統晶片 200所接收的影像資料可以是原始影像資料或是已經經過某些影像處理操作後的資料。麥克風204_1~204_N可以是數位麥克風，其設置在電子裝置110的不同位置，以分別產生多個聲音訊號至系統晶片200。需注意的是，在第2圖的實施例中影像擷取裝置202以及麥克風204_1~204_N係設置在電子裝置110內，然而，在其他的實施例中，影像擷取裝置202以及麥克風204_1~204_N可以外接於電子裝置110。 FIG. 2 is a schematic diagram of an electronic device 110 according to an embodiment of the present invention. As shown in FIG. 2 , the electronic device 110 includes a system chip 200 , an image capture device 202 and a plurality of microphones 204_1 - 204_N, wherein N is any suitable positive integer greater than one. In addition, the system chip 200 includes a person recognition circuit 210, a voice activity detection circuit 220, a sound detection circuit (in this embodiment, a sound direction detection circuit 230 is taken as an example) and A processing circuit 240 . In this embodiment, the image capture device 202 can be a camera or a video camera, so as to continuously capture images in the first conference room in real time to generate image data to the system chip 200, wherein the system chip The image data received at 200 may be original image data or data after some image processing operations have been performed. The microphones 204_1 ˜ 204_N may be digital microphones, which are disposed at different positions of the electronic device 110 to generate multiple audio signals to the system chip 200 respectively. It should be noted that in the embodiment of FIG. 2, the image capture device 202 and the microphones 204_1~204_N are installed in the electronic device 110, however, in other embodiments, the image capture device 202 and the microphones 204_1~204_N It can be externally connected to the electronic device 110 .

在系統晶片200內，人物辨識電路210係用來對從影像擷取裝置202接收到的影像資料進行人物辨識，以判斷出所接收到的影像資料內是否有人物的存在，並決定出每一個人物的特徵值及每一個人物在畫面的位置/區域。具體來說，人物辨識電路210可以使用深度學習或類神經網路的方式來對該影像資料中的每一個圖框進行處理，例如使用多個不同的卷積核(convolution filter)來對圖框進行多次卷積運算以辨識出圖框中是否有人物；此外，針對所偵測到的人物，透過先前所採用之深度學習或類神經網路的方式來決定出每一個人物的一特徵值(或是，每一個人物所在之區域的特徵值)，其中該特徵值可以表示為一個多維度的向量，例如維度為‘512’的向量。需注意的是，上述關於人物辨識的相關電路設計已為本領域具有通常知識者所熟知，再加上本實施例的重點之一在於人物辨識電路210所辨識出之人物及其特徵值的應用，故人物辨識電路210的其他細節在此不贅述。 In the system chip 200, the person recognition circuit 210 is used to perform person recognition on the image data received from the image capture device 202, to determine whether there is a person in the received image data, and to determine each person eigenvalues and the position/area of each character in the screen. Specifically, the person recognition circuit 210 can use deep learning or neural network-like methods to process each frame in the image data, for example, use multiple different convolution filters to process the frame Perform multiple convolution operations to identify whether there is a person in the frame; in addition, for the detected person, a feature value of each person is determined through the previously adopted deep learning or neural network method (or, the feature value of the area where each character is located), where the feature value can be represented as a multi-dimensional vector, for example, a vector with a dimension of '512'. It should be noted that the above-mentioned circuit design related to person identification is well known to those skilled in the art. In addition, one of the key points of this embodiment is the application of the person identified by the person identification circuit 210 and its eigenvalues. , so other details of the person identification circuit 210 will not be repeated here.

語音活性偵測電路220係用來接收來自麥克風204_1~204_N的聲音訊號，並判斷這些聲音訊號中是否有語音成分。具體來說，語音活性偵測電路220主要可以執行以下操作：對接收到的聲音訊號進行降噪處理、將聲音訊號轉換為頻域後對一個區塊進行處理以取得特徵值、將所取得的特徵值與一參考值進行比較以判斷該聲音訊號是否是語音訊號。需注意的是，由於語音活性偵測的相關電路設計已為本領域具有通常知識者所熟知，再加上本實施例的重點之一在於根據語音活性偵測電路220的判斷結果來進行後續的操作，故語音活性偵測電路220的其他細節在此不贅述。此外，在另一實施例中，語音活性偵測電路220可以僅接收來自麥克風204_1~204_N中部分麥克風的聲音訊號，而不需要接收所有麥克風204_1~204_N的聲音訊號。 The voice activity detection circuit 220 is used to receive the voice signals from the microphones 204_1~204_N, and determine whether there are voice components in the voice signals. Specifically, the voice activity detection circuit 220 can mainly perform the following operations: perform noise reduction processing on the received audio signal, convert the audio signal into a frequency domain and then process a block to obtain a feature value, convert the obtained Eigenvalues are compared with a reference value Perform comparison to determine whether the sound signal is a voice signal. It should be noted that since the circuit design related to voice activity detection is well known to those skilled in the art, and one of the key points of this embodiment is to perform subsequent steps according to the judgment result of the voice activity detection circuit 220 operation, so other details of the voice activity detection circuit 220 will not be repeated here. In addition, in another embodiment, the voice activity detection circuit 220 may only receive audio signals from some of the microphones 204_1~204_N, instead of receiving audio signals from all the microphones 204_1~204_N.

關於聲音方向偵測電路230的操作，由於麥克風204_1~204_N設置在電子裝置110上的位置為已知，故聲音方向偵測電路230可以根據來自麥克風204_1~204_N之聲音訊號的時間差(亦即，所接收之聲音訊號的相位差)，以判斷出第一會議室中主要聲音的方位角(azimuth)，亦即主要發言人物相對於電子裝置110的方向與角度。在本實施例中，聲音方向偵測電路230只會決定出一個方向，亦即若是第一會議室中有多個人物同時在說話，則會根據所接收到之多個聲音訊號的一些特性(例如，訊號強度)來判斷出主要聲音是來自於哪一個方向。需注意的是，由於聲音方向偵測的相關電路設計已為本領域具有通常知識者所熟知，再加上本實施例的重點之一在於根據聲音方向偵測電路230的判斷結果來進行後續的操作，故聲音方向偵測電路230的其他細節在此不贅述。 Regarding the operation of the sound direction detection circuit 230, since the positions of the microphones 204_1~204_N on the electronic device 110 are known, the sound direction detection circuit 230 can be based on the time difference of the sound signals from the microphones 204_1~204_N (that is, The phase difference of the received sound signal) to determine the azimuth of the main sound in the first conference room, that is, the direction and angle of the main speaker relative to the electronic device 110 . In this embodiment, the sound direction detection circuit 230 will only determine one direction, that is, if there are multiple people speaking in the first conference room at the same time, it will be based on some characteristics of the received sound signals ( For example, signal strength) to determine which direction the main sound is coming from. It should be noted that since the related circuit design of sound direction detection is well known to those skilled in the art, and one of the key points of this embodiment is to perform subsequent steps according to the judgment result of the sound direction detection circuit 230 Therefore, other details of the sound direction detection circuit 230 will not be repeated here.

關於系統晶片200的整體操作，參考第3圖所示之根據本發明一實施例之視訊處理方法的流程圖。在步驟300中，流程開始，電子裝置110上電且完成與第二會議室之電子裝置120的連線。在步驟302，語音活性偵測電路220接收來自麥克風204_1~204_N的聲音訊號，並判斷這些聲音訊號中是否有語音成分，若是，流程進入步驟304；若否，流程停留在步驟302以持續偵測所接收到的聲音訊號是否包含語音成分。在步驟304，處理電路240在得知語音活性偵測電路 220偵測到聲音訊號有語音成分後，致能人物辨識電路210，以使得人物辨識電路210開始對所接收到的影像資料進行人物辨識，以判斷出所接收到的影像資料內是否有人物的存在，並決定出每一個人物的特徵值及每一個人物在畫面的位置/區域。以第4圖為例來進行說明，人物辨識電路210偵測到影像中有5位人物，因此可以決定出每一個人物在畫面中的區域410~450，並決定出區域410~450內之影像內容的特徵值以分別作為每一個人物的特徵值。在步驟306，處理電路240致能聲音方向偵測電路230，且聲音方向偵測電路230開始根據來自麥克風204_1~204_N之聲音訊號的時間差，以判斷出主要聲音相對於電子裝置110的方向與角度。需注意的是，步驟304與步驟306可以同時執行，亦即本實施例之執行不以第3圖所示的順序為限。 Regarding the overall operation of the system chip 200 , refer to the flow chart of the video processing method according to an embodiment of the present invention shown in FIG. 3 . In step 300, the process starts, the electronic device 110 is powered on and the connection with the electronic device 120 in the second meeting room is completed. In step 302, the voice activity detection circuit 220 receives the sound signals from the microphones 204_1~204_N, and judges whether there is a voice component in these sound signals, if yes, the process enters step 304; if not, the process stays in step 302 for continuous detection Whether the received audio signal contains speech components. In step 304, the processing circuit 240 learns that the voice activity detection circuit After 220 detects that the audio signal has a voice component, enable the person recognition circuit 210, so that the person recognition circuit 210 starts to perform person recognition on the received image data, so as to determine whether there is a person in the received image data , and determine the feature value of each character and the position/area of each character in the screen. Taking Figure 4 as an example for illustration, the person recognition circuit 210 detects that there are 5 people in the image, so it can determine the area 410-450 of each person in the screen, and determine the image in the area 410-450 The characteristic value of the content is used as the characteristic value of each character respectively. In step 306, the processing circuit 240 enables the sound direction detection circuit 230, and the sound direction detection circuit 230 starts to judge the direction and angle of the main sound relative to the electronic device 110 according to the time difference of the sound signals from the microphones 204_1~204_N. . It should be noted that step 304 and step 306 can be executed simultaneously, that is, the execution of this embodiment is not limited to the sequence shown in FIG. 3 .

在步驟308，處理電路240根據人物辨識電路210所決定出之影像畫面中每一個人物所在的區域(例如，第4圖的區域410~450)，再加上聲音方向偵測電路230所偵測到主要發言人物相對於電子裝置110的方向與角度，便可以判斷出影像畫面中的哪一個人物正在說話。在步驟310，在決定出影像畫面中正在發言的人物之後，處理電路240將來自影像擷取裝置202的影像資料進行處理，以在影像資料中強調主要發言人物。具體來說，參考第5圖，假設處理電路240判斷區域440內的人物為主要發言人物，則處理電路240可以對影像資料進行處理，以將區域440內的人物進行放大、或是加上標籤/箭頭、或是其他任何影像處理方法，以強化區域440內之人物之視覺效果。在對影像資料進行處理以強化區域440內之人物之視覺效果之後，處理電路240便將處理後的影像資料傳送至後端電路進行其他的影像處理，再透過網路傳送至位於第二會議室中的電子裝置120，以使得第二會議室的參加者可以清楚地知道目前第一會議室中正在發言的人物。 In step 308, the processing circuit 240 determines the area of each person in the video frame determined by the person identification circuit 210 (for example, the areas 410-450 in FIG. 4 ), plus the detection of the sound direction detection circuit 230 Based on the direction and angle of the main speaking character relative to the electronic device 110 , it can be determined which character in the image is speaking. In step 310 , after determining the speaking person in the image frame, the processing circuit 240 processes the image data from the image capture device 202 to emphasize the main speaking person in the image data. Specifically, referring to FIG. 5, assuming that the processing circuit 240 judges that the person in the area 440 is the main speaking person, the processing circuit 240 can process the image data to enlarge or add a label to the person in the area 440. / arrows, or any other image processing method, to enhance the visual effect of the characters in the area 440 . After processing the image data to enhance the visual effect of the characters in the area 440, the processing circuit 240 sends the processed image data to the back-end circuit for other image processing, and then sends it to the second conference room through the network The electronic device 120 in the second meeting room, so that the participants in the second meeting room can clearly know who is currently speaking in the first meeting room.

需注意的是，上述對強化區域440內之人物之視覺效果的實施方式並非一定要對整個區域440都進行視覺強化，而可以僅對區域440的一部分進行視覺強化，這樣也可達到相同的效果。以第5圖為例來進行說明，區域440包含了人物的頭部與身體，而處理電路240可以僅將頭部部分進行放大即可。 It should be noted that the above-mentioned embodiment of enhancing the visual effects of the characters in the area 440 does not necessarily have to perform visual enhancement on the entire area 440, but only a part of the area 440 can be visually enhanced, which can also achieve the same effect . Taking FIG. 5 as an example for illustration, the area 440 includes the head and body of the character, and the processing circuit 240 can only enlarge the head part.

在步驟312，處理電路240持續追蹤之前所強調的人物，並持續將來自影像擷取裝置202的影像資料進行處理，以在影像資料中強調該人物。具體來說，人物辨識電路210可以持續所決定出之影像畫面中每一個人物所在的區域及其特徵值，而處理電路240可以根據之前所強調之人物的特徵值來持續在目前及後續的影像畫面中強調該人物。以第5圖的區域440為例，處理電路240可以追蹤後續所接收之影像畫面中特徵值與區域440之特徵值類似的區域/人物(例如，特徵值差異在一範圍內)，以持續在後續的影像畫面中強調該人物，即使所強調的該人物在後續影像畫面中有一小段時間並未說話，且聲音方向偵測電路230也未偵測到該人物向有聲音。 In step 312 , the processing circuit 240 keeps tracking the previously emphasized person, and continues to process the image data from the image capture device 202 to emphasize the person in the image data. Specifically, the character identification circuit 210 can continue to determine the area where each character is located in the image frame and its characteristic value, and the processing circuit 240 can continue to identify the character in the current and subsequent images according to the characteristic value of the previously emphasized character. The character is emphasized in the picture. Taking the area 440 in FIG. 5 as an example, the processing circuit 240 can track the area/person whose feature value is similar to the feature value of the area 440 in the subsequently received image frame (for example, the difference of the feature value is within a certain range), so as to continuously The person is emphasized in the subsequent image frame, even if the emphasized person does not speak for a short period of time in the subsequent image frame, and the sound direction detection circuit 230 does not detect that the person has a voice.

需注意的是，由於正在發言的人物可能會移動，且可能不會一直持續說話，故步驟312可以避免影像畫面不斷開啟與關閉強化發言人物之視覺效果，而影響到第二會議室之參加者的感受。 It should be noted that since the person who is speaking may move and may not continue to speak, step 312 can avoid the continuous opening and closing of the video screen to enhance the visual effect of the speaking person, which will affect the participants in the second meeting room feelings.

在步驟314，處理電路240根據人物辨識電路210所決定出之影像畫面中每一個人物所在的區域，再加上聲音方向偵測電路230所偵測到主要發言人物相對於電子裝置110的方向與角度，以及語音活性偵測電路220所偵測到是否有人在發言(亦即，所接收到的聲音訊號有語音成分)，以判斷發言的人物是否改變，若否，流程回到步驟312以持續追蹤目前發言的人物；若是，流程回到步驟 308以判斷出新的發言人物。具體來說，由於聲音方向偵測電路230僅能偵測聲音的方向性，而無法得知所判斷之方向的聲音是否是人的聲音，因此，透過搭配語音活性偵測電路220的操作，在語音活性偵測電路220偵測到目前聲音訊號中有語音成分的情形下，若是聲音方向偵測電路230所偵測到主要發言人物相對於電子裝置110的方向與角度改變至另一個人物的位置時，處理電路240才可以判斷發言的人物已經改變。需注意的是，為了避免處理電路240不斷地在影像資料中改變所強調的人物，步驟314的執行會需要偵測一段較長的時間才做判斷。 In step 314, the processing circuit 240 determines the area of each character in the video frame determined by the character recognition circuit 210, plus the direction and direction of the main speaking character detected by the sound direction detection circuit 230 relative to the electronic device 110. Angle, and whether the voice activity detection circuit 220 detects whether someone is speaking (that is, the received sound signal has a voice component), to determine whether the person speaking has changed, if not, the process returns to step 312 to continue Track who is currently speaking; if so, go back to step 308 to determine a new speaker. Specifically, since the sound direction detection circuit 230 can only detect the directionality of the sound, it is impossible to know whether the sound in the determined direction is a human voice. Therefore, by cooperating with the operation of the voice activity detection circuit 220, When the voice activity detection circuit 220 detects that there is a voice component in the current voice signal, if the voice direction detection circuit 230 detects that the direction and angle of the main speaker relative to the electronic device 110 is changed to the position of another character , the processing circuit 240 can determine that the speaker has changed. It should be noted that, in order to prevent the processing circuit 240 from constantly changing the highlighted person in the image data, the execution of step 314 will require a long detection period before making a judgment.

在另一實施例中，為了進一步確認發言的人物是否改變，處理電路240可以另外包含一聲紋辨識機制以用來輔助聲音方向偵測電路230的偵測結果。具體來說，由於每一個人的聲音有獨特的語音特性，故處理電路240中的聲紋辨識機制可以透過持續擷取部分聲音片段來判斷是否這些聲音片段的聲音特徵值是屬於同一個人，以供進行發言人物的判斷。舉例來說，若是根據人物辨識電路210、語音活性偵測電路220與聲音方向偵測電路230判斷出發言的人物已經改變，但聲紋辨識機制判斷聲音片段的聲音特徵值是屬於同一個人物，則處理電路240可以暫緩判斷發言的人物是否已經改變，並再偵測一段時間後再做判斷。 In another embodiment, in order to further confirm whether the speaker has changed, the processing circuit 240 may additionally include a voiceprint recognition mechanism to assist the detection result of the sound direction detection circuit 230 . Specifically, since the voice of each person has unique speech characteristics, the voiceprint recognition mechanism in the processing circuit 240 can continuously capture some sound clips to determine whether the sound feature values of these sound clips belong to the same person, for Make a judgment on the speaker. For example, if it is determined by the character identification circuit 210, the voice activity detection circuit 220 and the voice direction detection circuit 230 that the person speaking has changed, but the voiceprint recognition mechanism judges that the sound feature value of the sound segment belongs to the same person, Then the processing circuit 240 can suspend judging whether the speaker has changed, and make a judgment after detecting for a period of time.

在之前的實施例中，係以聲音方向偵測電路230來作為該聲音偵測電路來進行說明，然而，本發明並不以此為限。在其他實施例中，可用聲紋辨識機制取代前述實施例的聲音方向偵測電路230，僅根據聲紋辨識結果判斷發言人物，並據以決定所強調的對象。換句話說，本發明的該聲音偵測電路可以自多個麥克風取得多個聲音訊號以判斷出一主要聲音的聲音特徵值，而該聲音特徵值可以是主要聲音的一方位角或是供聲紋辨識機制之聲音片段的聲音特徵值。 In the previous embodiments, the sound direction detection circuit 230 is used as the sound detection circuit for illustration, however, the present invention is not limited thereto. In other embodiments, the voiceprint recognition mechanism can be used to replace the voice direction detection circuit 230 of the above-mentioned embodiment, and only the voiceprint recognition result is used to judge the speaking person and determine the emphasized object accordingly. In other words, the sound detection circuit of the present invention can obtain a plurality of sound signals from a plurality of microphones to determine the sound feature value of a main sound, and the sound feature value can be an azimuth of the main sound or a sound supply. The sound feature value of the sound clip of the grain recognition mechanism.

簡要歸納本發明，在本發明之視訊處理方法中，透過偵測目前正在發言的人物並在影像資料中強調該人物，可以讓遠端會議室的參加者清楚知道目前是誰正在發言，故可以有效地增進會議效率。 To briefly summarize the present invention, in the video processing method of the present invention, by detecting the person who is currently speaking and emphasizing the person in the video data, the participants in the remote conference room can clearly know who is currently speaking, so it can Effectively improve meeting efficiency.

以上所述僅為本發明之較佳實施例，凡依本發明申請專利範圍所做之均等變化與修飾，皆應屬本發明之涵蓋範圍。 The above descriptions are only preferred embodiments of the present invention, and all equivalent changes and modifications made according to the scope of the patent application of the present invention shall fall within the scope of the present invention.

110:電子裝置 110: Electronic device

200:系統晶片 200: system chip

202:影像擷取裝置 202: image capture device

204_1~204_N:麥克風 204_1~204_N: Microphone

210:人物辨識電路 210: Character identification circuit

220:語音活性偵測電路 220: Voice activity detection circuit

230:聲音方向偵測電路 230: Sound direction detection circuit

240:處理電路 240: Processing circuit

Claims

A system chip includes: a human identification circuit, used to obtain an image data from an image capture device in real time, and perform identification on the image data to generate a recognition result; a sound detection circuit, used to automatically A plurality of microphones obtain a plurality of sound signals to determine a sound feature value of a main sound; and a processing circuit, coupled to the person identification circuit and the sound detection circuit, is used to obtain the identification result and the The sound feature value of the main sound is used to determine a specific area in the image data, and the image data is processed to emphasize the specific area; wherein the recognition result includes a plurality of areas, and each area includes a person; and the processing circuit is used to select an area from the plurality of areas as the specific area according to the sound feature value of the main sound; and the recognition result further includes a plurality of feature values respectively corresponding to the plurality of areas , and the processing circuit tracks the feature value of the specific area to determine the position of the specific area in the subsequent image data, and processes the subsequent image data to emphasize the specific area.

The system chip as described in item 1 of the scope of the patent application further includes: a voice activity detection circuit, which is used to judge whether the at least a part of the sound signals contains at least a part of the sound signals according to the plurality of sound signals There is a speech component; wherein the processing circuit determines whether to determine the specific area in the image data according to the recognition result and the sound feature value of the main sound according to whether the at least part of the sound signal contains a speech component , and process the imagery to emphasize that particular area.

The system chip as described in item 2 of the scope of the patent application, wherein when the voice activity detection When the circuit indicates that the at least part of the sound signal contains speech components, the processing circuit will judge the specific area in the image data according to the recognition result and the sound feature value of the main sound, and the image data Treated to emphasize that particular area.

The system chip as described in item 1 of the scope of the patent application further includes: a voice activity detection circuit, which is used to judge whether the at least a part of the sound signals contains at least a part of the sound signals according to the plurality of sound signals There are speech components; wherein the processing circuit determines the plurality of feature values corresponding to the plurality of regions according to the person identification circuit, the sound feature value of the main sound detected by the sound detection circuit, and The speech activity detection circuit detects whether the at least a part of the sound signal contains a speech component, so as to determine whether the person speaking has changed, so as to determine whether to select another region from the plurality of regions as the specific region.

The system chip as described in item 1 of the scope of the patent application, wherein the processing circuit processes the image data to enlarge the figures in the specific area.

The system chip as described in item 1, 2, 3, 4 or 5 of the scope of patent application, wherein the sound detection circuit is a sound direction detection circuit, and the sound feature value of the main sound is a value of the main sound Azimuth.

A video processing method, including obtaining an image data from an image capture device in real time, and performing character recognition on the image data to generate a recognition result; Obtaining a plurality of sound signals from a plurality of microphones to determine a sound feature value of a main sound; judging a specific area in the image data according to the identification result and the sound feature value of the main sound; and The image data is processed to emphasize the specific region; wherein the recognition result includes a plurality of regions, each region contains a person, and the recognition result further includes a plurality of feature values respectively corresponding to the plurality of regions; and the video The processing method further includes: tracking the feature value of the specific area to determine the position of the specific area in the subsequent image data, and processing the subsequent image data to emphasize the specific area.

The video processing method described in item 7 of the scope of the patent application further includes: judging whether the at least a part of the sound signals contains speech components according to at least a part of the sound signals of the plurality of sound signals; The plurality of characteristic values of the plurality of regions, the sound characteristic value of the main sound, and whether the at least a part of the sound signal contains speech components, so as to determine whether the person speaking has changed, so as to judge whether it is from the plurality of Region Select another region as the specific region.

According to the video processing method described in item 7 of the scope of the patent application, the step of processing the subsequent image data to emphasize the specific area includes: processing the image data to enlarge the characters in the specific area.