KR20210004036A

KR20210004036A - Method and apparatus for operating independent classification model using metadata

Info

Publication number: KR20210004036A
Application number: KR1020190079803A
Authority: KR
Inventors: 조위덕; 최선탁; 이주영
Original assignee: 아주대학교산학협력단
Priority date: 2019-07-03
Filing date: 2019-07-03
Publication date: 2021-01-13
Anticipated expiration: 2039-07-03
Also published as: KR102267487B1

Abstract

메타데이터를 이용한 독립 분류 모델의 동작 방법을 개시한다. 본 발명의 일실시예에 따른 메타데이터를 이용한 독립 분류 모델의 동작 방법은 분류모델 학습부가, 복수의 학습 데이터를 메타데이터에 기반하여 분류한 결과인 복수의 데이터집합 각각에 대응되는 분류모델을 학습시키는 단계; 데이터 비교부가, 상기 메타데이터 및 소정의 데이터 유사 판단 기준 중 적어도 하나를 이용하여, 상기 복수의 데이터집합 중에서 분류 대상인 대상 데이터에 대응되는 데이터집합인 선택데이터집합을 결정하는 단계; 및 데이터 분류부가, 상기 선택데이터집합에 대응되는 분류모델인 선택분류모델을 이용하여 상기 대상 데이터를 분류하는 단계를 포함한다.A method of operating an independent classification model using metadata is disclosed. In the method of operating an independent classification model using metadata according to an embodiment of the present invention, the classification model learning unit learns a classification model corresponding to each of a plurality of data sets, which is a result of classifying a plurality of training data based on the metadata. Letting go; Determining a selection data set, which is a data set corresponding to target data to be classified, from among the plurality of data sets, by using at least one of the metadata and a predetermined data similarity determination criterion; And a data classification unit classifying the target data using a selection classification model that is a classification model corresponding to the selection data set.

Description

Operation method and device for independent classification model using metadata {METHOD AND APPARATUS FOR OPERATING INDEPENDENT CLASSIFICATION MODEL USING METADATA}

본 발명은 패턴 인식을 위해 메타데이터에 기반하여 분할된 데이터 집합들에 대하여 독립적인 분류 모델을 생성하여 학습시킨 독립 분류 모델의 동작 방법 및 그 장치에 관한 것이다.The present invention relates to a method and an apparatus for operating an independent classification model in which an independent classification model is generated and trained on data sets divided based on metadata for pattern recognition.

단일 알고리즘을 이용한 패턴 인식의 한계를 극복하기 위하여 복수의 알고리즘을 병렬 또는 직렬로 조합하여 분류기를 설계하는 앙상블 기법(ensemble approach)에 관한 연구가 진행되어 왔다.In order to overcome the limitation of pattern recognition using a single algorithm, studies on an ensemble approach for designing a classifier by combining a plurality of algorithms in parallel or serially have been conducted.

우선, 도 4(a)를 참조하면, 배깅(Bagging) 기법은 Bootstrap 기법과 Aggregating 기법을 결합한 것이다. Bootstrap 기법은 랜덤하게 재배치한 학습용 데이터 집합에서 중복을 허용하는 부분 집합(복원 랜덤 샘플링)을 생성하고 생성된 부분 집합 당 분류 모델을 학습시킨다. 또한, Aggregating 기법은 분류 대상 데이터가 들어왔을 때, 모든 분류 모델에 대해서 분류를 실행하고 결과를 집계(선택, 투표 등)를 이용하여 분류 결과를 결정한다. 이때, 각 분류 모델은 병렬로 처리되며 집계 과정에서 수합된다. 따라서, 각 분류 모델이 서로 독립적이지 않고 양의 상관 관계를 가지는 경우, 오히려 오차가 확대되는 문제를 야기할 수 있다.First, referring to FIG. 4(a), a bagging technique is a combination of a Bootstrap technique and an Aggregating technique. The Bootstrap technique generates a subset (restored random sampling) that allows duplication from a randomly rearranged training data set, and trains a classification model per generated subset. In addition, the Aggregating technique performs classification for all classification models when the data to be classified is received, and determines the classification result by using the aggregation (selection, voting, etc.). At this time, each classification model is processed in parallel and collected during the aggregation process. Therefore, when each classification model is not independent of each other and has a positive correlation, it may cause a problem that the error is rather widened.

또한, 도 4(b)를 참조하면, 부스팅(Boosting) 기법은 복원 랜덤 샘플링으로 데이터 집합을 나누어 학습시키는 기법이다. 주어진 데이터 집합에서 단순한 조건을 이용하여 약한 분류기를 선택한다. 이 때 약한 분류기는 한 가지는 확실하게 맞추는 조건으로 설계된다. 이때, 혼동행렬(confusion matrix)에서 False Positive(FP)나 False Negative(FN)가 최소인 분류기가 선택될 수 있다. 또한, 분류에 실패한 데이터에 가중치를 부여하고, 이러한 방법으로 틀린 영역에 중복되지 않는 조건 사용으로 위 과정을 반복 수행하여 일련의 약한 분류기를 결합한 강한 분류기를 설계한다. 즉, 각 분류 모델은 직렬로 처리되며 틀린 문제에 집중하여 어려운 문제를 해결한다. 따라서, 오류 데이터(outlier)에 민감하며, 과적합(overfitting) 문제를 야기할 수 있다.Also, referring to FIG. 4(b), the boosting technique is a technique for dividing and learning a data set by reconstructed random sampling. We select a weak classifier using simple conditions from a given data set. At this time, the weak classifier is designed under the condition that one surely fits. In this case, a classifier having a minimum false positive (FP) or false negative (FN) may be selected from the confusion matrix. In addition, a weight is assigned to the data that failed to be classified, and in this way, the above process is repeatedly performed using conditions that do not overlap in the wrong area to design a strong classifier combining a series of weak classifiers. In other words, each classification model is processed in series and solves difficult problems by focusing on the wrong problem. Therefore, it is sensitive to error data (outlier), and may cause an overfitting problem.

따라서, 이러한 기존의 앙상블 기법들의 문제점을 극복하기 위한 새로운 앙상블 기법에 대한 필요성이 대두되고 있다.Therefore, there is a need for a new ensemble technique to overcome the problems of these existing ensemble techniques.

한국 공개특허공보 제10-2017-0140757호(2017.12.21.)Korean Patent Application Publication No. 10-2017-0140757 (2017.12.21.)

본 발명의 목적은, 상기 문제점을 해결하기 위한 것으로, 메타데이터에 따라 학습 데이터를 분리하여 복수의 분류모델을 학습시킨 후, 그 중에서 데이터 기반으로 최적의 분류모델을 선택하여 분류함으로써, 입력된 데이터의 분류 성능을 향상시킬 수 있는 독립 분류 모델의 동작 방법 및 그 장치를 제공하는 것이다.An object of the present invention is to solve the above problem, by separating training data according to metadata to train a plurality of classification models, and then selecting and classifying an optimal classification model based on data from among them. It is to provide a method and apparatus for operating an independent classification model capable of improving the classification performance of.

본 발명이 해결하고자 하는 과제는 이상에서 언급한 과제(들)로 제한되지 않으며, 언급되지 않은 또 다른 과제(들)은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The problem to be solved by the present invention is not limited to the problem(s) mentioned above, and another problem(s) not mentioned will be clearly understood by those skilled in the art from the following description.

상기의 목적을 달성하기 위한 본 발명의 일 실시예에 따른 메타데이터를 이용한 독립 분류 모델의 동작 방법은 분류모델 학습부가, 복수의 학습 데이터를 메타데이터에 기반하여 분류한 결과인 복수의 데이터집합 각각에 대응되는 분류모델을 학습시키는 단계; 데이터 비교부가, 상기 메타데이터 및 소정의 데이터 유사 판단 기준 중 적어도 하나를 이용하여, 상기 복수의 데이터집합 중에서 분류 대상인 대상 데이터에 대응되는 데이터집합인 선택데이터집합을 결정하는 단계; 및 데이터 분류부가, 상기 선택데이터집합에 대응되는 분류모델인 선택분류모델을 이용하여 상기 대상 데이터를 분류하는 단계를 포함한다.In order to achieve the above object, the method of operating an independent classification model using metadata according to an embodiment of the present invention includes a classification model learning unit, each of a plurality of data sets, which is a result of classifying a plurality of training data based on the metadata. Learning a classification model corresponding to; Determining a selection data set, which is a data set corresponding to target data to be classified, from among the plurality of data sets, by using at least one of the metadata and a predetermined data similarity determination criterion; And a data classification unit classifying the target data using a selection classification model that is a classification model corresponding to the selection data set.

바람직하게는, 상기 선택데이터집합을 결정하는 단계는 상기 대상 데이터의 상기 메타데이터에 관한 정보가 존재하면, 상기 메타데이터 및 상기 유사 판단 기준 중 적어도 하나를 이용하고, 상기 대상 데이터의 상기 메타데이터에 관한 정보가 존재하지 않으면, 상기 유사 판단 기준을 이용할 수 있다.Preferably, in the determining of the selection data set, if information about the metadata of the target data exists, at least one of the metadata and the similarity determination criterion is used, and the metadata of the target data is If there is no related information, the similarity criterion may be used.

바람직하게는, 상기 분류모델을 학습시키는 단계와 상기 선택데이터집합을 결정하는 단계의 사이에, 상기 분류모델 학습부가, 상기 복수의 데이터집합에 포함된 개별 데이터집합과 상기 개별 데이터집합에 대응되는 분류모델을 짝지어 저장소에 저장하는 단계를 더 포함하고, 상기 데이터 비교부는 상기 저장소로부터 상기 복수의 데이터집합을 획득하고, 상기 데이터 분류부는 상기 저장소로부터 상기 선택분류모델을 획득할 수 있다.Preferably, between the step of training the classification model and the step of determining the selection data set, the classification model learning unit includes an individual data set included in the plurality of data sets and a classification model corresponding to the individual data sets. The data comparison unit may obtain the plurality of data sets from the storage unit, and the data classification unit may obtain the selection classification model from the storage unit.

바람직하게는, 상기 분류모델을 학습시키는 단계는 상기 복수의 데이터집합 각각에 대하여, 복수의 분류모델 중 하나인 임시 분류모델을 학습시키는 단계; 혼동행렬(confusion matrix)에 기초하여, 상기 학습된 임시 분류모델의 성능을 평가하는 단계; 및 상기 복수의 분류모델 모두에 대하여, 상기 임시 분류모델을 학습시키는 단계 및 상기 학습된 임시 분류모델의 성능을 평가하는 단계를 수행하여, 상기 평가된 성능에 따라 하나의 분류모델을 결정하는 단계를 포함할 수 있다.Preferably, the training of the classification model comprises: learning a temporary classification model, which is one of a plurality of classification models, for each of the plurality of data sets; Evaluating the performance of the learned temporary classification model based on a confusion matrix; And performing the step of training the temporary classification model and evaluating the performance of the learned temporary classification model for all of the plurality of classification models, and determining one classification model according to the evaluated performance. Can include.

바람직하게는, 상기 복수의 분류모델은 확률 및 통계, 도메인 변환, 인공 신경망, 전문가 시스템, 인스턴스 기반 학습, 의사 결정 트리 및 앙상블 기법에 기반한 분류모델들 중에서 선정될 수 있다.Preferably, the plurality of classification models may be selected from classification models based on probability and statistics, domain transformation, artificial neural networks, expert systems, instance-based learning, decision trees, and ensemble techniques.

바람직하게는, 상기 복수의 학습 데이터 및 상기 대상 데이터에 대하여 소정의 기준에 따른 대표값을 추출하는 특징추출(feature extraction) 및 상기 대표값으로 구성된 특징 공간의 차원을 축소하는 차원축소(dimensionality reduction)는 사전 설정에 따라서 데이터 처리부 및 상기 분류모델 중 하나에 의해 수행되거나, 상기 데이터 처리부 및 상기 분류모델에서 나뉘어 수행될 수 있다.Preferably, feature extraction for extracting a representative value according to a predetermined criterion for the plurality of training data and the target data, and dimensionality reduction for reducing the dimension of a feature space composed of the representative values May be performed by one of the data processing unit and the classification model according to a preset setting, or may be separately performed by the data processing unit and the classification model.

바람직하게는, 상기 데이터 처리부가 특징추출 또는 차원축소를 수행하는 경우에, 상기 분류모델을 학습시키는 단계의 이전에, 상기 데이터 처리부가, 상기 복수의 학습 데이터에 대하여 특징추출 또는 차원축소를 수행하는 단계; 및 상기 선택데이터집합을 결정하는 단계의 이전에, 상기 데이터 처리부가, 상기 대상 데이터에 대하여 특징추출 또는 차원축소를 수행하는 단계를 더 포함할 수 있다.Preferably, when the data processing unit performs feature extraction or dimension reduction, before the step of learning the classification model, the data processing unit performs feature extraction or dimension reduction on the plurality of training data. step; And prior to the step of determining the selection data set, the data processing unit performing feature extraction or dimension reduction on the target data.

바람직하게는, 상기 데이터 유사 판단 기준은 상기 복수의 데이터집합 각각에 포함된 데이터와 상기 대상 데이터의 유사도가 소정의 유사임계치 이상인지 여부 또는 상기 복수의 데이터집합 각각에 포함된 데이터와 상기 대상 데이터의 오차가 소정의 오차임계치 이하인지 여부일 수 있다.Preferably, the data similarity determination criterion is whether the similarity between the data included in each of the plurality of data sets and the target data is equal to or greater than a predetermined similarity threshold, or between the data included in each of the plurality of data sets and the target data. It may be whether the error is less than or equal to a predetermined error threshold.

또한, 상기의 목적을 달성하기 위한 본 발명의 일 실시예에 따른 메타데이터를 이용한 독립 분류 모델 장치는 저장소; 복수의 학습 데이터를 메타데이터에 기반하여 분류한 결과인 복수의 데이터집합 각각에 대응되는 분류모델을 학습시키고, 상기 복수의 데이터집합에 포함된 개별 데이터집합과 상기 개별 데이터집합에 대응되는 분류모델을 짝지어 상기 저장소에 저장하는 분류모델 학습부; 상기 메타데이터 및 소정의 데이터 유사 판단 기준 중 적어도 하나를 이용하여, 상기 복수의 데이터집합 중에서 분류 대상인 대상 데이터에 대응되는 데이터집합인 선택데이터집합을 결정하는 데이터 비교부; 및 상기 선택데이터집합에 대응되는 분류모델인 선택분류모델을 이용하여 상기 대상 데이터를 분류하는 데이터 분류부를 포함한다.In addition, the independent classification model apparatus using metadata according to an embodiment of the present invention for achieving the above object includes: a storage; A classification model corresponding to each of a plurality of data sets, which is a result of classifying a plurality of training data based on metadata, is trained, and an individual data set included in the plurality of data sets is paired with a classification model corresponding to the individual data set. A classification model learning unit configured and stored in the storage; A data comparison unit for determining a selection data set, which is a data set corresponding to target data to be classified, from among the plurality of data sets, using at least one of the metadata and a predetermined data similarity determination criterion; And a data classification unit for classifying the target data by using a selection classification model that is a classification model corresponding to the selection data set.

바람직하게는, 상기 데이터 비교부는 상기 대상 데이터의 상기 메타데이터에 관한 정보가 존재하면, 상기 메타데이터 및 상기 유사 판단 기준 중 적어도 하나를 이용하고, 상기 대상 데이터의 상기 메타데이터에 관한 정보가 존재하지 않으면, 상기 유사 판단 기준을 이용할 수 있다.Preferably, the data comparison unit uses at least one of the metadata and the similarity determination criterion when information about the metadata of the target data exists, and information about the metadata of the target data does not exist. Otherwise, the similarity determination criterion may be used.

바람직하게는, 상기 분류모델 학습부는 상기 복수의 데이터집합 각각에 대하여, 복수의 분류모델 중 하나인 임시 분류모델을 학습시키고, 혼동행렬(confusion matrix)에 기초하여, 상기 학습된 임시 분류모델의 성능을 평가하는 과정을 상기 복수의 분류모델 모두에 대하여 수행하고, 상기 평가된 성능에 따라 하나의 분류모델을 결정할 수 있다.Preferably, the classification model learning unit learns a temporary classification model, which is one of a plurality of classification models, for each of the plurality of data sets, and based on a confusion matrix, the performance of the learned temporary classification model The process of evaluating may be performed for all of the plurality of classification models, and one classification model may be determined according to the evaluated performance.

본 발명의 일 실시예에 따르면, 메타데이터에 따라 학습 데이터를 분류한 후 개별 분류모델을 학습시키므로, 학습 데이터 간의 중복이 발생하지 않아 과적합(overfitting) 문제를 방지할 수 있으며, 변인은 뚜렷하지만 일반화시키기 어려운 학습 데이터를 이용하는 경우 활용도가 높아지는 효과가 있다.According to an embodiment of the present invention, since individual classification models are trained after classifying training data according to metadata, overlapping between training data can be prevented, and an overfitting problem can be prevented. In the case of using learning data that is difficult to generalize, there is an effect of increasing utilization.

또한, 본 발명의 일 실시예에 따르면, 다수의 분류모델 중에 하나의 분류모델을 선별하여 분류를 수행하게 되므로, 다수의 분류모델을 직렬 또는 병렬로 이용하는 기존의 앙상블 기법보다 시스템 부하 및 동작 시간의 측면에서 우수한 효과가 있다.In addition, according to an embodiment of the present invention, since classification is performed by selecting one classification model from among a plurality of classification models, the system load and operation time are less than that of the conventional ensemble method using a plurality of classification models in series or in parallel. There is an excellent effect on the side.

또한, 본 발명의 일 실시예에 따르면, 다수의 분류모델을 제약없이 이용할 수 있어 개별 상황에 적합한 분류모델을 용이하게 적용할 수 있으며, 학습 데이터 간의 중복이 발생하지 않으므로 분류모델의 학습과정이 빠르게 수행될 수 있는 효과가 있다.In addition, according to an embodiment of the present invention, a plurality of classification models can be used without restrictions, so that a classification model suitable for individual situations can be easily applied, and because overlapping between training data does not occur, the learning process of the classification model is fast. There is an effect that can be performed.

도 1은 본 발명의 실시예에 따른, 메타데이터를 이용한 독립 분류 모델의 동작 방법을 설명하기 위한 흐름도이다.
도 2는 본 발명의 실시예에 따른, 분류모델을 학습시키는 방법을 설명하기 위한 흐름도이다.
도 3은 본 발명의 실시예에 따른, 메타데이터를 이용한 독립 분류 모델 장치의 블록도이다.
도 4는 종래의 기술에 따른 앙상블 기법과 본 발명의 일 실시예에 따른 앙상블 기법을 설명하기 위한 도면이다.
도 5는 혼동행렬(confusion matrix)를 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따른 데이터 처리부와 분류모델의 분리를 설명하기 위한 도면이다.1 is a flowchart illustrating a method of operating an independent classification model using metadata according to an embodiment of the present invention.
2 is a flowchart illustrating a method of learning a classification model according to an embodiment of the present invention.
3 is a block diagram of an independent classification model apparatus using metadata according to an embodiment of the present invention.
4 is a diagram illustrating an ensemble technique according to the prior art and an ensemble technique according to an embodiment of the present invention.
5 is a diagram for explaining a confusion matrix.
6 is a diagram illustrating separation of a data processing unit and a classification model according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.In the present invention, various modifications may be made and various embodiments may be provided, and specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to a specific embodiment, it is to be understood to include all changes, equivalents, and substitutes included in the spirit and scope of the present invention. In describing each drawing, similar reference numerals have been used for similar elements.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, and B may be used to describe various elements, but the elements should not be limited by the terms. These terms are used only for the purpose of distinguishing one component from another component. For example, without departing from the scope of the present invention, a first element may be referred to as a second element, and similarly, a second element may be referred to as a first element. The term and/or includes a combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is referred to as being "connected" or "connected" to another component, it is understood that it may be directly connected or connected to the other component, but other components may exist in the middle. Should be. On the other hand, when a component is referred to as being "directly connected" or "directly connected" to another component, it should be understood that there is no other component in the middle.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In the present application, terms such as "comprise" or "have" are intended to designate the presence of features, numbers, steps, actions, components, parts, or combinations thereof described in the specification, but one or more other features. It is to be understood that the presence or addition of elements or numbers, steps, actions, components, parts, or combinations thereof, does not preclude in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms as defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and should not be interpreted as an ideal or excessively formal meaning unless explicitly defined in this application. Does not.

이하, 본 발명에 따른 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시예에 따른, 메타데이터를 이용한 독립 분류 모델의 동작 방법을 설명하기 위한 흐름도이다.1 is a flowchart illustrating a method of operating an independent classification model using metadata according to an embodiment of the present invention.

본 발명에서 메타데이터는 분류모델을 설계하기 위하여 직접적으로 필요한 정보가 아니며, 변인(variable)으로 해석될 수 있다. 예컨대, '가속도 센서를 이용한 행동 인지'에서 가속도 센서로부터 수집된 신호는 데이터이며, 그 가속도 센서가 측정된 특정 행동은 클래스 또는 레이블이라고 한다. 이때, 메타데이터는 수집 환경, 실험자의 성별/연령, 실험자 번호 등 분류 대상인 데이터를 설명하는 변인 정보를 의미할 수 있다.In the present invention, metadata is not directly necessary information to design a classification model, but can be interpreted as a variable. For example, in'action recognition using an acceleration sensor', a signal collected from an acceleration sensor is data, and a specific action measured by the acceleration sensor is called a class or label. In this case, the metadata may refer to variable information that describes data to be classified, such as a collection environment, an experimenter's gender/age, and an experimenter number.

단계 S110에서는, 분류모델 학습부가, 복수의 학습 데이터를 메타데이터에 기반하여 분류한 결과인 복수의 데이터집합 각각에 대응되는 분류모델을 학습시킨다.In step S110, the classification model learning unit learns a classification model corresponding to each of the plurality of data sets, which is a result of classifying the plurality of training data based on metadata.

이때, 복수의 학습 데이터에는 다양한 종류의 메타데이터가 포함되어 있거나 별도로 제공될 수 있다. 그리고, 복수의 학습 데이터는 소정의 메타데이터에 기반하여 복수의 데이터집합으로 분류될 수 있다. 바람직하게는, 복수의 학습 데이터는 서로 다른 데이터집합에 동일한 학습 데이터가 중복되어 존재하지 않도록 분류될 수 있다.In this case, various types of metadata may be included in the plurality of learning data or may be separately provided. Further, the plurality of learning data may be classified into a plurality of data sets based on predetermined metadata. Preferably, the plurality of training data may be classified so that the same training data does not overlap in different data sets.

한편, 데이터 분류의 기준이 되는 메타데이터는 분류 모델 설계자의 경험적/실험적 요인, 분류 대상의 특징 또는 도메인지식에 따라 결정될 수 있다. 예컨대, 트레드밀에서 속도별 걷기/달리기 행동을 측정할 경우 여성이 남성보다 더 낮은 속도에서 뛰기 시작하므로 분리 기준이 되는 메타데이터는 성별로 설정될 수 있다.On the other hand, metadata, which is a criterion for data classification, may be determined according to the empirical/experimental factors of the classification model designer, characteristics of the classification object, or domain knowledge. For example, when measuring walking/running behavior by speed on a treadmill, since women start running at a lower speed than men, metadata serving as a separation criterion may be set as gender.

그 후, 분류모델 학습부는 분류된 복수의 데이터집합 각각에 대하여 최적화된 다양한 종류의 분류모델을 생성하여 학습시킬 수 있다. 이때, 변인 선택이 잘 될수록, 생성된(학습된) 분류 모델 간에 유사성이 적어질 수 있다. 그러나, 분류모델 간의 유사성을 파악하는 방법은 모델에 따라 다를 수 있으며, 그 방법이 존재하지 않을 수도 있다.Thereafter, the classification model learning unit may generate and train various types of classification models optimized for each of the plurality of classified data sets. In this case, the better the selection of variables is, the less similarity between the generated (learned) classification models may be. However, the method of grasping similarity between classification models may differ depending on the model, and the method may not exist.

단계 S120에서는, 데이터 비교부가, 메타데이터 및 소정의 데이터 유사 판단 기준 중 적어도 하나를 이용하여, 그 복수의 데이터집합 중에서 분류 대상인 대상 데이터에 대응되는 데이터집합인 선택데이터집합을 결정한다.In step S120, the data comparison unit determines a selection data set, which is a data set corresponding to the target data to be classified, from among the plurality of data sets, using at least one of the metadata and a predetermined data similarity determination criterion.

즉, 데이터 비교부는 메타데이터 및 데이터 유사 판단 기준 중 적어도 하나를 이용하여, 복수의 데이터집합 중에서 대상 데이터에 대응되는 데이터집합을 결정할 수 있다.That is, the data comparison unit may determine a data set corresponding to the target data from among a plurality of data sets by using at least one of metadata and data similarity determination criteria.

만일, 데이터 비교부가 메타데이터를 이용한다면, 복수의 데이터집합 중에서 대상 데이터의 메타데이터와 동일하거나 유사한 메타데이터로 분류된 데이터집합으로 선택데이터집합을 결정할 수 있다.If the data comparison unit uses the metadata, the selection data set may be determined as a data set classified as metadata identical to or similar to the metadata of the target data from among the plurality of data sets.

또한, 데이터 비교부가 데이터 유사 판단 기준을 이용한다면, 복수의 데이터집합 각각에 포함된 학습 데이터와 대상 데이터 간의 데이터 유사 여부를 판단한 후, 복수의 데이터집합 중에서 선택데이터집합을 결정할 수 있다.In addition, if the data comparison unit uses the data similarity determination criterion, after determining whether data is similar between the training data included in each of the plurality of data sets and the target data, the selection data set may be determined from among the plurality of data sets.

다른 실시예에서는, 데이터 비교부는 대상 데이터의 메타데이터가 존재하는지 여부에 따라, 상이한 방법으로 선택데이터집합을 결정할 수 있다.In another embodiment, the data comparison unit may determine the selection data set in a different manner according to whether or not metadata of the target data exists.

즉, 데이터 비교부는 만일 대상 데이터의 메타데이터에 관한 정보가 존재하면, 그 메타데이터 및 유사 판단 기준 중 적어도 하나를 이용하여 선택데이터집합을 결정할 수 있다.That is, if there is information about the metadata of the target data, the data comparison unit may determine the selection data set using at least one of the metadata and the similarity determination criteria.

하지만, 데이터 비교부는 만일 대상 데이터의 메타데이터에 관한 정보가 존재하지 않으면, 유사 판단 기준만을 이용하여 선택데이터집합을 결정할 수 있다.However, if there is no information on the metadata of the target data, the data comparison unit may determine the selection data set using only the similarity determination criterion.

이는, 대상 데이터에 따라서 데이터집합을 분류하기 위해 이용된 메타데이터가 포함되어 있을 수도, 포함되어 있지 않을 수도 있기 때문이다.This is because metadata used to classify a data set may or may not be included according to target data.

만일, 대상 데이터에 해당 메타데이터가 포함되어 있고, 데이터 비교부가 그 메타데이터만을 이용하여 선택데이터집합을 결정하는 경우, 데이터 유사 판단 기준을 이용한 데이터 비교 과정을 생략할 수 있어, 연산량을 줄일 수 있는 효과가 있다.If the corresponding metadata is included in the target data, and the data comparison unit determines the selection data set using only the metadata, the data comparison process using the data similarity determination criterion can be omitted, thereby reducing the amount of computation. It works.

또 다른 실시예에서는, 데이터 유사 판단 기준은 복수의 데이터집합 각각에 포함된 데이터와 대상 데이터의 유사도가 소정의 유사임계치 이상인지 여부 또는 복수의 데이터집합 각각에 포함된 데이터와 대상 데이터의 오차가 소정의 오차임계치 이하인지 여부일 수 있다.In another embodiment, the data similarity determination criterion is whether the similarity between the data included in each of the plurality of data sets and the target data is equal to or greater than a predetermined similarity threshold, or the error between the data included in each of the plurality of data sets and the target data is determined. It may be whether it is less than the error threshold of.

예컨대, 데이터 비교부는 복수의 데이터집합을 구성하는 개별 데이터집합에 포함된 학습 데이터 각각과 대상 데이터의 유사도(similarity)를 산출하고, 평균을 산출한 뒤, 그 평균 유사도가 유사임계치 이상이면 그 개별 데이터집합이 대상 데이터와 유사하다고 판단할 수 있다.For example, the data comparison unit calculates the similarity between each of the training data included in the individual data sets constituting the plurality of data sets and the target data, calculates an average, and if the average similarity is greater than or equal to the similarity threshold, the individual data It can be determined that the set is similar to the target data.

또한, 데이터 비교부는 복수의 데이터집합을 구성하는 개별 데이터집합에 포함된 학습 데이터 각각과 대상 데이터의 오차(error)를 산출하고, 평균을 산출한 뒤, 그 평균 오차가 오차임계치 이하이면 그 개별 데이터집합이 대상 데이터와 유사하다고 판단할 수 있다.In addition, the data comparison unit calculates an error between each of the training data and the target data included in the individual data sets constituting the plurality of data sets, calculates the average, and then calculates the average error of the individual data if it is less than the error threshold. It can be determined that the set is similar to the target data.

마지막으로 단계 S130에서는, 데이터 분류부가, 그 선택데이터집합에 대응되는 분류모델인 선택분류모델을 이용하여 대상 데이터를 분류한다.Finally, in step S130, the data classification unit classifies the target data by using a selection classification model that is a classification model corresponding to the selection data set.

즉, 데이터 분류부는 그 선택데이터집합에 대응되는 선택분류모델을 이용하여, 대상 데이터를 분류할 수 있다.That is, the data classification unit may classify target data using a selection classification model corresponding to the selection data set.

다시 말하면, 데이터 비교부가 대상 데이터와 데이터집합에 포함된 학습 데이터와의 유사성을 기반으로 선택데이터집합을 결정하면, 데이터 분류부가 그 선택데이터집합에 대응되는 선택분류모델을 이용하여 대상 데이터를 분류할 수 있다.In other words, if the data comparison unit determines the selection data set based on the similarity between the target data and the training data included in the data set, the data classification unit classifies the target data using the selection classification model corresponding to the selection data set. I can.

다른 실시예에서는, 분류모델 학습부가 그 복수의 데이터집합에 포함된 개별 데이터집합과 그 개별 데이터집합에 대응되는 분류모델을 짝지어 저장소에 저장하고, 데이터 비교부는 그 저장소로부터 복수의 데이터집합을 획득하고, 데이터 분류부는 그 저장소로부터 선택분류모델을 획득할 수 있다.In another embodiment, the classification model learning unit pairs the individual data sets included in the plurality of data sets and the classification models corresponding to the individual data sets in a storage, and the data comparison unit acquires a plurality of data sets from the storage. , The data classification unit may obtain a selection classification model from the storage.

즉, 본 발명에서 데이터 비교부는 복수의 학습 데이터를 분류한 결과인 복수의 데이터집합 중에서 대상 데이터에 대응되는 선택데이터집합을 결정해야 한다. 이는, 복수의 데이터집합이 단계 S110에서 학습이 완료된 이후에도 계속하여 이용된다는 것을 의미하며, 학습이 완료된 이후에는 학습 데이터를 이용하지 않는 다른 앙상블 기법들과의 차이점이라고 할 수 있다.That is, in the present invention, the data comparison unit must determine a selection data set corresponding to the target data from among a plurality of data sets that are a result of classifying a plurality of training data. This means that the plurality of data sets are continuously used even after the learning is completed in step S110, and it can be said to be a difference from other ensemble techniques that do not use the learning data after the learning is completed.

이를 위해, 분류모델 학습부는 복수의 데이터집합을 각각의 데이터집합에 대응되는 분류모델과 쌍을 이루도록 하여 저장소에 저장시킬 수 있다. 또한, 데이터 비교부 및 데이터 분류부는 저장소에 저장된 데이터집합 및 분류모델을 이용하여 동작할 수 있다. 이처럼, 데이터집합과 분류모델이 쌍을 이루어 저장소에 저장됨으로써, 데이터 분류부는 선택데이터집합과 짝을 이루는 선택분류모델을 저장소로부터 용이하게 획득할 수 있다.To this end, the classification model learning unit may pair a plurality of data sets with a classification model corresponding to each data set and store them in the storage. In addition, the data comparison unit and the data classification unit may operate using a data set and a classification model stored in the storage. In this way, since the data set and the classification model are paired and stored in the storage, the data classification unit can easily obtain a selection classification model paired with the selection data set from the storage.

한편, 본 발명의 저장소는 데이터베이스 서버, HDD, SSD등과 같은 저장 장치, 클라우드 저장소 등과 같이 다양한 형태일 수 있으나, 나열된 예시로 한정되지 않음은 물론이다.Meanwhile, the storage of the present invention may be in various forms, such as a database server, a storage device such as an HDD, an SSD, or a cloud storage, but is not limited to the listed examples.

또 다른 실시예에서는, 복수의 학습 데이터 및 대상 데이터에 대하여 소정의 기준에 따른 대표값을 추출하는 특징추출(feature extraction) 및 그 대표값으로 구성된 특징 공간의 차원을 축소하는 차원축소(dimensionality reduction)는, 사전 설정에 따라서 데이터 처리부 및 분류모델 중 하나에 의해 수행되거나, 데이터 처리부 및 분류모델에서 나뉘어 수행될 수 있다.In another embodiment, feature extraction for extracting a representative value according to a predetermined criterion for a plurality of training data and target data, and dimensionality reduction for reducing the dimension of a feature space composed of the representative values May be performed by one of a data processing unit and a classification model according to a preset setting, or may be separately performed by a data processing unit and a classification model.

예컨대, 도 6(a)를 참조하면, 분류모델의 사용 환경 또는 입력되는 데이터의 종류 등 다양한 요인에 따라, 분류모델 내부에서 학습 데이터 또는 대상 데이터에 대하여 특징추출과 차원축소가 수행될 수 있다.For example, referring to FIG. 6A, feature extraction and dimension reduction may be performed on training data or target data in the classification model according to various factors such as a usage environment of a classification model or a type of input data.

또한, 도 6(b)를 참조하면, 데이터 처리부에 의해 학습 데이터 또는 대상 데이터에 대하여 특징추출이 수행되고, 특징추출된 결과에 대하여는 분류모델 내부에서 차원축소가 수행될 수 있다.Further, referring to FIG. 6B, feature extraction may be performed on training data or target data by the data processing unit, and dimension reduction may be performed in the classification model for the result of the feature extraction.

또한, 도 6(c)를 참조하면, 데이터 처리부에 의해 학습 데이터 또는 대상 데이터에 대하여 특징추출 및 차원축소가 수행될 수 있다.In addition, referring to FIG. 6C, feature extraction and dimension reduction may be performed on training data or target data by the data processing unit.

이처럼, 본 발명에서는 특징추출과 차원축소가 데이터 처리부와 분류모델에 의해 가변적으로 분담하여 수행될 수 있도록 설정할 수 있다.As described above, in the present invention, feature extraction and dimension reduction can be set to be variably shared and performed by the data processing unit and the classification model.

또 다른 실시예에서는, 데이터 처리부가 특징추출 또는 차원축소를 수행하는 경우에, 단계 S110의 이전에, 복수의 학습 데이터에 대하여 특징추출 또는 차원축소를 수행하고, 단계 S120의 이전에, 대상 데이터에 대하여 특징추출 또는 차원축소를 수행할 수 있다.In another embodiment, when the data processing unit performs feature extraction or dimension reduction, prior to step S110, feature extraction or dimensionality reduction is performed on a plurality of training data, and before step S120, target data is For this, feature extraction or dimension reduction can be performed.

즉, 데이터 처리부는 분류모델에 대하여 학습이 수행되기 이전에 학습 데이터에 대하여 특징추출 또는 차원축소를 우선적으로 수행할 수 있다. 또한, 데이터 처리부는 대상 데이터가 선택분류모델에 입력되기 이전에 특징추출 또는 차원축소를 우선적으로 수행할 수 있다.That is, the data processing unit may preferentially perform feature extraction or dimension reduction on the learning data before learning is performed on the classification model. In addition, the data processing unit may preferentially perform feature extraction or dimension reduction before the target data is input to the selective classification model.

한편, 필터링 또는 양자화(quantization)와 같이 데이터의 잡음 또는 이상치를 제거하는 일반적인 전처리 과정(pre-processing)과 데이터를 단위 시간 또는 레코드 단위로 분할하는 세그멘테이션 과정(segmentation)은 데이터 처리부에 의해 수행되는 것이 바람직할 수 있다.On the other hand, the general pre-processing process of removing noise or outliers in data such as filtering or quantization and the segmentation process of dividing the data into units of time or record are performed by the data processor. It may be desirable.

이와 같이, 본 발명은 메타데이터에 따라 학습 데이터를 분류한 후 개별 분류모델을 학습시키므로, 학습 데이터 간의 중복이 발생하지 않아 과적합(overfitting) 문제를 방지할 수 있으며, 변인은 뚜렷하지만 일반화시키기 어려운 학습 데이터를 이용하는 경우 활용도를 높일 수 있는 효과가 있다.As described above, the present invention classifies the training data according to the metadata and then trains the individual classification model, so that overlapping between training data does not occur, thereby preventing an overfitting problem, and the variable is distinct but difficult to generalize. When learning data is used, there is an effect that can increase utilization.

도 2는 본 발명의 실시예에 따른, 분류모델을 학습시키는 방법을 설명하기 위한 흐름도이다.2 is a flowchart illustrating a method of learning a classification model according to an embodiment of the present invention.

단계 S210에서는, 분류모델 학습부가, 복수의 분류모델 중 하나인 임시 분류모델을 학습시킨다.In step S210, the classification model learning unit learns a temporary classification model that is one of the plurality of classification models.

예컨대, 분류모델 학습부는 복수의 분류모델 중 하나를 선택하여, 그 임시 분류모델을 학습시킬 수 있다.For example, the classification model learning unit may select one of a plurality of classification models and train the temporary classification model.

다른 실시예에서는, 복수의 분류모델은 확률 및 통계, 도메인 변환, 인공 신경망, 전문가 시스템, 인스턴스 기반 학습, 의사 결정 트리 및 앙상블 기법에 기반한 분류모델들 중에서 선정될 수 있다.In another embodiment, a plurality of classification models may be selected from classification models based on probability and statistics, domain transformation, artificial neural networks, expert systems, instance-based learning, decision trees, and ensemble techniques.

즉, 복수의 분류모델은 나열된 다양한 방법에 기반하는 분류모델들 중에서 선정될 수 있으며, 여기에서 언급되지 않은 방법에 기반하는 분류모델 또한 선정될 수 있음은 물론이다.That is, a plurality of classification models may be selected from among classification models based on various methods listed, and a classification model based on methods not mentioned herein may also be selected.

단계 S220에서는, 분류모델 학습부가, 혼동행렬(confusion matrix)에 기초하여, 그 학습된 임시 분류모델의 성능을 평가한다.In step S220, the classification model learning unit evaluates the performance of the learned temporary classification model based on a confusion matrix.

예컨대, 분류모델 학습부는, 그 학습된 임시 분류모델에 대하여 혼동행렬을 생성한 뒤, 그 혼동행렬을 분석하여 성능을 평가할 수 있다. 보다 구체적으로, 분류모델 학습부는 그 혼동행렬로부터 산출되는 정밀도(precision), 재현율(recall), 정확도(accuracy) 등을 이용하여 성능을 평가할 수 있다.For example, the classification model learning unit may generate a confusion matrix for the learned temporary classification model and then analyze the confusion matrix to evaluate performance. More specifically, the classification model learning unit may evaluate performance using precision, recall, and accuracy calculated from the confusion matrix.

한편, 도 5를 참조하면, 예측된 결과와 실제 결과와의 동일 및 차이에 관한 내용을 포함하고 있는 혼동행렬이 나타나 있다. Meanwhile, referring to FIG. 5, there is shown a confusion matrix including contents of the same and difference between the predicted result and the actual result.

여기서, True Positive(TP)는 실제와 예측이 모두 YES(positive)인 경우이고, False Negative(FN)는 실제는 YES이나 예측은 NO인 경우이고, False Positive(FP)는 실제는 NO이나 예측은 YES인 경우이고, True Negative(TN)는 실제와 예측이 모두 NO인 경우이다.Here, True Positive (TP) is a case where both the actual and the prediction are YES (positive), False Negative (FN) is the case that the actual is YES but the prediction is NO, and the False Positive (FP) is actually NO but the prediction is It is a case of YES, and True Negative (TN) is a case where both the actual and the prediction are NO.

이때, 정확도는 (TP + TN) / (TP + FN + FP + TN)이고, 정밀도는 TP / (TP + FP)이고, 재현율은 TP / (TP + FN)으로 산출될 수 있다.In this case, the accuracy is (TP + TN) / (TP + FN + FP + TN), the precision is TP / (TP + FP), and the recall can be calculated as TP / (TP + FN).

마지막으로 단계 S230에서는, 분류모델 학습부가, 복수의 분류모델 모두에 대하여, 단계 S210 및 S220을 수행하여, 구 평가된 성능에 따라 하나의 분류모델을 결정한다.Finally, in step S230, the classification model learning unit determines one classification model according to the previously evaluated performance by performing steps S210 and S220 for all of the plurality of classification models.

예컨대, 분류모델 학습부는 복수의 분류모델이 10개인 경우, 그 10개의 분류모델 각각에 대하여 학습시키고, 성능을 평가한 뒤, 그 중에서 가장 우수한 성능을 나타낸 하나의 분류모델을 결정할 수 있다. 그리고, 그 결정된 분류모델이 해당 데이터집합에 대응되는 것으로 설정할 수 있다.For example, when there are 10 classification models, the classification model learning unit may learn about each of the 10 classification models, evaluate performance, and then determine one classification model that exhibits the best performance among them. In addition, the determined classification model may be set to correspond to the data set.

도 3은 본 발명의 실시예에 따른, 메타데이터를 이용한 독립 분류 모델 장치의 블록도이다.3 is a block diagram of an independent classification model apparatus using metadata according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 실시예에 따른 메타데이터를 이용한 독립 분류 모델 장치(300)는 저장소(310), 분류모델 학습부(320), 데이터 비교부(330) 및 데이터 분류부(340)를 포함한다. 또한, 선택적으로 데이터 처리부(미도시)를 더 포함할 수 있다.Referring to FIG. 3, the independent classification model apparatus 300 using metadata according to an embodiment of the present invention includes a storage 310, a classification model learning unit 320, a data comparison unit 330, and a data classification unit 340. ). In addition, it may optionally further include a data processing unit (not shown).

한편, 본 발명의 실시예에 따른 메타데이터를 이용한 독립 분류 모델 장치(300)는 데스크탑 컴퓨터, 스마트폰, 태블릿, 노트북컴퓨터 및 서버 등에 탑재될 수 있다.Meanwhile, the independent classification model device 300 using metadata according to an embodiment of the present invention may be mounted on a desktop computer, a smart phone, a tablet, a notebook computer, and a server.

저장소(310)는 내부의 저장공간에 저장 요청된 데이터를 저장하여 보관한다.The storage 310 stores and stores data requested to be stored in an internal storage space.

분류모델 학습부(320)는 복수의 학습 데이터를 메타데이터에 기반하여 분류한 결과인 복수의 데이터집합 각각에 대응되는 분류모델을 학습시키고, 그 복수의 데이터집합에 포함된 개별 데이터집합과 그 개별 데이터집합에 대응되는 분류모델을 짝지어 저장소(310)에 저장한다.The classification model learning unit 320 learns a classification model corresponding to each of a plurality of data sets, which is a result of classifying a plurality of training data based on metadata, and the individual data sets and their individual data included in the plurality of data sets. Classification models corresponding to the set are matched and stored in the storage 310.

다른 실시예에서는, 분류모델 학습부(320)는 복수의 데이터집합 각각에 대하여, 복수의 분류모델 중 하나인 임시 분류모델을 학습시키고, 혼동행렬(confusion matrix)에 기초하여, 그 학습된 임시 분류모델의 성능을 평가하는 과정을 복수의 분류모델 모두에 대하여 수행하고, 그 평가된 성능에 따라 하나의 분류모델을 결정할 수 있다.In another embodiment, the classification model learning unit 320 trains a temporary classification model, which is one of a plurality of classification models, for each of a plurality of data sets, and based on a confusion matrix, the learned temporary classification The process of evaluating the performance of the model may be performed for all of a plurality of classification models, and one classification model may be determined according to the evaluated performance.

또 다른 실시예에서는, 복수의 분류모델은 확률 및 통계, 도메인 변환, 인공 신경망, 전문가 시스템, 인스턴스 기반 학습, 의사 결정 트리 및 앙상블 기법에 기반한 분류모델들 중에서 선정될 수 있다.In another embodiment, a plurality of classification models may be selected from classification models based on probability and statistics, domain transformation, artificial neural networks, expert systems, instance-based learning, decision trees, and ensemble techniques.

데이터 비교부(330)는 메타데이터 및 소정의 데이터 유사 판단 기준 중 적어도 하나를 이용하여, 그 복수의 데이터집합 중에서 분류 대상인 대상 데이터에 대응되는 데이터집합인 선택데이터집합을 결정한다.The data comparison unit 330 determines a selection data set, which is a data set corresponding to the target data to be classified, from among the plurality of data sets, using at least one of metadata and a predetermined data similarity determination criterion.

다른 실시예에서는, 데이터 비교부(330)는 대상 데이터의 메타데이터에 관한 정보가 존재하면, 메타데이터 및 유사 판단 기준 중 적어도 하나를 이용하고, 대상 데이터의 메타데이터에 관한 정보가 존재하지 않으면, 유사 판단 기준을 이용할 수 있다.In another embodiment, the data comparison unit 330 uses at least one of the metadata and the similarity criterion when information about the metadata of the target data exists, and if there is no information about the metadata of the target data, Similar criteria can be used.

데이터 분류부(340)는 그 선택데이터집합에 대응되는 분류모델인 선택분류모델을 이용하여 대상 데이터를 분류한다.The data classification unit 340 classifies the target data using a selection classification model, which is a classification model corresponding to the selection data set.

데이터 처리부(미도시)는 사전 설정에 따라, 학습 데이터 또는 대상 데이터에 대하여 전처리, 세그멘테이션, 특징추출 및 차원축소를 수행한다.The data processing unit (not shown) performs pre-processing, segmentation, feature extraction, and dimension reduction on training data or target data according to a preset setting.

다른 실시예에서는, 복수의 학습 데이터 및 대상 데이터에 대하여 소정의 기준에 따른 대표값을 추출하는 특징추출(feature extraction) 및 그 대표값으로 구성된 특징 공간의 차원을 축소하는 차원축소(dimensionality reduction)는 사전 설정에 따라서 데이터 처리부 및 분류모델 중 하나에 의해 수행되거나, 데이터 처리부 및 분류모델에서 나뉘어 수행될 수 있다.In another embodiment, feature extraction for extracting a representative value according to a predetermined criterion for a plurality of training data and target data, and dimensionality reduction for reducing the dimension of a feature space composed of representative values thereof According to a preset setting, it may be performed by one of the data processing unit and the classification model, or separately performed by the data processing unit and the classification model.

상술한 본 발명의 실시예들은 컴퓨터에서 실행될 수 있는 프로그램으로 작성가능하고, 컴퓨터로 읽을 수 있는 기록매체를 이용하여 상기 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다.The above-described embodiments of the present invention can be written in a program that can be executed on a computer, and can be implemented in a general-purpose digital computer that operates the program using a computer-readable recording medium.

상기 컴퓨터로 읽을 수 있는 기록매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드디스크 등), 광학적 판독 매체(예를 들면, 시디롬, 디브이디 등) 를 포함한다.The computer-readable recording medium includes a magnetic storage medium (for example, a ROM, a floppy disk, a hard disk, etc.), and an optical reading medium (for example, a CD-ROM, a DVD, etc.).

이제까지 본 발명에 대하여 그 바람직한 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been looked at around its preferred embodiments. Those of ordinary skill in the art to which the present invention pertains will be able to understand that the present invention can be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered from an illustrative point of view rather than a limiting point of view. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope equivalent thereto should be construed as being included in the present invention.

Claims

Learning, by a classification model learning unit, a classification model corresponding to each of a plurality of data sets, which is a result of classifying the plurality of training data based on metadata;
Determining a selection data set, which is a data set corresponding to target data to be classified, from among the plurality of data sets, by using at least one of the metadata and a predetermined data similarity determination criterion; And
Classifying, by a data classification unit, the target data using a selection classification model that is a classification model corresponding to the selection data set
Operating method of the independent classification model using metadata, characterized in that it comprises a.

The method of claim 1,
The step of determining the selection data set
When information about the metadata of the target data exists, at least one of the metadata and the similarity determination criterion is used,
If there is no information about the metadata of the target data, the similarity determination criterion is used.

The method of claim 1,
Between the step of training the classification model and the step of determining the selection data set,
Matching, by the classification model learning unit, individual data sets included in the plurality of data sets and classification models corresponding to the individual data sets, and storing them in a storage.
Including more,
The data comparison unit obtains the plurality of data sets from the storage, and the data classification unit obtains the selective classification model from the storage.

The method of claim 1,
Learning the classification model
For each of the plurality of data sets,
Learning a temporary classification model that is one of a plurality of classification models;
Evaluating the performance of the learned temporary classification model based on a confusion matrix; And
For all of the plurality of classification models, performing the step of training the temporary classification model and evaluating the performance of the learned temporary classification model to determine one classification model according to the evaluated performance
Operating method of the independent classification model using metadata, characterized in that it comprises a.

The method of claim 4,
The plurality of classification models
A method of operating an independent classification model using metadata, characterized in that it is selected from classification models based on probability and statistics, domain transformation, artificial neural networks, expert systems, instance-based learning, decision trees, and ensemble techniques.

The method of claim 1,
Feature extraction for extracting a representative value according to a predetermined criterion for the plurality of training data and the target data, and dimensionality reduction for reducing the dimension of a feature space composed of the representative values
The method of operating an independent classification model using metadata, characterized in that it is performed by one of the data processing unit and the classification model according to a preset setting, or is performed separately from the data processing unit and the classification model.

The method of claim 6,
When the data processing unit performs feature extraction or dimension reduction,
Before the step of training the classification model,
Performing, by the data processing unit, feature extraction or dimension reduction on the plurality of training data; And
Prior to the step of determining the selection data set,
The data processing unit, performing feature extraction or dimension reduction on the target data
The operation method of the independent classification model using metadata, characterized in that it further comprises.

The method of claim 1,
The criteria for determining the data similarity are
Whether the similarity between the data included in each of the plurality of data sets and the target data is equal to or greater than a predetermined similarity threshold, or whether an error between the data included in each of the plurality of data sets and the target data is less than or equal to a predetermined error threshold An operation method of an independent classification model using metadata, characterized in that.

Storage;
A classification model corresponding to each of a plurality of data sets, which is a result of classifying a plurality of training data based on metadata, is trained, and an individual data set included in the plurality of data sets is paired with a classification model corresponding to the individual data set. A classification model learning unit configured and stored in the storage;
A data comparison unit for determining a selection data set, which is a data set corresponding to target data to be classified, from among the plurality of data sets, using at least one of the metadata and a predetermined data similarity determination criterion; And
A data classification unit for classifying the target data using a selection classification model that is a classification model corresponding to the selection data set
Independent classification model device using metadata, characterized in that it comprises a.

The method of claim 9,
The data comparison unit
If information about the metadata of the target data exists, at least one of the metadata and the similarity determination criterion is used,
If there is no information on the metadata of the target data, the similarity determination criterion is used.

The method of claim 9,
The classification model learning unit
For each of the plurality of data sets,
Training a temporary classification model, which is one of a plurality of classification models, and evaluating the performance of the learned temporary classification model based on a confusion matrix is performed on all of the plurality of classification models, and the evaluation Independent classification model device using metadata, characterized in that one classification model is determined according to the performance.

The method of claim 11,
The plurality of classification models
An independent classification model device using metadata, characterized in that it is selected from classification models based on probability and statistics, domain transformation, artificial neural networks, expert systems, instance-based learning, decision trees, and ensemble techniques.

The method of claim 9,
Feature extraction for extracting a representative value according to a predetermined criterion for the plurality of training data and the target data, and dimensionality reduction for reducing the dimension of a feature space composed of the representative values
An independent classification model device using metadata, characterized in that it is performed by one of a data processing unit and the classification model according to a preset setting, or is performed separately from the data processing unit and the classification model.

The method of claim 9,
The criteria for determining the data similarity are
Whether the similarity between the data included in each of the plurality of data sets and the target data is equal to or greater than a predetermined similarity threshold, or whether an error between the data included in each of the plurality of data sets and the target data is less than or equal to a predetermined error threshold Independent classification model device using metadata, characterized in that.