CN103678540A

CN103678540A - In-depth mining method for translation requirements

Info

Publication number: CN103678540A
Application number: CN201310638833.6A
Authority: CN
Inventors: 江潮
Original assignee: WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Current assignee: WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority date: 2013-11-30
Filing date: 2013-11-30
Publication date: 2014-03-26

Abstract

The invention discloses an in-depth mining method for translation requirements. The in-depth mining method includes extracting a plurality of translated documents, creating document information sets according to translation information in the translated documents, and merging all records in the document information sets according to clients to obtain transaction databases; performing association calculation according to each record in the transaction databases, and making association rules of client requirement sets and subsets of the client requirement sets. The in-depth mining method has the advantages that the association rules among client data and business data are processed, mined and outputted via computer data, so that the accuracy is high, and data processing loads on computes can be effectively reduced.

Description

A kind of degree of depth method for digging to translate requirements

Technical field

The present invention relates to a kind of translation technology field, in particular to a kind of degree of depth method for digging to translate requirements.

Background technology

Data mining (Data Mining, DM), claims again the Knowledge Discovery in database (Knowledge Discover in Database, KDD), is the hot issue of current artificial intelligence and database field research.Data mining refers to the data-handling capacity of utilizing computing machine, from incomplete, noisy, fuzzy, random real application data in a large number, extracts the information that has particular kind of relationship wherein, the process of Repository of lying in.The information of excavating and knowledge, be not that universally applicable truth is found in requirement, neither go to find brand-new natural science theorem and pure mathematics formula, and not more any mechanical theorem proving.In fact, the knowledge being found is all relative, is to have specific prerequisite and constraint condition, towards specific area, also want can be easy to be understood by user simultaneously.The result that preferably can find with natural language expressing.

Because the similar enterprise in the same region of same industry has highly similar foreign trade characteristic, its required translate requirements also often has the correlativity of height.According to the translate requirements statistics to a large amount of clients, in certain time domain and territorial scope, client's translate requirements has very high similarity, and translate requirements is often along with region, time can present very large relevance at translation direction, industry, ambit.But with regard to the enterprise for independent, it does not recognize its needed translate requirements, by excavating the incidence relation of customer demand, the demand that can extend one's service, the outward service of extending user, the portfolio of increase transcription platform.

Find the data of these business demands often to need to go demand to carry out investigation statistics for a long time, efficiency is very low, and between the data that obtain of statistics, is related to that accuracy is very low by inquiry.

Summary of the invention

The present invention aims to provide a kind of degree of depth method for digging to translate requirements, has solved to be related to that accuracy is very low, inefficient problem between data.

The invention discloses a kind of degree of depth method for digging to translate requirements, comprising:

Extract some translation documents, according to the translation information in described translation document, set up document information collection, translation document described in corresponding one piece of every record that described document information is concentrated;

Every described record that described document information is concentrated comprises following feature: the classification of the described translation document of client, this region, client place, correspondence and this piece be the translation direction of translation document;

All records that described document information is concentrated merge according to described client, obtain transaction database; In every record in described transaction database, include the classification of the described translation document of region, described client place, correspondence and this piece customer demand collection that the direction merging of translation document obtains;

According to every record in described transaction database, carry out association and calculate, formulate the correlation rule of customer demand collection and its subset;

According to described correlation rule, to the client with the X subset of described customer demand collection, promote this customer demand centralized traffic.

Preferably, described associated calculating comprises:

According to the record in described transaction database, recursion goes out frequent k+1 item collection, and calculates the correlation degree of the concentrated subset of frequent k+1 item and this frequent k+1 item collection, and result meets the requirement of degree of confidence threshold values, exports described correlation rule.

Preferably, described recursion goes out the process bag of frequent k+1 item collection:

The described customer demand of every record of described transaction database is concentrated and is comprised at least one customer demand;

Scanning transaction database, according to customer demand described in the record in transaction database, obtains 1 collection all in described transaction database;

The support of calculating 1 collection described in each, supported degree is not less than frequent 1 collection of minimum support threshold values;

By frequent k item collection and frequent 1 collection, carry out nothing and repeat to merge, generate the frequent k+1 item collection that support is not less than minimum support threshold values.

Preferably, also comprise: described in each, 1 set pair is being answered boolean's array, the record sum that this boolean's array length is transaction database, each numerical digit of described boolean's array is corresponding with the record of described transaction database one by one according to the order of the record in described transaction database;

If certain record in transaction database comprises this 1 concentrated item, will be designated as 1 with this logical value recording in corresponding numerical digit; Otherwise, be designated as 0;

Calculate the support of described all 1 collection, reject described 1 collection that support is less than minimum support threshold values, obtain described frequent 1 collection.

Wherein, in boolean's array the number of " 1 " and the numerical digit length ratio of boolean's array as described support.

Preferably, also comprise: described k+1 item collection and corresponding boolean's array thereof are carried out nothing by frequent K item collection and boolean's array thereof and frequent 1 collection and boolean's array thereof and repeated merging and obtain;

In the process that repeats to merge in described nothing, the logical value in the identical numerical digit in frequent boolean's array of k item collection and boolean's array of frequent 1 collection is carried out logic and operation, obtains boolean's array of the frequent k+1 item of candidate collection;

Calculate the support of the frequent k+1 item of described all candidates collection; Rejecting support is less than the described k+1 item collection of minimum support threshold values, obtains described frequent k+1 item collection.

Preferably, the classification of described translation document is classified according to the languages of described translation document, industry, ambit.

The method for digging of the correlation rule between the translation ability in the present invention, has the following advantages:

1, by customer demand being carried out to association, calculate, improved the accuracy of data, can be for its associated business be provided to client;

2, the method that the present invention searches for and detects frequent item set, only need when generating 1 collection table, scan 1 time transaction database D, that compares most of other association rule algorithms repeatedly reads transaction database, has greatly reduced the IO expense producing owing to reading transaction database; While generating frequent item set, need not first produce candidate item, frequent k item collection is directly generated by frequent 1 collection and frequent k-1 item collection, compared to equally only needing single pass transaction database but transaction database need be compressed to the FP-growth method of frequent pattern tree (fp tree), there is memory consumption still less;

3, in this method, by employing boolean array, carry out the excavation of frequent item set, maximum calculating consumes as " logical and " computing, the computing pattern that meets the bottom of computing machine, the software of designing is thus fast operation not only, for the consumption of cpu and internal memory, also saves the most.

Accompanying drawing explanation

Accompanying drawing described herein is used to provide a further understanding of the present invention, forms the application's a part, and schematic description and description of the present invention is used for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 shows the process flow diagram of embodiment.

Embodiment

Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.

Every described record that described document information is concentrated comprises following feature: the classification of the described translation document of client, this region, client place, correspondence and this piece be the direction of translation document;

Preferably, described associated calculating comprises:

Preferably, the classification of described translation document is classified according to the languages of described translation document, industry, ambit.。

Further, the present invention also provides a preferably embodiment:

Take cloud transcription platform in translation document as basis, set up document demand information table, as table 1;

Table 1 is as follows:

T0006	C003	BJ	B	ENCN(is English-Chinese)
					T0007	C003	BJ	C	ENCN(is English-Chinese)
T0008	C004	BJ	A	CNEN(is Sino-British)
					T0009	C004	BJ	B	ENCN(is English-Chinese)
T0010	C004	BJ	D	ENCN(is English-Chinese)
					T0011	C005	BJ	A	CNEN(is Sino-British)
T0012	C005	BJ	C	ENCN(is English-Chinese)
					T0013	C006	BJ	B	ENCN(is English-Chinese)
T0014	C006	BJ	C	ENCN(is English-Chinese)
					T0015	C007	BJ	A	CNEN(is Sino-British)
T0016	C007	BJ	C	ENCN(is English-Chinese)
					T0017	C008	BJ	A	CNEN(is Sino-British)
T0018	C008	BJ	B	ENCN(is English-Chinese)
					T0019	C008	BJ	C	ENCN(is English-Chinese)
T0020	C008	BJ	E	CNEN(is Sino-British)
					T0021	C009	BJ	A	ENCN(is English-Chinese)
T0022	C009	BJ	B	ENCN(is English-Chinese)
					T0023	C009	BJ	C	ENCN(is English-Chinese)

Such as upper table the first row represents, under document T0001, classification is " A ", and translation direction is " China and Britain ", and under it, client is C001, and location is Beijing.

Demand information item in client's document demand information table is merged to processing by client, thereby obtain finally carrying out the transaction database D of Association Rule Analysis.2 of transaction databases, comprising: customer number, client requirement information item.

Table 2: transaction database D

Customer number	Customer demand item
		C001	A.CNEN.BJ、B.ENCN.BJ、E.CNEN.BJ
C002	B.ENCN.BJ、D.ENCN.BJ
		C003	B.ENCN.BJ、C.ENCN.BJ
C004	A.CNEN.BJ、B.ENCN.BJ、D.ENCN.BJ
		C005	A.CNEN.BJ、C.ENCN.BJ
C006	B.ENCN.BJ、C.ENCN.BJ
		C007	A.CNEN.BJ、C.ENCN.BJ
C008	A.CNEN.BJ、B.ENCN.BJ、C.ENCN.BJ、E.CNEN.BJ
		C009	A.CNEN.BJ、B.ENCN.BJ、C.ENCN.BJ

Scanning transaction database D, take D as requirement item table of Foundation, and as table 3, this table is containing 3, and first is requirement item sequence number; Second is requirement item title; The 3rd is boolean's array, array length is the number that records of transaction database D, and this boolean's array is value as follows, if its corresponding requirement item is present in i the record of transaction database D, by i element assignment of this array, be true value 1, otherwise be 0.

Table 3: requirement item table

Sequence number	Requirement item title	Boolean's array
			1	A.CNEN.BJ	100110111
2	B.ENCN.BJ	111101011
			3	C.ENCN.BJ	001011111
4	D.ENCN.BJ	010100000
			5	E.CNEN.BJ	100000010

By table 3, calculate frequent 1 collection: 1 collection that the true value number in the corresponding boolean's array of each requirement item is greater than to number of support (establishing minimum number of support is herein 2) comes out, and obtains frequent 1 collection table.

Table 4: frequent 1 collection table

Sequence number	Requirement item title	Boolean's array	Number of support
				1	A.CNEN.BJ	100110111	6
2	B.ENCN.BJ	111101011	7
				3	C.ENCN.BJ	001011111	6
4	D.ENCN.BJ	010100000	2
				5	E.CNEN.BJ	100000010	2

By the corresponding element of boolean's array of i record in frequent 1 the collection table of table 4 and j record is carried out to AND operation, the new boolean's array obtaining, if the number of true value is greater than number of support in this boolean's array, 2 of forming of the requirement item in i record and j record integrate as frequent item set.Thereby obtain frequent 2 collection, as following table:

Table 5: frequent 2 collection tables

Sequence number	Requirement item title	Boolean's array	Number of support
				1	Ａ、Ｂ	100100011	4
2	Ａ、Ｃ	000010111	4
				3	Ａ、Ｅ	100000010	2
4	Ｂ、Ｃ	001001011	4
				5	Ｂ、Ｄ	010100000	2
6	Ｂ、Ｅ	100000010	2

Analyze in i the record and frequent 1 j concentrated record in frequent k item collection table, if its requirement item title is k+1 item collection after merging, and the not merged mistake of this k+1 item collection, by this k+1 item set identifier, it is " merging ", i record in this frequent k item collection table and boolean's array of frequent 1 j concentrated record are carried out to AND operation, if obtain the number of true value in new boolean's array, be greater than number of support, this k+1 item integrates as frequent item set.

Table 7: by frequent 2 collection and resulting frequent 3 the collection tables of frequent 1 collection

Sequence number	Requirement item title	Boolean's array	Number of support
				1	Ａ、Ｂ、Ｃ	000000011	2
2	Ａ、Ｂ、Ｅ	100000010	2

By frequent 3 collection and frequent 1 collection, frequent 4 collection that obtain, for empty, stop the retrieval of frequent item set.So the frequent item set obtaining is as follows:

Sequence number	Requirement item title	Boolean's array	Number of support
				1	A.CNEN.BJ	100110111	6

By calculation of relationship degree formula subset of computations A, B, C, AB, AC, BC, { correlation degree of A, B, C}, compares with degree of confidence threshold values with item collection respectively;

Be calculated as follows:

Support_count (L)/support_count ' (S), compares result with min_conf;

Wherein, min_conf is min confidence threshold values, and support_count (L) is the support of frequent item set L, and support_count ' is (S) the final value of frequent item set S support.

Result is greater than 1, the associated L of output correlation rule S;

The client with S demand may have the demand of L requirement item simultaneously;

According to described correlation rule, the client to conduct with the subset of described customer demand item promotes business in this customer demand item.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the degree of depth method for digging to translate requirements, is characterized in that, comprising:

All records that described document information is concentrated merge according to described client characteristics, obtain transaction database; In every record in described transaction database, include the classification of the described translation document of region, described client place, correspondence and this piece customer demand collection that the direction merging of translation document obtains;

According to every record in described transaction database, carry out association and calculate, obtain the correlation rule of customer demand collection and its subset.

2. method according to claim 1, is characterized in that, described associated calculating comprises:

3. method according to claim 2, is characterized in that, the process that described recursion goes out frequent k+1 item collection comprises:

4. method according to claim 3, it is characterized in that, also comprise: described in each, 1 set pair is being answered boolean's array, the record sum that this boolean's array length is transaction database, each numerical digit of described boolean's array is corresponding with the record of described transaction database one by one according to the order of the record in described transaction database;

Calculate the support of described all 1 collection, reject described 1 collection that support is less than minimum support threshold values, obtain described frequent 1 collection;

5. according to the method for claim 4, it is characterized in that, also comprise: described k+1 item collection and corresponding boolean's array thereof are carried out nothing by frequent K item collection and boolean's array thereof and frequent 1 collection and boolean's array thereof and repeated merging and obtain;

6. method according to claim 1, is characterized in that, the classification of described translation document is classified according to the languages of described translation document, industry, ambit.