Summary of the invention
The present invention aims to provide a kind of degree of depth method for digging to translate requirements, has solved to be related to that accuracy is very low, inefficient problem between data.
The invention discloses a kind of degree of depth method for digging to translate requirements, comprising:
Extract some translation documents, according to the translation information in described translation document, set up document information collection, translation document described in corresponding one piece of every record that described document information is concentrated;
Every described record that described document information is concentrated comprises following feature: the classification of the described translation document of client, this region, client place, correspondence and this piece be the translation direction of translation document;
All records that described document information is concentrated merge according to described client, obtain transaction database; In every record in described transaction database, include the classification of the described translation document of region, described client place, correspondence and this piece customer demand collection that the direction merging of translation document obtains;
According to every record in described transaction database, carry out association and calculate, formulate the correlation rule of customer demand collection and its subset;
According to described correlation rule, to the client with the X subset of described customer demand collection, promote this customer demand centralized traffic.
Preferably, described associated calculating comprises:
According to the record in described transaction database, recursion goes out frequent k+1 item collection, and calculates the correlation degree of the concentrated subset of frequent k+1 item and this frequent k+1 item collection, and result meets the requirement of degree of confidence threshold values, exports described correlation rule.
Preferably, described recursion goes out the process bag of frequent k+1 item collection:
The described customer demand of every record of described transaction database is concentrated and is comprised at least one customer demand;
Scanning transaction database, according to customer demand described in the record in transaction database, obtains 1 collection all in described transaction database;
The support of calculating 1 collection described in each, supported degree is not less than frequent 1 collection of minimum support threshold values;
By frequent k item collection and frequent 1 collection, carry out nothing and repeat to merge, generate the frequent k+1 item collection that support is not less than minimum support threshold values.
Preferably, also comprise: described in each, 1 set pair is being answered boolean's array, the record sum that this boolean's array length is transaction database, each numerical digit of described boolean's array is corresponding with the record of described transaction database one by one according to the order of the record in described transaction database;
If certain record in transaction database comprises this 1 concentrated item, will be designated as 1 with this logical value recording in corresponding numerical digit; Otherwise, be designated as 0;
Calculate the support of described all 1 collection, reject described 1 collection that support is less than minimum support threshold values, obtain described frequent 1 collection.
Wherein, in boolean's array the number of " 1 " and the numerical digit length ratio of boolean's array as described support.
Preferably, also comprise: described k+1 item collection and corresponding boolean's array thereof are carried out nothing by frequent K item collection and boolean's array thereof and frequent 1 collection and boolean's array thereof and repeated merging and obtain;
In the process that repeats to merge in described nothing, the logical value in the identical numerical digit in frequent boolean's array of k item collection and boolean's array of frequent 1 collection is carried out logic and operation, obtains boolean's array of the frequent k+1 item of candidate collection;
Calculate the support of the frequent k+1 item of described all candidates collection; Rejecting support is less than the described k+1 item collection of minimum support threshold values, obtains described frequent k+1 item collection.
Preferably, the classification of described translation document is classified according to the languages of described translation document, industry, ambit.
The method for digging of the correlation rule between the translation ability in the present invention, has the following advantages:
1, by customer demand being carried out to association, calculate, improved the accuracy of data, can be for its associated business be provided to client;
2, the method that the present invention searches for and detects frequent item set, only need when generating 1 collection table, scan 1 time transaction database D, that compares most of other association rule algorithms repeatedly reads transaction database, has greatly reduced the IO expense producing owing to reading transaction database; While generating frequent item set, need not first produce candidate item, frequent k item collection is directly generated by frequent 1 collection and frequent k-1 item collection, compared to equally only needing single pass transaction database but transaction database need be compressed to the FP-growth method of frequent pattern tree (fp tree), there is memory consumption still less;
3, in this method, by employing boolean array, carry out the excavation of frequent item set, maximum calculating consumes as " logical and " computing, the computing pattern that meets the bottom of computing machine, the software of designing is thus fast operation not only, for the consumption of cpu and internal memory, also saves the most.
Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
The invention discloses a kind of degree of depth method for digging to translate requirements, comprising:
Extract some translation documents, according to the translation information in described translation document, set up document information collection, translation document described in corresponding one piece of every record that described document information is concentrated;
Every described record that described document information is concentrated comprises following feature: the classification of the described translation document of client, this region, client place, correspondence and this piece be the direction of translation document;
All records that described document information is concentrated merge according to described client, obtain transaction database; In every record in described transaction database, include the classification of the described translation document of region, described client place, correspondence and this piece customer demand collection that the direction merging of translation document obtains;
According to every record in described transaction database, carry out association and calculate, formulate the correlation rule of customer demand collection and its subset;
According to described correlation rule, to the client with the X subset of described customer demand collection, promote this customer demand centralized traffic.
Preferably, described associated calculating comprises:
According to the record in described transaction database, recursion goes out frequent k+1 item collection, and calculates the correlation degree of the concentrated subset of frequent k+1 item and this frequent k+1 item collection, and result meets the requirement of degree of confidence threshold values, exports described correlation rule.
Preferably, described recursion goes out the process bag of frequent k+1 item collection:
The described customer demand of every record of described transaction database is concentrated and is comprised at least one customer demand;
Scanning transaction database, according to customer demand described in the record in transaction database, obtains 1 collection all in described transaction database;
The support of calculating 1 collection described in each, supported degree is not less than frequent 1 collection of minimum support threshold values;
By frequent k item collection and frequent 1 collection, carry out nothing and repeat to merge, generate the frequent k+1 item collection that support is not less than minimum support threshold values.
Preferably, also comprise: described in each, 1 set pair is being answered boolean's array, the record sum that this boolean's array length is transaction database, each numerical digit of described boolean's array is corresponding with the record of described transaction database one by one according to the order of the record in described transaction database;
If certain record in transaction database comprises this 1 concentrated item, will be designated as 1 with this logical value recording in corresponding numerical digit; Otherwise, be designated as 0;
Calculate the support of described all 1 collection, reject described 1 collection that support is less than minimum support threshold values, obtain described frequent 1 collection.
Wherein, in boolean's array the number of " 1 " and the numerical digit length ratio of boolean's array as described support.
Preferably, also comprise: described k+1 item collection and corresponding boolean's array thereof are carried out nothing by frequent K item collection and boolean's array thereof and frequent 1 collection and boolean's array thereof and repeated merging and obtain;
In the process that repeats to merge in described nothing, the logical value in the identical numerical digit in frequent boolean's array of k item collection and boolean's array of frequent 1 collection is carried out logic and operation, obtains boolean's array of the frequent k+1 item of candidate collection;
Calculate the support of the frequent k+1 item of described all candidates collection; Rejecting support is less than the described k+1 item collection of minimum support threshold values, obtains described frequent k+1 item collection.
Preferably, the classification of described translation document is classified according to the languages of described translation document, industry, ambit.。
Further, the present invention also provides a preferably embodiment:
Take cloud transcription platform in translation document as basis, set up document demand information table, as table 1;
Table 1 is as follows:
| T0006 |
C003 |
BJ |
B |
ENCN(is English-Chinese) |
| T0007 |
C003 |
BJ |
C |
ENCN(is English-Chinese) |
| T0008 |
C004 |
BJ |
A |
CNEN(is Sino-British) |
| T0009 |
C004 |
BJ |
B |
ENCN(is English-Chinese) |
| T0010 |
C004 |
BJ |
D |
ENCN(is English-Chinese) |
| T0011 |
C005 |
BJ |
A |
CNEN(is Sino-British) |
| T0012 |
C005 |
BJ |
C |
ENCN(is English-Chinese) |
| T0013 |
C006 |
BJ |
B |
ENCN(is English-Chinese) |
| T0014 |
C006 |
BJ |
C |
ENCN(is English-Chinese) |
| T0015 |
C007 |
BJ |
A |
CNEN(is Sino-British) |
| T0016 |
C007 |
BJ |
C |
ENCN(is English-Chinese) |
| T0017 |
C008 |
BJ |
A |
CNEN(is Sino-British) |
| T0018 |
C008 |
BJ |
B |
ENCN(is English-Chinese) |
| T0019 |
C008 |
BJ |
C |
ENCN(is English-Chinese) |
| T0020 |
C008 |
BJ |
E |
CNEN(is Sino-British) |
| T0021 |
C009 |
BJ |
A |
ENCN(is English-Chinese) |
| T0022 |
C009 |
BJ |
B |
ENCN(is English-Chinese) |
| T0023 |
C009 |
BJ |
C |
ENCN(is English-Chinese) |
Such as upper table the first row represents, under document T0001, classification is " A ", and translation direction is " China and Britain ", and under it, client is C001, and location is Beijing.
Demand information item in client's document demand information table is merged to processing by client, thereby obtain finally carrying out the transaction database D of Association Rule Analysis.2 of transaction databases, comprising: customer number, client requirement information item.
Table 2: transaction database D
| Customer number |
Customer demand item |
| C001 |
A.CNEN.BJ、B.ENCN.BJ、E.CNEN.BJ |
| C002 |
B.ENCN.BJ、D.ENCN.BJ |
| C003 |
B.ENCN.BJ、C.ENCN.BJ |
| C004 |
A.CNEN.BJ、B.ENCN.BJ、D.ENCN.BJ |
| C005 |
A.CNEN.BJ、C.ENCN.BJ |
| C006 |
B.ENCN.BJ、C.ENCN.BJ |
| C007 |
A.CNEN.BJ、C.ENCN.BJ |
| C008 |
A.CNEN.BJ、B.ENCN.BJ、C.ENCN.BJ、E.CNEN.BJ |
| C009 |
A.CNEN.BJ、B.ENCN.BJ、C.ENCN.BJ |
Scanning transaction database D, take D as requirement item table of Foundation, and as table 3, this table is containing 3, and first is requirement item sequence number; Second is requirement item title; The 3rd is boolean's array, array length is the number that records of transaction database D, and this boolean's array is value as follows, if its corresponding requirement item is present in i the record of transaction database D, by i element assignment of this array, be true value 1, otherwise be 0.
Table 3: requirement item table
| Sequence number |
Requirement item title |
Boolean's array |
| 1 |
A.CNEN.BJ |
100110111 |
| 2 |
B.ENCN.BJ |
111101011 |
| 3 |
C.ENCN.BJ |
001011111 |
| 4 |
D.ENCN.BJ |
010100000 |
| 5 |
E.CNEN.BJ |
100000010 |
By table 3, calculate frequent 1 collection: 1 collection that the true value number in the corresponding boolean's array of each requirement item is greater than to number of support (establishing minimum number of support is herein 2) comes out, and obtains frequent 1 collection table.
Table 4: frequent 1 collection table
| Sequence number |
Requirement item title |
Boolean's array |
Number of support |
| 1 |
A.CNEN.BJ |
100110111 |
6 |
| 2 |
B.ENCN.BJ |
111101011 |
7 |
| 3 |
C.ENCN.BJ |
001011111 |
6 |
| 4 |
D.ENCN.BJ |
010100000 |
2 |
| 5 |
E.CNEN.BJ |
100000010 |
2 |
By the corresponding element of boolean's array of i record in frequent 1 the collection table of table 4 and j record is carried out to AND operation, the new boolean's array obtaining, if the number of true value is greater than number of support in this boolean's array, 2 of forming of the requirement item in i record and j record integrate as frequent item set.Thereby obtain frequent 2 collection, as following table:
Table 5: frequent 2 collection tables
| Sequence number |
Requirement item title |
Boolean's array |
Number of support |
| 1 |
A、B |
100100011 |
4 |
| 2 |
A、C |
000010111 |
4 |
| 3 |
A、E |
100000010 |
2 |
| 4 |
B、C |
001001011 |
4 |
| 5 |
B、D |
010100000 |
2 |
| 6 |
B、E |
100000010 |
2 |
Analyze in i the record and frequent 1 j concentrated record in frequent k item collection table, if its requirement item title is k+1 item collection after merging, and the not merged mistake of this k+1 item collection, by this k+1 item set identifier, it is " merging ", i record in this frequent k item collection table and boolean's array of frequent 1 j concentrated record are carried out to AND operation, if obtain the number of true value in new boolean's array, be greater than number of support, this k+1 item integrates as frequent item set.
Table 7: by frequent 2 collection and resulting frequent 3 the collection tables of frequent 1 collection
| Sequence number |
Requirement item title |
Boolean's array |
Number of support |
| 1 |
A、B、C |
000000011 |
2 |
| 2 |
A、B、E |
100000010 |
2 |
By frequent 3 collection and frequent 1 collection, frequent 4 collection that obtain, for empty, stop the retrieval of frequent item set.So the frequent item set obtaining is as follows:
| Sequence number |
Requirement item title |
Boolean's array |
Number of support |
| 1 |
A.CNEN.BJ |
100110111 |
6 |
By calculation of relationship degree formula subset of computations A, B, C, AB, AC, BC, { correlation degree of A, B, C}, compares with degree of confidence threshold values with item collection respectively;
Be calculated as follows:
Support_count (L)/support_count ' (S), compares result with min_conf;
Wherein, min_conf is min confidence threshold values, and support_count (L) is the support of frequent item set L, and support_count ' is (S) the final value of frequent item set S support.
Result is greater than 1, the associated L of output correlation rule S;
The client with S demand may have the demand of L requirement item simultaneously;
According to described correlation rule, the client to conduct with the subset of described customer demand item promotes business in this customer demand item.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.