CN103678540A - In-depth mining method for translation requirements - Google Patents

In-depth mining method for translation requirements Download PDF

Info

Publication number
CN103678540A
CN103678540A CN201310638833.6A CN201310638833A CN103678540A CN 103678540 A CN103678540 A CN 103678540A CN 201310638833 A CN201310638833 A CN 201310638833A CN 103678540 A CN103678540 A CN 103678540A
Authority
CN
China
Prior art keywords
collection
frequent
item
boolean
array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310638833.6A
Other languages
Chinese (zh)
Inventor
江潮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Original Assignee
WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd filed Critical WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority to CN201310638833.6A priority Critical patent/CN103678540A/en
Publication of CN103678540A publication Critical patent/CN103678540A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an in-depth mining method for translation requirements. The in-depth mining method includes extracting a plurality of translated documents, creating document information sets according to translation information in the translated documents, and merging all records in the document information sets according to clients to obtain transaction databases; performing association calculation according to each record in the transaction databases, and making association rules of client requirement sets and subsets of the client requirement sets. The in-depth mining method has the advantages that the association rules among client data and business data are processed, mined and outputted via computer data, so that the accuracy is high, and data processing loads on computes can be effectively reduced.

Description

A kind of degree of depth method for digging to translate requirements
Technical field
The present invention relates to a kind of translation technology field, in particular to a kind of degree of depth method for digging to translate requirements.
Background technology
Data mining (Data Mining, DM), claims again the Knowledge Discovery in database (Knowledge Discover in Database, KDD), is the hot issue of current artificial intelligence and database field research.Data mining refers to the data-handling capacity of utilizing computing machine, from incomplete, noisy, fuzzy, random real application data in a large number, extracts the information that has particular kind of relationship wherein, the process of Repository of lying in.The information of excavating and knowledge, be not that universally applicable truth is found in requirement, neither go to find brand-new natural science theorem and pure mathematics formula, and not more any mechanical theorem proving.In fact, the knowledge being found is all relative, is to have specific prerequisite and constraint condition, towards specific area, also want can be easy to be understood by user simultaneously.The result that preferably can find with natural language expressing.
Because the similar enterprise in the same region of same industry has highly similar foreign trade characteristic, its required translate requirements also often has the correlativity of height.According to the translate requirements statistics to a large amount of clients, in certain time domain and territorial scope, client's translate requirements has very high similarity, and translate requirements is often along with region, time can present very large relevance at translation direction, industry, ambit.But with regard to the enterprise for independent, it does not recognize its needed translate requirements, by excavating the incidence relation of customer demand, the demand that can extend one's service, the outward service of extending user, the portfolio of increase transcription platform.
Find the data of these business demands often to need to go demand to carry out investigation statistics for a long time, efficiency is very low, and between the data that obtain of statistics, is related to that accuracy is very low by inquiry.
Summary of the invention
The present invention aims to provide a kind of degree of depth method for digging to translate requirements, has solved to be related to that accuracy is very low, inefficient problem between data.
The invention discloses a kind of degree of depth method for digging to translate requirements, comprising:
Extract some translation documents, according to the translation information in described translation document, set up document information collection, translation document described in corresponding one piece of every record that described document information is concentrated;
Every described record that described document information is concentrated comprises following feature: the classification of the described translation document of client, this region, client place, correspondence and this piece be the translation direction of translation document;
All records that described document information is concentrated merge according to described client, obtain transaction database; In every record in described transaction database, include the classification of the described translation document of region, described client place, correspondence and this piece customer demand collection that the direction merging of translation document obtains;
According to every record in described transaction database, carry out association and calculate, formulate the correlation rule of customer demand collection and its subset;
According to described correlation rule, to the client with the X subset of described customer demand collection, promote this customer demand centralized traffic.
Preferably, described associated calculating comprises:
According to the record in described transaction database, recursion goes out frequent k+1 item collection, and calculates the correlation degree of the concentrated subset of frequent k+1 item and this frequent k+1 item collection, and result meets the requirement of degree of confidence threshold values, exports described correlation rule.
Preferably, described recursion goes out the process bag of frequent k+1 item collection:
The described customer demand of every record of described transaction database is concentrated and is comprised at least one customer demand;
Scanning transaction database, according to customer demand described in the record in transaction database, obtains 1 collection all in described transaction database;
The support of calculating 1 collection described in each, supported degree is not less than frequent 1 collection of minimum support threshold values;
By frequent k item collection and frequent 1 collection, carry out nothing and repeat to merge, generate the frequent k+1 item collection that support is not less than minimum support threshold values.
Preferably, also comprise: described in each, 1 set pair is being answered boolean's array, the record sum that this boolean's array length is transaction database, each numerical digit of described boolean's array is corresponding with the record of described transaction database one by one according to the order of the record in described transaction database;
If certain record in transaction database comprises this 1 concentrated item, will be designated as 1 with this logical value recording in corresponding numerical digit; Otherwise, be designated as 0;
Calculate the support of described all 1 collection, reject described 1 collection that support is less than minimum support threshold values, obtain described frequent 1 collection.
Wherein, in boolean's array the number of " 1 " and the numerical digit length ratio of boolean's array as described support.
Preferably, also comprise: described k+1 item collection and corresponding boolean's array thereof are carried out nothing by frequent K item collection and boolean's array thereof and frequent 1 collection and boolean's array thereof and repeated merging and obtain;
In the process that repeats to merge in described nothing, the logical value in the identical numerical digit in frequent boolean's array of k item collection and boolean's array of frequent 1 collection is carried out logic and operation, obtains boolean's array of the frequent k+1 item of candidate collection;
Calculate the support of the frequent k+1 item of described all candidates collection; Rejecting support is less than the described k+1 item collection of minimum support threshold values, obtains described frequent k+1 item collection.
Preferably, the classification of described translation document is classified according to the languages of described translation document, industry, ambit.
The method for digging of the correlation rule between the translation ability in the present invention, has the following advantages:
1, by customer demand being carried out to association, calculate, improved the accuracy of data, can be for its associated business be provided to client;
2, the method that the present invention searches for and detects frequent item set, only need when generating 1 collection table, scan 1 time transaction database D, that compares most of other association rule algorithms repeatedly reads transaction database, has greatly reduced the IO expense producing owing to reading transaction database; While generating frequent item set, need not first produce candidate item, frequent k item collection is directly generated by frequent 1 collection and frequent k-1 item collection, compared to equally only needing single pass transaction database but transaction database need be compressed to the FP-growth method of frequent pattern tree (fp tree), there is memory consumption still less;
3, in this method, by employing boolean array, carry out the excavation of frequent item set, maximum calculating consumes as " logical and " computing, the computing pattern that meets the bottom of computing machine, the software of designing is thus fast operation not only, for the consumption of cpu and internal memory, also saves the most.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide a further understanding of the present invention, forms the application's a part, and schematic description and description of the present invention is used for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 shows the process flow diagram of embodiment.
Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
The invention discloses a kind of degree of depth method for digging to translate requirements, comprising:
Extract some translation documents, according to the translation information in described translation document, set up document information collection, translation document described in corresponding one piece of every record that described document information is concentrated;
Every described record that described document information is concentrated comprises following feature: the classification of the described translation document of client, this region, client place, correspondence and this piece be the direction of translation document;
All records that described document information is concentrated merge according to described client, obtain transaction database; In every record in described transaction database, include the classification of the described translation document of region, described client place, correspondence and this piece customer demand collection that the direction merging of translation document obtains;
According to every record in described transaction database, carry out association and calculate, formulate the correlation rule of customer demand collection and its subset;
According to described correlation rule, to the client with the X subset of described customer demand collection, promote this customer demand centralized traffic.
Preferably, described associated calculating comprises:
According to the record in described transaction database, recursion goes out frequent k+1 item collection, and calculates the correlation degree of the concentrated subset of frequent k+1 item and this frequent k+1 item collection, and result meets the requirement of degree of confidence threshold values, exports described correlation rule.
Preferably, described recursion goes out the process bag of frequent k+1 item collection:
The described customer demand of every record of described transaction database is concentrated and is comprised at least one customer demand;
Scanning transaction database, according to customer demand described in the record in transaction database, obtains 1 collection all in described transaction database;
The support of calculating 1 collection described in each, supported degree is not less than frequent 1 collection of minimum support threshold values;
By frequent k item collection and frequent 1 collection, carry out nothing and repeat to merge, generate the frequent k+1 item collection that support is not less than minimum support threshold values.
Preferably, also comprise: described in each, 1 set pair is being answered boolean's array, the record sum that this boolean's array length is transaction database, each numerical digit of described boolean's array is corresponding with the record of described transaction database one by one according to the order of the record in described transaction database;
If certain record in transaction database comprises this 1 concentrated item, will be designated as 1 with this logical value recording in corresponding numerical digit; Otherwise, be designated as 0;
Calculate the support of described all 1 collection, reject described 1 collection that support is less than minimum support threshold values, obtain described frequent 1 collection.
Wherein, in boolean's array the number of " 1 " and the numerical digit length ratio of boolean's array as described support.
Preferably, also comprise: described k+1 item collection and corresponding boolean's array thereof are carried out nothing by frequent K item collection and boolean's array thereof and frequent 1 collection and boolean's array thereof and repeated merging and obtain;
In the process that repeats to merge in described nothing, the logical value in the identical numerical digit in frequent boolean's array of k item collection and boolean's array of frequent 1 collection is carried out logic and operation, obtains boolean's array of the frequent k+1 item of candidate collection;
Calculate the support of the frequent k+1 item of described all candidates collection; Rejecting support is less than the described k+1 item collection of minimum support threshold values, obtains described frequent k+1 item collection.
Preferably, the classification of described translation document is classified according to the languages of described translation document, industry, ambit.。
Further, the present invention also provides a preferably embodiment:
Take cloud transcription platform in translation document as basis, set up document demand information table, as table 1;
Table 1 is as follows:
Figure BDA0000427072320000061
T0006 C003 BJ B ENCN(is English-Chinese)
T0007 C003 BJ C ENCN(is English-Chinese)
T0008 C004 BJ A CNEN(is Sino-British)
T0009 C004 BJ B ENCN(is English-Chinese)
T0010 C004 BJ D ENCN(is English-Chinese)
T0011 C005 BJ A CNEN(is Sino-British)
T0012 C005 BJ C ENCN(is English-Chinese)
T0013 C006 BJ B ENCN(is English-Chinese)
T0014 C006 BJ C ENCN(is English-Chinese)
T0015 C007 BJ A CNEN(is Sino-British)
T0016 C007 BJ C ENCN(is English-Chinese)
T0017 C008 BJ A CNEN(is Sino-British)
T0018 C008 BJ B ENCN(is English-Chinese)
T0019 C008 BJ C ENCN(is English-Chinese)
T0020 C008 BJ E CNEN(is Sino-British)
T0021 C009 BJ A ENCN(is English-Chinese)
T0022 C009 BJ B ENCN(is English-Chinese)
T0023 C009 BJ C ENCN(is English-Chinese)
Such as upper table the first row represents, under document T0001, classification is " A ", and translation direction is " China and Britain ", and under it, client is C001, and location is Beijing.
Demand information item in client's document demand information table is merged to processing by client, thereby obtain finally carrying out the transaction database D of Association Rule Analysis.2 of transaction databases, comprising: customer number, client requirement information item.
Table 2: transaction database D
Customer number Customer demand item
C001 A.CNEN.BJ、B.ENCN.BJ、E.CNEN.BJ
C002 B.ENCN.BJ、D.ENCN.BJ
C003 B.ENCN.BJ、C.ENCN.BJ
C004 A.CNEN.BJ、B.ENCN.BJ、D.ENCN.BJ
C005 A.CNEN.BJ、C.ENCN.BJ
C006 B.ENCN.BJ、C.ENCN.BJ
C007 A.CNEN.BJ、C.ENCN.BJ
C008 A.CNEN.BJ、B.ENCN.BJ、C.ENCN.BJ、E.CNEN.BJ
C009 A.CNEN.BJ、B.ENCN.BJ、C.ENCN.BJ
Scanning transaction database D, take D as requirement item table of Foundation, and as table 3, this table is containing 3, and first is requirement item sequence number; Second is requirement item title; The 3rd is boolean's array, array length is the number that records of transaction database D, and this boolean's array is value as follows, if its corresponding requirement item is present in i the record of transaction database D, by i element assignment of this array, be true value 1, otherwise be 0.
Table 3: requirement item table
Sequence number Requirement item title Boolean's array
1 A.CNEN.BJ 100110111
2 B.ENCN.BJ 111101011
3 C.ENCN.BJ 001011111
4 D.ENCN.BJ 010100000
5 E.CNEN.BJ 100000010
By table 3, calculate frequent 1 collection: 1 collection that the true value number in the corresponding boolean's array of each requirement item is greater than to number of support (establishing minimum number of support is herein 2) comes out, and obtains frequent 1 collection table.
Table 4: frequent 1 collection table
Sequence number Requirement item title Boolean's array Number of support
1 A.CNEN.BJ 100110111 6
2 B.ENCN.BJ 111101011 7
3 C.ENCN.BJ 001011111 6
4 D.ENCN.BJ 010100000 2
5 E.CNEN.BJ 100000010 2
By the corresponding element of boolean's array of i record in frequent 1 the collection table of table 4 and j record is carried out to AND operation, the new boolean's array obtaining, if the number of true value is greater than number of support in this boolean's array, 2 of forming of the requirement item in i record and j record integrate as frequent item set.Thereby obtain frequent 2 collection, as following table:
Table 5: frequent 2 collection tables
Sequence number Requirement item title Boolean's array Number of support
1 A、B 100100011 4
2 A、C 000010111 4
3 A、E 100000010 2
4 B、C 001001011 4
5 B、D 010100000 2
6 B、E 100000010 2
Analyze in i the record and frequent 1 j concentrated record in frequent k item collection table, if its requirement item title is k+1 item collection after merging, and the not merged mistake of this k+1 item collection, by this k+1 item set identifier, it is " merging ", i record in this frequent k item collection table and boolean's array of frequent 1 j concentrated record are carried out to AND operation, if obtain the number of true value in new boolean's array, be greater than number of support, this k+1 item integrates as frequent item set.
Table 7: by frequent 2 collection and resulting frequent 3 the collection tables of frequent 1 collection
Sequence number Requirement item title Boolean's array Number of support
1 A、B、C 000000011 2
2 A、B、E 100000010 2
By frequent 3 collection and frequent 1 collection, frequent 4 collection that obtain, for empty, stop the retrieval of frequent item set.So the frequent item set obtaining is as follows:
Sequence number Requirement item title Boolean's array Number of support
1 A.CNEN.BJ 100110111 6
Figure BDA0000427072320000111
By calculation of relationship degree formula subset of computations A, B, C, AB, AC, BC, { correlation degree of A, B, C}, compares with degree of confidence threshold values with item collection respectively;
Be calculated as follows:
Support_count (L)/support_count ' (S), compares result with min_conf;
Wherein, min_conf is min confidence threshold values, and support_count (L) is the support of frequent item set L, and support_count ' is (S) the final value of frequent item set S support.
Result is greater than 1, the associated L of output correlation rule S;
The client with S demand may have the demand of L requirement item simultaneously;
According to described correlation rule, the client to conduct with the subset of described customer demand item promotes business in this customer demand item.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (6)

1. the degree of depth method for digging to translate requirements, is characterized in that, comprising:
Extract some translation documents, according to the translation information in described translation document, set up document information collection, translation document described in corresponding one piece of every record that described document information is concentrated;
Every described record that described document information is concentrated comprises following feature: the classification of the described translation document of client, this region, client place, correspondence and this piece be the translation direction of translation document;
All records that described document information is concentrated merge according to described client characteristics, obtain transaction database; In every record in described transaction database, include the classification of the described translation document of region, described client place, correspondence and this piece customer demand collection that the direction merging of translation document obtains;
According to every record in described transaction database, carry out association and calculate, obtain the correlation rule of customer demand collection and its subset.
2. method according to claim 1, is characterized in that, described associated calculating comprises:
According to the record in described transaction database, recursion goes out frequent k+1 item collection, and calculates the correlation degree of the concentrated subset of frequent k+1 item and this frequent k+1 item collection, and result meets the requirement of degree of confidence threshold values, exports described correlation rule.
3. method according to claim 2, is characterized in that, the process that described recursion goes out frequent k+1 item collection comprises:
The described customer demand of every record of described transaction database is concentrated and is comprised at least one customer demand;
Scanning transaction database, according to customer demand described in the record in transaction database, obtains 1 collection all in described transaction database;
The support of calculating 1 collection described in each, supported degree is not less than frequent 1 collection of minimum support threshold values;
By frequent k item collection and frequent 1 collection, carry out nothing and repeat to merge, generate the frequent k+1 item collection that support is not less than minimum support threshold values.
4. method according to claim 3, it is characterized in that, also comprise: described in each, 1 set pair is being answered boolean's array, the record sum that this boolean's array length is transaction database, each numerical digit of described boolean's array is corresponding with the record of described transaction database one by one according to the order of the record in described transaction database;
If certain record in transaction database comprises this 1 concentrated item, will be designated as 1 with this logical value recording in corresponding numerical digit; Otherwise, be designated as 0;
Calculate the support of described all 1 collection, reject described 1 collection that support is less than minimum support threshold values, obtain described frequent 1 collection;
Wherein, in boolean's array the number of " 1 " and the numerical digit length ratio of boolean's array as described support.
5. according to the method for claim 4, it is characterized in that, also comprise: described k+1 item collection and corresponding boolean's array thereof are carried out nothing by frequent K item collection and boolean's array thereof and frequent 1 collection and boolean's array thereof and repeated merging and obtain;
In the process that repeats to merge in described nothing, the logical value in the identical numerical digit in frequent boolean's array of k item collection and boolean's array of frequent 1 collection is carried out logic and operation, obtains boolean's array of the frequent k+1 item of candidate collection;
Calculate the support of the frequent k+1 item of described all candidates collection; Rejecting support is less than the described k+1 item collection of minimum support threshold values, obtains described frequent k+1 item collection.
6. method according to claim 1, is characterized in that, the classification of described translation document is classified according to the languages of described translation document, industry, ambit.
CN201310638833.6A 2013-11-30 2013-11-30 In-depth mining method for translation requirements Pending CN103678540A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310638833.6A CN103678540A (en) 2013-11-30 2013-11-30 In-depth mining method for translation requirements

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310638833.6A CN103678540A (en) 2013-11-30 2013-11-30 In-depth mining method for translation requirements

Publications (1)

Publication Number Publication Date
CN103678540A true CN103678540A (en) 2014-03-26

Family

ID=50316085

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310638833.6A Pending CN103678540A (en) 2013-11-30 2013-11-30 In-depth mining method for translation requirements

Country Status (1)

Country Link
CN (1) CN103678540A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1578955A (en) * 2001-09-04 2005-02-09 国际商业机器公司 Sampling approach for data mining of association rules
US20100268734A1 (en) * 2004-07-16 2010-10-21 International Business Machines Corporation System and method for distributed privacy preserving data mining

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1578955A (en) * 2001-09-04 2005-02-09 国际商业机器公司 Sampling approach for data mining of association rules
US20100268734A1 (en) * 2004-07-16 2010-10-21 International Business Machines Corporation System and method for distributed privacy preserving data mining

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐艳等: "电信客户服务需求的关联规则挖掘", 《信息通信》 *
方炜炜等: "基于布尔矩阵的关联规则算法研究", 《计算机应用研究》 *

Similar Documents

Publication Publication Date Title
CN104809242B (en) A kind of big data clustering method and device based on distributed frame
CN110134719B (en) A method for identifying and classifying sensitive attributes of structured data
CN110377605B (en) A Sensitive Attribute Identification and Classification Method for Structured Data
CN110688549B (en) Artificial intelligence classification method and system based on knowledge system map construction
CN104239553A (en) Entity recognition method based on Map-Reduce framework
CN103678530A (en) Rapid detection method of frequent item sets
CN110321446B (en) Related data recommendation method and device, computer equipment and storage medium
CN107944465A (en) A kind of unsupervised Fast Speed Clustering and system suitable for big data
CN114490667B (en) Multi-dimensional data analysis method, device, electronic device and medium
Christen Towards parameter-free blocking for scalable record linkage
CN110781943A (en) A Clustering Method Based on Adjacent Grid Search
CN103678540A (en) In-depth mining method for translation requirements
Priya et al. Entity resolution for high velocity streams using semantic measures
Lu et al. Knowledge extraction from structured engineering drawings
Islambekov et al. A fast topological approach for predicting anomalies in time-varying graphs
Maradana et al. Original Research Article One shot alpha numeric weight based clustering algorithm with user threshold
Manikandan et al. The Study on Clustering Analysis in Data Mining
Zhang et al. An approximate approach to frequent itemset mining
CN119312810B (en) A big data relationship mining method based on graph computing
Kothari et al. ’Survey of various clustering techniques for big data in data mining’
Routray et al. Adaptation of Fast Modified Frequent Pattern Growth approach for frequent item sets mining in Telecommunication Industry
He et al. Enterprise human resources information mining based on improved Apriori algorithm
CN117785987A (en) Data tag information mining method
Meng et al. CABGD: an improved clustering algorithm based on grid-density
Yang et al. An integrated approach for detecting approximate duplicate records

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 430070 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six

Applicant after: Language network (Wuhan) Information Technology Co., Ltd.

Address before: 430073 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six

Applicant before: Wuhan Transn Information Technology Co., Ltd.

COR Change of bibliographic data
RJ01 Rejection of invention patent application after publication

Application publication date: 20140326

RJ01 Rejection of invention patent application after publication