forked from munibanust/febrl
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathpytrainhmm.py
More file actions
334 lines (269 loc) · 12.5 KB
/
pytrainhmm.py
File metadata and controls
334 lines (269 loc) · 12.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
# =============================================================================
# pyTrainHMM.py - Main module to train a Hidden Markov Model (HMM)
#
# Freely extensible biomedical record linkage (Febrl) Version 0.1
# See http://datamining.anu.edu.au/projects/linkage.html
#
# =============================================================================
# AUSTRALIAN NATIONAL UNIVERSITY OPEN SOURCE LICENSE (ANUOS LICENSE)
# VERSION 1.0
#
# The contents of this file are subject to the ANUOS License Version 1.0 (the
# "License"); you may not use this file except in compliance with the License.
# Software distributed under the License is distributed on an "AS IS" basis,
# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License for
# the specific language governing rights and limitations under the License.
# The Original Software is "pyTrainHMM.py".
# The Initial Developers of the Original Software are Dr Peter Christen
# (Department of Computer Science, Australian National University), Dr Tim
# Churches (Centre for Epidemiology and Research, New South Wales Department
# of Health) and Drs Markus Hegland, Stephen Roberts and Ole Nielsen
# (Mathematical Sciences Insitute, Australian National University). Copyright
# (C) 2002 the Australian National University and others. All Rights Reserved.
# Contributors:
#
# =============================================================================
"""Module pyTrainHMM.py - Main module to train a Hidden Markov Model (HMM)
USAGE:
pyTrainHMM.py [project_module] [tag_mode] [train_file] [hmm_file]
[options]
ARGUMENTS:
project_module A Python module containing all project settings.
tag_mode Mode for tagging, must be either 'name' or 'locality'
train_file Name of the input data file containing the tagged
training records.
hmm_file The name of the HMM file to be written. It's a text file
containing the state and observation (tag) names; initial,
transition and observation probabilities.
OPTIONS (after arguments);
-s [method] Smoothing of HMM parameters. The method can either be set
to 'laplace' (for Laplace smoothing) or 'absdiscount'
(for the absolute discount smoothing method).
-l [file_name] Log warnings and errors into the given file
-v1 Verbose output level 1 (low output)
-v2 Verbose output level 2 (high output)
-nowarn Don't print warning messages
DESCRIPTION:
This module can be used to train a Hidden Markov Model (HMM) using tagged
data as created by 'pyTagData.py' and then edited manually. The 'tag_mode'
argument (either 'name' or 'locality') is used to get the possible HMM
states and observations from the 'config.py' module. They are also given
in the file: './hmm/hmm-states.txt'
The format of the input file is as follows:
- Comment lines must start with a hash character (#).
- Empty lines are possible.
- Each non-empty line that is not a comment line must contain one training
record.
- Training records must contain a comma separated list of pairs
'observation:state' (i.e. tag:state). Only known states and observations
(as listed in 'config.py') are allowed, unknown states or observations
will result in an error.
For verbose output, add the option '-v1' or '-v2' with the second option
producing more detailed output than the first one.
To suppress the printing of warning messages used the '-nowarn' option.
If logging is activated, warning messages are always logged.
If you want to save warning and error messages into a file, use the '-l'
option followed by the file name for the log file. If the log file already
exists, new information will be appended, otherwise the file will be
created. Logging is done according to the verbose level. If no verbose
level is given, only warning and error messages will be saved to the log
file.
The output file is a text file with all parameters of the HMM
(see 'simplehmm.py' module for more details).
For more information on the smoothing methods see the 'simplehmm.py'
module ('train' routine) or e.g.
V.Borkar et.al., Automatic Segmentation of Text into Structured Records
Section 2.2
EXAMPLE:
The following lines are an example of the input file format:
# 0: |dr peter baxter dea|
# TI:, GM:, GM:, GF:
# TI:, GM:, SN:, GF:
TI:title, GM:givenname, GM:givenname, SN:surname
# TI:, GM:, SN:, SN:
# 1: |miss monica mitchell meyer|
# TI:, UN:, GM:, SN:
TI:title, UN:givenname, SN:surname, SN:surname
# 2: |phd tim william jones harris|
TI:title, GM:givenname, GM:altgivenname, UN:surname, SN:altsurname
"""
# -----------------------------------------------------------------------------
import sys
import os
import xreadlines
import time
import config
import inout
import simplehmm
# -----------------------------------------------------------------------------
def trainhmm():
"""Main routine, open file, read lines, train HMM and save it to file.
USAGE:
trainhmm()
ARGUMENTS:
None
DESCRIPTION:
Main routine, see description of module above.
"""
# Process command line arguments and check for correctness - - - - - - - - -
#
if (len(config.options) < 3):
print '***** Error: %s needs at least four arguments:'% (sys.argv[0])
print '***** - Name of the project module'
print '***** - Tagging mode: "name" or "locality"'
print '***** - Input training file name'
print '***** - HMM output file name'
print '***** plus options'
raise Exception()
if (config.options[1] == config.options[2]):
print '*** Error: Input and output files must differ'
print '*** Input training file name:', config.options[1]
print '*** HMM output file name: ', config.options[1]
raise Exception()
in_file_name = config.options[1]
hmm_file_name = config.options[2]
# Get tagging mode/lookup-tables used - - - - - - - - - - - - - - - - - - - -
#
tag_mode = config.options[0]
if (tag_mode in ['name','na','n']):
tag_mode = 'name'
elif (tag_mode in ['locality','lolty','loc','l']):
tag_mode = 'loc'
else:
print '***** Error: Illegal tagging mode:', tag_mode
print '***** Must be either "name" or "locality"'
raise Exception()
# Check for optional arguments and process if any - - - - - - - - - - - - - -
#
config.verbose = 0 # Default: No verbose output
config.logging = 0 # Default: No logging into a file
smoothing = None # Default: No smoothing
config.nowarn = 0 # Deactivate no warning flag (print/log warning
# messages)
if (len(config.options) > 3):
options = config.options[3:]
while (options != []): # Do a loop processing all options
if (options[0] == '-nowarn'):
config.nowarn = 1 # Activate no warning flag
options = options[1:] # Remove processed '-nowarn' option
elif (options[0] == '-v1'):
config.verbose = 1 # Set to verbose output level 1
options = options[1:] # Remove processed '-v1' option
elif (options[0] == '-v2'):
config.verbose = 2 # Set to verbose output level 2
options = options[1:] # Remove processed '-v2' option
elif (options[0] == '-l'):
config.logging = 1
if (len(options) > 1):
if (options[1][0] != '-'): # Not another option, must be a file name
config.log_file = options[1] # Get name of log file
options = options[1:] # Remove file_name
options = options[1:] # Remove processed -'l' option only
try:
f_log = open(config.log_file,'a') # Test if file is appendable
except:
print '***** Error ********************',
print '***** Cannot write to log file: '+config.log_file
raise IOError()
# Write (append) header to log file
#
f_log.write(os.linesep)
f_log.write('##################################################')
f_log.write("############"+os.linesep)
f_log.write("#"+os.linesep)
f_log.write("# 'pyTrainHMM.py - Version 0.1' process started at: ")
f_log.write(time.ctime(time.time())+os.linesep)
f_log.write("#"+os.linesep)
f_log.write("# Input file name: "+in_file_name+os.linesep)
f_log.write("# HMM file name: "+hmm_file_name+os.linesep)
f_log.write(os.linesep)
f_log.close()
elif (options[0] == '-s'):
smoothing = 1 # Set to do a HMM smoothing
smoothing = options[1]
if (smoothing in ['l','la','lap','laplac','laplace']):
smoothing = 'laplace'
elif (smoothing in ['a','ad','abs','absd','absdis','absdisc',\
'absdiscount']):
smoothing = 'absdiscount'
else: # Illegal value
print "*** Error: Illegal value for 'smoothing' argument:", smoothing
print "*** Possible are: 'laplace' or 'absdiscount'"
raise Exception()
options = options[2:] # Remove processed option
else:
print '*** Error: Illegal option:', options[0]
raise Exception()
# Get HMM states and observations from configuration module - - - - - - - - -
#
if (tag_mode == 'name'):
state_list = config.name_hmm_states
obser_list = config.name_hmm_obser
else:
state_list = config.geoloc_hmm_states
obser_list = config.geoloc_hmm_obser
# Open input file - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
try:
f_in = open(in_file_name,'r')
except:
inout.log_message('Cannot open input file: '+in_file_name,'err')
raise IOError()
line_count = 0 # Counter for lines read
rec_count = 0 # Counter for training records read
# Read lines, discard comment lines and process training data lines - - - - -
#
training_data = [] # List of training records
train_list = [] # List of training sequences (dictionaries), extracted from
# training data
for line in xreadlines.xreadlines(f_in):
if (line[0] != '#') and (line.strip() != ''):
# Line must contain a training record
line = line.strip() # Remove line separators
config.curr_line = line # Make a copy of the unprocessed current line
line_list = line.split(',') # Split into a list of elements
line_data = [] # Training data list for one training record
inout.log_message(['Record number: '+str(rec_count)],'v1')
config.curr_line_no = line_count # Store current line number
for elem in line_list:
[k,v] = elem.split(':') # Split into key and value
tag = k.strip()
state = v.strip()
line_data.append((state,tag))
if (state not in state_list):
msg = ['Illegal state name in training record: '+state, \
'Line: '+str(line_count)+', record: '+str(rec_count), \
'Possible values: '+str(state_list)]
inout.log_message(msg,'err')
raise Exception()
if (tag not in obser_list):
msg = ['Illegal observation (tag) name in training record: '+tag, \
'Line: '+str(line_count)+', record: '+str(rec_count), \
'Possible values: '+str(obser_list)]
inout.log_message(msg,'err')
raise Exception()
inout.log_message(' Training record '+str(rec_count)+':'+ \
str(line_data),'v1')
train_list.append(line_data)
rec_count += 1
inout.log_message('','v1') # Print empty lines between records
line_count += 1
# Close input file - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
f_in.close()
inout.log_message('','v1') # Print empty lines between records
# Initalise HMM and train it with training data - - - - - - - - - - - - - - -
#
myhmm = simplehmm.hmm(state_list, obser_list)
myhmm.train(train_list,smoothing)
myhmm.print_hmm()
# Save trained HMM - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
myhmm.save_hmm(hmm_file_name)
inout.log_message(['Read '+str(line_count)+' lines, processed '+ \
str(rec_count)+' training records', 'End.'],'v1')
# ----------------------------------------------------------------------------
# Start main data HMM training routine
#
trainhmm()
# ----------------------------------------------------------------------------