comparison roundup/pygettext.py @ 8080:d1c29284ccd9

feat: issue2551287 - roundup-gettext extracts strings from detectors/extensions Enhance roundup_gettext.py to extract strings from detectors/extensions. If the polib module is available, roundup-gettext will extract translatable strings from the tracker's Python code. If polib is missing, it will print a warning. Marcus did most of the work, I had to do a python 2-> conversion of pygettext.py.
author John Rouillard <rouilj@ieee.org>
date Sat, 13 Jul 2024 18:27:11 -0400
parents
children a4127d7afaa9
comparison
equal deleted inserted replaced
8079:e3c5f85af7d5 8080:d1c29284ccd9
1 #! /usr/bin/env python
2 # Originally written by Barry Warsaw <barry@python.org>
3 #
4 # Minimally patched to make it even more xgettext compatible
5 # by Peter Funk <pf@artcom-gmbh.de>
6 #
7 # 2002-11-22 J�rgen Hermann <jh@web.de>
8 # Added checks that _() only contains string literals, and
9 # command line args are resolved to module lists, i.e. you
10 # can now pass a filename, a module or package name, or a
11 # directory (including globbing chars, important for Win32).
12 # Made docstring fit in 80 chars wide displays using pydoc.
13 #
14 # 2024-07-13 John Rouillard (rouilj@users.sourceforge.net)
15 # Converted from python 2.
16
17 from __future__ import print_function
18
19 # for selftesting
20 try:
21 import fintl
22 _ = fintl.gettext
23 except ImportError:
24 _ = lambda s: s
25
26 __doc__ = _("""pygettext -- Python equivalent of xgettext(1)
27
28 Many systems (Solaris, Linux, Gnu) provide extensive tools that ease the
29 internationalization of C programs. Most of these tools are independent of
30 the programming language and can be used from within Python programs.
31 Martin von Loewis' work[1] helps considerably in this regard.
32
33 There's one problem though; xgettext is the program that scans source code
34 looking for message strings, but it groks only C (or C++). Python
35 introduces a few wrinkles, such as dual quoting characters, triple quoted
36 strings, and raw strings. xgettext understands none of this.
37
38 Enter pygettext, which uses Python's standard tokenize module to scan
39 Python source code, generating .pot files identical to what GNU xgettext[2]
40 generates for C and C++ code. From there, the standard GNU tools can be
41 used.
42
43 A word about marking Python strings as candidates for translation. GNU
44 xgettext recognizes the following keywords: gettext, dgettext, dcgettext,
45 and gettext_noop. But those can be a lot of text to include all over your
46 code. C and C++ have a trick: they use the C preprocessor. Most
47 internationalized C source includes a #define for gettext() to _() so that
48 what has to be written in the source is much less. Thus these are both
49 translatable strings:
50
51 gettext("Translatable String")
52 _("Translatable String")
53
54 Python of course has no preprocessor so this doesn't work so well. Thus,
55 pygettext searches only for _() by default, but see the -k/--keyword flag
56 below for how to augment this.
57
58 [1] http://www.python.org/workshops/1997-10/proceedings/loewis.html
59 [2] http://www.gnu.org/software/gettext/gettext.html
60
61 NOTE: pygettext attempts to be option and feature compatible with GNU
62 xgettext where ever possible. However some options are still missing or are
63 not fully implemented. Also, xgettext's use of command line switches with
64 option arguments is broken, and in these cases, pygettext just defines
65 additional switches.
66
67 Usage: pygettext [options] inputfile ...
68
69 Options:
70
71 -a
72 --extract-all
73 Extract all strings.
74
75 -d name
76 --default-domain=name
77 Rename the default output file from messages.pot to name.pot.
78
79 -E
80 --escape
81 Replace non-ASCII characters with octal escape sequences.
82
83 -D
84 --docstrings
85 Extract module, class, method, and function docstrings. These do
86 not need to be wrapped in _() markers, and in fact cannot be for
87 Python to consider them docstrings. (See also the -X option).
88
89 -h
90 --help
91 Print this help message and exit.
92
93 -k word
94 --keyword=word
95 Keywords to look for in addition to the default set, which are:
96 %(DEFAULTKEYWORDS)s
97
98 You can have multiple -k flags on the command line.
99
100 -K
101 --no-default-keywords
102 Disable the default set of keywords (see above). Any keywords
103 explicitly added with the -k/--keyword option are still recognized.
104
105 --no-location
106 Do not write filename/lineno location comments.
107
108 -n
109 --add-location
110 Write filename/lineno location comments indicating where each
111 extracted string is found in the source. These lines appear before
112 each msgid. The style of comments is controlled by the -S/--style
113 option. This is the default.
114
115 -o filename
116 --output=filename
117 Rename the default output file from messages.pot to filename. If
118 filename is `-' then the output is sent to standard out.
119
120 -p dir
121 --output-dir=dir
122 Output files will be placed in directory dir.
123
124 -S stylename
125 --style stylename
126 Specify which style to use for location comments. Two styles are
127 supported:
128
129 Solaris # File: filename, line: line-number
130 GNU #: filename:line
131
132 The style name is case insensitive. GNU style is the default.
133
134 -v
135 --verbose
136 Print the names of the files being processed.
137
138 -V
139 --version
140 Print the version of pygettext and exit.
141
142 -w columns
143 --width=columns
144 Set width of output to columns.
145
146 -x filename
147 --exclude-file=filename
148 Specify a file that contains a list of strings that are not be
149 extracted from the input files. Each string to be excluded must
150 appear on a line by itself in the file.
151
152 -X filename
153 --no-docstrings=filename
154 Specify a file that contains a list of files (one per line) that
155 should not have their docstrings extracted. This is only useful in
156 conjunction with the -D option above.
157
158 If `inputfile' is -, standard input is read.
159 """)
160
161 import os
162 import importlib
163 import sys
164 import glob
165 import time
166 import getopt
167 import token
168 import tokenize
169 import operator
170
171 from functools import reduce
172
173 __version__ = '1.5'
174
175 default_keywords = ['_']
176 DEFAULTKEYWORDS = ', '.join(default_keywords)
177
178 EMPTYSTRING = ''
179
180 # The normal pot-file header. msgmerge and Emacs's po-mode work better if it's
181 # there.
182 pot_header = _('''\
183 # SOME DESCRIPTIVE TITLE.
184 # Copyright (C) YEAR ORGANIZATION
185 # FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
186 #
187 msgid ""
188 msgstr ""
189 "Project-Id-Version: PACKAGE VERSION\\n"
190 "POT-Creation-Date: %(time)s\\n"
191 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\\n"
192 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\\n"
193 "Language-Team: LANGUAGE <LL@li.org>\\n"
194 "Language: \\n"
195 "MIME-Version: 1.0\\n"
196 "Content-Type: text/plain; charset=CHARSET\\n"
197 "Content-Transfer-Encoding: ENCODING\\n"
198 "Generated-By: pygettext.py %(version)s\\n"
199
200 ''')
201
202 def usage(code, msg=''):
203 print(__doc__ % globals(), file=sys.stderr)
204 if msg:
205 print(msg, file=sys.stderr)
206 sys.exit(code)
207
208
209 escapes = []
210
211 def make_escapes(pass_iso8859):
212 global escapes
213 escapes = [chr(i) for i in range(256)]
214 if pass_iso8859:
215 # Allow iso-8859 characters to pass through so that e.g. 'msgid
216 # "H�he"' would result not result in 'msgid "H\366he"'. Otherwise we
217 # escape any character outside the 32..126 range.
218 mod = 128
219 else:
220 mod = 256
221 for i in range(mod):
222 if not(32 <= i <= 126):
223 escapes[i] = "\\%03o" % i
224 escapes[ord('\\')] = '\\\\'
225 escapes[ord('\t')] = '\\t'
226 escapes[ord('\r')] = '\\r'
227 escapes[ord('\n')] = '\\n'
228 escapes[ord('\"')] = '\\"'
229
230
231 def escape(s):
232 global escapes
233 s = list(s)
234 for i in range(len(s)):
235 s[i] = escapes[ord(s[i])]
236 return EMPTYSTRING.join(s)
237
238
239 def safe_eval(s):
240 # unwrap quotes, safely
241 return eval(s, {'__builtins__':{}}, {})
242
243
244 def normalize(s):
245 # This converts the various Python string types into a format that is
246 # appropriate for .po files, namely much closer to C style.
247 lines = s.split('\n')
248 if len(lines) == 1:
249 s = '"' + escape(s) + '"'
250 else:
251 if not lines[-1]:
252 del lines[-1]
253 lines[-1] = lines[-1] + '\n'
254 for i in range(len(lines)):
255 lines[i] = escape(lines[i])
256 lineterm = '\\n"\n"'
257 s = '""\n"' + lineterm.join(lines) + '"'
258 return s
259
260 def containsAny(str, set):
261 """Check whether 'str' contains ANY of the chars in 'set'"""
262 return 1 in [c in str for c in set]
263
264
265 def _get_modpkg_path(dotted_name, pathlist=None):
266 """Get the filesystem path for a module or a package.
267
268 Return the file system path to a file for a module, and to a directory for
269 a package. Return None if the name is not found, or is a builtin or
270 extension module.
271 """
272 # split off top-most name
273 parts = dotted_name.split('.', 1)
274
275 if len(parts) > 1:
276 # we have a dotted path, import top-level package
277 try:
278 file, pathname, description = importlib.find_module(parts[0], pathlist)
279 if file: file.close()
280 except ImportError:
281 return None
282
283 # check if it's indeed a package
284 if description[2] == imp.PKG_DIRECTORY:
285 # recursively handle the remaining name parts
286 pathname = _get_modpkg_path(parts[1], [pathname])
287 else:
288 pathname = None
289 else:
290 # plain name
291 try:
292 file, pathname, description = imp.find_module(
293 dotted_name, pathlist)
294 if file:
295 file.close()
296 if description[2] not in [imp.PY_SOURCE, imp.PKG_DIRECTORY]:
297 pathname = None
298 except ImportError:
299 pathname = None
300
301 return pathname
302
303
304 def getFilesForName(name):
305 """Get a list of module files for a filename, a module or package name,
306 or a directory.
307 """
308 if not os.path.exists(name):
309 # check for glob chars
310 if containsAny(name, "*?[]"):
311 files = glob.glob(name)
312 list = []
313 for file in files:
314 list.extend(getFilesForName(file))
315 return list
316
317 # try to find module or package
318 name = _get_modpkg_path(name)
319 if not name:
320 return []
321
322 if os.path.isdir(name):
323 # find all python files in directory
324 list = []
325 # get extension for python source files
326 if '_py_ext' not in globals():
327 global _py_ext
328 _py_ext = [triple[0] for triple in imp.get_suffixes()
329 if triple[2] == imp.PY_SOURCE][0]
330 for root, dirs, files in os.walk(name):
331 # don't recurse into CVS directories
332 if 'CVS' in dirs:
333 dirs.remove('CVS')
334 # add all *.py files to list
335 list.extend(
336 [os.path.join(root, file) for file in files
337 if os.path.splitext(file)[1] == _py_ext]
338 )
339 return list
340 elif os.path.exists(name):
341 # a single file
342 return [name]
343
344 return []
345
346 class TokenEater:
347 def __init__(self, options):
348 self.__options = options
349 self.__messages = {}
350 self.__state = self.__waiting
351 self.__data = []
352 self.__lineno = -1
353 self.__freshmodule = 1
354 self.__curfile = None
355
356 def __call__(self, ttype, tstring, stup, etup, line):
357 # dispatch
358 ## import token
359 ## print(('ttype:', token.tok_name[ttype], \
360 ## 'tstring:', tstring), file=sys.stderr)
361 self.__state(ttype, tstring, stup[0])
362
363 def __waiting(self, ttype, tstring, lineno):
364 opts = self.__options
365 # Do docstring extractions, if enabled
366 if opts.docstrings and not opts.nodocstrings.get(self.__curfile):
367 # module docstring?
368 if self.__freshmodule:
369 if ttype == tokenize.STRING:
370 self.__addentry(safe_eval(tstring), lineno, isdocstring=1)
371 self.__freshmodule = 0
372 elif ttype not in (tokenize.COMMENT, tokenize.NL):
373 self.__freshmodule = 0
374 return
375 # class docstring?
376 if ttype == tokenize.NAME and tstring in ('class', 'def'):
377 self.__state = self.__suiteseen
378 return
379 if ttype == tokenize.NAME and tstring in opts.keywords:
380 self.__state = self.__keywordseen
381
382 def __suiteseen(self, ttype, tstring, lineno):
383 # ignore anything until we see the colon
384 if ttype == tokenize.OP and tstring == ':':
385 self.__state = self.__suitedocstring
386
387 def __suitedocstring(self, ttype, tstring, lineno):
388 # ignore any intervening noise
389 if ttype == tokenize.STRING:
390 self.__addentry(safe_eval(tstring), lineno, isdocstring=1)
391 self.__state = self.__waiting
392 elif ttype not in (tokenize.NEWLINE, tokenize.INDENT,
393 tokenize.COMMENT):
394 # there was no class docstring
395 self.__state = self.__waiting
396
397 def __keywordseen(self, ttype, tstring, lineno):
398 if ttype == tokenize.OP and tstring == '(':
399 self.__data = []
400 self.__lineno = lineno
401 self.__state = self.__openseen
402 else:
403 self.__state = self.__waiting
404
405 def __openseen(self, ttype, tstring, lineno):
406 if ttype == tokenize.OP and tstring == ')':
407 # We've seen the last of the translatable strings. Record the
408 # line number of the first line of the strings and update the list
409 # of messages seen. Reset state for the next batch. If there
410 # were no strings inside _(), then just ignore this entry.
411 if self.__data:
412 self.__addentry(EMPTYSTRING.join(self.__data))
413 self.__state = self.__waiting
414 elif ttype == tokenize.STRING:
415 self.__data.append(safe_eval(tstring))
416 elif ttype not in [tokenize.COMMENT, token.INDENT, token.DEDENT,
417 token.NEWLINE, tokenize.NL]:
418 # warn if we see anything else than STRING or whitespace
419 print(_(
420 '*** %(file)s:%(lineno)s: Seen unexpected token "%(token)s"'
421 ) % {
422 'token': tstring,
423 'file': self.__curfile,
424 'lineno': self.__lineno
425 }, file=sys.stderr)
426 self.__state = self.__waiting
427
428 def __addentry(self, msg, lineno=None, isdocstring=0):
429 if lineno is None:
430 lineno = self.__lineno
431 if not msg in self.__options.toexclude:
432 entry = (self.__curfile, lineno)
433 self.__messages.setdefault(msg, {})[entry] = isdocstring
434
435 def set_filename(self, filename):
436 self.__curfile = filename
437 self.__freshmodule = 1
438
439 def write(self, fp):
440 options = self.__options
441 timestamp = time.strftime('%Y-%m-%d %H:%M+%Z')
442 # The time stamp in the header doesn't have the same format as that
443 # generated by xgettext...
444 print(pot_header % {'time': timestamp, 'version':
445 __version__}, file=fp)
446 # Sort the entries. First sort each particular entry's keys, then
447 # sort all the entries by their first item.
448 reverse = {}
449 for k, v in self.__messages.items():
450 keys = v.keys()
451 keys = sorted(keys)
452 reverse.setdefault(tuple(keys), []).append((k, v))
453 rkeys = reverse.keys()
454 for rkey in sorted(rkeys):
455 rentries = reverse[rkey]
456 rentries.sort()
457 for k, v in rentries:
458 isdocstring = 0
459 # If the entry was gleaned out of a docstring, then add a
460 # comment stating so. This is to aid translators who may wish
461 # to skip translating some unimportant docstrings.
462 if reduce(operator.__add__, v.values()):
463 isdocstring = 1
464 # k is the message string, v is a dictionary-set of (filename,
465 # lineno) tuples. We want to sort the entries in v first by
466 # file name and then by line number.
467 v = v.keys()
468 v = sorted(v)
469 if not options.writelocations:
470 pass
471 # location comments are different b/w Solaris and GNU:
472 elif options.locationstyle == options.SOLARIS:
473 for filename, lineno in v:
474 d = {'filename': filename, 'lineno': lineno}
475 print(_(
476 '# File: %(filename)s, line: %(lineno)d') % d, file=fp)
477 elif options.locationstyle == options.GNU:
478 # fit as many locations on one line, as long as the
479 # resulting line length doesn't exceed 'options.width'
480 locline = '#:'
481 for filename, lineno in v:
482 d = {'filename': filename, 'lineno': lineno}
483 s = _(' %(filename)s:%(lineno)d') % d
484 if len(locline) + len(s) <= options.width:
485 locline = locline + s
486 else:
487 print(locline, file=fp)
488 locline = "#:" + s
489 if len(locline) > 2:
490 print(locline, file=fp)
491 if isdocstring:
492 print('#, docstring', file=fp)
493 print('msgid', normalize(k), file=fp)
494 print('msgstr ""\n', file=fp)
495
496
497 def main():
498 global default_keywords
499 try:
500 opts, args = getopt.getopt(
501 sys.argv[1:],
502 'ad:DEhk:Kno:p:S:Vvw:x:X:',
503 ['extract-all', 'default-domain=', 'escape', 'help',
504 'keyword=', 'no-default-keywords',
505 'add-location', 'no-location', 'output=', 'output-dir=',
506 'style=', 'verbose', 'version', 'width=', 'exclude-file=',
507 'docstrings', 'no-docstrings',
508 ])
509 except getopt.error as msg:
510 usage(1, msg)
511
512 # for holding option values
513 class Options:
514 # constants
515 GNU = 1
516 SOLARIS = 2
517 # defaults
518 extractall = 0 # FIXME: currently this option has no effect at all.
519 escape = 0
520 keywords = []
521 outpath = ''
522 outfile = 'messages.pot'
523 writelocations = 1
524 locationstyle = GNU
525 verbose = 0
526 width = 78
527 excludefilename = ''
528 docstrings = 0
529 nodocstrings = {}
530
531 options = Options()
532 locations = {'gnu' : options.GNU,
533 'solaris' : options.SOLARIS,
534 }
535
536 # parse options
537 for opt, arg in opts:
538 if opt in ('-h', '--help'):
539 usage(0)
540 elif opt in ('-a', '--extract-all'):
541 options.extractall = 1
542 elif opt in ('-d', '--default-domain'):
543 options.outfile = arg + '.pot'
544 elif opt in ('-E', '--escape'):
545 options.escape = 1
546 elif opt in ('-D', '--docstrings'):
547 options.docstrings = 1
548 elif opt in ('-k', '--keyword'):
549 options.keywords.append(arg)
550 elif opt in ('-K', '--no-default-keywords'):
551 default_keywords = []
552 elif opt in ('-n', '--add-location'):
553 options.writelocations = 1
554 elif opt in ('--no-location',):
555 options.writelocations = 0
556 elif opt in ('-S', '--style'):
557 options.locationstyle = locations.get(arg.lower())
558 if options.locationstyle is None:
559 usage(1, _('Invalid value for --style: %s') % arg)
560 elif opt in ('-o', '--output'):
561 options.outfile = arg
562 elif opt in ('-p', '--output-dir'):
563 options.outpath = arg
564 elif opt in ('-v', '--verbose'):
565 options.verbose = 1
566 elif opt in ('-V', '--version'):
567 print(_('pygettext.py (xgettext for Python) %s') % __version__)
568 sys.exit(0)
569 elif opt in ('-w', '--width'):
570 try:
571 options.width = int(arg)
572 except ValueError:
573 usage(1, _('--width argument must be an integer: %s') % arg)
574 elif opt in ('-x', '--exclude-file'):
575 options.excludefilename = arg
576 elif opt in ('-X', '--no-docstrings'):
577 fp = open(arg)
578 try:
579 while 1:
580 line = fp.readline()
581 if not line:
582 break
583 options.nodocstrings[line[:-1]] = 1
584 finally:
585 fp.close()
586
587 # calculate escapes
588 make_escapes(not options.escape)
589
590 # calculate all keywords
591 options.keywords.extend(default_keywords)
592
593 # initialize list of strings to exclude
594 if options.excludefilename:
595 try:
596 fp = open(options.excludefilename)
597 options.toexclude = fp.readlines()
598 fp.close()
599 except IOError:
600 print(_(
601 "Can't read --exclude-file: %s") % options.excludefilename, file=sys.stderr)
602 sys.exit(1)
603 else:
604 options.toexclude = []
605
606 # resolve args to module lists
607 expanded = []
608 for arg in args:
609 if arg == '-':
610 expanded.append(arg)
611 else:
612 expanded.extend(getFilesForName(arg))
613 args = expanded
614
615 # slurp through all the files
616 eater = TokenEater(options)
617 for filename in args:
618 if filename == '-':
619 if options.verbose:
620 print(_('Reading standard input'))
621 fp = sys.stdin
622 closep = 0
623 else:
624 if options.verbose:
625 print(_('Working on %s') % filename)
626 fp = open(filename)
627 closep = 1
628 try:
629 eater.set_filename(filename)
630 try:
631 for token in tokenize.generate_tokens(fp.readline):
632 eater(*token)
633 except tokenize.TokenError as e:
634 print('%s: %s, line %d, column %d' % (
635 e[0], filename, e[1][0], e[1][1]), file=sys.stderr)
636 finally:
637 if closep:
638 fp.close()
639
640 # write the output
641 if options.outfile == '-':
642 fp = sys.stdout
643 closep = 0
644 else:
645 if options.outpath:
646 options.outfile = os.path.join(options.outpath, options.outfile)
647 fp = open(options.outfile, 'w')
648 closep = 1
649 try:
650 eater.write(fp)
651 finally:
652 if closep:
653 fp.close()
654
655 if __name__ == '__main__':
656 main()
657 # some more test strings
658 _(u'a unicode string')
659 # this one creates a warning
660 _('*** Seen unexpected token "%(token)s"') % {'token': 'test'}
661 _('more' 'than' 'one' 'string')

Roundup Issue Tracker: http://roundup-tracker.org/