IntroductionThe RtfTools package is a set of free PHP classes that operate on files containing Microsoft Rich Text Format data (*.rtf). Each class has been designed to accomplish a specific task that may be useful if you have to process Rtf files in various ways :
All the classes in the RtfTools package have been designed to be able to process Rtf documents that may be larger than the available memory. This is especially useful when you need to handle at once several Rtf documents whose total size may exceed your current PHP memory limit. However, you always have the choice of using a version that relies on an underlying file (allowing for processing files bigger that the available memory), or its twin version that operates on Rtf contents stored as a string into memory (allowing for faster processing). For example, the RtfTemplater class is simply an abstract class which has two derived classes :
The same dichotomy exists with all the other classes, at the exception of RtfMerger. The Overview of the RtfTools classes section provides more information about the hierarchy of the RtfTools classes, especially regarding when to use the derived classes that operate on string contents and the ones that operate directly on files. You will also find an Examples section, that gives a general overview on how to use the classes. If you would like examples that sound a little bit more like real life, using real sample Rtf files, you can also have a look at the examples directory in the .zip file containing the latest release of the distribution (which can be downloaded here : http://www.rtftools.net/download.php?version=latest) Finally, the Reference section gives a complete description of the RtfTools classes, their properties and their methods. LicensingThe applicable licensing scheme for using this package is GPL V3. PrerequisitesThis package requires PHP >= 5.6. InstallationThere is no particular installation process. Just extract the files located in the sources directory of the .ZIP archive to your preferred include directory location. You can also extract the whole archive if you like. Overview of the RtfTools classesThis section will give you an overview of how the classes in the RtfTools package are organized and why they were organized this way.
You will discover that most classes come in two versions : a string-based version which operates on a whole Rtf document directly loaded into memory, and a file-based version
that loads chunks of data from an Rtf document.
Finally, you will find a small discussion about when to chose the string-based version and when to chose the file-based version. Design requirementsClasses of this package have been designed with the following requirements in mind :
Introduction to the RtfTools class hierarchyMaybe the easiest way to understand how the classes of the RtfTools package are organized is to start from the root, the RtfDocument class ; the diagram below, which uses a home-made formalism, describes the origin of it all :
Diagram explanationsThe diagram above needs a few explanations :
The RtfDocument class hierarchyOf course, the diagram above was not only an example ; it describes the various components that are articulated around the base abstract class, RtfDocument. This diagram shows that the RtfDocument class implements the IRtfDocument interface. This is not completely true, in reality : the actual implementation of the IRtfDocument interface has been delegated into the two traits, RtfStringSupport and RtfFileSupport.
But here comes the dichotomy : at the next abstraction level, the RtfDocument class splits into two final versions :
RtfStringDocument and RtfFileDocument. The first one will load the contents of an Rtf
document entirely into memory, while the second one will read the document contents from disk, only when they are needed. The first approach is focused on performance, while the second one is focused on reducing memory usage. Classes derived from RtfDocumentBased on this modeling approach, most of the specialized classes of the RtfTools package roughly follow the same scheme. An example is given below for the RtfTemplater class :
The above diagram shows that the RtfTemplater class inherits from the RtfDocument one ; as its parent, this is an abstract class that later specializes in two classes, RtfStringTemplater and RtfFileTemplater. At the exception of the RtfMerger class, all other classes inherit more or less directly from RtfDocument. String-based vs file-based classesNow that we have understood the dichotomy between the string-based and file-based classes, there is one big question that may come up to your mind : "Why do I have to chose between a string-based version and a file-based one ?". Here are a few hints, which are not to be taken as truths : Chose the string-based version of an RtfDocument class when :
Conversely, chose the file-based version when :
Whatever the solution you chosed, please keep in mind that the API will remain exactly the same, whether you chose the string-based version of a class or its file-based counterpart. ExamplesYou will find below a few examples on how to use the various classes from the RtfTools package. You will also find running examples in the examples directory of the .ZIP archive containing the RtfTools package. Processing a template Rtf documentMerging multiple Rtf documents togetherMerging Rtf files is fairly simple ; first, create a instance of the **RtfMerger** class ; you can supply a list of files to be merged together, or add them later by calling the *Add()* method :
include ( 'path/to/RtfMerger.phpclass' ) ;
$merger = new RtfMerger ( 'sample1.rtf', 'sample2.rtf' ) ;
$merger -> Add ( 'sample3.rtf' ) ;
The above example specified the names of the files to be merged ; but you can also give objects inheriting from the RtfDocument class, such as in the example below :
$merger = new RtfMerger ( ) ;
$merger -> Add ( new RtfFileDocument ( 'sample3.rtf' ) ) ;
$merger -> Add ( new RtfStringDocument ( file_get_contents ( 'sample4.rtf' ) ) ) ;
$template_variables = [ 'a' => 'this is variable A', 'b' => 'this is variable b' ] ;
$merger -> Add ( new RtfFileTemplater ( 'sample5.rtf', $template_variables ) ;
Related class : RtfMerger Extracting text from an Rtf documentExtracting text from an Rtf document is easy ; the following example extracts plain text contents from files "sample1.rtf" and "sample2.rtf", and puts them in files "sample1.txt" and "sample2.txt", respectively. The plain text contents of file "sample2.rtf" are echoed on the standard output :
include ( 'path/to/RtfTexter.phpclass' ) ;
// Use the string-based version of the class for the first file
$contents = file_get_contents ( 'sample1.rtf' ) ;
$doc = new RtfStringTexter ( $contents ) ;
$doc -> SaveTo ( 'sample1.txt' ) ;
// Use the file-based version of the class for the second file
$doc = new RtfFileTexter ( 'sample2.rtf' ) ;
echo $doc -> AsString ( ) ;
$doc -> SaveTo ( 'sample2.txt' ) ;
Related class : RtfTexter Pretty-printing Rtf document contentsThe following example will process two files, sample1.rtf and sample2.rtf, and will generates their pretty-printed output to files sample1.txt and sample2.txt, respectively :
include ( 'path/to/RtfBeautifier.phpclass' ) ;
// Use the string-based version of the class for the first file
$contents = file_get_contents ( 'sample1.rtf' ) ;
$doc = new RtfStringBeautifier ( $contents ) ;
$doc -> SaveTo ( 'sample1.txt' ) ;
// Use the file-based version of the class for the second file
$doc = new RtfFileBeautifier ( 'sample2.rtf' ) ;
$doc -> SaveTo ( 'sample2.txt' ) ;
Now, if you are running Unix, you can type the following command to compare the contents of both documents :
$ diff sample1.txt sample2.txt | more
On Windows systems, you can use the Windiff command, which graphically displays its comparison results :
C:\ > windiff sample1.txt sample2.txt
(the windiff command can be downloaded here : http://www.grigsoft.com/download-windiff.htm) Related class : RtfBeautifier Parsing an Rtf fileClass referenceRtfDocument classThe RtfDocument class is an abstract class from which all other classes of the RtfTools package inherit (at the exception of RtfMerger). It supports the IRtfDocument interface, but it does not implement the methods declared in it : this role is delegated to the RtfStringSupport and RtfFileSupport traits, that are later used by specialized (non-abstract) classes such as RtfStringDocument and RtfFileDocument.
You may notice that there is a mix of naming conventions on method names ; Some use joined words with their first letter uppercased, some use lowercase words
separated with an underline.
Class diagramIf you have read the Overview of the RtfTools classes section, then you are already familiar with this diagram :
ConstructorThe RtfDocument constructor does not accept any parameters of its own ; it simply delegates instantiation to the __specialized_construct method of the RtfStringSupport and RtfFileSupport traits, passing all the arguments it received. You can have a look at the String and File support traits section later in this chapter for an explanation on how parent and derived class constructors intercommunicate their parameters. Methodspublic static function DecodeSpecialChars ( $contents, $convert_accents = false )Decodes characters using the Rtf notation \'xy, where x and y are hexadecimal digits, and replaces them with their Ansi counterparts. Parameters :
Return value :
Returns the input text, with all special characters converted.
Notes :
The following conversions apply :
public static function EscapeString ( $value )Some strings (designated as #PCDATA in the Rtf specifications) may contain characters that could be interpreted as Rtf instructions ; such characters are :
Parameters :
Return value :
Returns the escaped value.
public static function GetCompoundTag ( $data, $tag, $offset = 0, $include_tag = false )Extracts a compound tag from Rtf data, handling multiple nesting levels if necessary. For example, the color table present in the header part of an Rtf document has the following structure : {\colortbl;\red0\green0\blue0;\red0\green0\blue255;\red0\green255\blue255;
\red0\green255\blue0;\red255\green0\blue255;}
The \colortbl tag is enclosed within curly braces. The GetCompoundTag method locates such a tag in the Rtf data supplied by the $data parameter, and returns the enclosed contents, without the curly braces. Parameters :
Return value :
Returns the tag contents (including nested tags) if found, or false otherwise.
public static function get_document_start ( )Rtf documents contain a header part and a body part. To process a document body, we need of course to be sure where the header part ends and where the body part starts. Unfortunately, there is not precise point in an Rtf document that says : "this is the end of the header, and the start of the body part".
A header is made of several parts, such as tags that define the character set used globally in the document, as well as compound structures
such as font tables, color tables, style sheets and so on. The get_document_start method is able to locate the end of the very last part of a document header, which signals the start of the body part.
If no header end has been found (which should not happen except for very ill-formed documents), then the get_document_start
method will try to locate the first \sectd (section start) or \pard (reset paragraph
settings to their defaults). Return value :
Returns the byte offset, in the Rtf document, of the start of the document body.
public static function ToClosingDelimiter ( $string, $start = 0 )Suppose that you have a compound statement such as the following font table, located inside Rtf contents : (some Rtf contents)
{\fonttbl {\f1 ... {\panose ...} Time New Roman;} {\f2 ... Arial;}}
(some other Rtf contents)
The ToClosingDelimiter method will find the last closing brace, provided that you supply the index of the
opening brace in the Rtf document.
Although the RtfStringDocument and RtfFileDocument classes implement a to_closing_delimiter method that searches string or file contents until a closing brace has been found, it is sometimes handy to do it on a simple string. This is why there is also a generic closing delimiter search method that operates on strings, whatever the underlying document implementation looks like (string-based or file-based). As for its specialized counterpart, this method is able to handle nested constructs. Parameters :
Return value :
Returns the byte offset of the closing brace of the compound construct starting at offset $start - 1,
or false if the supplied Rtf data has imbalanced nested opening/closing braces.
public static function TwipsToCm ( $value )
Converts a value expressed in twips (1/1440 of an inch) to centimeters.
Propertiesprotected $DecodingTableThis table is used by the DecodeSpecialChars method to decode special character specifications of the form : \'xy (where xy are hexadecimal characters providing an ascii code) with their ascii equivalent. It also provides translations for the following characters :
protected $DecodingTableWithAccentsThis table is used by the DecodeSpecialChars method to replace accentuated characters with their ascii equivalent without accents, when the $convert_accents parameter is true. public $Name
For file-based documents, contains the name of the supplied input Rtf document. protected $RecordSize
Contains the record size used when writing output documents. ConstantsTWIPS_PER_CM
Number of twips per centimeters.
TOKEN_* constants
The TOKEN_* constants represent a syntactic element of an Rtf file :
IRtfDocument interface
The methods declared in the IRtfDocument interfaces are implemented by all classes derived from the RtfDocument class. All classes inheriting from RtfDocument (and therefore, supposed to implement the IRtfDocument interface) implement the ArrayAccess, Countable and Iterator interfaces. This means for example that calling the count() builtin function on an object inheriting from the RtfDocument class will return you the number of characters present in your Rtf document : $doc = new RtfStringDocument ( 'sample.rtf' ) ;
echo count ( $doc ) ;
Note that it will return the number of characters in the Rtf code, not the number of characters of the plain text.
You can iterate through each character of the Rtf data present in your document by using a for loop : $doc = new RtfFileDocument ( 'sample.rtf' ) ;
for ( $i = 0, $count = count ( $doc ) ; $i < $count ; $i ++ )
echo "CHAR at position $i = [{$doc [$i]}]\n" ;
Note that you can use array index notation to retrieve an individual character, such as in $doc [$i].
Similarly, you can use a foreach loop to iterate through individual characters : $doc = new RtfFileDocument ( 'sample.rtf' ) ;
foreach ( $doc as $ch )
{
// Do something with $ch...
}
public function AsString ( )Returns the contents of the underlying Rtf document as a string. public function SaveTo ( $filename )Saves the current document to the specified file. public function get_contents ( )Returns the whole contents of the underlying Rtf document, as a string. public function strchr ( $cset, $start = 0 )Searches for the first character in the Rtf document that is present in the $cset string, starting at the character position specified by $start.
This function behaves like a mix between the builtin strchr() and strcspn() functions. The reason for this is that most of the classes belonging to the RtfTools package need to parse Rtf contents ; most of their needs consists in finding the next character having semantics in the Rtf language : backslash, opening and closing brace. The method returns the offset of the found character, or false otherwise. public function strlen ( )
Returns the number of characters present in the underlying Rtf document. echo count ( $doc ) ;
echo $doc -> strlen ( ) ;
public function strpos ( $searched_string, $start = 0 )Searches the underlying Rtf document for the string specified by the $searched_string parameter, starting at the character offset specified by $start. The method returns the offset of the found string, or false otherwise. public function substr ( $start, $length = false )
Returns a substring of the underlying Rtf document. public function write ( $fp, $start, $length = false )Writes characters from the underlying Rtf document, starting at the offset specified by the $start parameter, to the file resource specified by $fp. If the $length parameter has been specified, only this number of characters will be written to the output file ; otherwise, all the characters from $start until the end of file will be written. public function to_closing_delimiter ( $start = 0 )
Searches for the closing delimiter of a compound construct, starting at the character offset specified by the
$start parameter. You can have a look at the RtfDocument::ToClosingDelimiter method for a more detailed explanation. String and File support traitsThe RtfStringSupport and RtfFileSupport traits have two characteristics in common :
RtfStringSupport traitThe specialized constructor of the RtfStringSupport trait has the following signature : protected function __specialized_construct ( $rtfdata, $chunk_size ) ;
The parameters are the following :
RtfFileSupport traitThe specialized constructor of the RtfFileSupport trait has the following signature : protected function __specialized_construct ( $rtffile, $record_size = 16384, $cache_size = 8 ) ;
The parameters are the following :
Although the latest release of this SearchableFile class is available in the lastest releases of the RtfTools package, you can also find it at phpclasses.org String and File document classesAs this will be the case for almost all the classes of the RtfTools package, you will have at a given point to decide whether to use string-based versions (consuming more memory, but less cpu and I/O) or the file-based versions (consuming pretty less memory, but more I/O). Both versions provide exactly the same features ; the choice is thus driven by the amount of data you will have to process, and how much memory and cpu usage are available to you. Although those classes do not have a great interest by themselves (you can only perform searches on the initial data, extract portions of it, and write contents to an output file), they have been designed so that the RtfMerge class will only work with objects inheriting from the RtfDocument class. They have different constructors, however : you will discover them in the following sections. RtfStringDocument classpublic function __construct ( $rtfdata, $chunk_size = 4 * 1024 * 1024 )Loads an Rtf document into memory. Look at the RtfStringSupport trait for an explanation about the constructor's parameters. A typical usage could be : $doc = new RtfStringDocument ( file_get_contents ( 'sample.rtf' ) ) ;
RtfFileDocument classpublic function __construct ( $file, $record_size = 16384, $cache_size = 8 )Loads an Rtf document into memory. Look at the RtfFileSupport trait for an explanation about the constructor's parameters. A typical usage could be : $doc = new RtfFileDocument ( 'sample.rtf' ) ;
RtfBeautifier classThe goal of the RtfBeautifier class is to take an Rtf document and to produce a pretty-printed output. But why wanting to pretty-print Rtf documents ? suppose that you have two Rtf documents whose contents are almost similar, and that you want to compare them. Since the raw Rtf data can have several instructions grouped on the same line, you will have to make the difference between two files that may have lines of Rtf data that are hundreds of characters long. Comparing data formatted in such a way can be a brain-killer ; suppose for example that the files you need to compare both have the same line, but one is 700 characters-long while the other one is 705 characters long, because some \pard tag has been inserted somewhere within. When using tools such as the Unix diff or the Windows windiff command, you will find in the output that those lines differ in both files, but you will have to visually compare a 700-characters long line with a 705-characters long one. If will be a tough task to identify that there is an additional \pard tag located inside the line of the second file. This is where the RtfBeautifier class comes to the scene : it is a debugging aid that takes a file and pretty-prints it by putting every Rtf syntactic element on a separate line, taking care of indentation levels. Pretty-printing an Rtf document is very simple ; consider the following PHP script which takes file sample1.rtf as input, and generates an output file, sample1.txt, containing the pretty-printed contents ; it then repeats the same process with file sample2.rtf :
<?php
include ( 'path/to/RtfBeautifier.phpclass' ) ;
$beautifier = new RtfFileBeautifier ( 'sample1.rtf' ) ;
$beautifier -> SaveTo ( 'sample1.txt' ) ;
$beautifier = new RtfFileBeautifier ( 'sample2.rtf' ) ;
$beautifier -> SaveTo ( 'sample2.txt' ) ;
Now you are able to compare files sample1.txt and sample2.txt using the diff or windiff commands (or whatever diff-like command you prefer). To give you an idea of what the output of the RtfBeautifier is, consider the following Rtf sample file contents (for the sake of brevity, only the start of the file is listed here, and the same line is show over 3 lines) :
{\rtf1\adeflang1025\ansi\ansicpg1252\uc1\adeff0\deff0\stshfdbch0\stshfloch0\stshfhich0\stshfbi0
\deflang1036\deflangfe1036{\fonttbl{\f0\froman\fcharset0\fprq2{\*\panose 02020603050405020304}
Times New Roman;}{\f1\fswiss\fcharset0\fprq2{\*\panose 020b0604020202020204}Arial;}
...
The output of the RtfBeautifier class will be :
{
\rtf1
\adeflang1025
\ansi
\ansicpg1252
\uc1
\adeff0
\deff0
\stshfdbch0
\stshfloch0
\stshfhich0
\stshfbi0
\deflang1036
\deflangfe1036
{
\fonttbl
{
\f0
\froman
\fcharset0
\fprq2
{
\*\panose
02020603050405020304
}
Times New Roman;
}
{
\f1
\fswiss
\fcharset0
\fprq2
{
\*\panose
020b0604020202020204
}
Arial;
}
...
As you may not have guessed, even if it looks like Rtf contents, the output of the RtfBeautifier class is not valid Rtf contents. Although not clearly stated in the Microsoft Rtf Specifications, spaces cannot be put everywhere ; for example, there must be no space or line break between an opening brace ("{") and tags such as "\fonttbl" or "\rtf1". As a conclusion, the RtfBeautifier class is definitely a debugging tool that generates output for easy comparison of Rtf files, nothing more... Class diagramConstructorThe constructor of the abstract class RtfBeautifier has the following signature :
public function __construct ( $options, $indentation_size )
The parameters are the following :
Methodspublic function AsString ( )Returns the pretty-printed contents of an Rtf documents as a string. public function SaveTo ( $filename )
Pretty-prints the underlying Rtf document and saves the generated output to a file. Propertiespublic $OptionsOptions that condition the behavior of the pretty-printing process. See the Constants section for more explanations on this set of flags. The AsString and SaveTo methods use the current value of this property to process pretty-printing options so it is safe to modify it just before calling them. public $IndentationSizeNumber of spaces to be used for each indentation level. ConstantsBEAUTIFIER_* constants
The BEAUTIFIER_* constants allow to specify a set of flags that condition the process of pretty-printing.
The following flags are available :
RtfStringBeautifier classpublic function __construct ( $rtfdata, $options = self::BEAUTIFY_ALL, $indentation_size = 4, $chunk_size = 4 * 1024 * 1024 )Creates an RtfBeautifier object, using the specified Rtf data. A typical usage could be :
$doc = new RtfStringBeautifier ( file_get_contents ( 'sample.rtf' ) ) ;
$doc -> SaveTo ( 'sample.txt' ) ; // Save pretty-printed contents to output file
echo $doc -> AsString ( ) ; // Echo pretty-printed contents to standard output
The parameters are the following :
RtfFileBeautifier classpublic function __construct ( $file, $options = self::BEAUTIFY_ALL, $indentation_size = 4, $record_size = 16384Creates an RtfBeautifier object, without loading the file contents into memory. A typical usage could be :
$doc = new RtfFileBeautifier ( 'sample.rtf' ) ;
$doc -> SaveTo ( 'sample.txt' ) ; // Save pretty-printed contents to output file
echo $doc -> AsString ( ) ; // Echo pretty-printed contents to standard output
The parameters are the following :
RtfMerger classThe RtfMerger class allows you to combine the contents of several Rtf files into a single one. It can be used for example for mass printing or for storing a set of related files into a single Rtf document. Unlike all the other classes of this package that process Rtf contents, this class does not inherit from RtfDocument. Merging documents togetherMerging documents together is a simple three-steps process :
Class diagram
The RtfMerger acts as a container for objects inheriting from the RtfDocument class.
Merging process overviewNote : a little knowledge of the Rtf Specifications would be welcome here to better understand the merging process. Merging several Rtf documents together require a few manipulations. Before explaining them, a short overview of the Rtf document format is needed. Rtf documents have a header and a body part ; the Microsoft Rtf Specifications state that an Rtf document is built like this :
<file> ::= '{' <header> <body> '}'
The above description states that an Rtf document always starts with an opening brace, followed by a header part, then by a body part, and finally terminated by a closing brace. If we have a further look at the <header> part, we will find something like this (a quotation mark after a construct means that it is optional) :
<header> ::= \rtf1 \fbidis? <character set> <from>? <deffont> <deflang>
<fonttbl>? <filetbl>? <colortbl>? <stylesheet>? <stylerestrictions>?
<listtables>? <revtbl>? <rsidtable>? <mathprops>? <generator>?
Globally, a header starts with the \rtf1 tag (an Rtf document always starts with the string {\rtf1),
followed by a certain number of tags which are more or less to be seen as global document properties ; then you will see compound structures such as
the font table, the color table, the style sheet table, etc.
The RtfMerger class discards any information related to Xml namespaces, but it allows you to specify author information that will be put in the final document. Tables in the header part of a document define a set of items : the color table defines the colors used in the document, the font table defines the fonts used in the document, the stylesheet table defines style sheets used in the document, and so on.
Each entry in these tables can be referred to later in the document body by using the appropriate tag (control word). For example, setting the foreground
color in a paragraph can be specified with the \cfx tag, where x is the entry number in the document color table.
The problem comes when merging multiple documents together ; each document (probably) use its own header tables for colors, fonts, stylesheets and it may happen
that an entry from document x conflicts with the same entry in document x+1. This is why, during the processing of a document to be merged, tables local to the document will hold entries in the corresponding RtfMergerDocument object, indicating which references should be renumebered, because there was already an entry having that id in the global header, but with a different definition.
The global header that will be generated will include all the entries coming from the first document to be merged, plus the renumbered entries coming
from the subsequent documents. The following sections give a little bit more details about each of these elements, and explain how they are handled during the merging process. You will see that some tables require specialized handling when renumbering references to their elements. Global document propertiesThe term Global document properties is used here to indicate tags (Control words, in the Microsoft terminology) that define settings at the document level. The following example specifies a default language code of 1025 when paragraph settings are reset to their default (using the \plain control word) ; it also specifies that the document uses the ansi character set (\ansi) along with code page 1036 (\ansicpg1036) :
\deflang1025\ansi\ansicpg1036
The RtfMerger class will collect all those various tags coming from the headers of the documents to be merged. However, if a tag has been found having a different parameter value in a previously processed document, it will not be overridden and a warning such as in the following example will be issued :
Tag \ansicpg value mismatch : current = 1057, previous = 1036
In its current state of development, the RtfMerger class simply ignores conflicting global document properties that may come from documents processed after the first one in the merging process. Color tablesColor tables are specified in a compound structure that starts with the \colortbl tag and contain color specifications in RGB format (using the \red, \green and \blue tags) ; color specifications are separated with a semicolon :
{\colortbl;\red255\green255\blue255;\red0;\green0;blue0;...}
In the above table, 3 colors are defined :
Color indexes are zero-based. The tags that reference a color within the body part of a document are :
When building a global color table regrouping all the colors referenced by the documents to be merged, the following rules apply :
Font tablesFont tables are specified in a compound structure that starts with the \fonttbl tag and contains font specifications defined in nested compound structures. The following example defines three fonts, Times New Roman, Arial, and Calibri that are referenced inside the document (the Rtf code has been intentionally indented for better readability) :
{\fonttbl
{\f0\froman\fcharset0\fprq2{\*\panose 02020603050405020304}Times New Roman;}
{\f1\fswiss\fcharset0\fprq2{\*\panose 020b0604020202020204}Arial;}
{\f39\fswiss\fcharset0\fprq2{\*\panose 020f0502020204030204}Calibri;}
}
While color indexes in color tables are assigned sequentially (ie, the first color in the color table has index 0, the second has index 1, and so on),
fonts have their own numbering scheme, specified by the \f tag, followed by a font number. The tags that reference a font within the body part of a document are \f and \af. Within a single document, all font numbers are unique. However, when it comes to merging multiple documents together, you need to take care that the font numbers used in an individual document will not conflict with the font numbers used in another document so, again, there will be a renumbering operation during the merging process. When building a global font table regrouping all the fonts defined by the documents to be merged, the following rules apply :
Note that the fonts coming from individual documents are renumbered sequentially, starting from 0. This means that a global font table coming from the font table example above will look like this (note that the Calibri font, which was initially referred to as font #39, is now font #2) :
{\fonttbl
{\f0\froman\fcharset0\fprq2{\*\panose 02020603050405020304}Times New Roman;}
{\f1\fswiss\fcharset0\fprq2{\*\panose 020b0604020202020204}Arial;}
{\f2\fswiss\fcharset0\fprq2{\*\panose 020f0502020204030204}Calibri;}
}
List tablesA list table is a case similar to a color table : it contains list definitions, whose id is assigned sequentially. The difference is that list numbers start at 1 while color numbers are 0-based. The list table of a document contains list definitions that can be referenced in the body part ; they define properties such as list levels, which in turn specify attributes such as the picture to be used for bullets, the numbering scheme for this level, etc. A list table definition is a compound statement that starts with the \listtable tag ; it contains list definitions that in turn start with the \list tag, such as in the following example (for brevity, the contents of each list has been replaced by an ellipsis) :
{\listtable {\list \listidx...} {\list \listidy...} ...}
Lists in a document are referenced by the \lsx tag, where x is the 1-based list entry index into the list table. When building a global list table regrouping all the lists referenced by the documents to be merged, the following rules apply :
List Override tablesList override tables are complements to existing list definitions ; there are generally two types of list overrides :
A list override table definition is a compound statement that starts with the \listoverridetable tag ; it contains list definitions that in turn start with the \listoverride tag, such as in the following example (for brevity, the contents of each list has been replaced by an ellipsis) :
{\listoverridetable {\listoverride \listidx \lsa...} {\listoverride \listidy \lsb...} ...}
Each list override definition contains two important tags :
In the document body, lists are referred to using the \lsx tag. The process of handling conflicting list override entries is nearly the same as the one used for font tables : conflicting override list entries are "anonymized", by removing the \listid and \ls tags. This "anonymized" version is used to check if we already encountered such a list override definition in a previous document. Depending on the comparison result, the same renumbering process as the one that is used for font definitions is applied here. The only difference is that the unique list ids (\listid tags) are renumbered together with the override list ids (\ls tags). Stylesheet tablesA stylesheet table is a list of nested stylesheet definitions, which are a shorthand for specifying character, paragraph or section formatting. A stylesheet table is a compound statement that starts with the \stylesheet tag ; it contains in turn stylesheet definitions that can specify any character, paragraph or section formatting tags, such as in the following (abbreviated) definition :
{\stylesheet
{\ql \li0\ri0\widctlpar\wrapdefault\aspalpha\aspnum\faauto\adjustright ...}
{\*\ts11\tsrowd\trftsWidthB3\trpaddl108\trpaddr108 ...}
{\*\cs10 \additive \ssemihidden Default Paragraph Font;}
}
There are a few kinds of styles, which are given by a specific tag followed by the stylesheet id ; the possible style identification tags are :
You will notice in the above example that the very first style (the one which starts with the \ql tag) has no id (ie, none of the style id tags listed above appears in it). This is the default style, which is by convention numbered as style #0. Thus, the process of renumbering style sheets in the global header will be very similar to the one used for font tables, with the following exceptions :
In addition, style definitions can contain tags whose parameter is a style id ; these ids must also be renumbered during the merge process. Such tags are :
RSIDRSID (Revision Save IDs) are used for revision tracking. The merging process will remove anything related to revision tracking ; this includes :
Preserving or removing RSIDs across the various documents to be merged is not a great issue in itself ; RSID numbers are normally numbers that are chosen randomly, so there are little chances that one RSID from document x conflicts with an RSID coming from document y. However, tracking the individual history of documents to be merged together inside the final merged document does not makes great sense. After all, the main purpose of a merged document is not to be edited afterwards ; a typical usage will be to print it, or to store it unmodified in a database. This is why the merging process removes any revision information, as if the whole document had been edited all at once before the first save (although a first save would create by itself a first RSID). ShapesShapes in a document can take various forms : they can be text areas, geometric shapes or more complex structures. The fact is that each shape has its own id, identified by the \shplid tag. Merging two documents having shapes with the same id may result in strange things :
This is why renumbering shapes so that they will all have a unique id in the merged document is really important. The merging processNow that you have a global view of what the merging process takes care about, describing its overall actions will be simpler :
The final merged document will only be generated when you call either the AsString or the SaveTo method. In the first case, you will need as much memory as necessary to hold the global header plus the body parts of each document to be merged. In the second case, you will only need enough memory to hold the biggest document body, or the global document header, whichever is the biggest. ConstructorThe RtfMerger class constructor has the following signature :
public function __construct ( [ [documents...] options] ) ;
An RtfMerger instance without any documents in it can be created this way :
$merger = new RtfMerger ( ) ;
You can then add existing documents by using the Add method or the array access methods :
$merger -> Add ( "sample1.rtf" ) ;
$merger [] = "sample2.rtf" ;
$merger [] = new RtfFileTemplater ( "sample3.rtf", $variables ) ;
You can also specify filenames or objects inheriting from the RtfDocument class to the constructor :
$merger = new RtfMerger ( "sample1.rtf", "sample2.rtf",
new RtfFileTemplater ( "sample3.rtf", $variables ) ) ;
Methodspublic function Add ( $args...)Adds the specified documents to the merger object. $args can be of two types :
Array accessThe RtfMerger class implements the Countable, ArrayAccess and IteratorAggregate interfaces, which allows you to have access to the documents that have been added to the merger object :
$merger = new RtfMerger ( "sample1.rtf", "sample2.rtf" ) ;
echo count ( $merger ) ; // Will display "2"
$doc2 = $merger [1] ; // $doc2 will be set to the RtfFileDocument object that
// has been created from "sample2.rtf" file contents
// Iterate through each document
foreach ( $merger as $doc )
// do something with $doc, which is of type RtfFileDocument
public function AsString ( )Returns the merged document contents as a string. public function SaveTo ( $filename )Saves the merged document contents to the specified filename. Propertiesprotected $DocumentsEvery document added through the Add or array access methods is put in this array, after being wrapped in an RtfMergerDocument object. Document information propertiesYou can define some document-information properties that will be put in the final merged document :
Note that the Keywords property is an array of strings. public $OptionsThe $Options property is a set of RTF_MERGE_* constants that condition the behavior of the RtfMerge object. private $GlobalHeaderThe $GlobalHeader property holds a object of class RtfMergerHeader. As more documents of type RtfMergerDocument are added to the merger object, this object is complemented by the new colors, fonts, stylesheets and list definitions brought by the new documents. When the final document will be generated with either the AsString or SaveTo methods, this object will return the mandatory Rtf code needed to build the document header. ConstantsRTF_MERGE_*The RTF_MERGE_* constants are used for the $Options property to define the behavior of the RtfMerge class for the merging process ; it can be any combination of the following :
RtfParser classThe RtfParser class is a general class that you can use to parse Rtf contents. It's a little bit more than a parser, however, because it handles special constructs that are specific to certain tags (or control words) ; it features the following :
Parsing an Rtf file only requires to repeatedly call the NextToken() method, which returns an object inherited from the RtfToken class, that gives all the necessary information about the next Rtf token available. OverviewThe simplest program to parse Rtf contents could look like this :
<?php
require ( 'RtfParser.phpclass' ) ;
$file = "sample.rtf" ;
$parser = new RtfFileParser ( $file ) ;
while ( ( $token = $parser -> NextToken ( ) ) !== false )
{
// do something with $token
}
Class diagramConstructorThe RtfParser abstract base class has the following constructor :
public function __construct ( ) ;
No particular parameter is required. Methodspublic function GetControlWordValue ( $word, $default = '' )
Gets the currently applicable parameter value for the specified control word.
\u10356
The Rtf specification states that Unicode characters are followed by character symbols (using the "\'" tag) which specify the number
of the code page that best matches the Unicode character that precedes :
\u10356\'a1\'b0
The number of character symbols that follow a Unicode character specification is given by the \uc tag ; in the above example, it should be written like this :
\uc2 \u10356\'a1\'b0
However, the specification states that this number (the parameter of the \uc2 tag) should be tracked and that a stack of applicable values depending on the current curly brace nesting level should be handled (the \uc tag may be present elsewhere in the document, not specifically before Unicode character specifications, and its default value should be 1). So, in the above example, we have to answer the question : "What is the current value of the \uc tag ?" whenever we encounter a \u tag, to be able to determine the number of character symbols that should follow it. For example, if the current value of the \uc tag is 1, then the following sequence will be interpreted as Unicode character #10356, and the nearest code page that can represent this character will be 161 (0xa1) ; the Unicode character is followed by an uppercase A (\'41) :
\u10356\'a1\'41
If the current value of the \uc tag is 2, then the above sequence will be interpreted as Unicode character #10356, and the nearest code page that can represent this character will be 41281 (0xa141). To be able to handle such a situation, you will first have to call the TrackControlWord() method to tell the parser that we want to track the current value of the \uc, as in the following example :
$parser -> TrackControlWord ( 'uc', true, 1 ) ;
Then whenever a \u control word specifying a Unicode character is encountered when parsing an Rtf document, you can call this method to retrieve the currently applicable value for \uc :
$uc_value = $parser -> GetControlWordValue ( 'uc' ) ;
The parameters are the following :
public function IgnoreCompounds ( $list )When parsing an Rtf document, not all control words may be of interest to you. This method allows you to supply the list of control words you want to be ignored, as an array of strings :
$parser -> IgnoreCompounds ( [ 'fonttbl', 'listtable', 'listoverridetable', 'pict' ] ) ;
When implementing a class inheriting from RtfParser, you should consider which tags (or control words) are useless for the task you want to carry on ; parsing a whole Rtf document can be slow, so helping the parser by telling him which tags can be safely ignored will result in performance improvement. This is especially true for control words such as \pict or \bin, which can embed huge amount of data. public function NextToken ( )Returns the next available token from the Rtf input stream. The returned value is of type RtfToken, or false if all tokens have been processed. This methods skips the tokens that are to be ignored, ie the ones that has been specified to a call to the IgnoreCompounds method. public function Reset ( )Resets the parser object, so that parsing can start again from the beginning. public function SkipCompound ( )Some Rtf constructs may not be of interest to you, depending on the result you want to achieve. Suppose for example that you do not want to interpret anything coming from the font table, which has a definition that looks like :
{\fonttbl{font definition 1}...{font definition n}}
SkipCompound allows you to continue past the closing brace that terminates the font table started by {\fonttbl, ignoring any content between these two delimiters. Note that the function will decrement the current brace nesting level. public function TrackControlWord ( $word, $stackable, $default_value = false )Tracks a control word specification in the current Rtf document. This allows for example to associate raw data with a control word, such as for the \pict tags. It also allows you to track control words whose value can be changed when entering a new nesting level and must be restored when exiting this nesting level (this is the case for example of the \uc Parameters are the following :
Properties$CurrentColumnReturns the current column position in the parsed Rtf document. Columns are numbered from 1. $CurrentLineReturns the current line in the parsed Rtf document. Lines are numbered from 1. $CurrentPositionReturns the current byte offset from the start of the file. Byte offsets start at 0. $NestingLevelCurrent nesting level of curly braces. RtfStringParser classpublic function __construct ( $rtfdata, $chunk_size = 4 * 1024 * 1024 )Creates an RtfParser object, using the specified Rtf data as a string. The parameters are the following :
RtfFileParser classpublic function __construct ( $file, $record_size = 16384 )Creates an RtfParser object, using the specified Rtf document. The parameters are the following :
RtfTemplater classThe RtfTemplater class allows for processing template documents using a specific macro language, in order to generate different final Rtf documents whose contents will depend on the input you supplied. Such input is mainly given through variables with as many different values as you have documents to process. OverviewThe principle of templating documents is really simple and needs only 3 steps :
Creating your first templateTo create your first template, simply use your favorite word processor as long as it can save or export contents into Rtf format. Such an editor could be Microsoft Word, OpenOffice, LibreOffice or even Wordpad ! The following example document (let's assume this is an Rtf document) references 4 variables : TITLE, FIRSTNAME, LASTNAME and SENDER. It also uses the PHP date() function to put the current date :
Date : %( date ( 'd/m/Y' ) )%
Dear %$TITLE% %$FIRSTNAME% %$LASTNAME%,
Your reservation for the year 2016 Annual Congress of Pataphysical Scientists
has been confirmed.
Regards,
%$SENDER%.
You can notice a few things from the above document template :
Generating personalized documentsA simple script will allow us to generate personalized documents from the document template we saw in the previous section. The first thing we need to do is to include the RtfTemplater.phpclass file :
include ( 'RtfTemplater.phpclass' ) ;
$template_file = 'example.rtf' ; // Assume this is our example template above
Now, we will need to supply some data to generate personalized documents, using different values for the TITLE, FIRSTNAME, LASTNAME and SENDER variables referenced in our template. To do that, we have to put individual values in an array ; the example below declares an array that contains the variable substitutions for 3 recipients :
$recipients =
[
[
'TITLE' => 'Ms',
'FIRSTNAME' => 'Jane',
'LASTNAME' => 'Doe',
'SENDER' => 'Alfred Jarry, Senior Pataphysics Engineer'
],
[
'TITLE' => 'Mr',
'FIRSTNAME' => 'John',
'LASTNAME' => 'Smith',
'SENDER' => 'Alfred Jarry, Senior Pataphysics Engineer'
],
[
'TITLE' => 'Mr',
'FIRSTNAME' => 'Peter',
'LASTNAME' => 'Watson',
'SENDER' => 'Alfred Jarry, Senior Pataphysics Engineer'
]
] ;
Now we can generate an output document for each entry in our $recipients array ; we will build a loop, and create a new instance of the RtfTemplater class, using our base template document and recipient data :
for ( $index = 1, $count = count ( $recipients ) ; $index <= $count ; $index ++ )
{
$recipient = $recipients [ $index ] ;
$templater = new RtfTemplater ( $template_file, $recipient ) ;
$templater -> SaveTo ( "output.$index.rtf", $recipient ) ;
}
Viewing the resultsThe sample code above will generate 3 files : "output.1.rtf", "output.2.rtf" and "output.3.rtf". Let's view one of them :
Date : 25/10/2016
Dear Ms Jane Doe,
Your reservation for the year 2016 Annual Congress of Pataphysical Scientists
has been confirmed.
Regards,
Alfred Jarry, Senior Pataphysics Engineer.
That's all ! you just created your first mailing script. Class diagramTemplater macro-language referenceThe templating pseudo-language implements a few simple control structures. All expressions can reference variables that have been passed to the constructor of the RtfStringTemplater or RtfFileTemplater class constructor :
$variables =
[
'VNAME1' => 'the value of vname1',
'VNAME2' => 'the value of vname2',
'INDEX' => 17,
'ARRAY' => [ 'string a', 'string b', 'string c' ],
'TITLE' => 'M.'
] ;
$document = new RtfStringTemplater ( $contents, $variables ) ;
Array keys are simply variable names, which are case-sensitive, while array values represent the string that will be substituted whenever the variable is referenced in the document. Note that in the above example, one of the variables, ARRAY, is not scalar ; such an array variable can be used in FOREACH constructs. Language overview
The macro templating language provides the following constructs :
Every macro language construct must be surrounded by percent signs, as in the following examples :
%$VARIABLE%
%( date ( 'd/m/Y' ) )%
%FOR ( $I = 1 TO $INDEX )
Paragraph marks (line breaks) between the enclosing percent signs of an instruction are ignored. Compound statements such as IF or loops can be nested.
The RtfTemplater class tries to be as smart as possible when differentiating macro constructs
from regular document contents. ExpressionsExpressions can reference variables passed to the class constructor, but they can also use any operators or functions provided by PHP. Expressions are replaced with their evaluation result in the output contents. As for the PHP language, variable names must be prefixed by the "$" sign ; for example (using our example $variables described above) :
%$VNAME%
will be substituted with :
the value of vname1
Referencing a variable name can be considered as the simplest possible expression ; when it comes to more complex expressions, you will need to enclose them with %( and %) :
Current index : %($INDEX + 100)%
Today is : %( date ( 'd/m/Y' ) )%
An expression can use any syntactic element allowed by PHP ; in addition, you can also call builtin functions, as the date() function in the above example. Undefined variables will be expanded to an empty string and a warning will be issued, unless the $warnings parameter of the class constructor has been set to false. Note that variable names are case-sensitive. IF constructsIF constructs are a way to conditionally include text in your output document ; as for the traditional if statements in various programming languages, the IF statement accepts an expression enclosed with parentheses. You can use any syntactic element recognized by PHP, call builtin functions and reference variables passed to the class constructor :
%IF ( $TITLE == 'M.' )%The value of TITLE is : "M"%END%
In the above example, the output document will contain the following string if the value of the $TITLE variable has been set to "M." :
The value of TITLE is : "M"
An IF construct can have as many ELSEIF alternatives as needed, and an optional ELSE statement :
%IF ( $INDEX == 19 )%
index = 19
%ELSEIF ( $INDEX == 18 )%
index = 18
%ELSE%
index is neither 19 nor 18.
%END%
Using our example $variables array where the $INDEX variable has the value 17, you will notice that the output document will contain two empty lines before the string : index is neither 19 nor 18.. This is due to the fact that a paragraph mark (a line break) has been inserted after each ending percent sign of each IF and ELSEIF/ELSE statements. If you would like no line break to be inserted in the output, and still preserve the readability of your macro-language constructs, then you could put the ending percent at the beginning of the next line :
%IF ( $INDEX == 19 )
%index = 19
%ELSEIF ( $INDEX == 18 )
%index = 18
%ELSE
%index is neither 19 nor 18.
%END%
FOR loopsFOR loops are a way to repeat text a certain number of times. Specify a start and end index :
%FOR ( $i = 1 TO $INDEX )
%This is line #%$i%
%END%
The above example will insert 17 lines in the output document (the INDEX variable has been defined to be 17 in our variables array), from «This is line #1» to «This is line #17». You can also specify an optional step :
%FOR ( $i = 1 TO $INDEX BY 2 )%
or :
%FOR ( $i = 1 TO $INDEX STEP 2 )%
FOREACH loopsFOREACH loops are based on array variables (look at the 'ARRAY' entry of the $variables array in the example above). The following instruction will output the text "string a", "string b", "string c" on separate lines :
%FOREACH ( $value IN $ARRAY )%%$value%
%END%
REPEAT loopsREPEAT loops are only a shortcut for FOR loops :
%REPEAT ( $i = $INDEX )%
is equivalent to :
%FOR ( $i = 1 TO $INDEX )%
Predefined variablesThe following variables are predefined and can be referenced anywhere :
Coping with percent signsThe templater class does its best in trying to distinguish control statements from pure text. It will for example correctly handle the following case :
Tax rate is : 20%
some other text
%$VNAME%
However, if you follow "20%" with a sign that is recognized as the start of an expression, such as an opening parenthesis :
Tax rate is : 20% (since 2016)
some other text
%$VNAME%
then it will try to interpret the string "% (since 2016) some other text%" as an expression and will issue a warning, because this is not a valid computed expression. To avoid such situations, simply double the percent sign, as in the following :
Tax rate is : 20%% (since 2016)
Under the hood...
ConstructorThe RtfTemplater abstract base class has the following constructor :
public function __construct ( $variables, $warnings = true )
The parameters are the following :
Methodspublic function AsString ( )Returns the preprocessed contents of an Rtf template as a string, using the variables that have been specified to the class constructor. public function SeparateTextFromRtf ( $contents )Separates the tags and text parts of a piece of Rtf contents. This function is especially used for extracting template constructs delimited by percent signs. It may happen that due to user manipulations in the template document, some Rtf tags may be interspersed with real template constructs. Imagine for example that in the string %$VNAME2%", the "%V" and "2%" parts have been put in boldface ; the corresponding Rtf code may look like :
%$VN}{\rtlch\fcs1 \af0 \ltrch\fcs0 \lang2057\langfe1036\langnp2057\insrsid15075231 AM}
{\rtlch\fcs1 \af0 \ltrch\fcs0 \b\lang2057\langfe1036\langnp2057\insrsid15075231
\charrsid15075231 E2%
This method ensures that both the original text contents (%$VNAME%) and Rtf data will be preserved, It returns an associative array containing two entries :
public function SaveTo ( $filename )
Saves the preprocessed contents of an Rtf templte to a file, using the variables that have been specified to the class constructor. Propertiespublic $VariablesDocument variables, as specified to the class constructor. Since this variable is public, it can be freely changed after instantiating the class. public $WarningsEnables/disables warnings. The initial value of this property has been passed to the constructor. private static $TagsWithTextParameterThis internal array is used by the SeparateTextFromRtf method to identify Rtf tags that always include a text parameter, which is not to be confused with regular text coming from the document. RtfStringTemplater classpublic function __construct ( $rtfdata, $variables = [], $warnings = true, $chunk_size = 4 * 1024 * 1024 )Creates an RtfTemplater object, using the specified Rtf data as the template. A typical usage could be :
$variables = [ 'FIRSTNAME' => 'Jane', 'LASTNAME' => 'Doe' ] ;
$doc = new RtfStringTemplater ( file_get_contents ( 'sample.rtf', $variables ) ) ;
$doc -> SaveTo ( 'sample.txt' ) ; // Save templated contents to output file
echo $doc -> AsString ( ) ; // Echo templated contents to standard output
The parameters are the following :
RtfFileTemplater classpublic function __construct ( $file, $variables = [], $warnings = true, $record_size = 16384 )Creates an RtfTemplater object, using the specified Rtf file as a template. A typical usage could be :
$variables = [ 'FIRSTNAME' => 'Jane', 'LASTNAME' => 'Doe' ] ;
$doc = new RtfFileTemplater ( file_get_contents ( 'sample.rtf', $variables ) ) ;
$doc -> SaveTo ( 'sample.txt' ) ; // Save templated contents to output file
echo $doc -> AsString ( ) ; // Echo templated contents to standard output
The parameters are the following :
RtfTexter class
The RtfTexter class extracts text from an Rtf document. Extracting text from an Rtf document is pretty simple, as shown by the following example :
<?php
include ( 'path/to/RtfTexter.phpclass' ) ;
$texter = new RtfFileTexter ( 'sample.rtf' ) ;
echo $texter -> AsString ( ) ; // Echo text contents
$texter -> SaveTo ( 'sample.txt' ) ; // Save text contents to sample.txt
Class diagramConstructorThe constructor of the RtfTexter class has the following signature :
public function __construct ( $options = self::TEXTEROPT_ALL, $page_width = 80 )
Parameters are the following :
Methodspublic function AsString ( )Returns the text contents of an Rtf document as a string. public function SaveTo ( $filename )Saves the text contents of an Rtf document to the specified file. protected function FormatParagraphs ( $data )Internal method. Formats the specified paragraph(s) (which may contain several lines) to fit the width specified by the $PageWidth property. This method is called only if the $Options property has the TEXTEROPT_WRAP_TEXT flag set. protected function SetOptions ( $flags )protected function TextifyData ( &$data, $nesting_level_to_reach = false )
Internal method. Processes the text data to be extracted.
Propertiespublic $EolString used for end of lines. public $OptionsOption flags (a combination of TEXTEROPT_* constants). public $PageWidthMaximum width, in characters, of a page. This setting will be enforced only if the TEXTEROPT_WRAP_TEXT flag is set for the $Options property. protected static $IgnoreList = [ ... ]Compound tags that can be safely ignored during text extraction. protected static $TranslatedCharacters = [ ... ]Characters that must be substituted to avoid spurious data in the output. Such characters are for example the left and right double-quotes. protected static $TranslatedTags = [ ... ]Tags that are to be translated either to their ascii or html entity equivalents. ConstantsTEXTEROPT_* constantsGets/sets the flags that will condition the text extraction process. It can be any combination of the following flags :
RtfStringTexter classpublic function __construct ( $rtfdata, $options = self::TEXTEROPT_ALL, $page_width = 80 )Loads Rtf data for further extraction. The parameters are the following :
A typical usage could be :
$doc = new RtfStringTexter ( file_get_contents ( 'sample.rtf' ) ) ;
echo $doc -> AsString ( ) ; // Echo text contents from file sample.rtf
echo $doc -> SaveTo ( 'sample.txt' ) ; // Save text contents to file sample.txt
RtfFileTexter classpublic function __construct ( $file, $options = self::TEXTEROPT_ALL, $page_width = 80 )Loads Rtf data from the specified file for further extraction. The parameters are the following :
A typical usage could be :
$doc = new RtfFileTexter ( 'sample.rtf' ) ;
echo $doc -> AsString ( ) ; // Echo text contents from file sample.rtf
echo $doc -> SaveTo ( 'sample.txt' ) ; // Save text contents to file sample.txt
Internal classes referenceThis section provides a references to the classes that are used internally by the RtfTools package and are not normally exposed to the outside world. RtfMergerDocument classWhenever a document is added to a merger object, it is wrapped by an RtfMergerDocument object which basically performs the following tasks :
ConstructorThe constructor of the RtfMergerDocument class has the following signature :
public function __construct ( $parent, $document, $global_header )
Parameters are the following :
Methodsprotected function ExtractColorTable ( $header )Extracts the color table from the document header. Updates the global header acordingly and holds a table of color renumberings in case of conflicts. protected function ExtractFontTable ( $header )Extracts the font table from the document header. Updates the global header acordingly and holds a table of font renumberings in case of conflicts. protected function ExtractListTable ( $header )Extracts the list table from the document header. Updates the global header acordingly and holds a table of list renumberings in case of conflicts. protected function ExtractOverrideListTable ( $header )Extracts the override list table from the document header. Updates the global header acordingly and holds a table of list override renumberings in case of conflicts. protected function ExtractStylesheetTable ( $header )Extracts the stylesheet table from the document header. Updates the global header acordingly and holds a table of stylesheet renumberings in case of conflicts. protected function ExtractSettings ( $header )Extracts the various settings that can be found in a header of an Rtf document, specified as single tags. In the current version, a warning will be issued if one of the documents has a header setting different from the first one that has been encountered. Future versions may be able to handle different setting values more gracefully. public function GetBody ( $remove_rsid = true )Returns the body of the underlying document, once all the renumbering operations have been applied for the color tables, font tables, stylesheet tables and so on. protected function ReplaceReferences ( $text, $remove_rsid = false, $renumber_shapes = false )This method is called by the GetBody method to replace any reference to colors, fonts, styles and lists with their new number in the merged document. The $remove_rsid parameter specifies whether revision history information should be removed from the document body. Although the method's default value is false, the RtfMerger class always set it to true. The $renumber_shapes parameter specifies whether shapes should also be renumbered. The only reason why this parameter should be false is when processing stylesheet contents. protected function ReplaceStylesheetReferences ( )Since stylesheets contain formatting tags, some of them may reference elements that need to be renumbered (colors, fonts, etc.). This method calls the ReplaceReferences method to perform the necessary replacements that apply to stylesheets contents. Propertiesprotected $BodyOffsetHolds the byte offset, into the underlying Rtf document, of the body start. private static $DEBUG = falseOutputs debug information when set to a combination of the RTFMERGER_DEBUG_* constants. protected $DocumentHolds the underlying document object. protected $ParentHolds the parent RtfMerger object. ConstantsRTFMERGER_DEBUG_* constantsThe RTFMERGER_DEBUG_* constants can be used to define the RtfMergerDocument::$DEBUG property to output useful debug information :
RtfMergerHeader classThe RtfMergerHeader is used internally by the RtfMerger class to collect header information from the various documents to be merged.
During this process of collecting information, the class has to be considered as passive : it is manipulated by the various
RtfMergerDocument instances that represent the documents to be merged. ConstructorThe constructor has no parameter and only instantiates an object of class RtfMergerHeader. Methodspublic function BuildHeader ( )Returns the Rtf code for the header (aka Global header) of the output merged file. public function GetColorTable ( )Returns the Rtf code for the color table containing all the colors coming from the documents to be merged. public function GetDocumentInformation ( )Returns the Rtf code for document information (see the Document Information Properties section for more information). public function GetFontTable ( )Returns the Rtf code for the font table containing all the font definitions coming from the documents to be merged. public function GetGenerator ( )Returns the Rtf code for the generator entry, which specifies the software that has generated the document. public function GetListTable ( )Returns the Rtf code for the list table containing all the list definitions coming from the documents to be merged. public function GetListOverrideTable ( )Returns the Rtf code for the list override table containing all the list overrides coming from the documents to be merged. public function GetStylesheetTable ( )Returns the Rtf code for the stylesheet table containing all the stylesheets coming from the documents to be merged. Propertiespublic $ColorTable = []An associative array whose keys are the color definitions, and whose values are color indexes. See the Color tables section of Merger process for more information on how this table is built and upgraded when processing new documents to be merged. Document information propertiesDocument information is a special compound tag (\info) that allows you to specify creator's information in an Rtf document. All the properties below can be accessed through the RtfMerger object, without the "Info" prefix :
All those properties default to the value false. When set to a string, they will be written in the \info tag when generating the merged document. The creation and revision times will also be automatically added to the output document information. Note that the InfoKeywords property is an array of strings. See the Document information properties section of Merger process for more information on how these properties can be accessed directly through an RtfMerger object. public $FontTable = []An associative array whose keys are the md5 hash of the "anonymized" version of the font definition, and whose values are associative arrays containing the following entries :
See the Font tables section of Merger process for more information on how this table is built and upgraded when processing new documents to be merged. public $ListTable = []An associative array whose keys are the md5 hash of the "anonymized" version of the list definition, and whose values are associative arrays containing the following entries :
See the List tables section of Merger process for more information on how this table is built and upgraded when processing new documents to be merged. public $ListOverrideTable = []An array containing the list overrides, where the references to the list entries have been renumbered when necessary. See the List override tables section of Merger process for more information on how this table is built and upgraded when processing new documents to be merged. public $NextShapeId = 1000
Shapes are one of the rare elements contained in the body part of a document that need to be renumbered to avoid conflicts across multiple documents. This number is incremented each time a new shape is found in some document. See the Shapes section of Merger process for more information on how shape numbering is processed across multiple documents. public $Settings = []An associative array whose keys are tags (aka as Control Words, in the Microsoft documentation), and whose values are the tag parameter. See the Global properties section of Merger process for more information on how this table is built and upgraded when processing new documents to be merged. public $StylesheetTable = []An associative array whose keys are the md5 hash of the "anonymized" version of the stylesheet definition, and whose values are associative arrays containing the following entries :
Note that stylesheets can include tags such as \sbasedonx, \snextx and
\slinkx, where x is also a stylesheet index. See the Stylesheet tables section of Merger process for more information on how this table is built and upgraded when processing new documents to be merged. RtfToken classesThe NextToken method of the RtfParser class is used to parse Rtf documents and retrieve the next token available from the Rtf stream. The type of value returned by this method is always an object inheriting from the RtfToken abstract class. If you have browsed the RtfTools package documentation and source code, you may have noticed that most of the classes use their own, simplified, internal parser. This is the case for example for classes such as RtfBeautifier and RtfTemplater, where the parsing needs are really basic and do not need an elaborate method to analyze Rtf contents. In some situations, however, you may have more complex needs in terms of parsing. This is the case of the RtfTexter class, which needs to differentiate between pure plain text, and what is to be considered as a parameter of a compound statement ; you will encounter such situations with font definitions for example, which look like this :
{\f1 ... Times New Roman;}
...
{\par ... This is a sample paragraph.}
The string "Times New Roman;" in the above example is not plain text, but rather the display name of the font identified by id #1 (\f1 tag). The second line, however, introduces a new paragraph, whose contents are "This is a sample paragraph.". By using the RtfParser class, you will be able to distinguish whether the additional text specified before the closing brace is to be interpreted as text or not. The sections below describe the various kinds of objects returned by the NextToken method of the RtfParser class. The section related to the RtfToken class shows the methods and properties common to all its derived classes. The sections related to classes inheriting from RtfToken will only show the differences and additions specific to those classes. All of these classes are instantiated by the NextToken method of the RtfParser class, and are not meant to be instantiated from other places. Class diagramRtfToken classThe RtfToken astract class provides public properties and methods that are common to every syntactic element that can be found in an Rtf document. ConstructorThe RtfToken class constructor is called by all its derived classes and has the following signature :
public function __construct ( $type, $text, $space_after, $offset, $line, $column )
The parameters are the following :
Methodspublic function ToRtf ( )Returns the whole token, as it was found in the Rtf stream. public function ToText ( )
Returns the whole token, as it was found in the Rtf stream. public function __tostring ( )A synonym for ToRtf. Propertiespublic $ColumnColumn number of the start of the Rtf tag in the input document. Column numbers start at 1. public $LineLine number of the start of the Rtf tag in the input document. Line numbers start at 1. public $OffsetByte offset of the start of the Rtf tag in the input document. Byte offsets start at 0. public $SpaceAfterSet to true if the related tag has a space after (spaces after a control word are to be considered as being part of the control word, not as plain text). public $TextContains the Rtf syntactic element, as it has been found in the input Rtf stream. public $TypeToken type, as described in the TOKEN_* section of the RtfDocument class. RtfControlSymbolToken classImplements a control symbol token, such as \~ (unbreakable space) or \- (optional hyphen). ConstructorThe constructor of the RtfControlSymbolToken class has the following signature :
public function __construct ( $char, $offset, $line, $column )
The $char parameter indicates the character following the leading backslash. Other parameters are the same as for the RtfToken class. Methodspublic function ToText ( )Returns the token text, ie the real character expressed by the input Rtf tag. The following subsitutions occur :
RtfControlWordToken classImplements a control word, such as \par or \f12. This class also handles "special" control words, that are preceded by the \* special construct, such as in : \*\panose. ConstructorThe class constructor has the following signature :
public function __construct ( $word, $space, $special, $offset, $line, $column )
The $word parameter holds the control word itself, followed by its optional integer parameter. The $special parameter is a boolean value that indicates whether the control word was preceded by the \* special construct or not. Other parameters are the same as for the RtfToken class. PropertiesControlWordControl word. For a tag such as \*\pnseclvl1, this property will contain the string "pnseclvl". ParameterHolds the optional integer parameter after the control word. For a tag such as \*\pnseclvl1, this property will contain the integer value "1". If the control word does not contain any parameter, this property will be set to the empty string. SpecialA boolean value that indicates whether the control word is a special one, ie preceded by the \* construct. RtfDataToken classThe RtfDataToken class is a base abstract class for Rtf compound constructs that end with some data before the last closing brace. Such a construct could be for example a picture, denoted by the \pict control word. ConstructorThe constructor of the RtfDataToken class has the following signature :
public function __construct ( $type, $data, $offset, $line, $column )
The $data parameter holds the data that has been found before the last closing brace. Other parameters are the same as for the RtfToken class. RtfBDataToken classThe RtfBData class is intended for maybe the only tag in the Rtf specifications that has a parameter which gives the length of the data immediately following it : the \bin tag. The following example defines some binary data which is 10 bytes long :
{\bin10 0123456789}
ConstructorThe constructor of the RtfBDataToken class has the following signature :
public function __construct ( $data, $offset, $line, $column )
The $data parameter holds the binary data located just after the \bin control word (in the example above, this would be the string "0123456789"). Other parameters are the same as for the RtfToken class. PropertiesRelatedControlWordIndicates the control word which is related to this data entry (\pict for pictures, \bin for binary data, and any other control word that starts a compound statement containing character data). RtfPCDataToken classHolds free-form text data specified within curly braces. ConstructorThe constructor of the RtfPCData token has the following signature :
public function __construct ( $data, $offset, $line, $column )
The $data parameter holds text data located just before the closing brace. Other parameters are the same as for the RtfToken class. MethodsToTextReturns the character data after removing newlines and carriage returns, which are not part of the text. RtfSDataToken classThe RtfSData class holds hexadecimal data that represent an embedded image. This is typically the kind of data found in \pict tags :
{\pict 0ABC2937DF...}
ConstructorThe constructor of the RtfSData token has the following signature :
public function __construct ( $data, $offset, $line, $column )
The $data parameter holds text data located just before the closing brace. Other parameters are the same as for the RtfToken class. RtfEscapedCharacterToken classHolds a character representation specified using the \'xy notation, where x and y are hexadecimal digits representing the character code in the Windows Ansi character set. ConstructorThe constructor of the RtfEscapedCharacterToken class has the following signature :
public function __construct ( $hex, $offset, $line, $column )
The $hex parameter specifies the integer code of the character specification that has been found in the input Rtf stream. Other parameters are the same as for the RtfToken class. Methodspublic function ToText()Returns the underlying character value, as a string. Propertiespublic $CharHolds the string value of the character. public $OrdHolds the integer character code. RtfEscapedExpressionToken classThe RtfEscapedExpressionToken class is designed to represent escaped characters that may have a special syntactic meaning within an Rtf document, such as \{, \} and \\. Although such cases could have been covered by the RtfControlSymbolToken class, they have been intentionally made distinct so that more advanced parsers can make the difference between both cases without requiring further testing. ConstructorThe constructor of the RtfEscapedExpressionToken class has the following signature :
public function __construct ( $char, $offset, $line, $column )
The $char parameter specifies the character immediately after the backslash. Other parameters are the same as for the RtfToken class. MethodsReturns the character following the backslash, as a string. Propertiespublic $CharHolds the string value of the character. RtfInvalidToken classIn some cases, the NextToken method of the RtfParser class can return a token having the RtfInvalidToken class, to indicate that something unexpected was found in the input Rtf stream. Such cases can arise in the following situations :
ConstructorThe constructor of the RtfInvalidToken class has the following signature :
public function __construct ( $text, $offset, $line, $column )
All parameters have the same meaning as for the RtfToken class. RtfLeftBraceToken classThe RtfLeftBrace class represents an opening brace, which is one of the basic Rtf syntactic elements. ConstructorThe constructor of the RtfLeftBraceToken class has the following signature :
public function __construct ( $space_after, $offset, $line, $column )
All parameters have the same meaning as for the RtfToken class. RtfNewlineToken classThe RtfNewlineToken class represents a line break that has been encountered in the input Rtf stream. Since line breaks, which are normally represented by newlines or cr+lf's, are not significant, extended parsers relying on the RtfParser class can safely ignore them (note that the current line and column positions in the Rtf input stream will be updated accordingly anyway). ConstructorThe constructor of the RtfLeftBraceToken class has the following signature :
public function __construct ( $text, $offset, $line, $column )
All parameters have the same meaning as for the RtfToken class. RtfRightBraceToken classThe RtfRightBrace class represents an opening brace, which is one of the basic Rtf syntactic elements. ConstructorThe constructor of the RtfRightBraceToken class has the following signature :
public function __construct ( $space_after, $offset, $line, $column )
All parameters have the same meaning as for the RtfToken class. |