Automating the Structural Markup Process in the Conversion of Print Documents to Electronic Text

Casey Palowitch, Darin Stewart

University of Pittsburgh Library System
Pittsburgh PA USA 15260
Tel: 1-412-648-7859



This paper presents the initial results of an initiative to construct a system for automatically identifying structural features and applying Standard Generalized Markup Language (SGML) tagging, based on the Text Encoding Initiative Document Type Definition (TEI DTD), to text captured from print documents via optical character recognition (OCR). The system interprets typographical and geometric analysis output from the OCR process, mapping combinations of characteristic features to TEI constructs based on a user-generated document analysis specification. This development is part of a pilot project to create from the original paper document a TEI-encoded edition of the Transactions of the American Medical Association, Vol. 2, 1849, a research resource for 19th century United States medical and urban historical study. The goal of the project is a generalizable solution for automated minimal tagging of printed documents, allowing a more rapid creation of structurally-encoded texts for digital libraries.

KEYWORDS: automatic tagging, SGML, optical character recognition, document structure recognition, electronic document conversion


Retrospective conversion of printed documents into digital forms is an increasingly necessary and desirable task facing those building digital libraries. Networked access technologies like the World Wide Web (WWW) allow remote use of digital collections and a variety of new methodologies for interaction and manipulation. Digital conversion efforts along these lines are becoming commonplace, with hundreds of projects underway around the world to capture existing printed monographs, serials, and special-format items with digital technologies. Digital imaging technology is inexpensive and can provide high resolution and color representations, but images cannot by themselves be indexed or rapidly searched for content. OCR and structural markup systems such as the Standard Generalized Markup Language (ISO 8879) [8] allow the creation of digital documents that can be indexed and searched as textual databases. However for the high-production needs of libraries, the inaccuracy of commercial OCR software on all but the most modern and uniform documents still demands thorough proofreading and error correction, and the intensive manual labor required for SGML text markup is often a barrier to its more widespread use. While some commercial projects may be able to justify investment in manual markup effort [7], libraries who wish to provide their patrons with the access, searching, and manipulation possibilities inherent with texts having structural markup must find other alternatives. One alternative is automation of the markup process.

Many current projects in document image analysis and structure and pattern recognition are top-down approaches, identifying structural elements at the page or text-block level before proceeding on to the recognition of individual characters [1]. Others build document representations in a bottom-up manner, parsing document 'tokens' to build structures [4][5][6]. The present system differs from these efforts in two fundamental respects. First, rather than 'parsing' document structures, the present system works only at the level of layout, mapping combinations of presentation characteristics to structural elements. Secondly, the present system is not attempting to 'read' or 'understand' the information content of the document [11], merely to encode its structure as accurately and efficiently as possible for digital library applications such as network delivery or automated information retrieval.

The present system operates on information generated by the OCR software Aurora,[1 to recognize structures from their physical characteristics. In addition to its extremely accurate character recognition, Aurora is designed to provide page layout information from which text structures can be inferred, specifically line geometry information (vertical and horizontal positions on the page, length of line, height of line) and typographic information (font size and style information). One recent similar effort to define a system architecture for translating the Xerox XDOC format (word-level geometry and font data) into structural markup has influenced the present approach [10]. At least one commercial software system, ]FastTag, by Avalanche Development, Inc., provides a similar solution to that of the present effort, but the advantage of Aurora's exceptional character recognition accuracy, especially with older printed documents, was the significant factor in its choice as the OCR engine for the system.

Figure 1. System Architecture


Original print document. Our project is focused on creating an electronic-text edition of the Transactions of the American Medical Association, vol.2, 1849, the pre-cursor to the Journal of the AMA. This document was chosen for the project primarily because of its research importance and heavy interlibrary-loan demand at the Falk Library of the Health Professions at the University of Pittsburgh. The 1000-page volume is also fragile: the paper crumbles with normal use, and its spine and binding have disintegrated. A searchable e-text edition will allow remote use of the document without requiring its physical transfer.

Optical scan. Because the original was fragile and available only for a limited time, it was necessary to capture the highest quality image in one pass. A single high-resolution tiff image of each page was captured using a Ricoh IS-50 scanner on a Macintosh IIcx running the Optix document image management system from Blueridge Technologies, Inc. All pages were scanned at the scanner's native 400 pixel per inch resolution in 8 bit grayscale. This produced large individual page image files

of approximately 6 megabytes each, but allowed the derivation of other formats as required without rescanning. This also provided the data needed to allow Aurora to perform its most accurate OCR processing. Due to the size of these files and limited disk space, a maximum of 150 pages could be scanned in one session, and because of document fragility, scanning was painstaking. Each scanning session lasted approximately three hours. The resulting 5.5 GB of image data were stored on 4mm DAT tapes, and were subsequently accessed for later processing.

Document Analysis. In order for the post-processing application to associate geometric and typographical data with the existence of a structural element, it must be provided with a document analysis specification. The specification contains three parts. The first is a precise specification of the presentational and geometric characteristics that can identify uniquely the structural components. The second is a structure-hierarchical model of the text, which must identify the hierarchical relationships among the structural components (i.e. nesting of elements). The third component is a mapping of font changes to desired markup (e.g. emphasis, quotations) and the mapping of the document's glyph vocabulary to SGML entities for substitution. The document analysis specification is prepared in much the same manner as document analysis prior to SGML document type definition design, referring back to the original document (or document images) to determine characteristics and relationships.

Because the Transactions text was to be marked up using the Text Encoding Initiative DTD [3], the specification phase produced a mapping of the structural elements of the Transactions to the elements provided for in the TEI DTD[2. For the first phase of the project, the following TEI elements were amenable to unique identification using a combination of geometric and presentation characteristics:]

· Four levels of nested structural divisions (DIV elements of three types and P paragraphs)

· Headers at each structural level when present (HEAD elements)

· Footnotes, both occurrence point in text and at page bottom (NOTE elements)

· Tables identified and wrapped for later cell-oriented markup (TABLE elements)

· Page breaks, with page numberings inserted as attributes (PB elements)

· Emphasis markup (e.g. EMPH, Q elements) mapped with attributes

@24: h 13 d -8 x 43 y 445 x 311 l 268

tageous their use will be eminently injurious.

@25: h 14 d -4 x 178 y 486 x 313 l 135


@26: h 16 d -4 x 60 y 514 x 447 l 387

With these general observations your committee will introduce the

@27: h 14 d -3 x 43 y 532 x 448 l 405

special reports from individual members, embracing an account of

@28: h 14 d -3 x 44 y 549 x 446 l 402

the sanitary condition of the cities of Portland, Concord, Boston,

@29: h 15 d -3 x 42 y 567 x 445 l 403

Lowell, New York, Philadelphia, Baltimore, Charleston, New Or-

@30: h 14 d -4 x 42 y 584 x 445 l 403

leans, and Louisville, so far as it may be developed by answers to

@31: h 13 d -4 x 42 y 601 x 377 l 335

the questions propounded in the circular issued by them.

@32: h 14 d -3 x 55 y 619 x 445 l 390

Figure 3. Sample of Aurora output

Optical Character Recognition. Aurora's output consists of alternating lines of geometric data and the character stream; the character stream is interspersed with presentation data as well. Each original page image resulted in its own independently coded output file. A sample of this CGP data format (characters, geometry, and presentation) is presented in Figure 3. The geometric data contains a line number, line height, vertical and horizontal position of the line start and end relative to the left and right page edges and the page top, and overall length. Although there are none marked in the example, font changes are indicated with an escape symbol '@' and a typeface indicator, immediately before the characters in the character stream.

Markup Processing. In the system architecture in Fig. 1, the structural and presentation markup phases are implemented as an iterative process. The markup processing is performed by a program written in the Perl programming language [12] and executed in a UNIX environment. Perl was chosen as the prototype language for its strengths in operation on input by lines, and string processing using regular expressions. In its current implementation, the program operates on four lines of input, two lines of geometric data and two lines of character data. The geometric data is parsed and stored in variables; the text in temporary buffers. Using a combination of comparisons, the contents of the variables and buffers are matched against the document analysis specification to indicate the beginning (or end) of a structural division. Unlike the raw OCR output, which is independently coded on a per page basis, the structural information detected at this stage can cross page boundaries. A stack is maintained, to preserve the structural hierarchy and to ensure a parsable final output. The stack operators are informed by the structural hierarchy provided also in the document analysis specification. Currently 'open' elements are maintained on the stack, and the indication of the opening of a higher-level structural unit forces the closing of the remaining elements on the stack. After structures have been identified, the temporary text buffers are flushed, and font change indicators are replaced with the proper tags. These are also kept on the stack. Finally SGML entities are substituted for special characters as per the document analysis specification. The final output is a parsable (if at the moment simple) SGML document.

tageous their use will be eminently injurious.</P></DIV><DIV type='subsection'><HEAD>CONCLUDING REMARKS.</HEAD><P> With these general observations your committee will introduce thespecial reports from individual members, embracing an account of the sanitary condition of the cities of Portland, Concord, Boston, Lowell, New York, Philadelphia, Baltimore,Charleston, New Or-

leans, and Louisville, so far as it may be developed by answers to the questions propounded in the circular issued by them.</P>

Figure 4. Tagged output for previous example


Performance Criteria [9]. Four performance measures are useful in evaluating this effort. With respect to the automatic tagging process: 1) the proportion of tags placed correctly to tags placed erroneously; and 2) the proportion of the overall tagging effort on the AMA that is automated. With respect to the OCR process, 3) character recognition accuracy. Overall: 4) the speed of the system.

With a quality page image in hand, the processing time for both OCR and autotagging is under five seconds per page on a SPARC-10 Model 50 system. Aurora's output contained approximately 90% fewer OCR errors than two mid-range commercial OCR software packages tested on the same texts. Moreover, even at this early stage, the autotagging process has reduced by approximately 40% the number of element identifications and manual tagging operations by the human text-taggers in the Transactions project. And from a forty-page sample taken from the autotagging output, in which over 600 tags were placed, only approximately 8% of the tag placements were erroneous, either misattributions or unparsable SGML, which again is a quite satisfactory result at this stage of the project. These initial statistics will be supplemented by a full report at project completion, with more substantive evaluation.

The image acquisition, storage, and OCR components of the system require a large volume of temporary and permanent storage, as well as significant CPU power. If the printed document is fragile or access to it is inconvenient, the digital images need to be accessed repeatedly in the document analysis phase, which consumes time and network bandwidth. In the present case it was necessary to create a set of easily-manipulable page images at 1/16 the file size for viewing purposes. It is expected that through further work identifying additional presentation-markup relationships, and adding logic for identifying dates and other numerical constructs, the tagging performed by the system can be increased even further. This labor savings is achieved at the cost of up-front investment in preparing the document analysis specification, and at the moment is only cost-effective for documents or document collections of large size. It is important to point out however, that projects producing critical or scholarly editions with a high proportion of tagging involving expert evaluation may see smaller benefit relative to the overall tagging effort. However such projects may utilize a system of this type for initial markup of basic structures. Indeed, the initial automatic markup of the Transactions was followed by markup by human specialists. Furthermore, the current system is only able to discern structures which can be defined unambiguously in terms of their geometric or typographical characteristics. Whereas for most print documents this set is quite large, it is by no means the complete set of regularly occurring structures.

The current project also has pointed out difficulties with the use of strict hierarchical representations such as SGML and applications of SGML such as the TEI in encoding documents with concurrent and complex organic structures, a characteristic shared by many printed texts. In several cases so far, automatic markup of the Transactions has been complicated by such problems.


The focus of the present effort has been to develop a usable prototype for recognizing major structural features of a target text, and our implementation has been successful enough to proceed with enhancing and expanding the system's competence in handling other textual features which can be identified programmatically, such as dates. Other directions for the project include: 1) the design of a 'standard' document analysis specification file format, which would draw on the existing standards for document output formats. The structural-hierarchy component of the document analysis specification should be derived directly from a given DTD. 2) Implementation in a compiled language such as C++ to enable deployment apart from an interpreter, to enable run-time interaction with the Aurora OCR application and other applications, such as an SGML parser, and to utilize object-oriented software engineering methodologies.


1. Baird, H. S., H. Bunke, and K. Yamamoto (eds.). Structured document Image analysis. Berlin: Springer-Verlag, 1992.

2. Burnard, Lou, and C. M. Sperberg-McQueen. Encoding for Interchange : An Introduction to the TEI, TEI U5. Chicago: Text Encoding Initiative, 1994.

3. Burnard, Lou, and C. M. Sperberg-McQueen (eds.). Guidelines for Electronic Text Encoding and Interchange, TEI P3. Chicago: Text Encoding Initiative, 1994.

4. Conway, Alan. "Page grammars and page parsing: a syntactic approach to document layout recognition." in Proc. 2nd. Int'l Conf. on Document Analysis and Recognition (Tsukuba Science City, October 20-22, 1993), IEEE Press, pp. 761-764.

5. Crawford, R. G. and Susan Lee. "A prototype for fully automated entry of structured documents." Canadian Journal of Information Science 15(4): 39-50, December 1990.

6. Dengel, Andreas, Rainer Bleisinger, Rainer Hoch, Frank Fein, and Frank Hones. "From paper to office document standard representation." IEEE Computer 25(5): 63-67, July 1992.

7. Fawcett, Heather. "The new Oxford English Dictionary project." Technical Communication 40(3) 379-382, August 1993.

8. Goldfarb, Charles. SGML Handbook. Clarendon: Oxford University Press, 1990.

9. Kanai, J., T. A. Nartker, S. V. Rice, G. Nagy. "Performance metrics for document understanding systems" in Proc. 2nd. Int'l Conf. on Document Analysis and Recognition (Tsukuba Science City, October 20-22, 1993), IEEE Press, pp. 424-427.

10. Taghva, Kazem, Allen Condit, Julie Borsack, and Srinivasulu Erva. "Structural markup of OCR Generated Text." UNLV Information Science Research Institute 1994 Annual Report. Las Vegas: ISRI.

11. Taylor, S. Liebowitz, M. Lipshutz, D. A. Dahl, C. Weir. "An intelligent document understanding system" in Proc. 2nd. Int'l Conf. on Document Analysis and Recognition (Tsukuba Science City, October 20-22, 1993), IEEE Press, pp. 107-110.

12. Wall, Larry, and Randal L. Schwartz. Programming perl. Sebastopol CA: O'Reilly & Associates, Inc., 1991.