SGML: Technology Survey

 

CPSC 689/602

Special Topics in Digital Libraries

Fall 1998


Table of Contents

Introduction

Markups

SGML Basics

Implications for Digital Libraries

References


Introduction

Historically, two trends in design can be identified for text-editing applications. Designs that model the author-as-typist and the author-as-programmer-typist. As a result applications tend to have a look and feel of a super typewriter or a programming interface. They provided adequate support for secretaries and programmers. However, these trends devoted far more attention to keyboards, printers, fonts, displays, graphics, colors, and similar features than to the retrieval and structuring of information or even to the verification of spelling and grammar. Coombs et al. comment how WYSIWYG systems (What You See Is What You Get) tend to focus authors' attention on appearance all the time, not just when the document is ready for submission [Coombs et al.]. This tendency prompted a change in scholar authoring shifting the role of the author from text composer to typist or typesetter.

Due to the nature of scholar work, better support was required for the creation of texts and documents. Some functionality such as spelling and grammar verification has been incorporated to traditional text-editing applications. However, one requirement that typically ignored by the author-as-typist and the author-as-programmer-typist models is the specification of document structure. This is particularly relevant for the scholar work, since most scholar texts and documents present characteristic structures.

As a result alternative document models and approaches began entering the standard work arena in order to provide better support for the specification of document structure. One of such approaches is the use of markups and the specification of a markup language. Markups allow the encoding of document structure within the document itself and can be used for defining how the document should be presented at reading or printing time. Different markup techniques have been used, such as procedural markup and descriptive markup. Goldfarb, one of the initial creators of GML (Generalized Markup Language) helped clarifying the advantages of descriptive markup over procedural markup.

However, the use of markup is not a new concept. Since the 1960’s different persons such as William Tunnicliffe, Stanley Rice and Norman Scharpf had suggested the use of markups and the separation of document content form its format [SGML Users' Group 1990]. The use of markups can be traced back several centuries if we consider that from a technical point of view even punctuation, capitalization and spacing can be considered as markups [Goldfarb 1981]. Based on the ideas of Rice and Tunnicliffe, Charles Goldfarb, Edward Mosher and Raymond Lorie invented the GML, a descriptive markup language, as part of their research project at IBM. Major portions of GML were implemented in IBM’s applications and achieved considerably acceptance in the industry.

In 1978 the American National Standards Institute (ANSI) [ANSI 1986], with approval of the International Standard Organization (ISO), began to look for a standard description of text for computational purposes. Goldfrab leaded a project to create a text description language standard based on GML. The Graphic Communications Association supported the effort and provided a nucleus of dedicated people for the task of developing Goldfarb's basic language design for SGML into a standard. The first working draft of the SGML standard was published in 1980 and by 1983 the GCA recommend the sixth working draft as an industry standard, which was accepted by institutions such as the US Internal Revenue Service (IRS) and the US Department of Defense. In 1985, a draft proposal for an international standard was published and the international SGML Users' Group was founded. In 1986 SGML is finally publish as an ISO standard.

Early SGML applications were frequently developed for use by a single organization or a small community of users However, the Association of American Publishers (AAP) endorsed the ANSI-ISO SGML and developed its first application, the Electronic Manuscript Project. Also the US Department of Defense used SGML for the documentation component of its Computer-aided Acquisition and Logistic Support (CALS) initiative.

SGML has recently been criticized even by supporters of descriptive markup. Above all, critics consider SGML too complicated for both authors and implementers. However different tools have been developed for UNIX, Macintosh and Windows platforms in order to facilitate document creation. Another critique of SGML refers to its lack of support for mathematics, graphics, and tabular material. However, based on SGML metalinguistic properties, it is possible to create SGML applications that support mathematics and tabular material. It also seems reasonably to believe that the standard can support graphics through descriptive markup as well as through referential markup.


Markups

Traditionally, the use of annotation and other marks within a text intended to specify how a particular text should be presented or printed is referred as markup. For instance, boldface, font, size, etc. can be indicated by the use of special symbols. As the formatting and printing of texts was automated, the term was extended to cover all sorts of special markup codes inserted into electronic texts to govern formatting, printing, or other processing. Therefore, markup can be defined as any means of making explicit the interpretation of a text [ACH/ACL/ALLC 1994].A text is more than an undifferentiated sequence of words. Typically, a text has a structure with different structural components such as chapters, sections, paragraphs and sentences. Structural components may have different nature, such as syntactic, semantic, stylistic, presentational, etc. These structural components can overlap each other, since they may represent different levels of knowledge. Structural components are very valuable since they augment the usability of the text by enabling processes such as the analysis and interpretation of the text.

As mentioned before, it can be argued that all texts contain markups, since punctuation marks, capitalization, even the spaces between words, might be regarded as a kind of markup. These markups help the human reader determine the beginning and end of structural components such as a word, or how to identify structural units such as headings or dependent clauses or sentences. Coombs et al. express this as:

  • There is no such thing as "no markup". Whenever an author writes anything, he or she "marks it up." 6 For example, spaces between words indicate word boundaries, commas indicate phrase boundaries, and periods indicate sentence boundaries. The markup is not part of the text or content of the expression but tells us something about it. When we "translate" writing into speech (that is, when we read aloud), we do not normally read the markup directly; instead, we interpret the markup and use various paralinguistic gestures to convey the appropriate information. A question mark, for example, might become a raising of the voice or the eyebrow.
  • Nevertheless, we typically consider this markup technique as an intrinsic characteristic of writing. However other kinds of markup techniques are possible. These markups are added to the text at creation time and used for the proper presentation for reading. For the purpose of this explanation, the process of inserting markups to a text hereon is referred as encoding.

    However, when referring to a markup language, we should understand a set of markup conventions used together for encoding texts. A markup language specifies:

    SGML provides the means for doing the first three. The last is provided by other documentation sources such as "The SGML/XML Web Page" [Cover 1998] and "A Gentle Introduction to SGML" [ACH/ACL/ALLC 1994].

    Coombs et al. presented an analysis of different markup techniques [Coombs et al 1987]. They present six different types of markups:

    Punctuational. Punctuational markup consists of the use of a closed set of marks to provide primarily syntactic information about written utterances. This kind of markup has been used for centuries and is considered part of the writing process. From this point of view, only some ancient manuscripts have no markup (scripto continua).

    Presentational. A higher level encoding allows authors to make the presentation clearer in different ways. For instance, horizontal and vertical spacing, folios, page breaks, enumeration of lists and notes, and a many other symbols and devices. Presentational markup clarifies the presentation of a document and makes it suitable for reading.

    Procedural. In order to implement automatic systems, many text-processing systems replace the presentational markup by procedural markup. Procedural markup consists of commands that indicate how text should be formatted. In other words, procedural markup instructs a text formatter X to "do Y", for example, skip three lines. Therefore markups are processed according to the rules specified in the documentation for the particular system.

    Descriptive. Descriptive markup allows authors to identify the element types of text tokens or units. Authors who are accustomed to procedural markup often think of descriptive markup as if it were procedural and may even use tags procedurally. The main difference between descriptive and procedural markup is that the latter indicates what a particular text formatter should do while the former indicates what is the nature of a text element. In other words, descriptive markup declares that a segment of a document is a member of a particular class. Following the previous example, descriptive markup tells all text formatters, "this is an X," for example, a long quotation. Normally, descriptive markups can be mapped onto procedural markups. This allows descriptive markup to be processed by an open set of applications.

    Referential. Referential markup refers to external entities. The markups are replaced by the specified entities during document processing.

    Metamarkup. Metamarkup provides authors and support personnel with a facility for controlling the interpretation of markup and for extending the vocabulary of descriptive markup languages.

    From the previous six types, only three (presentational, procedural, and descriptive) actually compete against each other. Pros and cons can be considered based on markup processing. Currently there are three major categories of markup processing:

    Presentational markup is designed for reading. Procedural markup is designed for formatting, but usually only by a single program. Descriptive markup is moderately well suited for reading but is primarily designed to support an open class of applications such as, information retrieval. Also, descriptive markup supports authors in focusing on the structure and content of documents. Both presentational and procedural markups tend to focus authors' attention on physical presentation. Some examples and a comparison between procedural and descriptive markup can be found in [ArborText 1998].

    Descriptive markup solves many of the problems that scholars face in document development and presents several advantages compared to presentational and procedural markup techniques. Without descriptive markup, only special systems with incompatible formats will offer even a portion of the authorial support that scholars have a right to expect from their computers. Some of the descriptive markup advantages are presented following.

    Descriptive markup has strong support from the industry and many supporters trying to establish it as the de facto standard. Also, industry evolution during the last few years and the widely success of the Web prompts the acceptance of descriptive markup. It does in fact posses many advantages compared to the other markup systems. However, acceptance of descriptive markup is still being retarded based on the desire to retain familiar technologies and practices and by developers' use of proprietary formats wishing lock users into their products.

    In regard to SGML, even supporters of descriptive markup have recently criticized it. Critics consider SGML too complicated for both authors and implementers, even though different tools have been developed for UNIX, Macintosh and Windows platforms in order to facilitate document creation. Another criticism of SGML refers to its lack of support for mathematics, graphics, and tabular material. However, based on SGML metalinguistic properties, it is possible to create SGML applications that support mathematics and tabular material. It also seems reasonably to believe that the standard can support graphics through descriptive markup as well as through referential markup.


    SGML Basics

    As SGML has became increasingly popular, multiple sources of documentation regarding SGML are available. However, there are many differences between them. For the purpose of this survey, OASIS bibliographic recommendations are followed [Cover 1998]. Oasis maintains a bibliographic database with over 2000 selected documents. This section has been elaborated by combining and summarizing the top recommendation by OASIS, namely [ACH/ACL/ALLC 1994, ArborText 1998 and SoftQuad 1991]

    SGML, the acronym for Standard Generalized Markup Language, is an international standard (ISO 8879:1986) for the definition of device-independent, system-independent methods of representing texts in electronic form. [ACH/ACL/ALLC 1994]. It is a metalanguage in the sense that it supplies a formal notation for the definition of generalized markup languages. SGML provides for the structuring and hypertext linking of document and database information in a vendor-neutral, machine-independent, and human-readable format. A basic design goal of SGML was to ensure that documents encoded according to its provisions should be transportable from one hardware and software environment to another without loss of information. SGML provides a general purpose machine-independent mechanism for string substitution.

    From a logical point of view, a document can be divided into three distinct layers: Structure, Content and Style [ArborText 1998]. SGML considers these three layers, although it mainly deals with the content and structure layers. The style layer is usually based on proprietary systems. However, other standards have been created in an effort to manage the style and presentation of documents. Among these standards is possible to notice the Output Specification (OS) and the ISO’s Document Style Semantics and Specification Language (DSSSL).

    In order to deal with the document structure, SGML introduces the notion of a document type, and hence a document type definition or DTD. The DTD formally defines the document structure by specifying its constituent parts and their structure. Also the DTD specifies the rules and relationships between the different structural components. DTD accompanies a document wherever it goes. Parser programs can be written in order to process documents, take advantage of the encapsulated knowledge, verify that the document follows the rules of the DTD, and even verify that the DTD itself is structurally correct. SGML uses a simple and consistent mechanism for the markup or identification of structural components. A document instance is a document whose content has been tagged in conformance with a particular DTD.

    The content layer corresponds to the information itself. The information can be an arbitrary structural component. In SGML, structural components are referred as elements. Each element has its own name. However, SGML does not provide interpretation for the particular semantics of the name. It only allows defining the relationships between different elements. For instance an element<myelement> can be a decomposed into <mysubelement> elements. The following section provides a more detail presentation of how SGML deals with the document content. Subsequently a discussion about the DTD and the document structure is presented .

    Document Content

    In order to markup texts, SGML normally uses pairs of tags that indicate the beginning (star-tag) and the end of an element (end-tag). These tags bracket off the element occurrences within the running text, in rather the same way as different types of parentheses or quotation marks are used in conventional punctuation. Traditionally markups are encapsulated by angle brackets. Additionally a solidus character (slash) precedes the end-tag name. The following example illustrates the use of markups in SGML.

    <paragraph> this is a paragraph of my document </paragraph>

    SGML allows a hierarchical structuring of documents where elements can be nested within elements. For instance in the following example, a heading and two paragraphs are embedded in a section element.

    <-- This is an example of embedded tags -->

    <section>

    <heading>SGML Example</heading>

    <paragraph>

    <line>this is the first line of the SGML example</line>

    <line>this is the second line of the SGML example</line>

    </paragraph>

    <paragraph>

    <line>this is the first line of the SGML example</line>

    <line>this is the second line of the SGML example</line>

    </paragraph>

    </section>

    The first line in the previous example illustrates the use of tag comments. In SGML these comments are not treated as part of the document. Also line breaks and spaces have no meaning within SGML. They are use for readability purposes.

     

    As mentioned before, an important characteristic of SGML is that rules can be defined for the relationships among different elements. This allows to specify rules like:

    1. A section is conformed by a single heading element that precedes other section elements.
    2. Apart from the heading element, a section element contains only paragraph elements.
    3. Every paragraph is contained in a section.
    4. Paragraph elements are conformed only by line elements.
    5. Every line is contained in a paragraph.
    6. Nothing can follow a paragraph except another paragraph or the end of the section.
    7. Nothing can follow a line except another line or the end of the paragraph.

    Based on the previous rules it is possible to infer missing markups. This feature provides the author with a greater flexibility to compose documents. For instance, based on rule 1 and 2, it is possible to infer that the author might not explicitly mark up the end of the heading element. In the same manner, rules 6 and 7 allow to infer if a lines or paragraph element end-tag is missing. This allows the author to write the following simplified text.

    <-- This is an example of embedded tags -->

    <section>

    <heading>SGML Example

    <paragraph>

    <line>this is the first line of the SGML example

    <line>this is the second line of the SGML example

    <paragraph>

    <line>this is the first line of the SGML example

    <line>this is the second line of the SGML example

    </section>

    The following section presents how to specify rules such as the ones used in the previous example and discusses in more detail the DTD.

    Document Structure

    The formal specification of structure of a document is accomplish by designing the document type definition or DTD and it can be a lax or restrictive as required. However, the document designer must consider how real texts are handled and the convenience or simple rules. When designing a new document, it is generally easiest to impose a more precise structure. It is important to notice that every DTD is an interpretation of the text. Therefore it is possible to have different DTD for the same collection of documents. Nevertheless, most applications of SGML today consider domains where the uniformity of documents is desired.

    A DTD is expressed in SGML as a set of declarative statements, using a simple syntax defined in the standard.

    <!ELEMENT section - - (heading?, paragraph+)>

    An element declaration, like an element, is delimited by angle brackets. The first character following the opening bracket must be an exclamation mark, followed immediately by one of a small set of SGML-defined keywords, specifying the kind of object being declared in this case an ELEMENT. After the keyword there are three more parts: a name or group of names, two characters specifying the minimization rules, and a content model. Components of the declaration are separated by white space, that is one or more blanks, tabs or new lines [ACH/ACL/ALLC 1994]. The first part of each declaration above gives the generic identifier of the element which is being declared, for example `section', `paragraph, etc. It is possible to declare several elements in one statement. Before continuing, we can consider the following declarations that would be appropriate for the previous section example.

    <!ELEMENT section - - (heading?, paragraph+)>

    <!ELEMENT paragraph - O (line+)>

    <!ELEMENT heading - O (#PCDATA) >

    <!ELEMENT line - O (#PCDATA) >

    The minimization rules determine whether or not start- and end-tags must be explicitly marked up in every occurrence of the element concerned. The first character relates to the start-tag, and the second to the end-tag. In either case, either a hyphen indicates that the tag must be presents and a letter O (for "amicable" or "optional") indicates that it may be omitted. Thus, in this example, every element must have a start-tag and only the <section> elements must have end-tags as well.

     

    The content model of the element, enclosed in parentheses, specifies what element occurrences may legitimately be contained. Contents are specified either in terms of other elements or using special reserved words. The most commonly encountered reserved word is PCDATA, which is an abbreviation for parsed character data. PCDATA means that the element being defined may contain any valid character data. In our example, <heading> and <line> content models specify PCDATA only and name no embedded elements, therefore they may not contain any embedded elements.

    In order to indicate how many times the element named in its content model may occur SGML uses occurrence indicators. There are three occurrence indicators in the SGML syntax:

    Thus, in order to change the paragraph element declaration such that it allows to possibility of being empty, the declaration should be as follows.

    <!ELEMENT paragraph - O (line*)>

    In cases that the content model contains more than one component it is required to specify the order in which the elements may appear. This ordering is determined by the group connector used between its components. There are three possible group connectors:

    The comma means that the components it connects must both appear in the order specified by the content model.

    The ampersand indicates that the components it connects must both appear but may appear in any order.

    The vertical bar indicates that only one of the components it connects may appear.

    <!ELEMENT section - - (heading?, paragraph+)>

    If the comma in our example were replaced by an ampersand in the section element declaration then the heading could appear either before the paragraphs or at the end (but not between paragraphs). If it were replaced by a vertical bar, then a section would consist of either a title or just paragraphs, but not both.

    When different elements have similar properties, they can share the same declaration. In this situation, it is convenient to supply a name group as the first component of a single element declaration, rather than give a series of declarations differing only in the names used. For instance

    <!ELEMENT (line | line1 | line2) - O (#PCDATA) >

    Also, in order to specify complex content model that have more than single elements or PCDATA SGML allows the use of models in which the components are lists of elements, combined by group connectors. Such lists, known as model groups, may also be modified by occurrence indicators and themselves combined by group connectors.

    <!ELEMENT section - - (heading?, (paragraph | line)+ ) >

    Quite complex models can easily be built up in this way, to match the structural complexity of many types of text. In the previous declaration a section may start with a heading followed by either one or many paragraphs or one or many lines.

    While this hierarchic approach is very effective for a large number of purposes, it is not adequate for the full complexity of real textual structures. In particular, it does not cater for the case of more or less freely floating elements that can appear at almost any hierarchic level in the structure, and it does not cater for the case where different elements overlap or several different trees may be identified in the same document. To deal with the first case, SGML provides the exception mechanism; to deal with the second, SGML permits the definition of ‘concurrent’ document structures.

    Exceptions are added after the content model. They indicate that they can exceptionally include (plus sign) or exclude (minus sign) other content models. For instance, in the following declaration indicates that lines can appear at any point in the content of a section (even before the heading).

    <!ELEMENT section - - (heading?, paragraph+) +(line)>

    Concurrent document structures allow multiple ways of looking to the same data. Since DTD specify a particular hierarchy, separate DTD must be created for each hierarchic tree into which the text is to be structured.

    The definition we have built up for the section document looks, in full, like this:

    <!DOCTYPE paper[

    <!ELEMENT section - - (heading?, paragraph+)>

    <!ELEMENT paragraph - O (line+)>

    <!ELEMENT (heading | line) - O (#PCDATA) >

    ]>

    As this example shows, the name of a document type must always be the same as the name of the largest element in it, that is the element at the top of the hierarchy. In the case we would not like to allow headings to be printed at the bottom of a page, we could consider the concurrent document type.

    <!DOCTYPE page.section [

    <!ELEMENT page.section - - (page+) >

    <!ELEMENT page - - ((heading?, line+)+) >

    <!ELEMENT (heading|line) - O (#PCDATA) >

    ]>

    We have now defined two different ways of looking at the same basic text -the PCDATA components grouped by both these document type definitions into lines or headings. In one view, the lines are grouped into paragraphs and sections while in the other they are grouped into pages only. Notice that it is exactly the same text that is visible in both views. However the two hierarchies allow us to arrange the text in two different ways.

    To mark up the two views, it will be necessary to indicate which hierarchy each element belongs to. This is done by including the name of the document type (the view) within parentheses immediately before the identifier concerned, inside both start- and end-tags. Thus, pages (which are only visible in the <page.section> document type) must be tagged with a <(page.section)page> tag at their start and a </(page.section)page> at their end. In the same way sections and paragraphs must be tagged using <(paper)section> and <(paper)paragraph> tags respectively. For the line and heading elements, however, which appear in both hierarchies, no document type specification need be given: any tag containing only a name is assumed to mark an element present in every active document type. This allows to process the document according to the interest of the user. However, CONCUR is an optional feature of SGML, and not all available SGML software systems support it, while those which do, do not always do so according to the letter of the standard.

    SGML also provides more advance markups such as attributes. Attributes are used to describe information that is in some sense descriptive of a specific element occurrence but not regarded as part of its content. For example, you might wish to add a status attribute to occurrences of some elements in a document to indicate their degree of reliability, or to add an identifier attribute so that you could refer to particular element occurrences from elsewhere within a document.

    Although different elements may have attributes with the same name, they are always regarded as different, and may have different values assigned to them. If an element has been defined as having attributes, the attribute values are supplied in the document instance as attribute-value pairs inside the start-tag for the element occurrence. An end-tag may not contain an attribute-value specification, since it would be redundant. The next section illustrates the use of attributes.

    <section param="location" value="server"> ... </section>

    In SGML it is possible to encode and name arbitrary parts of the actual content of a document in a portable way by using entities. An entity refers to a named part of a marked up document, irrespective of any structural considerations. An entity might be a string of characters or a whole file of text. Entities are included in a document by entity references. As an example consider the following internal entity declaration (first one) and the external entity declaration (second declaration).

    <!ENTITY myname "Joe Aggie">

    <!ENTITY ChaperTwo SYSTEM "chp2.txt">

    In the second case, the system identifier is the name of an operating system file and the replacement text of the entity is the contents of the file. Once an entity has been declared, it may be referenced anywhere within a document by supplying its name prefixed with an ampersand character and followed by a semicolon. When an SGML parser encounters such an entity reference, it immediately substitutes the value declared for the entity name. Thus, the passage "In representation of &myname;, I …" is interpreted as "In representation of Joe Aggie, I …"

    Whenever an SGML parser found "&ChapterTwo;", it will expand the presentation such that it includes whatever the system finds in the file chp2.txt.

    Applications

    There are many application of the SGML/XML family of standards, including ISO-HTML, HyTime, DSSSL, XSL, XLL, XLink, XPointer, SPDL, CGM, and several others. HTML is actually one set of pre-defined SGML mark-up or in other words it is a DTD. It simply happens to be such a widely accepted DTD that it is sometimes confused as a language itself. HTML files are collections of information with mark-up describing that information. HTML browsers read the mark-up in HTML files and decide how to display the appropriate information based on that mark-up. Another influential subset of SGML is the Extensible Markup Language or XML. XGML is also a simplified form of SGML optimized for use on the Internet. It is being developed through the agency of W3C as an application profile of SGML.

    Industry point of view

    As mentioned before, SGML is strongly supported by different industry sectors such as publishing and Internet. Many organizations currently support SGML, among which are W3C, SGML Users Group and US IRS. Some of these organization gather influential corporations and institutions, for instance, OASIS. OASIS is a nonprofit, international consortium dedicated to accelerating the adoption of product-independent formats based on public standards. These standards include SGML, XML, HTML as well as others that are related to structured information processing. Members of OASIS (showed in Table I) are providers, users and specialists of the technologies that make these standards work in practice.

    ActiveSystems, Inc. Adobe Systems, Inc. AIS/Berger-Levrault
    AND-USA Inc. ArborText, Inc. Architag International Corporation
    L. A. Burman Associates Center for Electronic Text in the Law Chrystal Software
    CITEC International Ltd. James Clark Crane Softwrights Ltd.
    CSW Informatics Data Conversion Laboratory Database Publishing Systems, Ltd.
    DataChannel, Inc. Datalogics Incorporated Document Management Solutions, Inc
    Electronic Data Foundry, Inc. Electronic Information Arts, Inc. Enigma Incorporated
    Ericsson Telecom Folio Corporation Fuji Xerox Information Systems, Co. Ltd.
    Charles Goldfarb Graphic Communications Association IBM Corporation
    INERA Inc. InfoObjects Inc. Information Strategies, Inc
    INSO Corporation Interleaf Inc. International Press Telecommunications Council (IPTC)
    ISOGEN International Corp Jouve Software, Inc. Lonergan Digital SARL
    Matthew Bender & Company, Inc. Microstar Software, Ltd. Mulberry Technologies, Inc.
    Multilingual Technology Ltd. Neville & Associates Noldor Technologies
    NPC Digital Services Office of the Courts, State of Utah Okina Consulting
    PharmaSoft AB POET Software Corporation ProText
    RivCom SAAB Service Partners AB SAS Institute, Inc.
    SGMLWorks! SoftQuad Inc. Software AG
    Solvera Information Services Soph-Ware Associates Sörman Information AB
    STEP (Sturtz Electronic Publishing GmbH) Structured Information Consulting Sun Microsystems
    Synergy Incubate, Inc. Synex Information AB Synth-Bank
    Tata Infotech Ltd. TechnoTeacher Texcel
    The Word Electric Thomson Corporation Tokyo Electron America, Inc.
    Uniscope Inc. Veo Systems XMLXperts
    Xyvision, Inc. Yuri Rubinsky Insight Foundation  

    Table I. OASIS Members.

    Another very influential organization supporting descriptive markup is the World Wide Web Consortium or W3C. The W3C was established in October 1994 in collaboration with CERN with support from DARPA and the European Commission. W3C intends to develop common protocols that promote the World Wide Web evolution and ensure its interoperability. W3C is an international industry consortium hosted by the Massachusetts Institute of Technology Laboratory for Computer Science (MIT/LCS) in the United States, the Institut National de Recherche en Informatique et en Automatique (INRIA) in Europe, and the Keio University Shonan Fujisawa Campus in Japan. The Consortium is led by Tim Berners-Lee and Jean-François Abramatic, Chairman. It provide services including a repository of information about the World Wide Web for developers and users, reference code implementations to embody and promote standards, and various prototype and sample applications to demonstrate use of new technology.


    Implications for Digital Libraries

    There are many benefits and implications of SGML for Digital Libraries. SGML, as a descriptive markup language, can be beneficial for authors, publishers and readers. For instance:

    1. Authors can share documents and collaborate with colleagues without the current concerns about incompatibility between text formatters and printing devices.
    2. Publishers do not have to re-key documents, thus eliminating an expensive and error-prone task.
    3. Easier proofing process. As a result publishers can save considerable administrative costs and reduce the time required to get a document into print. Moreover, publishers will no longer have to negotiate with authors who want to make changes after the galleys have been set. For their part, authors will be relieved of the burden of proofreading documents that were correct at the time of submission.
    4. Avoid re-keying of documents. Subsequent editions, revisions, or collections may be generated from the source files of the document.
    5. Automatic generation of bibliographic information. This process makes citations available almost immediately to users of on-line bibliographic databases and reduces possible errors in citations. The time from submission of a text to entry in the literature of a field can be reduced considerably. This also allows the development of other applications to recognize fields or specialties, and to produce automatic citation indexes [Giles et al. 1998, Chen and Carr 1998 ]
    6. Documents may be included directly in on-line databases for electronic publishing and full-text retrieval, which is another way of introducing them into the literature almost instantaneously.

    Publishers and authors have already begun to demand these improvements in the publishing process. With the expenses of scholarly publishing rising continually, cost containment will become more and more important, and authors will find properly marked electronic manuscripts more marketable than other electronic manuscripts and typescripts.

    Also, due to the fact that Digital Libraries designers tend to make available the referenced collections over Internet, many social and regulatory issues might arise. Digital libraries can benefit from SGML in many ways. From a social point of view, public acceptance and current industry investments seem to define the web as a de facto standard. From a regulatory point of view, business and control over the Internet demand features such as security, interoperability, information sharing and distributed processing that require the strong use of standards. From a functional point of view, SGML can encapsulate the required document metainformation (semantic and structural) in order to improve information retrieval and automatic generation of presentations. Also, universal access to Digital Libraries can be facilitated by the adoption of an international standard such as SGML. And while there are still many issues that need to be solved, such as intellectual property and the relationship between traditional and digital libraries, SGML seems to be the standard that is defining the industry for the next decade.


    References

    ACH/ACL/ALLC 1994.

    "A Gentle Introduction to SGML."

    in Guidelines for Electronic Text Encoding and Interchange (TEI P3).

    C.M. Sperberg-McQueen and Lou Burnard. Chicago. April 8 1994. 2 volumes, xxvi +, Pages 13-36

    Also Internet Document URL: http://sable.ox.ac.uk/ota/teip3sg/

    ACH/ACL/ALLC [Association for Computers and the Humanities, Association for Computational Linguistics, Association for Literary and Linguistic Computing]

     

    American National Standards Institute (ANSI) 1986.

    "Information Processing -Text and Office Systems-Standard Generalized Markup Language (SGML)"

    ISO 8879-1986(E). New York: ANSI, 1986.

     

    ArborText 1998.

    "Getting Started with SGML: A Guide to the Standard Generalized Markup Language and Its Role in Information Management".

    White Paper. Internet Document: http://www.arbortext.com/wp.html#knowsgml

    Last Modified: Thursday, July 02, 1998

     

    Chen, Chaomei and Carr, Less 1998.

    "Trailblazing the Literature of Hypertext: Author Co-Citation Analysis (1989-1998)"

     

    Coombs, James. H., Renear, Allen H., DeRose, Steven J. 1987.

    "Markup Systems and the Future of Scholarly Text Processing." Communications of the Association for Computing Machinery 30/11 (1987) 933-947. ISSN: 0001-0782.

     

    Cover, Robin 1998.

    "The SGML/XML Web Page"

    OASIS. Internet Document URL: http://www.oasis-open.org/cover/sgml-xml.html

    Last modified: October 22, 1998

     

    Giles, Bollacker, and Lawrence 1998.

    "CiteSeer"

    ACM Digital Libraries 1998, pp. 89-98

     

    Goldfarb, C. F. 1981.

    "A Generalized Approach to Document Markup."

    Proceedings of the ACM SIGPLAN SIGOA Symposium on Text Manipulation.

    New York: ACM, 1981. 68-73. (Adapted as "Annex A. Introduction to Generalized Markup" in ISO 8879.)

     

    SGML Users' Group 1990.

    "A Brief History of the Development of SGML"

    Internet Document URL: http://www.sgmlsource.com/history/sgmlhist.htm

     

    SoftQuad 1991.

    "The SGML Primer"

    Toronto: SoftQuad Inc., December, 1991. 36 pages.

    Internet Document URL: http://www.softquad.com/

     

    W3C 1998.

    "World Wide Web Consortium Homepage"

    Internet Document URL: http://www.w3.org/