Previous section: Introduction

Hypervideo Properties

One of the topics in the hypervideo research agenda is concerned with the rhetoric hypervideo. Hypervideo as other dynamic media types requires a different rhetoric than static media types. These changes in the rhetoric affect not only the video or other dynamic media in a hypermedia system, but they affect the whole hypermedia system. Hardman et al. required extending the assumptions for hypertext links in order to include context in hypermedia links [Hardman et al. 1993]. Listøl explain how, in order to reduce the user confusion arisen from discontinuities between different media types modified the text to make it more like video and video to make it more like text by using footnotes [Listøl 1994]. Sawhney et al. proposed a rhetorical framework for hypervideo. This new rhetoric and aesthetic requirements have implied changes in traditional concepts such as nodes and links and the inclusion of time in the whole presentation.

Concepts like nodes and links have been influenced greatly by media types such as text and graphics. Nevertheless these media types differ from video in more than one ways. Listøl points that reading requires the user to look in an "active" form while video requires a more "passive" way of looking [Listøl 1994]. Also text and graphics are static. Their message is present at all times in the screen, while video’s message depends on the changes in the images that occur over time. One more difference between static media and dynamic media is that in static media the user sets the tempo of information acquisition while in dynamic media the tempo is set by the machinery that makes the illusion of motion pictures. Due to the static nature of text and graphics it is easier to encapsulate ideas or concepts in nodes. In video however, it is more difficult to map the concept of nodes to segments of video. On the lowest end a node could be a single frame or a scene. The next two sections provide a perspective on different approaches towards linking and the segmenting of video into nodes. Following those sections is a discussion about the rhetoric of the hypervideo as a whole and specification of time.

Linking and Navigation

Different approaches to linking hypervideo have been proposed over time. Some of this approaches are shown next.

Video footnotes

The idea of video footnotes draws from traditional text footnotes the concept of a link or reference to other information [Listøl 1994]. The reason to use a metaphor from traditional text is to help bridge the discontinuity gap when linking video-to-text. Nevertheless, since video footnotes are outside of the display window they are easier to implement and represent. In his implementation of the Interactive Kon-Tiki Museum Listøl use micons [Brøndmo and Davenport 1991] in order to represent the video footnotes.

Linking opportunities

Linking opportunities were presented in HyperCafe [Sawhney et al. 1996]. The work in HyperCafe assumes a videocentric medium and is mainly focused in video-to-video linking. However it is also possible to extend their concepts to linking between different media types, such as text-to-video and video-to-text. In order to provide linking functionality, it is possible to use three different linking opportunities [Balcom 1996, Swahney et al. 1996]:

As a result of the linking opportunities two new links are possible: spatio-temporal links and temporal links. When using spatio-temporal links, selecting a specified area of the source video is selected during a specified time interval can trigger destination videos. Temporal links are time-based references between different scenes.

Travelling hotspot

This approach is being used in different authoring tools and systems [Hirata et al 1993, Hirata et al 1996, OZER 1997, VEON 1998] It consists of specifying hot areas in each frame and linking the areas to a common destination. Some systems provide facilities that automate the task of drawing the area individually in each frame. By comparing images from frame to frame it is possible to specify the hotspot for each frame, even if it is changing its XY coordinates. Whenever the algorithm fails, it is necessary for the user to manually specify the hotspots. This technique is similar to spatio-temporal links. However travelling hotspot systems typically consider each frame as a node and spatio-temporal link systems considered a scene as a node. The difference is that for a spatio-temporal link, the whole scene would be a valid link opportunity and therefore a single specification is enough for the whole scene. For travelling hotspots there is no concept of scene, therefore more work is required, but it is easier to follow a particular moving image in the video. In spatio-temporal link opportunities the image should maintain certain stability with regard to its XY coordinates on the screen.

Content-oriented integration

Other navigational strategies have been proposed such as content-oriented integration, which consists of conceptual-based navigation and media-based navigation [Hirata et al 1993, Hirata et al 1996]. In content-oriented integration, the media-dependent representation (i.e. sketches) is translated into a concept that is media-independent. Concepts are linked together and an inverse process is used in order to translate the concept to an appropriate media representation. Currently, content-oriented integration in hypervideo uses travelling spots as part of the media-dependent navigation. A sketch may be used to define an image in a frame. The system can look for that image and retrieve the frames that contain it. The user can also move the sketch in certain pattern indicating to the system to find a set of frames that represent a similar moving pattern for that hotspot. As technologies such as image recognition evolve it will be possible to provide better ways to specify hotspots in hyper video. The strength of this approach is that it avoids the discontinuity of input and output when using video

Node segmentation

There are different possible positions regarding the granularity of hypervideo. One position is to define the smallest unit of hypervideo to be a scene. Another position is to define the smallest unit of a hypervideo to be a frame. A third option is to consider a whole video as a node and use indexing schemes to point in it.

Frame granularity

Certainly this seems to be the finer granularity possible for video. It is the atomic in the time axis. From the system point of view it is easier to manipulate since it maps into a kind of static media and it is possible to associate the meta-information for each node directly. Nevertheless it might prove to be too fine. It requires the definition of meta-information frame by frame. While some of the frames may need to be referenced directly, other frames do not require any direct reference. From the rhetoric point of view, a frame may not be consider as "video" since one of the intrinsic characteristics of vide is the change in time. Therefore for most a frame cannot contain the message of the video. From a rhetorical and conceptual point of view, this approach is similar to creating nodes for each individual phrase or word in a hypertext context.


A scene is a set of frames presented in a sequential fashion [Sawhney et al. 1996]. This definition implies that this set of frames, presented in a strictly sequential way provides certain meaning. From a rhetoric point of view, this approach also includes the concept of time. Therefore a scene would be the minimum sequential set of frames that conveys meaning. This concept presents a good granularity for hypervideo. It still allows to reference individual frames inside a scene (anchors could be used for that task). Also, by linking scenes together, scene-granularity allows the creation of a larger concept, namely the narrative sequence. Sawhney et al. define the narrative sequence as a possible path through a set of linked video scenes dynamically assembled based on the user interaction. It is at this level that Bernstein’s contours are build. From a system point of view, this approach provides small files that eases memory management. Nevertheless this approach imposes on the author the definition of scenes, therefore imposing a fixed concept of meaning. For instance, in computer-aided procedure guidance application, a hypervideo may aid a user by displaying the sequence of steps required for that procedure. In this scenario, scenes should be defined as the minimal steps. Beginner users may require the system to stop at each step, while more advanced user may require that the system display several steps without stopping. Different narrative sequences may be programmed in order to provide this functionality.

Whole video

This approach considers a whole video as a single node. Internal references are possible by specifying a time coordinate and X, Y coordinates. While this approach may seem as imposing too much complexity on the addressing but for some applications may be the most appropriate. In the computer-aided procedure guidance example, as different user may have different backgrounds, they may interpret a scene in different manners. Therefore the author’s idea of a scene could be two scenes for a particular user or maybe just a part of a scene for another user. Systems like that require defining scenes dynamically. A dynamic indexing of the video could provide such functionality. For applications such as this, segmenting the video into predefined scenes, results in a higher degree of complexity. Therefore a whole video as a node approach seems more appropriate for these applications.

Hypervideo rhetoric

Another issue concerning hypervideo is the specification of the presentation for both, the destination video and the source video [Hardman et al. 1993]. They define this definition as context. This definition is even more important when there are more than one video active at a time. For instance, when traversing a link from a source video different behaviors for the source and destination video may be desirable. The source video may keep running or pause until the destination video ends playing. If the source video pauses then it is important to determine what will happen when the destination video ends. It could start from the same point in time where it was paused or could it restart from the beginning. Also, in another context the traversing of a link may imply that the destination video replaces the source video. This requirement for context in Hardman’s words is not limited to hypervideo, but it extends to other media used in the hypermedia session. With all this possibilities and the fact that different paths through the hypervideo are possible, it is important that the narrative sequence and the whole presentation remain coherent. Scenario-based hypermedia systems such as Videobook [Ogawa et al. 1990, Ogawa et al. 1992] consider the integration of different media. The specification of media and its synchronization is done using a nodal structure and a timer driven links. Another research thread related with the specification of presentation for multimedia is the creation of meta-languages like HyTime and SMIL. Also due to the distributed nature of the Web, the creation of a standard for the specification of multimedia is very important. HyTime and SMIL are attempts at the creation of such a standard for Internet. The next sections describe briefly both, HyTime and SMIL.


HyTime is an ISO standard meta-language that include general representations for links and anchors, and positioning and projecting arbitrary objects in time and space [Buford, 1996]. It was design during the late 1980s. At the time of its standardization very little was actually implemented. Based on the technology of that time, the great variety of systems and lack of interoperability, HyTime required much complexity. HyTime improves and also requires SGML (Standard Generalized Markup Language). HyTime support derived from its position as an international standard and the idea that the Web would migrate to SGML Web, for which HyTime represents the natural evolution. HyTime allows anchors to be formed dynamically. It also allows the representation of synchronization relationships. HyTime is primarily concerned with addressing, associating and structuring of hypermedia information. Nevertheless HyTime does not represent interaction and presentation aspects of multimedia content. It provides a significantly limited support for scripting language integration. The complexity and limitations of HyTime present serious challenges for HyTime to become the standard for a multimedia Web.

SMIL: Synchronized Multimedia Integration Language

SMIL is an open XML-based layout language proposed by the World Wide Web Consortium (W3C) for a way of choreographing interactive multimedia content for real-time delivery over the World Wide Web [Real Networks, 1998]. XML is a W3C recommendation for syntactical infrastructure or meta-language that provides a framework for creating and describing other markup languages, such as HTML and SMIL. More specifically, XML provides the syntactical instructions for creating new tags. SMIL is a markup language very similar in syntax to HTML. It does not require a programming language and it can be authored using a simple text editor. Making it very accessible for any individual in the world. SMIL is completely complementary to HTML. While HTML is a language to describe the presentation of a Web page, SMIL describes the presentation of the multimedia data that fills the Web page

SMIL defines the mechanism to compose multimedia presentations, synchronizing where and when the different media are to be presented to the end over the Internet. It is designed to support the layout of any data type and file format, or container file formats. It provides a way of taking different media and place them relative to one another on a timeline, and relative to one another on the screen. SMIL is designed to work in a distributed environment and to make efficient use of network resources over the internet. For instance, a SMIL file can list custom presentation choices for different bandwidths or different language preferences.

SMIL was developed by a group conformed of research organizations and industries from the CD-ROM, interactive television, Web, and audio and video streaming industries. Since SMIL is an open language it is capable of modifications and refinements. It is a platform data type and file format-neutral markup language.

The importance of SMIL is that it may become the standard for multimedia on the Web.

Due to the complexity added by the inclusion of time and the rhetoric imposed by time-based media one of the areas that require much development is the authoring systems. The next section presents a brief discussion about some of the hypervideo authoring systems

Next section: Authoring Systems

Back to Contents     Back to CPSC 610 page