| Vascular Plant Image Library
Query the Full Text Index - Information |
What is the format of the collection?
The collection is based on contributor's determinations with regard to plant name (genus, species) and family assignment for each plant photographed. Structure and content of the comments for each image is also determined by the contributor. We are working on building linkages, during processing, to ancillary sources of information for unique strings that might be present in a given contributor's comments, especially supplemental location data for natural areas that are frequently referenced.
How to update the collection in the future?
The base image library file is re-processed
by Dr. Wilson when changes are made to files provided by individual contributors,
either additions or corrections. The large text file produced
(dftout1.txt) is shipped via FTP to the CSDL server and, via an automated
system activated by Dr. Wilson from a web page, reindexed.
If one wants to find some particular information which is stored in a computer text file then one has a few alternative courses of action. One can operate directly on the text files with utilities, such as UNIX grep, or can process the text files into some form of database. Grep is generally limited to identifying lines by matching on regular expressions. If the collection of files which grep operates on becomes large, then continual passes over the entire text on each query becomes expensive. However, its usage is simple as no auxiliary files must be created.
A database consists of some data and indexes into that data. By having indexes one can query a large database quickly. Standard databases divide the data up into records of fields. This means that the granularity of search is a field. In a full-text system, such as MG, there are no fields (or there is an arbitrary sized list of word fields per document) and instead every word is indexed. Using this method, we can accept free-form information and yet be fast on searches. The next question is what is the overhead of this database. In MG most files which are produced are in a compressed form. The two notable compressed files being the given data and the index, called an "inverted file". By compressing the files it is possible to have the size of the database smaller than the size of the source data.
The most common use for MG has been as a
search database on unix mail files. However, any set of text data
can be used, one just needs to determine what constitutes a document (see
mgintro++(1)).
MG has also been used on large collections such as Comact (Commonwealth
Acts of Australia) which is around 132 megabytes and also on sizes up to
around 2 gigabytes for TREC (a mixture of collections such as the Wall
Street Journal and Associated Press).
This document is modified from an original prepared
by Dr. John Leggett's CPSC670 Fall 1999 students that selected this system
as a project, Haiyan Wang and Jingchen Xu. (return
to query page).