Isite, Release 2.05

GENERAL INSTRUCTIONS

There's a multi-step process to bring up one or more databases under the FGDC Z39.50 server.

  1. Prepare your metadata records
  2. Prepare the field definition files, if necessary
  3. Build the searchable indexes by running Iindex
  4. Test the searchable indexes by running sample searches with Isearch
  5. Configure the Z39.50 server
  6. Start the Z39.50 server by running zserver
  7. Test the Z39.50 server by running sample searches with zclient or izclient
  8. Notify the Clearinghouse that your server is ready to be accessed.

PREPARE THE METADATA RECORDS

Take your collection of plain-text metadata records, and run them through mp to get the corresponding text, HTML and SGML tagged versions. Put all of them in the same directory (that is, make sure that the text, HTML and SGML files for each record are together - there can be several directories of them, if desired). The files can have either 4 letter filename extensions (*.text, *.html and *.sgml), or 3 letter filename extensions (*.txt, *.htm and *.sgm). The search engine will find whichever one is present.

PREPARE THE FIELD DEFINITIONS

Take the following list of field definitions (also found in the file fgdc.fields) and save it locally. You could even call it fgdc.fields, if you want:

----------------------------CUT HERE--------------------------------

attrmres=num
begdate=date
begtime=time
bounding=gpoly
caldate=date
denflat=num
eastbc=num
enddate=date
endtime=time
feast=num
fnorth=num
latprjo=num
longcm=num
northbc=num
numdata=num
numstop=num
procdate=date
proctime=time
pubdate=date
pubtime=time
rdommax=num
rdommin=num
rngdates=text
southbc=num
srcscale=num
stdparll=num
time=time
timeinfo=text
timeperd=date-range
westbc=num

----------------------------CUT HERE--------------------------------

Note that it takes the 8-character SGML tag names and associates a data type with each one. And, yes, there are more fields here than we actually have made searchable (for now). Text fields do not need to be included here, but it doesn't hurt.

The possible data types are "text" (the default), "num", "date", "date-range" and "gpoly". Descriptions of the new data types (other than "text") can be found in doc/NumericData.txt.

The fieldtype file will be read by Iindex when you index the files.

RUNNING THE INDEXER

Now, we're ready to build the index. Use a command like the following:

./Iindex -d <path>/<mydbname> -t fgdc -m 8 -o fieldtype=<field-type-path> \
-o mpcommand="path to mp" <list-of-sgmlfiles>

where you'll replace <path> and <mydbname> with the path and name of the searchable index you wish to create,<field-type-path> should be replaced by the path and name of the field type file you made above. You can generate the list of SGML files with find or just "*.sgml" (if they're all in one directory). Run Iindex with no arguments to see all of the indexing options.

If you forget to specify the fieldtype file, the doctype will nag you about it and give you the opportunity to enter it. You can have the search engine call a parser to convert the SGML files to text or HTML on-the-fly instead of making static versions of all the files ahead of time. Just use the doctype option mpcommand and specify the command line to call the program. Note that this version expects it to be mp, and builds the appropriate command, so if you use another program, it had better have the same command line arguments as mp.

At this point, the index files will be created.

ABOUT THE FGDC DOCTYPE

The doctype assumes that either there exist 3 files for each metadata record (an SGML-tagged version to feed the indexer, and plain text and HTML versions, with the same root filename, for presentation), or that the forms for presentation can be generated dynamically with a filter program. Those files can be created any way you like, but Peter Schweitzer's mp program is the recommended tool in either case.

The actual code is in the Isite-fgdc distribution in the files fgdc.cxx and fgdc.hxx, in the subdirectory Isearch-geo-alpha/doctype.

The doctype contains three primary methods - one to parse the documents, one to parse the fields and one to service presentation requests. The field parser is of most direct interest. It uses a routine called usefulFGDCField, which contains a list of fields to be indexed and made searchable. All other fields are skipped. Here's the actual routine, so you can see what's currently being handled.

---------------------------------------------------------------------
GDT_BOOLEAN usefulFGDCField(const STRING& FieldName)
{
	if (FieldName.Search("title"))
		return GDT_TRUE;
	else if (FieldName.Search("pubdate"))
		return GDT_TRUE;
	else if (FieldName.Search("descript"))
		return GDT_TRUE;
	else if (FieldName.Search("abstract"))
		return GDT_TRUE;
	else if (FieldName.Search("edition"))
		return GDT_TRUE;
	else if (FieldName.Search("placekey"))
		return GDT_TRUE;
	else if (FieldName.Search("purpose"))
		return GDT_TRUE;
	else if (FieldName.Search("srcscale"))
		return GDT_TRUE;
	else if (FieldName.Search("lineage"))
		return GDT_TRUE;
	else if (FieldName.Search("themekey"))
		return GDT_TRUE;
	else if (FieldName.Search("themekt"))
		return GDT_TRUE;
	else if (FieldName.Search("bounding"))
		return GDT_TRUE;
	else if (FieldName.Search("westbc"))
		return GDT_TRUE;
	else if (FieldName.Search("eastbc"))
		return GDT_TRUE;
	else if (FieldName.Search("northbc"))
		return GDT_TRUE;
	else if (FieldName.Search("southbc"))
		return GDT_TRUE;
	else if (FieldName.Search("origin"))
		return GDT_TRUE;
	else if ((FieldName.Search("begdate")) && !(FieldName.Search("begdatea")))
		return GDT_TRUE;
	else if ((FieldName.Search("enddate")) && !(FieldName.Search("enddatea")))
		return GDT_TRUE;
	else if (FieldName.Search("caldate"))
		return GDT_TRUE;
	else if (FieldName.Search("geoform"))
		return GDT_TRUE;
	else if (FieldName.Search("browsed"))
		return GDT_TRUE;
	else if (FieldName.Search("browsen"))
		return GDT_TRUE;
	else if (FieldName.Search("direct"))
		return GDT_TRUE;
	else if (FieldName.Search("indspref"))
		return GDT_TRUE;
	else if (FieldName.Search("dsgpoly"))
		return GDT_TRUE;
	else if (FieldName.Search("dsgpolyx"))
		return GDT_TRUE;
	else if (FieldName.Search("cntorgp"))
		return GDT_TRUE;
	else if (FieldName.Search("timeinfo"))
		return GDT_TRUE;
	else if (FieldName.Search("timeperd"))
		return GDT_TRUE;
	else if (FieldName.Search("rngdates"))
		return GDT_TRUE;
	else if (FieldName.Search("progress"))
		return GDT_TRUE;
	else if (FieldName.Search("update"))
		return GDT_TRUE;
	else
		return GDT_FALSE;
}

----------------------------------------------------------------

This list can be expanded, but you if you change it in the file fgdc.cxx, you will have to recompile the source, and there's no guarantee that other Clearinghouse sites will support the same additional fields for searching.

Any tag matching one of these, at any nesting level in the metadata record, is indexed. In addition, tags are created for the full nesting level for the tag, so that not only can one search on title, one can also search on a specific instance of title.

(Z39.50 geek.note - we need to define some way to handle this nesting. Jim and I have talked about a "nesting" relation attribute, but it might also be good to think about the simple answer - "title" would be distinct from "METADATA_IDINFO_CITATION_CITEINFO_TITLE" [which is what I use for the title in the search Brief records] by giving them separate use attributes - ugly but effective. Maybe too ugly...)

TESTING THE INDEX

Once Iindex has run to completion, you can test to see if the index is built correctly, and that the files are in the expected places with a command like the following:

./Isearch -d <path>/<mydbname> <search-terms>

where you'll replace <path> and <mydbname> with the path and name of the searchable index you created with Iindex, and where the search terms can be selected words or phrases from your documents. For example,

./Isearch -d db/test water

will search the test database (see the script build-testindex) for the word "water", while

./Isearch -d db/test title/water

will search the same database for the word water when it appears in a title field.

RUNNING THE Z39.50 SERVER

The next step is to edit zserver.ini and sapi.ini so that zserver can find the index you just created. zserver reads a field map file specified in sapi.ini. The right one to use is bin/geo.fgdcmap (you can also use bin/bib1.fgdcmap and bin/gils.fgdcmap) – it contains the mapping between the Z39.50 use attribute numbers and the actual names of the fields as they're expressed in the SGML metadata records and indexed. For now, I map specific instances of the fields, including the nesting level. For example, attribute #4 (TITLE) is actually mapped to METADATA_IDINFO_CITATION_CITEINFO_TITLE (which is one specific instance of TITLE in the record).

Note that whenever you run Iindex to reindex a document collection, you must kill the existing zserver process and start a new one. This will ensure that all of the correct links to the databases are found.

Assuming you've indexed your files with Iindex and that testing with the command line program Isearch works, here is a check list for making sure zserver is configured:

1. Edit bin/zserver.ini to put your index's name in DBList. It won't hurt to also set DebugLevel=5 at this point. Don't set ServerMode to INETD unless you really know what you're doing - it's really a debugging mode and zserver is designed to work best with ServerMode=STANDALONE. Check that Port is the right value for your installation (remember – you need system privs to run zserver on a port number less than 1024). And check that the path to sapi.ini is correct.

2. Edit bin/sapi.ini to add the descriptive information about your database to the list of available indexes. The entry should look like this:

[TESTHTML]
Type=ISEARCH
Location=/tmp
FieldMaps=my.map

Note that this says you've run Iindex with "-d /tmp/TESTHTML" and it's created a bunch of files in /tmp with names like TESTHTML.*. The field map file associates the field names created by Iindex with Z39.50 use attribute numbers. It is not mandatory if you want to use the default BIB1 attribute set. There's one already set up to map the GEO profile use attributes to the field names created by Iindex with the FGDC doctype - it's bin/geo.fgdcmap (which contains specific instances of the field names, rather than the generic ones listed in the GEO profile – just in case you were wondering).

3. Once you've got the field map files squared away (and you may not need to change anything because I've already set it up for our test installations), and done any necessary editing to zserver.ini and sapi.ini, you can just run zserver to test it.

It's best to cd into the bin directory and start zserver with the command "./zserver". If you've set DebugLevel to 5, you should see a bunch of messages about the indexes it found. If it complains, you've either got a path wrong in sapi.ini, or the database name is incorrect in the DBList in one of the two ini files.

Now, you can bang on the server with one of the clients.

4. Once you've established that it's working, you can run zserver in background. Change to the bin subdirectory and issue the command:

./zserver &

This will start the server and put it into the background. You can start it from any other directory with the command

<path-to-server>/zserver -i<path-to-inifile>/zserver.ini

but be careful about specifying relative paths in zserver.ini and sapi.ini.

TESTING THE Z39.50 SERVER

Once the server is running, you can test it with the program zclient. The command is

./zclient <host> <port> <dbname> <search-terms>

To search the test database, assuming you've built it and left the specifications for it in zserver.ini and sapi.ini, issue the command

./zclient localhost 5555 test water

to duplicate the test search from above, only this time it goes through the Z39.50 server the same way queries through the Clearinghouse will.

Incidentally, to search for the word "water" in the title field, you have to use the Z39.50 syntax. It's arcane, but powerful, and looks like

./zclient localhost 5555 test "water[1,4]"

If you're interested, this says to search for the word "water" in the index with attribute=1 (the Z39.50 "use" attribute, or field name), and that the use attribute has the value 4 (which is the Z39.50 code in the GEO profile for the title field). Note that the advantage of this approach is that the field doesn't actually have to be called "title" – it can be called anything (like "METADATA_IDINFO_CITATION_CITEINFO_TITLE") and the user doesn't have to know that because the mapping between the logical name (use attribute=4) and the local database name. Without this capability, we would not be able to search titles in multiple databases unless all the databases had the same field name.

NOTIFY THE CLEARINGHOUSE

Send mail to Doug Nebert (ddnebert@fgdc.gov) to tell him the name of your server, the port your server is running on and the name of your database(s). Also include a short descriptive phrase for each database which will tell the user what the database contains. This short phrase will appear in the selection list in the Clearinghouse search form.

PREPARING THE GATEWAY

Very few sites will need to do this. If you think you need to set up a gateway, check with Doug Nebert to see if it's really necessary. For Clearinghouse purposes, it's sufficient to index your files and run the Z39.50 server. Once the server is running and tested, you can send the server information to Doug and/or me, and we'll add it to the master gateway.

That having been said, if your heart is set on running a gateway, here are instructions.

Get the source code tar file Isite-2.XX.tar.gz (where XX is the release number, currently 05). Unpack the source and cd into the root Isite-geo-beta directory. There's some path configuration information in the file zdist/defines.hxx. Edit that file and from the Isite directory, execute the command "make". If you get an error message about a missing Makefile, cd to the Isearch subdirectory, execute the command "./configure", then cd back to the Isite directory and issue the "make" command again. It _should_ proceed without errors (at least, it does on the Linux, Solaris, SunOS, Dec Unix, DG and SGI boxes I have access to).

Change to the bin subdirectory. Put zgate and zcon into your cgi-bin directory. Edit gateway.ini to include the list of indexes which is to appear in the search form. gateway.ini contains lines like this:

----------------------------CUT HERE--------------------------------

[sites]

# This is the list of databases to be presented in the search form, if
# you want the user to select from multiple databases. See the file
# html/numsearch.html for an example. Zgate will paste these into the
# search form when the user initiates a session.
#
# There are two possible forms - local databses and remote databases.
#
# To include a local database, enter two values - the database name and
# a short description to be displayed for the user. For example:
#
# test+Test Database
#
# includes the database called test, and puts the string "Test Database"
# into the selection list.
#
# To include a remote database, you must specify the host, port,
# database name and a short description. For example:
#
# kudzu.cnidr.org:5555/test+Global Test Database
#
# includes the database called test running on kudzu.cnidr.org on a
# Z39.50 server on port 5555, and puts the string "Global Test Database"
# into the selection list.
#
# See Isite.html for details.
location=test+Archie's Test Database,\
kudzu.cnidr.org:5555/test+Global Test Database

[config]
# This is the path to the gateway forms in the local file system
WEBFORMS=/usr/local/etc/httpd/htdocs
# This is used as a scratch directory to track named queries but
# it's not fully implemented yet
SPATH=/tmp/queries
# This is where the log files and temporary files get written
SCRATCH_DIR=/tmp
# This is the timeout value for zcon - the process will die after
# this many seconds if zcon doesn't hear from zgate. It overrides
# the value set in the source code file defines.hxx
TIMEOUT=300

[zgate]
# GATEWAY_PATH and GATEWAY_HTML define the URL to the gateway login form
#GATEWAY_PATH=/zgate
GATEWAY_HTML=gateway.html
# FILE_PATH and FILE_NAME define the URL to the gateway binary
FILE_PATH=cgi-bin
FILE_NAME=zgate
# TRACKER is a temp file to hold some state info
TRACKER_PATH=tmp
TRACKER_NAME=recent

[zcon]
# FILE_PATH and FILE_NAME define the name of the status temp file
FILE_PATH=cgi-bin
FILE_NAME=zcon
# HISTORY holds past queries
HISTORY_PATH=tmp
HISTORY_NAME=hx
# LOCK holds status info for the connection manager
LOCK_PATH=tmp
LOCK_NAME=lock
# PROG_NAME is the current name of the connection manager
PROG_NAME=zcon
----------------------------CUT HERE--------------------------------

The first entry is for a local database, that is, one on your local machine. All that's required is the name of the database and a short description. The other entry points to a remote database, so the entry contains the database node, port number, name and a description to appear in the form. Just make sure no "+" signs appear in the description. Put gaeway.ini in the cgi-bin directory, too.

Next, install gateway.html and one of the search forms (search.html, numsearch.html or advsearch.html - they're profusely commented) in your http server directory. Edit the input field value in gateway.html to point to your local copy of a search form. Edit the form field action in search form to contain the URL of zgate on your server.

In order to be able to search multiple indexes on the gateway server, you'll need to run a second instance of zserver. What I do is create a second directory called "gateway". Copy the files zserver, zserver.ini, sapi.ini and bib1.map into the gateway subdirectory. Edit sapi.ini to point to the indexes you want to mount, and edit zserver.ini so that gateway/zserver runs on the next higher port number from where bin/zserver is running. That is, if bin/zserver is set up to run on port 5333, set the port number for gateway/zserver to be 5334. Run both servers, and you're ready to go. The secondary server has to be running on the same server as the primary server.

Archie Warnock (warnock@awcubed.com)
A/WWW Enterprises
http://www.awcubed.com