What does the example do?
Overview
The example can do these things:
- It can load an Emdros database with a text, initializing the
database for use with the example.
- It can build HAL-spaces of the in the database
- It can load a previously generated HAL space from the database
- It can emit two kinds of output:
- A complete HAL-space as a Comma-Separated-Value (CSV)
text file suitable for loading into a spreadsheet.
- Certain parts of a HAL-space, based on certain
word-forms that are of interest.
What output does it give?
- For building the database:
- A loaded Emdros database
- A list of word-forms found in the text (the Q word forms
spoken of on the HAL
space page).
- For querying the database:
Optionally, a Comma-Separated Value (CSV) file
containing the entire HAL-space, suitable for loading into a
spreadsheet.
An output file with data for words which you are
specially interested in.
For each word, a list is given of the words with which it
cooccurs most frequently and closely.
If the word form you are interested in is w1 and the word
form with which it cooccurs is w2, then the score is
calculated as Matrix[w1][w2] + Matrix[w2][w1].
This score is printed twice: First, it is printed in a
scaled form. The score is multiplied by a user-specified
factor and divided by the text length. And second, it is
printed in its raw form, as it came from the matrix.
The list is sorted, so that the "heaviest" words come
first. The user can specify how many words to put in the
list.
What input does it need?
- For loading the database: Only a text file.
- For querying the database: A configuration file with a
special format.
What is a configuration?
A configuration file is a
plain text file which looks like a Unix configuration file, and holds
information necessary for running the HAL example. See the
later page for its format.
What information does an input file contain?
An input file contains the following:
- The database name which holds the text to
analyze.
- The sliding window width, n.
- The name of the CSV-file containing the
HAL-space in a CSV-format, suitable for reading into a spreadsheet.
("none" if you don't want a CSV file).
- The name of the output file containing the
output for the words you are interested in.
- The words you are interested in.
- The maximum number of values for each word you
are interested in.
- The factor by which to multiply each value for
a given word. This is first divided by the number of words in the
text. This can come in handy if you wish to compare texts of
different lengths.
|