Configuring the program

Format of the configuration file

The configuration file follows many other Unix and Windows configuration files in that:

  • Comments are prefixed by #, and anything from the # to the end of the line is ignored.
  • Blank lines are ignored.
  • The rest is a number of "key = value" pairs.
  • The keys are pre-defined (see below).
  • The values are either "quote-enclosed strings" (e.g., "C:\Emdros\mymap.map") or consist of letters, numbers, underscores, and/or dots, optionally followed by a "quote-enclosed string" (e.g., 'word.surfce', 'word.surface."C:\Documents and Settings\Administrator\teckitmap.map"').

When a value has dots that are not enclosed in "quotes", then the strings on either side of the dots are interpreted as subkeys. For example, the value "word.surface" represents the subkey "word" with the value "surface", and the value "word.surface."/home/myname/Blah.map" represents the subkey "word" with the subsubkey "surface", followed by the value "/home/myname/Blah.map".

Here is a sample configuration file, explained bit by bit:

Database selection

# database
database = mydb

You can specify a database that is always to be used with this configuration file (unless overridden with the -d switch to eqtc).

If using SQLite 2 or SQLite 3, you may wish to specify a path. Do so in quotes:

# Database path. You can place it anywhere you want, so long
# as you abide by the rules of your operating system. For
# example, on Windows, do not place any "changing" data,
# such as an Emdros database, underneath 
# C:\Program Files\.
database = "C:\Users\yourusername\Documents\Emdros\mydb.sqlite3"

Rasterising unit

# rasterising unit
raster_unit          = clause

The Emdros Query Tool operates with a notion of "rasterising unit". That is the unit to be displayed on one line. For example, if your query returns a bunch of words, then, in the example above, all clauses that contains at least one of the words will be fetched and displayed.

There can only be one rasterising unit.

Raster context

# raster context
raster_context_before          = 10
raster_context_after           = 10
The "raster_unit" can be replaced with "so many monads of context" (before and after a hit). If a raster_unit is specified, it will take priority. If a raster unit is not specified, then both of the raster_context_before / raster_context_after values must be present.

Data units

# data units
data_unit            = clause
data_unit            = phrase
data_unit            = word
data_feature         = word.surface
data_feature         = word.psp
data_feature         = phrase.phrase_type
data_feature         = phrase.function        # You can have more than one
data_unit_name            = clause."Cl"
data_left_boundary   = phrase.OPEN_BRACKET    # Specifies left boundary marker
data_right_boundary  = phrase.CLOSE_BRACKET   # Specifies right boundary marker

The data units are the units to be displayed in each rasterising line. They can be anything, and need not be words.

You must specify which feature(s) to display for each data unit. The feature-names must be prefixed with the name of the data unit plus a dot, as in the example above.

The capitalisation must be exactly the same as the value for the "data_unit" key. For example, if you said "data_unit = phrase", then you must also say "data_feature = phrase.phrase_type", not "Phrase.phrase_type".

There can be more than one data unit. If so, they should be specified in the order from largest to smallest (e.g., clause, phrase, word). This will give the "output" output style (see below) a hint as to how to print things in the right order.

You can optionally specify "boundary markers" that will be printed at the left and right boundaries of a unit respectively. The strings to be printed can be taken from the following table:

This string... Is replaced by...
(without the quotes)
SPACE " "
COMMA ","
COMMA_SPACE ", "
COLON ":"
COLON_SPACE ": "
OPEN_BRACE "{"
CLOSE_BRACE "}"
OPEN_BRACKET "["
CLOSE_BRACKET "]"
OPEN_PAREN "("
CLOSE_PAREN ")"
NEWLINE newline
NIL ""

The "data_unit_name" key gives, for a given object type, a string which will appear above all the other data_features (if any). In the above example, the clause unit is given a "Cl" label.

Finally, in the graphical version of the Emdros Query Tool, it is possible to have an interlinear display. The order of the lines in the interlinear display is the same as the data_feature keys. The number of lines is equal to the number of features for the data unit for which the most data_feature keys are given, plus the number of data_unit_name keys for that unit.

TECkit mappings

#surface
data_feature_teckit_mapping  = word.surface."e:\TECkit\mymap.map"
data_feature_teckit_in_encoding  = word.surface.bytes
data_feature_teckit_out_encoding = word.surface.unicode

# lemma
data_feature_teckit_mapping  = word.lemma."e:\TECkit\mymap.map"
data_feature_teckit_in_encoding  = word.lemma.bytes
data_feature_teckit_out_encoding = word.lemma.unicode

TECkit is a tool made by SIL International. It converts between encodings, in particular to and from Unicode. The Emdros Query Tool incorporates TECkit, and you can apply it to any textual feature of any object type.

TECkit works with a so-called "map file" -- a text file which you or someone else writes. More information about writing TECkit mappings can be found on SIL's website:

http://scripts.sil.org/TECkit/

The Emdros Query Tool needs three pieces of information in order for TECkit to work on a particular feature:

  1. The name of the file which holds the maping. This is given with the key "data_feature_teckit_mapping".
  2. The input encoding (encoding of the feature-string): This is given with the key "data_feature_teckit_in_encoding". The value can be either "bytes" or "unicode" (without the quotes). "bytes" means that TECkit does not convert to UTF-8. "unicode" means it is converted to UTF-8 for display. You should use whatever is used in the map file for input encoding here.
  3. The output encoding (encoding to transform into): This is given with the key "data_feature_teckit_out_encoding". The same meanings and restrictions apply as for the input encoding.

TECkit can not only convert between encodings, but also remove stuff from a string. This can come in handy when you have characters in your feature-strings which you do not wish to display. Again, see the TECkit site on SIL's website for information on how to write a TECkit mapping.

You should give first the object type, then a dot, then the feature-name, then a dot, then the full path to the map file. You probably need to enclose the path in "double quotes".

You can only have one TECkit per feature.

Reference unit

# reference units
reference_unit      = verse
reference_feature   = verse.book
reference_feature   = verse.chapter
reference_feature   = verse.verse

reference_sep = SPACE # between book and chapter
reference_sep = COMMA # between chapter and verse

If you have a unit in your database which somehow identifies the position in the document, or an ID, you can display these units at the left of each line. The canonical example is the Biblical system of book-chapter-verse, but in many corpora, there will be a unit identifying, e.g., which newspaper article something came from.

In the above example, verse is the reference unit, and three features are fetched, namely book, chapter, verse. The order in which they are specified in the configuration file is the order in which they will be emitted.

If there is more than one reference unit feature, you must specify the separators to separate them. In the above example "SPACE" will be emitted between "book" and "chapter", and "COMMA" will be emitted between the chapter and the verse (again, the order matters). See the table above for some possibilities of using special characters.

There can be only one reference unit.

Output style

#output_style = kwic
#output_style = tree
#output_style = xml
output_style = output

Specifies which implementation to use for emitting solutions. Currently, three kinds of output style are implemented:

  • output: A "bracketed" view.
  • tree: A "tree" view
  • kwic: A "key words in context" view.
  • xml: XML output view. For usage, please see the DTD that is emitted along with the output.

Data tree parent

# Tree parent feature.
# If output_style = tree, then it is assumed that
# there is a feature on all relevant data units which gives the
# id_d of the parent.  That is, each child node in the tree
# must have a feature which provides the id_d of its parent.
# If a data_unit is provided which does not have a data_tree_parent,
# then that data_unit *must* contain the top-most nodes in the tree.
data_tree_parent = clause.parent
data_tree_parent = phrase.parent
data_tree_parent = word.parent

If "output_style" is set to "tree", then this option specifies, for each terminal and non-terminal in the tree, what feature gives the parent of the node. Note that this feature must have type "id_d", and the value must point to the id_d of the parent node.

Tree terminal unit

# Tree terminal unit.
# If output_style = tree, then the Emdros Query Tool needs to know
# which object types are terminals (i.e., leaf nodes in the tree)
# and which object types are non-terminals.  This is done by
# designating *one* (1) data_unit to be the data_tree_terminal_unit.
# The rest of the data_units will then be non-terminals.
data_tree_terminal_unit = word

This options tells the tree layout code which data_unit contains the terminals. Note that the Emdros Query Tool assumes that terminals and nonterminals are different object types. There may be more than one nonterminal object type, but only one terminal object type. The non-terminsl object types are determined based on the data_unit option.

Object Type Name as Tree node name

# Tree nonterminal unit name.
# If and only if this is set to "true" (without the quotes),
# use the object type name as the node name in the tree.
#
# otherwise, it is advisable to add data_feature entries for all
# nonterminal units, which will then be shown.
#
# You can set this to "true" and still use data_feature -- the features
# will then be addedbelow the object type name.
#
data_tree_object_type_name_for_nonterminals = true

This options tells the tree layout code to add the object type name of each nonterminal as the first line in each node box. If set to "true", this is what is done. If set to anything else, or if not set, it is not done.

Hit type

# hit type
# hit_ must be one of:
#    focus
#    innermost
#    innermost_focus
#    outermost
hit_type    =  outermost

The hit type determines how the sheaf is interpreted. There are four available options:

  • focus: Means that an object originating in a block with the FOCUS keyword present will result in one "hit".
  • innermost: Means that only the innermost MatchedObjects will give rise to hits; one hit per string of blocks in which all matched objects have no descendants (i.e., no inner sheaf).
  • innermost_focus: Like innermost, but only those matched objects whose "focus" boolean is set will have their monads included.
  • outermost: Means that only the outermost MatchedObjects will give rise to hits; one hit per outermost MatchedObject.

If none of these are specified, then "outermost" is assumed as the default.

Options

# display options
option = apply_focus
option = break_after_raster
option = quiet
option = single_raster_units

You can have these options:

Option Meaning
apply_focus If set, then those data units which had the "focus" keyword in the original query will be surrounded by {braces} in the output.
break_after_raster If set, then a newline is emitted after each raster-line. If not set, then the raster-lines are run together.
quiet If set, then only results will be printed; nothing else. If not set, then things like progress and number of solutions will be printed. If an error occurs, then that will be printed regardless of the status of this option.
single_raster_units If set, then each raster unit will only ever be printed once. This affects the number of solutions printed: If two solutions each contain the same raster unit, then only one of the solutions will be printed.

Display options

input_area_font_name  = "Arial MS Unicode"
input_area_font_size  = 11  # in points
output_area_font_name_1 = "SPIonic"
output_area_font_name_2 = "Courier New"
output_area_font_name_3 = "Times New Roman"
output_area_magnification  = 100  # in percent (%)

You can set the default font name and font size (in points) for the input area.

You cannot set the font size in points for the output area. Instead, you can set it to a percentage of 12 point. For example, setting output_area_magnification to 150 will select a font size of 18 points, and setting it to 200 will select a font size of 24 points.


Previous:Normal usage
Up:Table of contents
Next:Query Guide