To populate an NLP dictionary you first create an item, then associate one or more terms with that item. Commonly a dictionary consists of multiple items, with each item associated with multiple terms. An item is a word or phrase that is a relevant tag for many entities in the source texts. When an entity in the source texts is determined to be a match, it is tagged with the item. For example, the item “ship” is a relevant tag for “ship”, “boat”, “sail”, “oars”, and so forth.
To perform this matching, you populate each item in the dictionary with match terms. A term can be single entity (like “motor boat”) or a phrase or sentence (like “boats are rowed with oars or paddles”). NLP indexes each term in the dictionary using the same language model used for the source texts. NLP then matches each term with the same content unit in the source texts (a Concept term is matched against a Concept in source text; a CRC term is matched against a CRC in source text). If NLP identifies a match between a term and a unit of source text, NLP tags the source text passage with the associated dictionary item. This matching frequently is not identical, but requires NLP to use a scoring algorithm to determine if the term and source text warrant being tagged as a match.
Terminology
A Dictionary is a way to group different terms that have something to do with one another in a logical way. A dictionary could for example be Cities, ICD10 codes, or French wines. As a dictionary is the level of aggregation used within the matching APIs, it is specific to the use case to decide what level of real-world grouping should correspond to a dictionary. Taking a higher level (such as "all ICD10 codes") will yield better performance and use lower disk space, but a lower level (such as "a separate one for all ICD10 categories") might offer grouped results with greater granularity. Each dictionary has a name and a description.
A Dictionary Item is a uniquely identifiable item in your dictionary. Examples of a dictionary item could be cities, the individual codes in ICD10 or individual chateaux. Each dictionary typically has many dictionary items (lots of small dictionaries with few items can decrease performance). A dictionary item has a URI, which should be unique within the domain and can be used as an external identifier, and an optional description. This URI can be used when building rules to interpret matching results later on.
A Dictionary Term is a string that could appear somewhere in a text and represent the Dictionary Item it belongs to. For example, "Antwerp", "Anvers" and "Antwerpen" could be different terms associated with the same dictionary item representing the city of Antwerp. Dictionary terms are the free text strings on which the actual matching is based when doing string-based matching and could be different spellings, translations or synonyms of what your Dictionary Item stands for. These strings are passed through the engine and, when containing more than just a single entity, will automatically be transformed into a more complex structure to be able to match across the boundaries of a single concept (CRC or Path). A dictionary term should also have a language associated with it, if it needs to be processed by the engine.
When processing a new dictionary term by passing it through the NLP engine, one or more Dictionary Elements are generated to represent the different entities identified within the term. For example, a dictionary term "failure of the liver" would be translated into the three elements "failure", "of" and "liver", with "the" being discarded as non-relevant. These elements are generated and managed automatically and only figure in some types of output, so you shouldn't worry too much about them.
If you want to identify specifically-formatted dates, numbers or other formatted pieces of string, you can use Dictionary Formats to specify them, and these can then be included in a Dictionary Term, either representing the complete term, or just a single element within a more complex one. A format is a meaningful pattern of characters, such as a date format. You could associate the formats “nn/nn/nnnn” and “nnnn-nn-nn” with the item named Date. NLP tags any occurrence of these formats in the source texts with the Date item.
Note:
NLP provides semantic attributes that flag many common representations of date, time, duration, and measurement. Check the availability and specificity of these attributes in your national language before defining Dictionary Formats.