Skip to main content
Previous sectionNext section

Domain Architect

InterSystems IRIS® data platform provides the Domain Architect as an interactive interface for creating and populating NLP domains and performing analysis on the indexed data. Domain Architect is accessed using the InterSystems IRIS Management Portal.

It consists of three tools:

  • Domain Architect: for creating an NLP domain and populating it with source text data.

  • Domain Explorer: for analyzing the data in an NLP domain by looking at specific entities.

  • Indexing Results: for displaying how NLP analyzed the text data in a source, using highlighting to show different types of entities.

All functionality provided through the Domain Architect is also available by using ObjectScript to invoke NLP class methods and properties.

Accessing Architect

The starting point for accessing the Domain Architect is the Management Portal Analytics option. From there you select the Text Analytics option. This displays the Domain Architect option.

All NLP domains exist within a specific namespace. Therefore, you must specify which namespace you wish to use by selecting the Switch option at the top of any Management Portal interface page. This displays the list of available namespaces, from which you can make your selection.

A namespace must be enabled for NLP before it can be used. Selecting an enabled namespace displays the NLP Domain Architect option.

Note:

If selecting an enabled namespace does not display the Domain Architect option, you do not have a valid license for NLP. Look at Licensed to in the Management Portal header. Review or activate your license key.

Enabling a Namespace

A namespace must be enabled for NLP before it can be used with Domain Architect.

  • If no namespaces are enabled, the Text Analytics option does not display any options.

  • If the current namespace is not enabled, the Analytics option displays a list of analytics-enabled namespaces. Select one of these displayed namespaces.

To enable a namespace for NLP from the Management Portal, select System Administration, Security, Applications, Web Applications. This displays a list of web applications; the third column indicates if a listed item is a namespace (“Yes”) or not. Select the desired namespace name from the list. This display the Edit Web Application page. In the Enable section of the page select the Analytics check box. Click the Save button.

You cannot enable the %SYS namespace. This is because you cannot create NLP domains in the %SYS namespace.

You can set your Management Portal default namespace. From the Management Portal select System Administration, Security, Users. Select the name of the desired user. This allows you to edit the user definition. From the General tab, select a Startup Namespace from the drop-down list. Click Save.

Creating a Domain

From the Domain Architect press the New button to define a domain. You specify the following domain values (in the specified order):

  • Domain name: The name you assign to a domain must be unique for the current namespace (not just unique within its package class). A domain name may be of any length and contain any typeable characters, including spaces (the % character is valid, but should be avoided). Domain names are not case-sensitive. However, because Domain Architect uses the domain name to generate a default domain definition class name, it is recommended that you follow class naming conventions when naming a domain, unless there are compelling reasons to do otherwise.

  • Definition class name: the domain definition package name and class name, separated by a period. If you first specified the domain name, clicking on the Definition class name generates default names for the domain definition package and class. The package name defaults to User. The class name defaults to the domain name, stripped of non-alphanumeric characters. You can accept or modify this default.

    The package name and the class name can contain only alphanumeric characters, and are case-sensitive. Specifying a package name that differs from an existing package name only in lettercase results in an error. Within a package, specifying a class name that differs from an existing class name only in lettercase results in an error.

  • Allow Custom Updates: optionally select this box if you wish to enable adding data or dictionaries to this domain manually; the default is to not allow custom updates.

Click the Finish button to create the domain. This displays the Model Elements selection screen.

You must Save and Compile a newly created domain before exiting that domain.

If you attempt to create a duplicate domain name, the Domain Architect issues a “Domain name already in use” error.

For other ways to create a domain, refer to NLP Domains. Note that Domain Architect is the only domain creation interface that allows you to define a domain definition package name and class name.

Opening a Domain

Creating a domain using the Management Portal interface immediately opens the domain, allowing you to begin immediately to manage this new domain.

To manage an existing domain, click the Open button to list all existing domains in the namespace. This display lists the packages that contain domains. Select a package to display its domains. Select an existing domain. This displays the Model Elements selection screen.

Changing the Domain Name and Check Boxes

Creating or opening a domain displays the Model Elements window. If you click on the domain name in this window, the Details tab displays the Domain Name field, the Domain Tables Package field, and the Allow Custom Updates and Disabled check boxes. You can modify these characteristics of the domain. Changing the Domain Name does not change the Definition class name.

Checking the Allow Custom Updates check box allows the manual loading of data sources and dictionaries into this domain using interfaces other than Domain Architect.

Checking the Disabled check box prevents the loading of all data (source data, metadata, dictionary matching data) during the Build operation. Each of these types of data also has its own Disabled check box that allows you to disable loading of each types of data separately.

You must Save and Compile a renamed domain before exiting that domain.

Deleting a Domain

To delete the current domain, click the Delete button. This displays the Drop domain data window. you can either delete just the domain contents or delete the domain definition. Click Drop domain & definition class to delete the domain and its associated class definition, including the specifications of data sources, blacklists, and other model elements.

Model Elements

After creating a domain, or opening an existing domain, you can define model elements for the domain. To add or modify model elements, click on the expansion triangle next to one of the headings. Initially, no expansion occurs. Once you have defined some model elements, clicking the expansion triangle shows the model elements you have defined.

To add a model element, click the heading. Then click the Add button shown in the Details tab on the right side. Specify the name and values. The model element is automatically generated when you leave the Details area. Model elements are listed in the order of their creation, with the most-recently-created element at the top of the list; modifying a model element does not change its position in the list.

To modify a model element, expand the heading, then click a defined model element. The current values are shown in the Details tab on the right side. Modify the name and/or values as desired. The model element is automatically re-generated when you leave the Details area.

Once you have created model elements, clicking on the Expand All button (or one of the expansion triangles) displays these defined values. The Element Type column shows the type of each model element. Clicking on the red “X” deletes that model element.

The Save button saves all changes. The Domain Architect page heading is followed by an asterisk (*) if there are unsaved changes. Click Save to save your changes.

The Undo button reverses the most recent unsaved change. You can click Undo repeatedly to reverse unsaved changes in the reverse order that they were made. Once changes are saved, this button disappears.

The following Model Elements are provided:

Domain Settings

This model element allows you to modify the characteristics of the domain. All Domain Settings are optional and take default values. Domain Settings provides the following options:

  • Languages: select one or more languages that you wish NLP to identify in the text data. If you check more than one language, automatic language identification is activated. This increases the processing required for texts. Therefore, you should not select multiple languages unless there is a real likelihood that texts in the selected language will be part of the data set. The default language is English.

  • Add Parameter: this button allows you to specify a domain parameter value. You can only add a domain parameter to an empty domain; this means that you must add all desired domain parameters before you Build the domain with Data Locations specified. Otherwise, the Compile to add, modify, or delete domain parameters fails with an error message; you can use the Delete button to drop domain contents to allow you to add, modify, or delete domain parameters.

    To add a parameter, specify the domain parameter name and the new value. Domain parameter names are case-sensitive. You can use either name form. For example, Name=SortField, Value=1 or Name=$$$IKPSORTFIELD, Value=1. No validation is performed. All unspecified domain parameters take their default values. To view the parameters that you have added, expand the Domain Settings heading.

  • Maximum Concept Length: the largest number of words that should be indexed as a concept. This option is provided to prevent a long sequence of words from being indexed as a concept. The default (0) uses the language-specific default for the maximum number of words. This default should be used unless there are compelling reasons to modify it.

  • Manage User Dictionary: this button displays a “Manage User Dictionary” box that allows you to specify one or more strings to the user dictionary. Each specified string either specifies a string that will rewrite to a new string, or specifies a string to which you assign an attribute label from a drop-down list.

Metadata Fields

Add Metadata: this button allows you to specify a source metadata field. For each metadata field you specify the field name, the data type (String, Number, or Date), the supported operators, and the storage type. After creating a domain, you can optionally specify one or more metadata fields that you can use as criteria for filtering sources. A metadata field is data associated with an NLP data source that is not itself NLP indexed data. For example, the date and time that a text source was loaded is a metadata field for that source. Metadata fields must be defined before loading text data sources into a domain.

Case Sensitive check box: By default, a metadata field is not case-sensitive; you can select this check box to make it case-sensitive.

Disabled check box: You can select the Disabled check box to disable all metadata fields, or you can select the Disabled check box displayed with an individual metadata field to disable just that metadata field. A disabled field is not loaded during the Build operation.

The metadata fields that you specify here appear in the Data Locations Add data from table and Add data from query details under the title “Metadata mappings”.

Data Locations

Specifies the source for adding data. Option are Add data from table, Add data from query, Add data from files, Add RSS data, and Add data from global.

  • The Drop existing data before build check box allows you to specify whether source text data already indexed in this NLP domain should be deleted before adding the source text data specified here. To use this check box to drop data, data loading must not be disabled. To drop existing data without loading new data, use the Delete button Drop domain contents only option.

  • The Disabled check box allows you to disable source indexing; disabled source data is not loaded during the Build operation. If data loading is disabled, the Drop existing data before build check box is ignored.

    A Build operation for a large number of texts may take some time. If you have already loaded the data locations and wish to add or modify metadata or a matching dictionary you can click the Data Locations Disabled check box to index these model elements without reloading the data locations.

After specifying data locations, you must Save and Compile the domain, then select the Build button to build the data indices.

Add Data from Table

This option allows you to specify data stored in an existing SQL table in the current namespace. It provides the following fields:

  • Name: you can either specify a name or take the default name for the extracted result set table. Follows SQL table naming conventions. The default name is Table_1 (with the integer incrementing for each additional extracted result set table you define).

  • Batch Mode: a check box indicating whether or not to load source text data in batch mode.

  • Schema: from this drop-down list select an existing schema in the current namespace.

  • Table Name: from this drop-down list select an existing table in the selected schema.

  • ID Field: from this drop-down list select a field from the selected table to serve as the ID field (primary record identifier). An ID field must contain unique, non-null values.

    Selecting –custom– from the drop-down list allows you to input a field name; for example, a hidden RowId field or a field that does not (yet) exist. Field names are not case-sensitive. Selecting –custom– also displays the Show Default Options button. This button selects the first non-hidden field in the table from the drop-down list and also allows you to return to the drop-down list of fields.

  • Group Field: an SQL select-item expression that retrieves a secondary record identifier from the selected table. This field defaults to the initial ID Field selection.

    Selecting –custom– from the drop-down list allows you to input a field name; for example, a hidden RowId field or a field that does not (yet) exist. Field names are not case-sensitive. Selecting –custom– also displays the Show Default Options button. This button selects the first non-hidden field in the table from the drop-down list and also allows you to return to the drop-down list of fields.

  • Data Field: from this drop-down list select a field from the selected table to serve as the data field. The data field contains the text data loaded for NLP indexing. You can specify a field of data type %String or %Stream.GlobalCharacter (character stream data).

    Selecting –custom– from the drop-down list allows you to input a field name; for example, a hidden RowId field or a field that does not (yet) exist. Field names are not case-sensitive. Selecting –custom– also displays the Show Default Options button. This button selects the first non-hidden field in the table from the drop-down list and also allows you to return to the drop-down list of fields.

  • Where Clause: you can optionally specify an SQL WHERE clause to limit which records are included in the result set table. Do not include the WHERE keyword.

If you have defined one or more Metadata Fields for this domain, the Metadata mapping option allows you to specify a metadata field for this table. From the drop-down list you can select a field from the selected table, select – not mapped –, or select – custom –. If you select – custom – the Architect displays an empty field in which you can specify the custom mapping.

If you have not defined any Metadata Fields for this domain, the Metadata mapping option provides a Declare Metadata button that directs you to the Add Metadata domain option.

Add Data from Query

Add data from query is similar to Add data from table, but allows you to specify a fully-formed SQL query for an existing table (or tables), from which you provides the following fields:

  • Name: you can either specify a name or take the default name for the extracted result set table. Follows SQL table naming conventions. The default name is Query_1 (with the integer incrementing for each additional extracted result set table you define).

  • Batch Mode: a check box indicating whether or not to load source text data in batch mode.

  • SQL: the query text, an InterSystems SQL SELECT statement. Defining a query allows you to select fields from more than one table by using JOIN syntax. When specifying more than one table, assign column aliases to selected fields. Defining a query also allows you to specify an expression field that you can use as the Group field.

    The following field selection drop-down lists display the selected fields. They do not display table alias prefixes. If the field has a column alias, this alias is listed rather than the field name.

  • ID Field: from this drop-down list select a field from the selected table to serve as the ID field. An ID field must contain unique, non-null values.

  • Group Field: from this drop-down list select a select-item expression (such as an SQL function expression) from the query to serve as a secondary record identifier (group field). For example, YEAR(EventDate).

  • Data Field: from this drop-down list select a field from the selected table to serve as the data field. The data field contains the text data loaded for NLP indexing.

If you have defined one or more Metadata Fields for this domain, the Metadata mapping option allows you to select either – not mapped – or – custom – for each defined metadata field. The default is – not mapped –. If you select – custom – the Architect displays an empty field in which you can specify the custom mapping.

If you have not defined any Metadata Fields for this domain, the Metadata mapping option provides a Declare Metadata button that directs you to the Add Metadata domain option.

The Model Elements window Element Type column displays a truncated form of the query you defined; the query is truncated after the first table name in the FROM clause. The full query is shown in the Details window.

Add Data from File

This option allows you to specify data stored in files. It provides the following fields:

  • Name: you can either specify a name or take the default name for the extracted data file. The default name is File_1 (with the integer incrementing for each additional extracted data files you define).

  • Path: the complete directory path to the directory containing the desired files. The Path syntax is filesystem dependent; on a Windows system it might look like the following: C:\\temp\NLPSources\

  • Extensions: the file extension, such as txt or xml. Do not include the dot prefix when specifying the file extension. Specify multiple extensions as a comma-separated list with no dots and no spaces; for example, txt,xml. If specified, only files with the specified extensions are included in the resulting extracted data. If the Extensions field is left blank (the default) all files are included, regardless of their extensions.

  • Filter Condition: a condition used to restrict which files are to included in the resulting extracted data.

  • Recursive: a check box indicating whether to select files recursively. When checked, data can be extracted from the files in the specified directory and files in all of its subdirectories, and their sub-subdirectories, etc. When not checked, data can be extracted only from files in the specified directory. The default is non-recursive (check box not checked).

  • Batch Mode: a check box indicating whether or not to load source text data in batch mode.

  • Encoding: a drop-down list of the types of character set encoding to use to process the files.

Add RSS Data

This option allows you to specify data from an RSS stream feed. It provides the following fields:

  • Name: you can either specify a name or take the default name for the extracted data. The default name is RSS_1 (with the integer incrementing for each additional RSS source you define).

  • Batch Mode: a check box indicating whether or not to load source text data in batch mode.

  • Server Name: the name of the host server on which the URL is found.

  • URL: the navigation path within the server address to the actual RSS feed.

  • Text Elements: a comma-separated list of text elements to load from the RSS feed. For example title,description. Leave blank for defaults.

Add Data from Global

This option allows you to specify data from an InterSystems IRIS global. It provides the following fields:

  • Name: you can either specify a name or take the default name for the extracted data. The default name is Global_1 (with the integer incrementing for each additional global source you define).

  • Batch Mode: a check box indicating whether or not to load source text data in batch mode.

  • Global Reference: The global from which you wish to extract the source data.

  • Begin Subscript: the first global subscript in a range of subscripts to include.

  • End Subscript: the last global subscript in a range of subscripts to include.

  • Filter Condition: a condition used to restrict which files are to included in the resulting extracted data.

Blacklists

Define blacklists: After creating a domain, you can optionally create one or more blacklists for that domain. A blacklist is a list of terms (words or phrases) that you do not want a query to return. Thus a blacklist allows you to perform NLP operations that ignore specific terms in data sources loaded in the domain.

  • Name: specify the name of a new blacklist, or take the default name. Blacklist names are not case-sensitive. Specifying a duplicate blacklist name results in a compile error. The default name is Blacklist_1 (with the integer incrementing for each additional blacklist you define).

  • Entries: specify terms to include in the blacklist, one term per line. Terms should be in lower case. Duplicate terms are permitted. You can copy/paste terms from one blacklist to another. You can include blank lines to separate groups of terms. A line return at the end of your list of terms is optional; blank lines are not counted as entries.

If you add, modify, or delete a blacklist, you must Save and Compile the domain for this change to take effect.

Because defining blacklists has no effect on how data is loaded into a domain, changes to blacklists do not require re-building the domain.

Blacklists are compiled, then supplied to the Domain Explorer, which allows you to specify none, one, or multiple blacklists when performing analysis of source text data loaded into the domain. A blacklist is applied to some (but not all) Domain Explorer analytics.

Matching

The Matching option provides the Add Dictionary option to define a dictionary and specify its items and terms.

The Matching option provides four check box options, as follows:

  • Disabled: You can select the Disabled check box to disable building of all dictionaries, or you can select the Disabled check box displayed with an individual dictionary to disable the building of that dictionary. Selecting Disabled check boxes allows you to build only those dictionaries that you have changed. The default is off.

  • DropBeforeBuild: default on

  • AutoExecute: default on

  • IgnoreDictionaryErrors: default on

Add Dictionary

The Add Dictionary button displays the dictionary definition options: dictionary name (with a supplied default), an optional description, the dictionary language selected from a drop-down list of NLP supported languages, and the disabled check box. The default name is Dictionary_1 (with the integer incrementing for each additional dictionary you define).

The Add Item button displays the item definition options: item name (with a supplied default), a uri name (with a supplied default), the item language selected from a drop-down list of NLP supported languages, and the disabled check box. To define more items, select the dictionary name. Items are listed in order of creation, with the most recent at the top of the list. Within each item you can define one or more terms. The default name is Item_1, the default uri name is uri:1 (with the integer incrementing for each additional item you define for this dictionary).

The Add Term button displays the term definition options: a string specifying the term, the term language selected from a drop-down list of NLP supported languages, and the disabled check box. To define more terms, select the item name. Terms are listed in order of creation, with the most recent at the top of the list.

Save, Compile, and Build

You must save, compile, and build a domain (in that order) using the buttons provided. You must save and compile a domain after adding, modifying, or deleting any Model Elements.

The Save button saves the current domain definition. Architect greys out (disables) the Save button if no domain definition is open. Architect does not issue an error if you save a domain definition without changing it.

The Compile button compiles the current domain definition. It compiles all of the classes and routines that comprise the domain definition. If you have not saved changes that you made to the domain definition, the compile operation prompts you to save the domain definition before compiling.

The Build button loads the specified sources into the current domain. If you have made changes to the Data Locations, Metadata Fields, or Matching dictionaries, you must build the domain. The Build Domain window displays progress messages such as the following:

13:50:48: Loading data...
13:51:49: Finished loading 3 sources
13:51:49: Creating dictionaries and profiles...
13:51:49: Finished creating 1 dictionaries, 1 items, 3 terms and 0 formats
13:51:49: Matching sources...
13:51:50: Finished matching sources
13:51:50: Successfully built domain 'mydomain'

The build operation can be time-consuming. If a Disabled check box is checked for a model element, the Build operation does not load the corresponding sources. Selecting Disabled check boxes allows you to build only those model elements that you have changed.

Domain Explorer

There are two ways to access the Domain Explorer:

  • From the Management Portal Analytics option select the Text Analytics option. This displays the Domain Explorer option. When you select this option it prompts you to select an existing domain from a drop-down list.

  • From the Management Portal Analytics option select the Text Analytics option. Access the Domain Architect and create or access a domain. Once you have specified Data Locations and populated the domain with this data using the Build button, you can select Domain Explorer from the Tools tab. This displays the Domain Explorer as a separate browser tab with the current domain selected.

The Domain Explorer is a display interface with broad application. It shows a wealth of information about the source text data indexed in a domain. It initially displays a list of either the top (most-frequently-occurring) concepts, or the dominant (highest dominance) concepts. You can toggle between these two lists.

If you select an entity, the Domain Explorer provides analysis of similar entities and related concepts, and analysis of the appearance of the specified entity in larger text units (sources, paths, and CRCs). This provides a contextual at-a-glance view of what's in your data.

The Domain Explorer provides generic filters that support selecting subsets of the sources in a domain based on metadata criteria. This interface provides a sample of how NLP Smart Indexing can be used to quickly overview and navigate a large set of documents.

Domain Explorer Settings

By default, the Domain Explorer displays analysis of the domain that was current in Domain Architect or the domain you selected when you invoked the Domain Explorer.

To select another domain:

  1. Select the Gear icon at the upper right of the Domain Explorer. This displays the Settings box.

  2. The Settings box contains the Switch domain drop-down list. Select a domain from this list. By default, this list include the domains defined in the current namespace. If you select the Include other namespaces check box, the drop-down list includes domains defined in all namespaces.

To apply blacklists:

  1. Select the Sunglasses icon at the upper right of the Domain Explorer. If the domain has no defined blacklists, this icon does not appear.

  2. The Blacklists box contains check boxes for each defined blacklist. Select one or more, then click the Apply button.

To use stemming:

  1. Select the Gear icon at the upper right of the Domain Explorer. This displays the Settings box.

  2. If the domain is configured for stemming, the Settings box also contains the Use stems instead of entities and Show representation form for stems check boxes. If Use stems instead of entities is checked, the Domain Explorer performs stemming analysis and changes the Domain Explorer headings as follows: Top Concepts/Dominant Concepts becomes Top Stems/Dominant Stems, Similar Entities becomes Similar Stems, Related Concepts disappears, leaving Proximity Profile, and the CRCs tab disappears. If Show representation form for stems is checked, each stem is displayed as a representative word; if not checked, the stem itself is displayed. Both boxes are checked by default.

The number at the top right of the Domain Explorer is the number of sources loaded in the selected domain that are available for data analysis. This number can be limited by applying filters.

Listing All Concepts

The Domain Explorer initially provides concept analysis of the data sources loaded in the domain. There are two ways to list concepts, by frequency or by dominance. You can toggle between these two by selecting the frequency or dominance button:

  • Top Concepts: selecting the frequency button lists all concepts in the sources in descending order of frequency. If multiple concepts have the same frequency, the concepts are listed in descending collation order. Each concept is listed with its frequency (total number of occurrences in all sources) and spread (number of sources containing that concept). To view frequency counts for a single source, use the Indexing Results tool.

  • Dominant Concepts: selecting the dominance button lists all concepts in the sources in descending order of dominance score. If multiple concepts have the same dominance score, the concepts are listed in descending collation order. The dominance score is calculated by taking the dominance values for each source and using an averaging algorithm to determine the dominance of a concept across all loaded sources. Dominance values in a single source are integer values, with the most dominant concept given a dominance of 1000. To view dominance values for a single source, use the Indexing Results tool.

Analyzing a Specified Entity

There are two ways to display analysis of a specific entity:

  • Select a concept from either the Top Concepts or Dominant Concepts listings.

  • In the entry field in the top left corner you can type the first few characters (minimum of 2, not case-sensitive) of a word found in an entity, and the Domain Explorer displays a drop-down list of all of the existing entities that contain a word beginning with those characters. Select an entity from this drop-down list, then press the Explore! button. You can use this option to display Relations or Concepts; both types of Entities are shown in the drop-down list.

Selecting an entity displays two kinds of analysis of that entity: associated entities and specified entity in context.

Associated Entities

Selecting an entity displays the following listings:

  • Similar Entities: a list of concepts and relations that are similar to the specified entity, with the frequency (total number of occurrences in all sources) and spread (number of sources containing that concept) of each concept or relation. The first similar entity listed is always the specified entity itself. For a concept, this first listed entity is the same as the Top Concepts listing for that concept.

  • Related Concepts: selecting the related button displays a list of concepts that are related to the specified concept, with the frequency (total number of occurrences in all sources) and spread (number of sources containing that concept) each concept. A related concept is a concept that appears in a CRC with the specified concept.

  • Proximity Profile: selecting the proximity button displays the Proximity Profile table. This lists concepts associated by proximity to the specified concept, with a proximity score for each concept.

Selecting an entity from the Similar Entities, Related Concepts, or Proximity Profile listings changes all listings to analysis of that entity. It does not change the Top Concepts and Dominant Concepts listings.

Entity in Context

Selecting an entity also displays the following listings of that entity in context:

  • Sources: a list of source texts containing the specified entity (shown highlighted in green), along with the internal source ID (an integer) and external source ID. Sources are listed in descending order by internal source ID. The source text displays all sentences in the source that contain the entity; intervening sentences that do not contain the entity are not displayed, but are indicated by ellipsis (...); note that leading ellipsis is not shown when the first displayed sentence is not the first sentence in the source, and trailing ellipsis is always shown after the final sentence, even when the last displayed sentence is actually the last sentence in the source.

    Red text indicates negation, with the entities within the scope of the negation attribute in red letters. Negation scope is not necessarily the same as the corresponding path, sentence, or CRC.

    Selecting the Eye icon or clicking anywhere in the listing for a source displays the full text of the source. Each occurrence of the specified entity is highlighted and each negation scope text is shown in red letters in the full text. (The % option must be set to 100% to display all occurrence of the specified entity in this full text box.)

    Selecting the Arrow icon displays the Indexing Results tool.

  • Paths: a list of paths containing the specified entity. Paths are listed in descending order by ID. Note that because path IDs are assigned on a per-source basis, the same path text may be listed multiple times with different path IDs.

    The entities and attributes of the path are color coded and highlighted as described in the Indexing Results tool description of Indexed Sentences, with the addition of the Explore! entity appearing in yellow-orange.

    Selecting a path element changes all listings to analysis of that entity. It does not change the Top Concepts and Dominant Concepts listings.

    Selecting the Eye icon displays the full text of the source with the specified entity highlighted in green.

    Selecting the Arrow icon displays the Indexing Results tool.

  • CRCs: a list of Concept-Relation-Concept (CRC) sequences that contain the specified entity, with the frequency (total number of occurrences of that CRC in all sources) and spread (number of sources containing that CRC). Note that many CRCs contain only one concept: CR or RC. The entity type highlighting is the same as for Paths, except that Path-relevant Words are not part of CRCs and are therefore not displayed. Attributes are not highlighted in the CRCs listing.

    Selecting a CRC element changes all listings to analysis of that entity. It does not change the Top Concepts and Dominant Concepts listings.

    Selecting the Eye icon displays the Sources with selected CRCs box, listing each source that contains an instance of the CRC. The CRC is highlighted in green in the context of its sentence, and flagged with the Source ID of the source. A source ID listing can contain multiple sentences containing the specified CRC; intervening sentences that do not contain the CRC are indicated by ellipsis. From the Sources with selected CRCs box you can select the Eye icon for a source containing the CRC to display the full text of the source with the specified entity (not the CRC) highlighted in green.

Note:

If Japanese is the only language supported for the domain, the Domain Explorer display differs as follows: the Related Concepts and CRCs listings are not shown. An Entity Vectors listing is substituted for the Paths listing.

Full Text Box

The Eye icon displays the full text of a selected source. This text box is identified by the external ID of the source. For example, :SQL:1171:1171.

The source text is tagged as follows:

  • The specified entity is highlighted in green.

  • Red text indicates negation, with the entities within the scope of the negation attribute in red letters.

This full text box provides the following option buttons:

  • metadata: displays the metadata for the source. All sources are provided with a DateIndexed metadata field. This date stamp is represented as a UTC date and time in the Display format for your locale. It is truncated to whole seconds. To return to the source text, press the metadata button again.

  • highlight: performs no action.

  • indexing: displays the source text highlighted to indicate the types of entities, as follows:

    • Green: the specified entity (either a Concept or a Relation).

    • Blue: a Concept.

    • White: a Relation.

    • Light Blue: a Path-relevant Word.

    • Unmarked: a Non-relevant word.

    Negation scope text is displayed in red letters.

  • dictionaries: performs no action.

  • %: summarizes the source text. The default percentage is 100% (full text). Specifying a integer less than 100 and then pressing the % button summarizes the source text by reducing the text to (roughly) the specified size by eliminating sentences that are have a low relevancy score, when compared to the other sentences in the source. Summerization does not necessarily retain sentences that contain the specified entity.

Limiting the Sources to Analyze

You can limit the scope of your data analysis by using filters. A filter includes or excludes data sources that are loaded in the domain from analysis. By default, the Domain Explorer analyzes all data sources loaded in the domain.

  • The Filter icon (funnel) button at the top right of the Domain Explorer applies a filter, which includes or excludes sources from analysis based on the criteria you specify. You can specify several types of filters, and can apply more than one filter. Multiple filters can be associated with AND, OR, NOT AND, or NOT OR logic.

    To add a filter, select the filter type from the drop-down list, specify the filter criteria, then select the add button, then the Apply button. When adding multiple filters, you select the AND/OR logic option associating the filters after the add button and before the Apply button.

    When one or more filters are in effect, the Filter icon displays in green.

    The number to the left of the Filter icon indicates the number of sources included after applying the filters. If no filters are applied, this number is the total number of sources in the domain.

  • To remove a single filter, select the Filter icon, then select the black X next to the filter description, then select the Apply button. To remove all filters, select the Filter icon, then the Clear button, then the Apply button.

    The following filter types are supported:

    • Metadata: used to exclude sources by their metadata values. By default, all sources have DateIndexed metadata. To apply DateIndexed metadata, select this field, select an operator, and select a date value by clicking on the calendar icon, then selecting the desired day.

    • Source IDs: used to select sources for inclusion by source ID. You can specify a single source ID or a comma-separated list of source IDs.

    • Source ID Range: used to select sources for inclusion by source ID. You can range of source IDs by specifying the from and to range values. The range is inclusive of these values.

    • External IDs: used to select sources for inclusion by their external IDs. For example, :SQL:1171:1171. You can specify a single ID or a comma-separated list of IDs. External source IDs are listed in the Sources listing.

    • SQL: used to select sources for inclusion by specifying an SQL query.

Indexing Results

The Indexing Results tool enables you to view the NLP indexing of the contents of a individual data source. This displays three listings: Indexed sentences, Concepts, and CRCs. The Indexed sentences display includes both color-coded text that shows entity types (Concept, Relations, Non-relevants, Path-relevants) and color-coded highlighting that shows attributes and their scope.

You can access the Indexing Results tool from the InterSystems IRIS Management Portal by selecting Analytics, then Text Analytics. The Analytics options are not displayed unless you are in a namespace that has been enabled for Analytics. Select the desired namespace. This displays the Analytics tools. You can access the Indexing Results tool in either of two ways:

  • By selecting the Text Analytics and then the Indexing Results option.

  • By selecting the Text Analytics and then the Domain Architect option. In the Domain Architect you open an existing domain or define a new domain. Once you are in a compiled domain, you can use the Tools tab Indexing Results button to display how NLP has indexed the data. This displays the Indexing Results tool as a separate browser tab.

The Indexing Results tool enables you to display indexed results of either data in a specified domain or manual input data.

Note:

To display Indexing Results options displayed at the top right you may have to scroll horizontally.

Domain Data

At the top right of the Indexing Results window is a drop-down list of defined domains. It defaults to the first defined domain. Select the desired domain.

Click the wide blank box across the top of the window to display a drop-down single-line listing of the contents of each indexed data source. Select one of these sources to display the indexing results for that source.

You can use the >> button to collapse (make disappear) the wide single-line source box. This enables you to view the indexing results without horizontal scrolling. You can use the << button to expand (make reappear) the wide single-line source box as a blank box that you can click to select another data source.

Manual input Data

At the top right of the Indexing Results window select the manual input button to input text directly for NLP indexing results analysis. This opens the Real-time input box. Type or paste your input text in the blank box. Use the Configuration drop-down box to select an existing (or default) configuration, or select language —> and then use the second drop-down list to select a national language or Auto-detect.

Indexed Sentences

The sentences in the source are listed in order, one sentence per line. Entity types (Concept, Relations, Non-relevants, Path-relevants) and attributes are indicated by color-coding and highlighting.

At the top right of the Indexing Results window you can select the highlighting type: either light or full: light uses color-coding and underlining to indicate entity types and attributes; it is intended to be unobtrusive to allow for convenient reading of sentences; full displays boxes around each entity and uses thick lines for attributes to provide a clearer representation of the NLP indexed structures. The information content of both type of highlighting is the same. The default is full.

The sentence text is highlighted for entities as follows:

  • concept: blue, boxed

  • relation: light green, boxed

  • non-relevant: grey, not boxed

  • path-relevant: black, grey box

The sentence text is highlighted for attributes as follows:

  • A Negation attribute phrase has red text (with concepts in bold letters and relations in regular letters); the concepts and relations are further clarified in full highlighting, where the enclosing boxes are the entity type color: blue for concepts, light green for relations. The negation keywords are underlined in red; multi-word negation terms (such as “was not”) are shown with each word underlined in red.

  • A Time, Duration, or Frequency attribute phrase is underlined with an orange dotted line. Time attribute keywords are underlined in orange. Duration attribute keywords are underlined in bright green. Frequency attribute keywords are underlined in yellow.

  • A Measurement attribute is underlined with a magenta dotted line. The measurement keywords are underlined in magenta.

  • A Negative Sentiment attribute is underlined with a purple dotted line. The sentiment keywords are underlined in purple.

  • A Positive Sentiment attribute is underlined with a green dotted line. The sentiment keywords are underlined in green.

These combinations make it possible to highlight combinations of entities and attributes. For example, a Measurement attribute that is part of a Negation attribute phrase.

Concepts and CRCs

The Indexing Results displays two listings, one of all concepts in the source, one of all of the CRCs in the source

  • Concepts in the source in descending order.

  • CRCs in the source highlighted (as above) to indicate concepts and relations, in descending order. Note that the CRCs listings do not include non-relevant or path-relevant words and do not indicate attributes.

At the top right of the Indexing Results window the sort by buttons allow you to toggle the Concepts and CRCs listings to display either frequency counts or dominance values in descending order.

In the Concepts listing, the most dominant concept(s) are given a dominance of 1000. Less dominant concepts are given smaller integer values, with larger sources tending to have lower least-dominant values. For example, a source containing 25 concepts might have a dominance range between 1000 and 83; a source containing 300 concepts might have a dominance range between 1000 and 2.

In the CRCs listing, the dominance score is arrived at my adding the dominance values of the concepts and relations.

Note:

If Japanese is the only language supported for the domain, the Indexing Results display substitutes a single Entities listing for the Concepts and CRCs listings.