docs.intersystems.com
Home  /  First Look: Text Analytics with InterSystems Products


Articles
First Look: Text Analytics with InterSystems Products
InterSystems: The power behind what matters   
Search:  


This First Look guide introduces you to InterSystems IRIS™ support for Natural Language Processing (NLP) text analytics, which provides semantic analysis of unstructured text data in a variety of natural languages. This enables you to discover useful information about the contents of a large number of text documents without any prior knowledge of the contents of the texts.
This First Look guide presents an introduction to InterSystems IRIS Natural Language Processing, and walks through some initial tasks associated with indexing text data for semantic text analysis. Once you’ve completed this exploration, you will have indexed a group of texts and performed analysis determining the most common entities in those texts, metrics about those entities, various kinds of associations found between entities, and viewing the appearances of an entity in the source texts. These activities are designed to use only the default settings and features, so that you can acquaint yourself with the fundamentals of NLP text analysis. For the full documentation on Text Analytics, see the InterSystems IRIS Natural Language Processing (NLP) Guide.
A related, but separate, tool for handling unstructured texts is InterSystems IRIS SQL Search. SQL Search allows you to search for these same entities, as well as single words, regular expressions and other constructs in multiple texts. Inherently, a search solution presupposes that you know what you are looking for. NLP text analytics is designed to help you discover content and connections between content entities without necessarily starting from an idea to look for.
Why NLP Text Analytics Is Important
Increasingly, organizations are amassing larger and larger quantities of unstructured text data, far in excess of their ability to read or catalog these texts. Frequently, an organization may have little or no idea what the contents of these text documents are. Conventional “top-down” text analysis based on pure search technologies makes assumptions about the contents of these texts, which may miss important content.
InterSystems IRIS Natural Language Processing (NLP) allows you to perform text analysis on these texts without any upfront knowledge of the subject matter. It does this by applying language-specific rules that identify semantic entities. Because these rules are specific to the language, not the content, NLP can provide insight into the contents of texts without the use of a dictionary or ontology.
How InterSystems IRIS Implements NLP Text Analytics
To prepare texts for NLP analytics you must load those texts into a domain, and then build the domain. Based on its analysis of the texts, NLP builds indices for the domain that NLP can use to rapidly analyze large quantities of text. Texts can be input from a variety of data locations, including SQL tables, text files, strings, globals, and RSS data.
NLP supports the following functionality:
Trying NLP Text Analytics for Yourself
It’s easy to use InterSystems IRIS Text Analytics. This simple procedure walks you through the basic steps of generating NLP metrics.
  1. Preliminaries
    You need to have an InterSystems IRIS instance that is up and running and has an active license key. (You can view the licence key from the Management Portal: select System Administration > Licensing.)
    This documentation uses the Aviation.Event SQL table, which is available on GitHub at https://github.com/intersystems/Samples-Aviation. (You do not need to know anything about GitHub or have a GitHub account.) To install these samples, InterSystems recommends that you create a dedicated namespace called (for example) TESTSAMPLES and then load the samples into that namespace (or you can use an existing namespace; however, you cannot use the %SYS namespace). To create a namespace, use the Management Portal options System Administration > Configuration > System Configuration > Namespaces. For the general process of downloading from GitHub, see Downloading Samples for Use with InterSystems IRIS. After you download a sample, be sure to open the README file and follow the setup instructions.
  2. Enable the Namespace
    You must enable each namespace that you wish to use for NLP. To enable the TESTSAMPLES namespace for NLP, access the Management Portal from the InterSystems IRIS launcher. Select System Administration > Security > Applications > Web Applications. This displays a list of web applications. Select /csp/testsamples from the list. This displays the Edit Web Application page. In the Enable section of the page select the Analytics check box. Click the Save button.
  3. Create a Domain.
    All NLP analysis occurs within a domain. You associate multiple texts with a domain. You then build the domain, creating indices that are used by NLP queries.
    A domain is created within a namespace. You can create multiple domains within a namespace. You can associate a text with multiple domains.
    There are several ways to create, populate, and build a domain. The following example uses the Domain Architect interface.
  4. Add Data Locations.
    Within a domain you can define data locations and other model elements for the domain. To add or modify model elements, click on the expansion triangle next to one of the headings. Initially, no expansion occurs. Once you have defined some model elements, clicking the expansion triangle shows the model elements you have defined.
    Click the Data Locations triangle to display the Details tab on the right side of the screen. The Details tab shows five Add Data options. Select Add data from table.
    This option allows you to specify data stored in an SQL table. In this example we will specify the following fields:
    The Domain Architect page heading is followed by an asterisk (*) if there are unsaved changes to the current domain definition. Click Save to save your changes.
  5. Compile the Domain by pressing the Compile button.
    Then build the NLP indices for the data sources by pressing the Build button.
  6. Explore the data.
    Select the Tools tab on the right side of the screen. Select the Domain Explorer button.
    The Domain Explorer initially displays a list of the most significant concepts in the source texts:
    When you select one of these concepts the other Domain Explorer listings are displayed:
    By selecting a concept in any of these lists, these listings are refreshed based on that concept. Alternatively, you can also type an entity (Concept or Relation) into the Domain Explorer Explore area and click the Explore! button.
    By using these listings, you can determine what concepts appear in the source documents, how significant they are, and what other concepts are associated with them.
    The lower portion of the Domain Explorer allows you to view how a selected concept appears in the source texts:
    By clicking the eye icon, you can display the complete text of the source, with the selected concept highlighted, and red text used to indicate negation.
  7. Add a Blacklist
    Often the list of top concepts begins with concepts that are too common or concepts that have little value in discovering useful information. These may be words or phrases that appear in all of the sources (such as ”accident report” or “conclusions”), general concepts (such as “airplane” or “pilot”), or concepts not relevant to your use of the data (such as a list of cities). You can use a blacklist to prevent the display of these concepts. A blacklist only affects the display of concepts in certain query results; it has no effect on NLP indexing of concepts.
    1. In the Domain Architect click the Open button and select Samples >> then MyTest to open the existing domain Samples.MyTest.
    2. Click the Blacklists expansion triangle. This displays the Add blacklist button in the Details tab on the right side of the screen. Click Add blacklist to display the Name and Entries fields. Accept the default name for the blacklist (Blacklist_1). In the Entries box list entries (concepts) one concept per line; entries are not case-sensitive. In this example list the concepts: pilot, student pilot, co-pilot, passenger, instructor, flight instructor, certified flight instructor.
    3. Save and Compile the domain. (You do not need to Build the domain to add, modify, or remove blacklists).
    4. In the Domain Explorer click the sunglasses icon in the upper right corner. This displays a list of the blacklists defined for this domain that you can apply. Select Blacklist_1. Note that the Top Concepts listing no longer lists the blacklist concepts.
This example is provided to give you some initial experience with InterSystems IRIS Natural Language Processing. You should not use this example as the basis for developing a real application. To use NLP in a real situation you should fully research the available choices provided by the software, then develop your application to create robust and efficient code.
Learn More About NLP Text Analytics
InterSystems has other resources to help you learn more about NLP Text Analytics, including: