docs.intersystems.com
InterSystems IRIS Data Platform 2019.2

First Look: Text Analytics with InterSystems Products
InterSystems: The power behind what matters   
Search:  


This First Look guide introduces you to InterSystems IRIS™ support for Natural Language Processing (NLP) text analytics, which provides semantic analysis of unstructured text data in a variety of natural languages. This enables you to discover useful information about the contents of a large number of text documents without any prior knowledge of the contents of the texts.
This First Look guide presents an introduction to InterSystems IRIS Natural Language Processing, and walks through some initial tasks associated with indexing text data for semantic text analysis. Once you’ve completed this exploration, you will have indexed a group of texts and performed analysis determining the most common entities in those texts, metrics about those entities, various kinds of associations found between entities, and viewing the appearances of an entity in the source texts. These activities are designed to use only the default settings and features, so that you can acquaint yourself with the fundamentals of NLP text analysis. For the full documentation on Text Analytics, see the InterSystems IRIS Natural Language Processing (NLP) Guide.
A related, but separate, tool for handling unstructured texts is InterSystems IRIS SQL Search. SQL Search allows you to search for these same entities, as well as single words, regular expressions and other constructs in multiple texts. Inherently, a search solution presupposes that you know what you are looking for. NLP text analytics is designed to help you discover content and connections between content entities without necessarily starting from an idea to look for.
To browse all of the First Looks, including those that can be performed on a free evaluation instance of InterSystems IRIS, see InterSystems First Looks.
Why NLP Text Analytics Is Important
Increasingly, organizations are amassing larger and larger quantities of unstructured text data, far in excess of their ability to read or catalog these texts. Frequently, an organization may have little or no idea what the contents of these text documents are. Conventional “top-down” text analysis based on pure search technologies makes assumptions about the contents of these texts, which may miss important content.
InterSystems IRIS Natural Language Processing (NLP) allows you to perform text analysis on these texts without any upfront knowledge of the subject matter. It does this by applying language-specific rules that identify semantic entities. Because these rules are specific to the language, not the content, NLP can provide insight into the contents of texts without the use of a dictionary or ontology.
How InterSystems IRIS Implements NLP Text Analytics
To prepare texts for NLP analytics you must load those texts into a domain, and then build the domain. Based on its analysis of the texts, NLP builds indices for the domain that NLP can use to rapidly analyze large quantities of text. Texts can be input from a variety of data locations, including SQL tables, text files, strings, globals, and RSS data.
NLP supports the following functionality:
Trying NLP Text Analytics for Yourself
It is easy to use InterSystems IRIS Text Analytics. This simple procedure walks you through the basic steps of generating NLP metrics.
This example is provided to give you some initial experience with InterSystems IRIS Natural Language Processing. You should not use this example as the basis for developing a real application. To use NLP in a real situation you should fully research the available choices provided by the software, then develop your application to create robust and efficient code.
Before You Begin
To use the procedure, you will need a running InterSystems IRIS instance. Your choices for InterSystems IRIS include several types of licensed and free evaluation instances; the instance need not be hosted by the system you are working on (although they must have network access to each other). For information on how to deploy each type of instance if you do not already have one to work with, see Deploying InterSystems IRIS in InterSystems IRIS Basics: Connecting an IDE.
You also need to obtain the Aviation.Event SQL table, which is available on GitHub at https://github.com/intersystems/Samples-Aviation. Follow the instructions provided in Downloading and Setting up the Sample Files in First Look: SQL Search with InterSystems Products to download and set up the files.
Create a Domain and Add Data Locations
All NLP analysis occurs within a domain. You associate multiple texts with a domain. You then build the domain, creating indices that are used by NLP queries.
A domain is created within a namespace, such as the SAMPLES namespace you created by following the procedure in First Look: SQL Search with InterSystems Products in the previous section. You can create multiple domains within a namespace. You can associate a text with multiple domains.
There are several ways to create, populate, and build a domain. The following example uses the Domain Architect interface.
  1. Open the Management Portal for your instance in your browser, using the URL described for your instance in InterSystems IRIS Basics: Connecting an IDE.
  2. Navigate to the Domain Architect page (Analytics > Text Analytics > Domain Architect). Before using the Analytics options, you may need to switch to the analytics-enabled SAMPLES namespace.
  3. Click the New button to define a domain. Specify the following domain values (in the order given):
    Click the Finish button to create the domain. This displays the Model Elements selection screen.
  4. Within a domain you can define data locations and other model elements for the domain. To add or modify model elements, click on the expansion triangle next to one of the headings. Initially, no expansion occurs. Once you have defined some model elements, clicking the expansion triangle shows the model elements you have defined.
    Click the Data Locations triangle to display the Details tab on the right side of the screen. The Details tab shows five Add Data options. Select Add data from table.
    This option allows you to specify data stored in an SQL table. In this example we will specify the following fields:
    The Domain Architect page heading is followed by an asterisk (*) if there are unsaved changes to the current domain definition. Click Save to save your changes.
  5. Compile the Domain by pressing the Compile button.
  6. Build the NLP indices for the data sources by pressing the Build button.
Explore the Data
Explore the data using the procedure that follows:
  1. On the Domain Architect page, select the Tools tab on the right side of the screen, then click the Domain Explorer button.
  2. The Domain Explorer initially displays a list of the most significant concepts in the source texts:
  3. When you select one of these concepts the other Domain Explorer listings are displayed:
    By selecting a concept in any of these lists, these listings are refreshed based on that concept. Alternatively, you can also type an entity (Concept or Relation) into the Domain Explorer Explore area and click the Explore! button.
    By using these listings, you can determine what concepts appear in the source documents, how significant they are, and what other concepts are associated with them.
    The lower portion of the Domain Explorer allows you to view how a selected concept appears in the source texts:
    By clicking the eye icon, you can display the complete text of the source, with the selected concept highlighted, and red text used to indicate negation.
  4. You can add a blacklist to exclude undesired concepts. Often the list of top concepts begins with those that are too common or have little value in discovering useful information. These may be words or phrases that appear in all of the sources (such as ”accident report” or “conclusions”), general concepts (such as “airplane” or “pilot”), or concepts not relevant to your use of the data (such as a list of cities). You can use a blacklist to prevent the display of these concepts. A blacklist affects onlythe display of concepts in certain query results; it has no effect on NLP indexing of concepts.
    1. In the Domain Architect click the Open button and select Samples >> then MyTest to open the existing domain Samples.MyTest.
    2. Click the Blacklists expansion triangle. This displays the Add blacklist button in the Details tab on the right side of the screen. Click Add blacklist to display the Name and Entries fields. Accept the default name for the blacklist (Blacklist_1). In the Entries box list entries (concepts) one concept per line; entries are not case-sensitive. In this example list the concepts: pilot, student pilot, co-pilot, passenger, instructor, flight instructor, certified flight instructor.
    3. Save and Compile the domain. (You do not need to Build the domain to add, modify, or remove blacklists).
    4. In the Domain Explorer click the sunglasses icon in the upper right corner. This displays a list of the blacklists defined for this domain that you can apply. Select Blacklist_1. Note that the Top Concepts listing no longer lists the blacklist concepts.
Learn More About NLP Text Analytics
InterSystems has other resources to help you learn more about NLP Text Analytics, including:


Send us comments on this page
View this article as PDF   |  Download all PDFs
Copyright © 1997-2019 InterSystems Corporation, Cambridge, MA
Content Date/Time: 2019-08-23 06:47:59