docs.intersystems.com
Home  /  Application Development: Analytics Options  /  Using InterSystems UIMA  /  Overview


Using InterSystems UIMA
Overview
[Back] 
InterSystems: The power behind what matters   
Search:  


UIMA (Unified Information Management Architecture) is an industry standard for processing unstructured data. It is an open-source framework available from the Apache Software Foundation (Apache UIMA). It provides a contract with software implementors for a standardized representation of the results of unstructured data analysis. UIMA can be used for any type of unstructured data; it is most typically used for natural language text.
UIMA is used to generate annotations for a source text. These annotations reference the source text by start and end position in that text. The annotations are separate from the source text and do not alter the source text.
InterSystems can perform the following operations:
You can use InterSystems IRIS Natural Language Processing (NLP) independently of UIMA. This can be done concurrently with using NLP with UIMA. UIMA is an additional technology that can interface with NLP. It does not change or supersede older NLP indexing and processing.
This chapter describes the following UIMA operations:
What is UIMA and What Can You Use It For
UIMA provides a standard for annotating unstructured data. By using this standard, different unstructured data analysis technologies, such as InterSystems IRIS Natural Language Processing (NLP), can annotate the same source text without the annotations interfering with each other or interfering with the ability to parse the source text. Thus the UIMA framework allows for the combining of annotations by different technologies that each focus on one data analysis task. For each technology, UIMA creates an analysis engine object as the interface to it; therefore, NLP interacts with UIMA through the NLP text analysis engine. Commonly, analysis engines work independently of one another, each producing its own annotations; however, UIMA provides the ability for an analysis engine to use the annotations created by another analysis engine.
The UIMA standard provides a framework for the implementation of UIMA-compliant technologies dedicated to a particular task. These technologies can support operations such as tokenizing, semantic analysis (InterSystems IRIS NLP), named-entity recognition (NER) tools for identifying names of persons and places, rule-based information extraction, speech-to-text conversion, and others.
An analysis engine produces annotations. An annotation is an encoding associated with a fragment of the source text, identified by a beginning and end character position, with a UIMA type string, and (optionally) other annotation-specific features. An analysis engine does not change the original source text.
UIMA is easy to use because deployment and scaling are handled by the UIMA framework. It handles setting up and invoking instances of these components in a possibly distributed architecture. It provides interoperability through this common annotation format.
InterSystems’ implementation of UIMA is compliant with additional packages available from Apache UIMA, including Asynchronous Scaleout (UIMA-AS), Distributed UIMA Cluster Computing (UIMA-DUCC), and UIMA Ruta, a rule-based annotation workbench.
UIMA Glossary
Annotation: annotations within a source text inherit from the uima.tcas.AnnotationBase UIMA type and have three mandatory properties: an annotation type name (similar to a Java FQN), a start position property and an end position property, which are expressed as integer character positions from the beginning of the source text, counting from 0. Annotations that inherit from uima.tcas.TOP UIMA type are not associated with a particular section of the source text, but are instead associated with the source text itself. Annotations can have additional technology-specific properties, as defined by the implementor. Annotations can be nested and/or can refer to other annotations.
CAS: Common Analysis Structure, an in-memory object that provides cooperating UIMA components with a common representation and mechanism for shared access to the source text being analyzed. A CAS is a data structure that holds one or more Sofas; it always contains at least one Sofa.
Sofa: A CAS can have more than one “view” on the data to be processed, called a Sofa (Subject of Analysis). For example, a CAS for a web page can have one Sofa for the web page with HTML markup, and another Sofa for just the web page text. Commonly, a CAS has only one Sofa; if it has multiple Sofas, each presents a variant on the same data. For example, a source text translated into English, French, and Spanish would be one CAS with three Sofas. Annotations are always associated with an individual Sofa. A Sofa contains 1) the source text data being processed; 2) any annotations provided by Analysis Engines; 3) indices to the annotations; 4) the type system for the annotations.
AE: Analysis Engine, a technology that analyzes unstructured data. An analysis engine is a UIMA object that implements the UIMA interface which a technology (such as NLP) interacts with. Each technology has its own analysis engine. An analysis engine is configured as a UIMA component descriptor file in XML.
Pipeline: a linear series of Analysis Engines that are executed in the order specified. Analysis engines can create annotations independent of each other, or they can take as input annotations generated by an analysis engine earlier in the pipeline and generate annotations based on the work of another analysis engine. InterSystems supports a linear pipeline. Non-linear pipelines can be supported by an aggregate analysis engine; these are not directly supported by InterSystems but are compatible with InterSystems UIMA implementation.
Type System: a uniform standard for specifying annotation type names. An XML descriptor containing a section specifying the annotation vocabulary it uses internally and the types it emits. This XML descriptor is specified in the Functional Index AEDESCRIPTOR. When invoking an analysis engine you can limit recognition of annotation types to a subset of the types supported by that analysis engine.
Online Resources for UIMA
The Apache UIMA home page: http://uima.apache.org/index.html
http://uima.apache.org/d/uimaj-current/references.html#ugr.ref.json.overview
http://uima.apache.org/d/uimaj-current/references.html#ugr.ref.cas
UIMA-compliant annotators that can be downloaded from Apache Software Foundation: http://uima.apache.org/sandbox.html#UIMA%20Addons%20components
Apache cTAKES, an large set of annotators specifically designed for medical patient record unstructured data: http://ctakes.apache.org/
Overview of UIMA Support in the IRIS Data Platform
InterSystems IRIS Data Platform™ support of UIMA provides the following:
InterSystems provides support for invoking UIMA in the following two ways:
In either scenario, InterSystems provides support for persisting the output in a UIMA Annotation Store. Providing a persistent independent Annotation Store is an InterSystems extension to the Apache UIMA standard.
The UIMA Functional Index
The UIMA functional index is an InterSystems SQL table index. It indexes the contents of a single column in an SQL source table. This index automatically loads data into a UIMA workflow from this source text column.
The InterSystems UIMA functional index class, %UIMA.Index implements the %Library.FunctionalIndex specification.
Once a functional index has been defined and compiled, adding records to the SQL source table results in the source text column data being processed by the specified UIMA analysis engine (or a linear series (pipeline) of analysis engines). An INSERT, UPDATE, DELETE, or a %BuildIndices() operation automatically revises the functional index, invoking the UIMA analysis engine (or engines).
Defining a Functional Index is described below.
The UIMA Annotation Store
Upon compiling a UIMA functional index, the system automatically generates several persistent classes to store UIMA annotations. This UIMA annotation store is an InterSystems extension to the UIMA framework. It stores UIMA output in persistent classes (SQL tables) that are separate from the source text tables, but linked to those tables. The InterSystems Annotation Store provides a flexible SQL-based annotation storage for later analysis of unstructured data across multiple source texts. This avoids the large XML annotation files created by standard UIMA.
The UIMA annotation store can be used as an annotation store for the annotations generated by InterSystems IRIS Natural Language Processing (NLP) and other analysis engines specified in the Functional Index.
The UIMA annotation store can also be used independent of the Functional Index to store additional manual annotations based on user review of the source text.
Using NLP as a UIMA Analysis Engine
You can use InterSystems IRIS Natural Language Processing (NLP) as a UIMA analysis engine, generating UIMA annotations for NLP Concepts and Relations. These annotations are fully compatible with UIMA annotations supplied by other UIMA technologies.
You can also use NLP independently of UIMA, generating InterSystems IRIS globals for NLP Concepts and Relations.
Because UIMA provides UIMA-compliant indexing for NLP results, but does not change the NLP engine itself, these two uses of NLP can be performed concurrently on the same unstructured data sources. Refer to Using InterSystems IRIS Natural Language Processing (NLP) for further details.
How to Use the UIMA Functional Index
Launching the Java Gateway
You must have the Java Gateway running on port 5555 to compile a UIMA functional index. This Java Gateway provides access to the InterSystems UIMA Java Gateway.
The Java Gateway is an InterSystems IRIS feature that allows invocation of Java classes from ObjectScript. It runs as a daemon process listening on a TCP/IP port. It spawns off threads to execute the Java code as requested through messages sent from ObjectScript through %Net.Remote.Gateway code.
The UIMA Java Gateway com.intersys.uima.Gateway is a Java class that holds the public methods our UIMA integration presents to ObjectScript. It is exposed to InterSystems IRIS through the Java Gateway.
The Java Gateway needs to be running on the same host as the InterSystems IRIS instance (localhost), listening on port 5555. No additional classpath settings should be provided when launching the JVM (Java Virtual Machine), but regular JVM parameters for setting minimal and maximum memory may be supplied.
On a Windows system you can start the Java Gateway by executing the following command from the Windows Run interface:
%JAVA_HOME%\bin\java -classpath "C:\InterSystems\UIMA\dev\java\lib\JDK18\*;C:\InterSystems\UIMA\dev\java\lib\jackson\*" com.intersys.gateway.JavaGateway 5555
For this command to function, you must have defined JAVA_HOME. If JAVA_HOME is not defined on your Windows system, go to the Control Panel, System option, Advanced system settings. Select the Environment Variables button. Define a new system variable named JAVA_HOME. Browse to the path to your Java bin directory, and assign this path as the JAVA_HOME value. For example, C:\Program Files\Java\jre1.8.0_141.
Defining a Functional Index
You must define a functional index to allows users to use UIMA process unstructured data stored in an SQL table column. A UIMA function index is an index of type %UIMA.Index.
The following example defines the NYT.Articles SQL table as a persistent class (a table of article texts from the New York Times) and defines a UIMA index on the FullArticle column:
Class NYT.Articles Extends %Persistent
{
Property NYTID As %Integer;
Property PubDate As %Date;
Property FullArticle As %String(MAXLEN=32000);
Index IdxNYTArticles On (FullArticle) As %UIMA.Index(
    AEDESCRIPTOR = "classpath:/com/intersys/uima/annotator/iKnowEngine.xml"
    );
}
This Functional Index takes the following parameters:
The table on which the functional index is applied needs to be compiled. This generates a number of methods on the table itself that are specific to the %FunctionalIndex framework. It also validates the Annotation Store descriptor XML (ANNOTATIONSTOREDEF) and, if it passes, generate the appropriate ObjectScript classes based on it.
Compiling the table also tests the AEDESCRIPTOR referenced from the functional index. This test is performed through a call to the Java Gateway. Therefore, by default, the Java Gateway needs to be running when you compile a table with a UIMA functional index.
You can perform this table compile without an active Java Gateway by setting the index parameter TESTONCOMPILE=0. This suppresses AEDESCRIPTOR testing. However, an active Java Gateway is required when you insert data into the table.
How to Use the UIMA Annotation Store
Compiling a UIMA functional index automatically generates a UIMA Annotation Store. A UIMA Annotation Store is a package containing a set of persistent classes (SQL tables).
The Annotation Store is configured through a piece of XML that is supplied through the functional index or directly through the annotation filer. Refer to %UIMA.Model.annotationStore for configuration options for the annotation store.
Annotation Store classes inherit from %UIMA.AnnotationStore.* superclasses and are generated by %UIMA.AnnotationStore.ClassGenerator.
The Annotation Filer
The UIMA Annotation Filer (com.intersys.uima.filer.AnnotationFiler) is a Java class that implements the UIMA Analysis Engine interface and therefore acts as a regular UIMA component. The UIMA Annotation Filer is itself an analysis engine, the last analysis engine in a UIMA processing pipeline. Rather than adding annotations by itself, it reads all existing annotations from the supplied CAS (in-memory text object). It sends these annotations back to InterSystems IRIS for storing in the UIMA Annotation Store. This is sometimes referred to as a CAS Consumer in UIMA terminology.
Like any other analysis engine, the Annotation Filer is configured through a UIMA component descriptor file in XML, containing database connection parameters, an identifier for the Annotation Store to file the data into, and the XML description of the annotation store.
Note:
All of these Annotation Filer parameters are configured automatically by the UIMA Functional Index.
Annotation Store Tables
Compiling a UIMA functional index automatically generates a UIMA Annotation Store. A UIMA Annotation Store is a set of InterSystems IRIS tables that stores all the annotations for a given dataset processed by a UIMA pipeline. This pipeline can consist of one or more UIMA analysis engines, with the UIMA Annotation Filer as the last engine in the pipeline.
By default, the Annotation Store is given the same name as the persistent class for which the index is defined. Therefore, in our example, compiling a functional index for a column of the table NYT.Articles would produce a corresponding Annotation Store package called NYT_Articles. This naming default is modifiable.
The UIMA Annotation Store consists of (at minimum) the following set of tables:
Because these are SQL tables, you can access their contents with standard SQL queries.
Type Table
The Type table consists of the following fields:
Field Data Type Indexed? Purpose
name String unique index The type of annotation.
parent Integer   References Type table.
SofaTable
The Sofa table consists of the following fields:
Field Data Type Indexed? Purpose
docID Integer bitmap index References source text table.
hasManualAnnotations Boolean bitmap index Specifies if there are any manual annotations.
mimeType String   MIME type (also known as media type), an ISO standard description of the type of the data represented by the Sofa.
sofaID String bitmap index  
sofaString String   The source text for this Sofa.
Annotation Table
The Annotation table consists of the following fields:
Field Data Type Indexed? Purpose
begin Integer %pos standard index The beginning character position for the annotation text. Positions are a count of Unicode characters, counting from 0.
coveredText String   The annotation text. An exact copy of the section of source text identified by the begin and end character positions.
docID Integer bitmap index References source text table.
end Integer %pos standard index The ending character position for the annotation text.
isManual Boolean bitmap index Flags a manual annotation.
sofaID Integer   References Sofa table.
typeID Integer bitmap index References Type table.
A Top Annotation table does not include the begin, coveredText, and end fields.
An annotation table can receive additional field values generated by the analysis engine. If no Annotation fields have been defined for these additional field values, they are stored in a generic features field as a JSON array of key:value pairs.
You can access annotations using the %UIMA.AnnotationStore.Store methods GetAnnotations() and GetAnnotationsRS()
Refining the Annotation Store
By default, the Annotation Store is given a package name that is the same as the persistent class for which the index is defined. You can optionally change the Annotation Store package name and add features to the annotation store by specifying an Xdata block. You can use this XData block to define additional annotation store features, including additional tables, columns, indices, and filters. You can do this in either of two ways:
The XData block contains XML-formatted data. For further details, refer to Defining and Using XData Blocks in Defining and Using Classes.
The UIMA Annotation Store may contain additional annotation tables, additional columns in those tables, and additional indices on them.
Changing Names
You can specify a different package name for the UIMA Annotation Store in the XData block, as follows:
XData IdxNYTArticles [ XMLNamespace = "http://www.intersystems.com/UIMA/annotationStore" ]
  {<store package="NYT.ArticleAS">
    </store>
  }
You can change the name of the Annotation Table in the XData block, as follows:
XData IdxNYTArticles [ XMLNamespace = "http://www.intersystems.com/UIMA/annotationStore" ]
  {<store package="NYT.Articles">
   <tables>
     <table name="FilteredAnnotation">
     </table>
   </tables>
   </store>
  }
Specifying Additional Tables
You can specify additional Annotation Store tables to store annotations. For example, you could create annotation tables to store the annotations from two different analysis engines in different annotation tables. The following XData block creates two Annotation tables; the second annotation table has an additional field named normalizedValue:
XData IdxNYTArticles [ XMLNamespace = "http://www.intersystems.com/UIMA/annotationStore" ]
  {<store package="NYT.Articles">
   <tables>
     <table name="Annotation1">
     </table>
     <table name="Annotation2">
       <features storeOther="json" >
         <feature name="normalizedValue" path="normalizedValue" >
         <parameter key="MAXLEN">300</parameter>
         </feature>
       </features>
     </table>
   </tables>
    </store>
   }
In %UIMA.Model.annotationStore, specify additional annotation tables in the tables property. You can define a table using %UIMA.Model.table.
You may need to create an additional table for top level annotations. The topLevel boolean property allows you to specify whether an annotation table contains annotations within the source text (annotations on a unit of text defined with a beginning and an end character position), or whether it is a table of “top” annotations that apply to the entire source text.
Specifying Additional Columns
You may wish to add columns to an Annotation Table. For example, in a Top Annotation Table you may wish to add a field for the NLP Dominance score.
If the analysis engine generates annotation fields that do not correspond to existing Annotation Table fields, these values are stored in a generic features field as a JSON array of key:value pairs.
To add columns, specify each column as a feature property in the XData block supplied to the functional index. A feature must be defined within features within a table. Defining one or more additional columns also automatically defines a generic features column. The following example defines three additional fields, plus the generic features field:
XData IdxNYTArticles [ XMLNamespace = "http://www.intersystems.com/UIMA/annotationStore" ]
 {
 <store package="NYT.Articles">
 <tables>
 <table>
   <features storeOther="json" >
     <feature name="normalizedValue" path="normalizedValue" >
    <parameter key="MAXLEN">300</parameter>
    </feature>
    <feature name="occurrences" path="occurrences" type=":annotationList:Annotation" />
    <feature name="parent" path="_parent" type=":annotation" />
  </features>
 </table>
 </tables>
 </store>
 }
Specifying Additional Indices
In %UIMA.Model.table, specify additional annotation table indices in the indices property. You can define an index using %UIMA.Model.index.
These additional indices must be supplied to an XData block that you specify in the %UIMA.Index ANNOTATIONSTOREDEF parameter. An index must be defined within a table. The following example defines an additional field and indexes that field:
XData IdxNYTArticles [ XMLNamespace = "http://www.intersystems.com/UIMA/annotationStore" ]
 {
 <store package="NYT.Articles">
 <tables>
  <table>
   <features storeOther="json" >
     <feature name="normalizedValue" path="normalizedValue" >
     <parameter key="MAXLEN">300</parameter>
     </feature>
   </features>
   <indices>
     <index properties="normalizedValue" name="NV" />
   </indices>
  </table>
 </tables>
 </store>
 }
Adding Annotation Filters
By default, all generated annotations are stored in the Annotation Store. You can specify one or more filters to apply to the annotations generated by the analysis engines. These filters instruct the Annotation Filer to store only annotations of the UIMA types specified in the filter. Note that by applying a filter you automatically exclude all annotations other than those explicitly specified in the filter.
A filter is supplied to an XData block that you specify in the %UIMA.Index ANNOTATIONSTOREDEF parameter. The following is an example XData block specifying a filter. Note that a filter is defined within store, but not within a table:
XData IdxNYTArticles [ XMLNamespace = "http://www.intersystems.com/UIMA/annotationStore" ]
 {
 <store package="NYT.Articles">
   <filters>
     <include>
     <exclude pattern="org.apache.uima.alchemy.ts.entity.AlchemyAnnotation" />
     </include>
   </filters>
   <tables>
     <table name="Annotation1">
     </table>
   </tables>
 </store>
 }
See %UIMA.Model.filters and %UIMA.Model.filterRule.
Manual Annotations
After running an analysis engine on a source text, you may discover that there are additional annotations that you wish to include for that source text. You can supply these annotations directly to the Annotation Store as manual annotations.
You can use the %UIMA.AnnotationStore.Store.FileAnnotation() method to manually insert a single annotation into the Annotation Store.
You can use the %UIMA.AnnotationStore.Store.FileAnnotations() method to manually insert an array of annotations into the Annotation Store.
UIMA flags manual annotations in the Annotation table using the isManual Boolean field.
By default, the Functional Index deletes all prior annotations for a source text before adding the annotations generated by the specified analysis engines. If you wish to preserve prior annotations, you must flag these as manual annotations.
How to Use NLP as a UIMA Analysis Engine
The InterSystems IRIS Natural Language Processing (NLP) engine is exposed as a UIMA annotator and is already loaded on the classpath, accessible at "classpath:/com/intersys/uima/annotator/iKnowEngine.xml". You can use NLP as a UIMA annotator by defining a UIMA functional index. This does not in any way limit the concurrent use of NLP independent of UIMA, which defines NLP indices, which are stored as InterSystems IRIS globals.
When using NLP as a UIMA annotator, NLP places the results of its syntactic analysis into a compileable UIMACPP header file, then includes this data in an NLP UIMA wrapper, and sends it to the NLP engine. This allows NLP to perform processing on UIMA annotations as if they were InterSystems IRIS globals. The NLP engine itself is not changed.
NLP annotation types include Concepts, Relations, Negation, Positive Sentiment, Negative Sentiment. Top annotation types can include unique entities and dominance score.
The NLP UIMA wrapper code imports the UIMACPP (UIMA C++) library, which has dependencies on the Xerces and APR-1 libraries. Xerces is already part of the InterSystems IRIS code base. UIMACPP and APR-1 are provided as part of InterSystems UIMA implementation. Both are maintained by the Apache Software Foundation.
The NLP Annotation Type System
NLP supports the Sofa Unicode text data format, which has a metadata parameter: language. NLP uses this language parameter to choose the corresponding language model. A language is specified using the two-letter ISO language codes; for example, en for English. Because a Sofa is identified as containing source text in a single language, NLP Automatic Language Identification (ALI) is not supported. The language identified for the Sofa is considered a constant for NLP UIMA indexing.
Modifying the Descriptor
Sample Annotation Store Definition
The UIMA REST API
The UIMA REST API, like its equivalents for Analytics and NLP, has a default generic web application that gets forwarded for various namespaces in %Api.UIMA, but an individual version of the API can be locked to another web app through the %UIMA.REST.v1 class.
REST API Basics
%UIMA.REST.v1 provides endpoints for accessing UIMA functionality over REST. You should set up a REST service at http://localhost:57772/api/uima/v1/[namespace]. Substitute your Web Server port number for 57772. To determine your Web Server port number, start the Management Portal. At the top of the page, click About. View the Web Server Port setting.
The following REST operations are supported:
Accessing the Swagger Reference Documentation
The UIMA REST API is fully documented using the OpenAPI Specification (also known as Swagger). The description in YAML is available from the "/swagger" endpoint and can be loaded directly into swagger-ui for convenient GUI capabilities on top of this API.
To use it, either install swagger-ui or go to http://petstore.swagger.io and point to this endpoint.
Reference Material
Intersystems UIMA Type System
InterSystems supports two annotation type system: for annotations within source text, and for top annotations that apply to the entire source text:
Annotations within Text
Top Annotations
NLP Engine Deployment Descriptor
Annotation Filer Deployment Descriptor