Performance Considerations when Loading Texts

Important:

InterSystems has deprecated InterSystems IRIS® Natural Language Processing (NLP). It may be removed from future versions of InterSystems products. The following documentation is provided as reference for existing users only. Existing users who would like assistance identifying an alternative solution should contact the WRCOpens in a new tab.

Because NLP typically handles large amounts of text data, the following InterSystems IRIS data platform performance considerations should be heeded when loading source text:

Before starting a batch load of a significant number of sources, stop database journaling. Once the batch load completes, make sure to restart journaling. Refer to the “Journaling” chapter of the Data Integrity Guide for information on stopping and restarting journaling.
Before starting a batch load of a significant number of sources (or a small number of very large sources), set the global buffer pool to a size large enough to handle this operation. NLP indexing creates a large number of temporary globals. If the global buffer pool is not large enough to handle these temporary globals in memory, they are written to disk. These disk I/O operations can significantly affect NLP performance. Refer to “Memory and Startup Settings” for more information.
NLP indexing requires substantially more disk space than the space occupied by the source texts. The approximate space requirements for temporary and permanent globals are described in “Globals and Space Requirements” section of the “Implementation” chapter of this manual.
Do not configure more language support than is required for your sources. Your NLP Configuration should specify only those languages that are actually found in your sources. If all of your sources are in one language, do not specify automatic language identification. Unless n-grams are required for the language, do not set the EnableNgrams domain parameter.