Using iKnow
Language Identification
[Home] [Back] [Next]
InterSystems: The power behind what matters   
Class Reference   
Search:    

This chapter describes how to configure and use Automatic Language Identification (ALI), which is applied at the sentence level. It also describes a few language-specific issues.

Configuring Automatic Language Identification
An iKnow Configuration establishes the language environment for source document content. A Configuration is independent of any specified set of source data. You can either define a Configuration, or take the default Configuration. If you do not specify a Configuration, the default is English-only, with no automatic language identification.
A configuration defines the following language options:
The following example creates a configuration that assumes all source texts will be in English or French, and supports automatic language identification:
  SET myconfig="EnglishFrench"
  IF ##class(%iKnow.Configuration).Exists(myconfig) {
     SET cfg=##class(%iKnow.Configuration).Open(myconfig)
     WRITE "Opened existing configuration ",myconfig,! 
  }
  ELSE {
     SET cfg=##class(%iKnow.Configuration).%New(myconfig,1,$LISTBUILD("en","fr"),"",1)
     DO cfg.%Save()
     IF ##class(%iKnow.Configuration).Exists(myconfig)
     {WRITE "Configuration ",myconfig," now exists",! }
     ELSE {WRITE "Configuration creation error" QUIT }
  }
      SET cfgId=cfg.Id
      WRITE "with configuration ID ",cfgId,!
   SET rnd=$RANDOM(2)
  IF rnd {
       SET stat=##class(%iKnow.Configuration).%DeleteId(cfgId)
       IF stat {WRITE "Deleted the ",myconfig," configuration" }
       }
  ELSE {WRITE "No delete this time",! }
 
Using Automatic Language Identification
iKnow performs automatic language identification on a per-sentence basis. When the current configuration has activated automatic language identification, iKnow tests each sentence in each source text to determine which of the languages specified in the Configuration is the language used in that sentence. This identification is a statistical probability. This has the following consequences:
iKnow subsequently uses this language determination in determining CRCs and other iKnow analysis.
Thus, source texts and sentences within a source text can be in different languages. iKnow automatically determines which language model to apply. Automatic language identification also assigns a confidence level in its language identification as an integer indicating a percentage. These range from 100 (complete confidence) to 0 (indeterminate). If automatic language identification is not active, all sentences are assigned a confidence level of 0.
Language Identification Queries
The following example uses GetTopLanguage() to identify the language for a source and the degree of confidence in that identification. Because language identification is performed on the sentence level, the language for the source is the result of averaging the language identification confidence for the component sentences. This method returns the language as a two character abbreviation (in this case, “en”). Note that totlangconf (the total of the language confidence for the sentences) must be divided by numlangsent, not by numsent. These two sentence count numbers are usually, but not always, the same. This is because a source may contain sentences for which no language can be determined.
Configuration
  SET myconfig="EnFr"
  IF ##class(%iKnow.Configuration).Exists(myconfig)
       {SET cfg=##class(%iKnow.Configuration).Open(myconfig) }
  ELSE {SET cfg=##class(%iKnow.Configuration).%New(myconfig,1,$LISTBUILD("en","fr"),"",1)
        DO cfg.%Save() }
  SET cfgId=cfg.Id 
  ZNSPACE "Samples"
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).Exists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).Open(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET stat=flister.SetConfig(myconfig)
    IF stat '= 1 { WRITE "SetConfig error ",$System.Status.DisplayError(stat)
                   QUIT }
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 10 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
GetSources
  DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId)
  SET i=1
  WHILE $DATA(result(i)) {
     SET intId = $LISTGET(result(i),1)
     SET extId = $LISTGET(result(i),2)
    SET numsent = ##class(%iKnow.Queries.SentenceAPI).GetCountBySource(domId,result(i))
     WRITE !,extId," has ",numsent," sentences",!
     SET srclang = ##class(%iKnow.Queries.SourceAPI).GetTopLanguage(domId,intId,.totlangconf,.numlangsent)
     WRITE "Source language is ",srclang,!,"with a confidence % of ",totlangconf/numlangsent,!!
     SET i=i+1
     }
 
The following example uses GetLanguage() to identify the language for each sentence in a source and the degree of confidence in that identification. This method returns the language as a two character abbreviation (in this case, “en”) and the confidence level as a percentage between 0 and 100. Note that the confidence level is rarely (if ever) 100%.
Configuration
  SET myconfig="EnFr"
  IF ##class(%iKnow.Configuration).Exists(myconfig)
       {SET cfg=##class(%iKnow.Configuration).Open(myconfig) }
  ELSE {SET cfg=##class(%iKnow.Configuration).%New(myconfig,1,$LISTBUILD("en","fr"),"",1)
        DO cfg.%Save() }
  SET cfgId=cfg.Id 
  ZNSPACE "Samples"
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).Exists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).Open(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET stat=flister.SetConfig(myconfig)
    IF stat '= 1 { WRITE "SetConfig error ",$System.Status.DisplayError(stat)
                   QUIT }
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 10 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
GetOneSource
  DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId)
  FOR i=1:1:10 {
   IF $DATA(result(i)) {
     SET intId = $LISTGET(result(i),1)
     SET extId = $LISTGET(result(i),2)
     SET myconf=0 
  SET numSentS = ##class(%iKnow.Queries.SentenceAPI).GetCountBySource(domId,result(i))
  WRITE !,extId," has ",numSentS," sentences",!
GetSentencesInSource
     SET sentStat=##class(%iKnow.Queries.SentenceAPI).GetBySource(.sent,domId,intId)
     IF sentStat=1 {
         SET i=1
         WHILE $DATA(sent(i)) { 
            SET sentnum=$LISTGET(sent(i),1)
            WRITE "sentence:",sentnum
            SET lang = ##class(%iKnow.Queries.SentenceAPI).GetLanguage(domId,sentnum,.myconf)
            WRITE " language:",lang," confidence:",myconf,!
            SET i=i+1
         }
     }
   }
   ELSE { WRITE !,"That's all folks!" }
  }
 
Overriding Automatic Language Identification
You can use the LanguageFieldName domain parameter to override Automatic Language Identification. If activated, this parameter determines which language to apply by accessing a metadata field for each source. This metadata field contains the ISO language code. If the metadata field data is present, Automatic Language Identification is overridden for that source. If the metadata field is empty or invalid, Automatic Language Identification is used for that source. The LanguageFieldName domain parameter is inactive by default. For further details, refer to the Domain Parameters appendix of this manual.
Language-Specific Issues
German: the German eszett (“ß”) character is normalized as “ss”. German commonly requires setting the EnableNgrams domain parameter.