Using Caché XML Tools
Customizing How the Caché SAX Parser Is Used
[Back] [Next]
   
Server:docs1
Instance:LATEST
User:UnknownUser
 
-
Go to:
Search:    

Whenever Caché reads an XML document, it uses the Caché SAX (Simple API for XML) Parser. This chapter describes your options for controlling the Caché SAX Parser. It discusses the following topics:

About the Caché SAX Parser
The Caché SAX Parser is used whenever Caché reads an XML document.
It is an event-driven XML parser that reads an XML file and issues callbacks when it finds items of interest, such as the start of an XML element, start of a DTD, and so on.
(More accurately, the parser works in conjunction with a content handler, and the content handler issues the callbacks. This distinction is important only if you are customizing the SAX interface, as described in Creating a Custom Content Handler, later in this chapter.)
The parser uses the standard Xerces-C++ library, which complies with the XML 1.0 recommendation and many associated standards. For a list of these standards, see http://xml.apache.org/xerces-c/.
Available Parser Options
You can control the behavior of the SAX parser in the following ways:
The available options depend on how you are using the Caché SAX Parser, as summarized in the following table:
SAX Parser Options in %XML Classes
Option %XML.Reader %XML.TextReader %XML.XPATH.Document %XML.SAX.Parser
Specifying parser flags supported supported supported supported
Specifying which parsing events are interesting (for example, start of element, end of element, comments) not supported supported not supported supported
Specifying a schema specification supported supported supported supported
Disabling entity resolution or otherwise customizing entity resolution supported supported supported supported
Specifying a custom HTTP request (if parsing a URL) not supported supported not supported supported
Specifying the content handler not supported not supported not supported supported
Parse documents at HTTPS locations supported not supported not supported supported
Resolve entities at HTTPS locations not supported not supported not supported supported
Specifying the Parser Options
You specify the parser behavior differently depending on how you are using the Caché SAX Parser:
Setting the Parser Flags
The %occSAX.inc include file lists the flags that you can use to control the validation performed by the Xerces parser. The basic flags are as follows:
The following additional flags provide useful combinations of the basic flags:
For details, see %occSAX.inc, which also provides links to further details on these kinds of validation.
The following fragment shows how you can combine parser options:
...
#include %occInclude
#include %occSAX
...
 ;; set the parser options we want
 set opt = $$$SAXVALIDATION
               + $$$SAXNAMESPACES
               + $$$SAXNAMESPACEPREFIXES
               + $$$SAXVALIDATIONSCHEMA
...
  set status=##class(%XML.TextReader).ParseFile(myfile,.doc,,opt)
  //check status
  if $$$ISERR(status) {do $System.Status.DisplayError(status) quit}
Specifying the Event Mask
The %occSAX.inc include file also lists the flags that you use to specify which event callbacks to process. For performance reasons, it is desirable to process only the callbacks that you need. You may or may not need to specify the mask, depending on which class you use to call the Caché SAX Parser.
Basic Flags
The basic flags are as follows:
Convenient Combination Flags
The following additional flags provide useful combinations of the basic flags:
Combining Flags into a Single Mask
The following fragment shows how you can combine multiple flags into a single mask:
...
#include %occInclude
#include %occSAX
...
 // set the mask options we want
 set mask = $$$SAXSTARTDOCUMENT
               + $$$SAXENDDOCUMENT
               + $$$SAXSTARTELEMENT
               + $$$SAXENDELEMENT
               + $$$SAXCHARACTERS
...
 // create a TextReader object (doc) by reference
 set status = ##class(%XML.TextReader).ParseFile(myfile,.doc,,,mask)

Specifying a Schema Document
You can specify a schema specification against which to validate the document source. Specify a string that contains a comma-separated list of namespace/URL pairs:
"namespace URL,namespace URL,namespace URL,..."
Here namespace is the XML namespace (not a namespace prefix) and URL is a URL that gives the location of the schema document for that namespace. There is a single space character between the namespace and URL values. For example, the following shows a schema specification with a single namespace:
"http://www.myapp.org http://localhost/myschemas/myapp.xsd"
The following shows a schema specification with two namespaces:
"http://www.myapp.org http://localhost/myschemas/myapp.xsd,http://www.other.org http://localhost/myschemas/other.xsd"
Disabling Entity Resolution
Even when you set SAX flags to disable validation, the SAX parser still attempts to resolve external entities, which can be time-consuming, depending on their locations.
The class %XML.SAX.NullEntityResolver implements an entity resolver that always returns an empty stream. Use this class if you want to disable entity resolution. Specifically, when you read the XML document, use an instance of %XML.SAX.NullEntityResolver as the entity resolver. For example:
   Set resolver=##class(%XML.SAX.NullEntityResolver).%New()
   Set reader=##class(%XML.Reader).%New()
   Set reader.EntityResolver=resolver
   
   Set status=reader.OpenFile(myfile)
   ...
Important:
Because this change disables all resolution of external entities, this technique also disables all external DTD and schema references in your XML document.
Performing Custom Entity Resolution
Your XML document may contain references to external DTDs or other entities. By default, Caché attempts to find the source documents for these entities and resolve them. To control how Caché resolves external entities, use the following procedure:
  1. Define an entity resolver class.
    This class must extend the %XML.SAX.EntityResolver class and must implement the resolveEntity() method, which has the following signature:
    method resolveEntity(publicID As %Library.String, systemID As %Library.String) as %Library.Integer
    This method is invoked each time the XML processor finds a reference to an external entity (such as a DTD); here publicID and systemID are the Public and System identifier strings for that entity.
    The method should fetch the entity or document, return it as a stream, and then wrap the stream in an instance of %XML.SAX.StreamAdapter. This class provides the necessary methods that are used to determine characteristics of the stream.
    If the entity cannot be resolved, the method should return $$$NULLOREF to indicate to the SAX parser that the entity cannot be resolved).
    Important:
    Despite the fact that the method signature indicates that the return value is %Library.Integer, the method should return an instance of %XML.SAX.StreamAdapter or a subclass of that class.
    Also, identifiers that reference external entities are always passed to the resolveEntity() method as specified in the document. Particularly, if such an identifier uses a relative URL, the identifier is passed as a relative URL, which means that the actual location of the referencing document is not passed to the resolveEntity() method, and the entity cannot be resolved. In such scenarios, use the default entity resolver rather than a custom one.
    For an example of an entity resolver class, see the source code for %XML.SAX.EntityResolver.
  2. When you read an XML document, do the following:
    1. Create an instance of your entity resolver class.
    2. Use that instance when you read the XML document, as described in Specifying the Parser Options,” earlier in this chapter.
Also see the previous section, Disabling Entity Resolution; note that %XML.SAX.NullEntityResolver (discussed in that section) is a subclass of %XML.SAX.EntityResolver.
Example 1
For example, consider the following XML document:
<?xml version="1.0" ?>
<!DOCTYPE html SYSTEM  "c://temp/html.dtd">
<html>
<head><title></title></head>
<body>
<p>Some < xhtml-content > with custom entities &entity1; and &entity2;.</p>
<p>Here is another paragraph with &entity1; again.</p>
</body></html>
This document uses the following DTD:
<!ENTITY entity1
         PUBLIC "-//WRC//TEXT entity1//EN"
         "http://www.intersystems.com/xml/entities/entity1">
<!ENTITY entity2
         PUBLIC "-//WRC//TEXT entity2//EN"
         "http://www.intersystems.com/xml/entities/entity2">
<!ELEMENT html (head, body)>
<!ELEMENT head (title)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT body (p)>
<!ELEMENT p (#PCDATA)>
To read this document, you would need a custom entity resolver like the following:
Class CustomResolver.Resolver Extends %XML.SAX.EntityResolver
{

Method resolveEntity(publicID As %Library.String, systemID As %Library.String) As %Library.Integer
{
    Try {
        Set res=##class(%Stream.TmpBinary).%New()
        //check if we are here to resolve a custom entity
        If systemID="http://www.intersystems.com/xml/entities/entity1" 
        {
            Do res.Write("Value for entity1")
            Set return=##class(%XML.SAX.StreamAdapter).%New(res)
            }
            Elseif systemID="http://www.intersystems.com/xml/entities/entity2" 
            {
                Do res.Write("Value for entity2")
                Set return=##class(%XML.SAX.StreamAdapter).%New(res)
            }
            Else //otherwise call the default resolver
            {
                Set res=##class(%XML.SAX.EntityResolver).%New()
                Set return=res.resolveEntity(publicID,systemID)
            }
    }
    Catch 
    {
        Set return=$$$NULLOREF
    }
    Quit return
}

}
The following class contains a demo method that parses the file shown earlier and uses this custom resolver:
Include (%occInclude, %occSAX)

Class CustomResolver.ParseFileDemo
{

ClassMethod ParseFile() As %Status
{
    Set res= ##class(CustomResolver.Resolver).%New()  
    Set file="c:/temp/html.xml"
    Set parsemask=$$$SAXALLEVENTS+$$$SAXERROR
    Set status=##class(%XML.TextReader).ParseFile(file,.textreader,res,,parsemask,,0)
    If $$$ISERR(status) {Do $system.OBJ.DisplayError(status) Quit $$$ERROR(status)}

    Write !,"Parsing the file ",file,! 
    Write "Custom entities in this file:"
    While textreader.Read()
    {
        If textreader.NodeType="entity"{
            Write !, "Node:", textreader.seq
            Write !,"    name: ", textreader.Name
            Write !,"    value: ", textreader.Value
        }
    }

    Quit $$$OK
}

}
The following shows the output of this method, in a Terminal session:
GXML>d ##class(CustomResolver.ParseFileDemo).ParseFile()
 
Parsing the file c:/temp/html.xml
Custom entities in this file:
Node:13
    name: entity1
    value: Value for entity1
Node:15
    name: entity2
    value: Value for entity2
Node:21
    name: entity1
    value: Value for entity1
Example 2
For example, suppose that you need to read an XML document that contains the following (assuming that c:\cachesys is your Cache installation directory; see Default Caché Installation Directory in the Caché Installation Guide for the actual location on your system):
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
 "c:\cachesys\csp\docbook\doctypes\docbook\docbookx.dtd">
In this case, the resolveEntity method would be invoked with publicId set to -//OASIS//DTD DocBook XML V4.1.2//EN and systemId set to c:\cachesys\csp\docbook\doctypes\docbook\docbookx.dtd.
The resolveEntity method determines the correct source for the external entity, returns it as a stream, and wraps it in an instance of %XML.StreamAdaptor. The XML parser reads the entity definition from this specialized stream.
For an example, refer to the %XML.Catalog and %XML.CatalogResolver classes included in the Caché library. The %XML.Catalog class defines a simple database that associates public and system identifiers with URLs. The %XML.CatalogResolver class is an entity resolver class that uses this database to find the URL for a given identifier. The %XML.Catalog class can load its database from an SGML-style catalog file; this file maps identifiers to URLs in a standard format.
Creating a Custom Content Handler
You can create a custom content handler for your own needs, if you call the Caché SAX Parser directly. This section discusses the following topics:
Overview of Creating Custom Content Handlers
To customize how the Caché SAX Parser imports and handles XML, create and use a custom SAX content handler. Specifically, create a subclass of %XML.SAX.ContentHandler. Then, in the new class, override any of the default methods to perform the actions that are required. Use the new content handler as an argument when you parse an XML document; to do this, you use the parsing methods of the %XML.SAX.Parser class.
This operation is illustrated in the following diagram:
The process for creating and using a custom import mechanism is as follows:
  1. Create a class that extends %XML.SAX.ContentHandler.
  2. In that class, include the methods that you wish to override and provide new definitions as needed.
  3. Write a class method that reads an XML document by using one of the parsing methods of the %XML.SAX.Parser class, namely ParseFile(), ParseStream(), ParseString(), or ParseURL().
    When you call the parsing method, specify your custom content handler as an argument.
Customizable Methods of the SAX Content Handler
The %XML.SAX.ContentHandler class automatically executes certain methods at specific times. By overriding them, you can customize the behavior of your content handler.
Responding to Events
The %XML.SAX.ContentHandler class parses an XML file and generates events when it reaches particular points in the XML file. Depending on the event, a different method is executed. These methods are as follows:
These methods are empty by default, and you can override them in your custom content handler. For information on their expected argument lists and return values, see the class documentation for %XML.SAX.ContentHandler.
Handling Errors
The %XML.SAX.ContentHandler class also executes methods when it encounters certain errors:
These methods are empty by default, and you can override them in your custom content handler. For information on their expected argument lists and return values, see the class documentation for %XML.SAX.ContentHandler.
Computing the Event Mask
When you call the Caché SAX Parser (via the %XML.SAX.Parser class), you can specify a mask argument that indicates which callbacks are interesting. If you do not specify a mask argument, the parser calls the Mask() method of the content handler. This method returns an integer that specifies the composite mask that corresponds to your overridden methods of the content handler.
For example, suppose that you create a custom content handler that contains new versions of the startElement() and endElement() methods. In this case, the Mask() method returns a numeric value that is equivalent to the sum of $$$SAXSTARTELEMENT and $$$SAXENDELEMENT, the flags that corresponding to these two events. If you do not specify a mask argument to the parsing method, the parser calls the Mask() method of your content handler and thus processes only those two events.
Other Useful Methods
The %XML.SAX.ContentHandler class provides other methods that are useful in special situations:
These methods are final and cannot be overridden.
Argument Lists for the SAX Parsing Methods
To specify a document source, you use the ParseFile(), ParseStream(), ParseString(), or ParseURL() method of the %XML.SAX.Parser class. In any case, the source document must be a well-formed XML document; that is, it must obey the basic rules of XML syntax. The complete argument list is as follows, in order:
  1. pFilename, pStream, pString, or pURL — The document source.
  2. pHandler — A content handler, which is an instance of the %XML.SAX.ContentHandler class.
  3. pResolver — An entity resolver to use when parsing the source. See Performing Custom Entity Resolution,” earlier in this chapter.
  4. pFlags — Flags to control the validation and processing performed by the SAX parser. See Setting the Parser Flags,” earlier in this chapter.
  5. pMask — A mask to specify which items are of interest in the XML source. Usually you do not need to specify this argument, because for the parsing methods of %XML.SAX.Parser, the default mask is 0. This means that the parser calls the Mask() method of the content handler. That method computes the mask by detecting (during compilation) all the event callbacks that you customized in the event handler. Only those event callbacks are processed. However, if you want to specify the mask, see Specifying the Event Mask,” earlier in this chapter.
  6. pSchemaSpec — A schema specification, against which to validate the document source. This argument is a string that contains a comma-separated list of namespace/URL pairs:
    "namespace URL,namespace URL"
    Here namespace is the XML namespace used for the schema and URL is a URL that gives the location of the schema document. There is a single space character between the namespace and URL values.
  7. pHttpRequest (For the ParseURL() method only) — The request to the web server, as an instance of %Net.HttpRequest.
    For details on %Net.HttpRequest, see the book Using Caché Internet Utilities. Or see the class documentation for %Net.HttpRequest.
  8. pSSLConfiguration — Configuration name of a client SSL/TLS configuration.
    See Using HTTPS,” later in this chapter.
Note:
Notice that this argument list is slightly different from that of the parse methods of the %XML.TextReader class. For one difference, %XML.TextReader does not provide an option to specify a custom content handler.
A SAX Handler Example
Suppose you want a list of all the XML elements that appear in a file. To do this, you need simply to note every start element. Then the process is as follows:
  1. Create a class, here called MyApp.Handler, which extends %XML.SAX.ContentHandler:
    Class MyApp.Handler Extends %XML.SAX.ContentHandler
    {
    }
  2. Override the startElement() method with the following content:
    Class MyApp.MyHandler extends %XML.SAX.ContentHandler
    {
    // ...
    
    Method startElement(uri as %String, localname as %String, 
                 qname as %String, attrs as %List)
    {
        //we have found an element
        write !,"Element: ",localname
    }
    
    }
    
  3. Add a class method to the Handler class that reads and parses an external file:
    Class MyApp.MyHandler extends %XML.SAX.ContentHandler
    {
    // ...
    ClassMethod ReadFile(file as %String) as %Status
    {
        //create an instance of this class
        set handler=..%New()
    
        //parse the given file using this instance
        set status=##class(%XML.SAX.Parser).ParseFile(file,handler)
    
        //quit with status
        quit status
    }
    }
    Note that this is a class method because it is invoked in an application to perform its processing. This method does the following:
    1. It creates an instance of a content handler object:
          set handler=..%New()
    2. It invokes the ParseFile() method of the %XML.SAX.Parser class. This validates and parses the document (specified by filename) and invokes the various event handling methods of the content handler object:
          set status=##class(%XML.SAX.Parser).ParseFile(file,handler)
      Each time an event occurs while the parser parses the document (such as a start or end element), the parser invokes the appropriate method in the content handler object. In this example, the only overridden method is startElement(), which then writes out element names. For other events, such as reaching end elements, nothing happens (the default behavior).
    3. When the ParseFile() method reaches the end of the file, it returns. The handler object goes out of scope and is automatically removed from memory.
  4. At the appropriate point in the application, invoke the ReadFile() method, passing it the file to parse:
     Do ##class(Samples.MyHandler).ReadFile(filename)
    Where filename is the path of the file being read.
For instance, if the content of the file is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<Root>
  <Person>
    <Name>Edwards,Angela U.</Name>
    <DOB>1980-04-19</DOB>
    <GroupID>K8134</GroupID>
    <HomeAddress>
      <City>Vail</City>
      <Zip>94059</Zip>
    </HomeAddress>
    <Doctors>
      <Doctor>
        <Name>Uberoth,Wilma I.</Name>
      </Doctor>
      <Doctor>
        <Name>Wells,George H.</Name>
      </Doctor>
    </Doctors>
  </Person>
</Root>
Then the output of this example is as follows:
Element: Root
Element: Person
Element: Name
Element: DOB
Element: GroupID
Element: HomeAddress
Element: City
Element: Zip
Element: Doctors
Element: Doctor
Element: Name
Element: Doctor
Element: Name
Using HTTPS
%XML.SAX.Parser supports HTTPS. That is, you can use this class to do the following:
In all cases, if any of these items are served at an HTTPS location, do the following:
  1. Use the Management Portal to create an SSL/TLS configuration that contains the details of the needed connection. For information, see the chapter Using SSL/TLS with Caché in the Caché Security Administration Guide.
    This is a one-time step.
  2. When you invoke the applicable parsing method of %XML.SAX.Parser, specify the pSSLConfiguration argument.
By default, Caché uses the Xerces entity resolution. %XML.SAX.Parser uses its own entity resolution only in the following cases: