Using the InterSystems Spark Connector
The InterSystems IRIS™ Spark Connector enables an InterSystems IRIS database to function as an Apache Spark data source
. It implements a plug-compatible replacement for the standard Spark jdbc data source
. This allows the results of a complex SQL query executed within the database to be retrieved by the Spark program as a Spark Dataset
, and for a Dataset
to be written back into the database as a SQL table.
The Spark Connector has an intimate knowledge of and tight integration with the underlying database server that provides several advantages over the standard Spark jdbc
The Connector recognizes a richer set of operators than the standard jdbc
data source, allowing more operations to be 'pushed down' into the underlying database for execution.
An InterSystems IRIS database can be sharded
, meaning that tables that are transparently partitioned across multiple servers running on different computers. The Connector allocates compute tasks so that each Spark executor is co-located with the server from which it draws its data. This not only reduces the movement of data across the network, but, more importantly, allows the Spark program's performance to scale linearly with the number of shards, and so size of the data set, on which it operates.
The Connector exploits the server's innate ability to automatically parallelize certain queries and so allow large result sets to be returned quickly to the Spark driver program through multiple concurrent network connections. By contrast, the standard jdbc
data source requires the user to explicitly specify how the result set is to be partitioned, which in practice is often very difficult to do well.
Data Source Provider Class Names
var df = spark.read
Since the full class name is very awkward to use, it is normally specified with the short alias jdbc
var df = spark.read.format("jdbc").option("dbtable","mytable").load()
The InterSystems Spark Connector data source is referenced in exactly the same way, using the full provider class name com.intersys.spark
or the short alias iris
var df = spark.read.format("iris").option("dbtable","mytable").load()
The terms jdbc
(lower case, in the same typography as other class names) are used frequently in this book, and always refer specifically to the data source provider class names, never to Java JDBC or InterSystems IRIS.
Requirements and Configuration
The Spark Connector requires the following:
Optional Configuration Settings
The Connector recognizes a number of configuration settings that parameterize its operation. These are parsed from the Apache Spark SparkConfconfiguration
structure at startup and may be specified by:
values for the --conf
option passed on the command line.
arguments to the SparkConf()
constructor or its set()
member functions, called from within the driver application itself.
, and password
options specify connection string values for a read or write. The default values are automatically defined using information from the default InterSystems IRIS master instance specified in the SparkConf
configuration. Connection options can be explicitly specified in a read or write operation (see Connection Options
) to override the defaults.
Default values are also assigned to the following settings:
A positive integral number of seconds after which a connection to an instance is judged to have expired and is automatically closed and garbage collected. Default = 60.
is a regular expression (which may contain parenthesized groups)
is a replacement string (which may contain $ references to those groups)
If specified, this setting describes how to convert the host name of an InterSystems IRIS server into the host name of the preferred Spark worker that should handle requests to read and write dataset partitions to it.
The most common cluster configuration is to arrange that a Spark worker runs on each machine that hosts an InterSystems IRIS server, for then records need not travel across the network. It can happen, however, that the host name by which the master server knows its shard server differs from the host name by which the Spark master knows its worker, even though they are running on the very same machine. Their host names could be aliased by a DNS server, for example, or could be running in separate Docker containers.
This setting offers a means of defining at install time a function that maps between the two host names.