Introduction to the InterSystems Spark Connector
Due to continued improvements in both Apache Spark and the InterSystems JDBC driver, the InterSystems Spark Connector no longer provides significant advantages over the standard Spark JDBC connector. The Spark Connector is deprecated, and will be discontinued in an upcoming 2022 release.
The InterSystems IRIS® Spark Connector enables an InterSystems IRIS database to function as an Apache Spark data sourceOpens in a new tab. It implements a plug-compatible replacement for the standard Spark jdbc data sourceOpens in a new tab. This allows the results of a complex SQL query executed within the database to be retrieved by the Spark program as a Spark Dataset, and for a Dataset to be written back into the database as a SQL table.
The Spark Connector has an intimate knowledge of —and tight integration with —the underlying database server that provides several advantages over the standard Spark jdbc data source:
Predicate Push Down
The Connector recognizes a richer set of operators than the standard jdbc data source, allowing more operations to be 'pushed down' into the underlying database for execution.
An InterSystems IRIS database can be sharded, meaning that tables that are transparently partitioned across multiple servers running on different computers. The Connector allocates compute tasks so that each Spark executor is co-located with the server from which it draws its data. This not only reduces the movement of data across the network, but, more importantly, allows the Spark program's performance to scale linearly with the number of shards, and so size of the data set, on which it operates.
The Connector exploits the server's innate ability to automatically parallelize certain queries and so allow large result sets to be returned quickly to the Spark driver program through multiple concurrent network connections. By contrast, the standard jdbc data source requires the user to explicitly specify how the result set is to be partitioned, which in practice is often very difficult to do well.
Data Source Provider Class Names
Spark data sources are accessed through provider class names. The standard Spark jdbc data source provider class is named org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider, and can be used by specifying it in a call to format() as demonstrated in the following example:
var df = spark.read .format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider") .option("dbtable","mytable").load()
Since the full class name is very awkward to use, it is normally specified with the short alias jdbc:
var df = spark.read.format("jdbc").option("dbtable","mytable").load()
The InterSystems Spark Connector data source is referenced in exactly the same way, using the full provider class name com.intersystems.spark or the short alias iris:
var df = spark.read.format("iris").option("dbtable","mytable").load()
The terms jdbc and iris (lower case, in the same typography as other class names) are used frequently in this book, and always refer specifically to the data source provider class names, never to Java JDBC or InterSystems IRIS.
Requirements and Configuration
The Spark Connector requires the following:
InterSystems IRIS — for more information on configuration, see “Spark Connector Best Practices” later in this book.
Java 1.8 — InterSystems IRIS does not support earlier versions.
Optional Configuration Settings
The Connector recognizes a number of configuration settings that parameterize its operation. These are parsed from the Apache Spark SparkConfconfiguration structure at startup and may be specified by:
the file spark-defaults.conf associated with the Spark cluster.
values for the --conf option passed on the command line.
arguments to the SparkConf()Opens in a new tab constructor or its set() member functions, called from within the driver application itself.
The url, user, and password options specify connection string values for a read or write. The default values are automatically defined using information from the default InterSystems IRIS master instance specified in the SparkConf configuration. Connection options can be explicitly specified in a read or write operation (see “Connection Options”) to override the defaults.
A string of the form "IRIS://host:port/namespace" that specifies the address of the default Spark master server to connect to if none is.
The user account with which to connect to the master server.
The password for the user account with which to connect to the master server.
Default values are also assigned to the following settings:
A positive integral number of seconds after which a connection to an instance is judged to have expired and is automatically closed and garbage collected. Default = 60.
An optional string of the form pattern => rewrite where:
pattern is a regular expression (which may contain parenthesized groups)
rewrite is a replacement string (which may contain $ references to those groups)
If specified, this setting describes how to convert the host name of an InterSystems IRIS server into the host name of the preferred Spark worker that should handle requests to read and write dataset partitions to it.
The most common cluster configuration is to arrange that a Spark worker runs on each machine that hosts an InterSystems IRIS server, for then records need not travel across the network. It can happen, however, that the host name by which the master server knows its shard server differs from the host name by which the Spark master knows its worker, even though they are running on the very same machine. Their host names could be aliased by a DNS server, for example, or could be running in separate Docker containers.
This setting offers a means of defining at install time a function that maps between the two host names.
For more information, see the Apache Spark documentation on “Spark ConfigurationOpens in a new tab” and the org.apache.spark.SparkConfOpens in a new tab class.