InterSystems Implementation Reference for Java Third Party APIs
Apache Spark Support
The InterSystems IRIS™ Spark Connector
is an implementation of the Data Source API for Apache Spark that allows the Spark data processing engine to make optimal use of the InterSystems IRIS Data Platform and its distributed data capabilities.
This chapter provides technical details about the InterSystems Spark Connector. The following topics are discussed:
is a high performance Java analytics engine for use in clustered computing environments. It's heart is the Resilient Distributed Dataset (RDD) which represents a distributed, fault tolerant, collection of data that can be operated on in parallel. Spark includes libraries for SQL, machine learning, graph processing, stream processing, and many other functions.
Spark provides a jdbc data source
that allows the results of a complex SQL query executed within the database to be retrieved by Spark as a Dataset
, and for a Dataset
to be written back into the database as a SQL table.
The InterSystems IRIS data platform can connect to Spark using only the jdbc
data source, but the InterSystems Spark Connector implements a custom iris
data source that provides important enhancements for optimal performance.
The terms jdbc
(lower case, in the same typography as other class names) are used frequently in this book, and always refer specifically to the data source provider class names, never to Java JDBC or InterSystems IRIS.
Also see the following related documents:
The InterSystems IRIS Spark Connector is a plug-compatible replacement for the Spark jdbc
The InterSystems IRIS Spark Connector provides InterSystems-specific extension methods to improve usability.
Spark is implemented using a combination of Java and Scala, and can run on any JVM. The Data Source API and all extensions are implemented in Scala.