Apache Spark Support
The InterSystems IRIS® Spark Connector is an implementation of the Data Source API for Apache Spark that allows the Spark data processing engine to make optimal use of the InterSystems IRIS® data platform and its distributed data capabilities.
This chapter provides technical details about the InterSystems Spark Connector. The following topics are discussed:
Apache Spark and the InterSystems Spark Connector — provides an overview and resource links for the Spark Connector.
Spark Connector Extensions — lists and discusses InterSystems extensions to the Data Source API.
Apache Spark and the InterSystems Spark Connector
Apache SparkOpens in a new window is a high performance Java analytics engine for use in clustered computing environments. It's heart is the Resilient Distributed Dataset (RDD) which represents a distributed, fault tolerant, collection of data that can be operated on in parallel. Spark includes libraries for SQL, machine learning, graph processing, stream processing, and many other functions.
Spark provides a jdbc data sourceOpens in a new window that allows the results of a complex SQL query executed within the database to be retrieved by Spark as a Dataset, and for a Dataset to be written back into the database as a SQL table.
The InterSystems IRIS data platform can connect to Spark using only the jdbc data source, but the InterSystems Spark Connector implements a custom iris data source that provides important enhancements for optimal performance.
The terms jdbc and iris (lower case, in the same typography as other class names) are used frequently in this book, and always refer specifically to the data source provider class names, never to Java JDBC or InterSystems IRIS.
Installation and Configuration
See the following sections in Using the InterSystems Spark Connector for information on installation and configuration:
Requirements and Configuration provides InterSystems-specific configuration information.
Spark Connector Best Practices describes ways to optimize Spark Connector hardware and software.
Also see the following related documents:
Spark Connector Compliance and Compatibility
The InterSystems IRIS Spark Connector is a plug-compatible replacement for the Spark jdbc data source.
Spark Connector Extensions
The InterSystems IRIS Spark Connector provides InterSystems-specific extension methods to improve usability.
Spark is implemented using a combination of Java and Scala, and can run on any JVM. The Data Source API and all extensions are implemented in Scala.