Learning
Community
Open Exchange
Global Masters
InterSystems IRIS Data Platform 2019.3 / Application Development / Using the InterSystems Spark Connector / Spark Connector Quick Reference
Previous section   

Spark Connector Quick Reference

This chapter is a quick reference to the InterSystems Spark Connector API methods, which define a custom Scala interface for the Spark Connector. It extends the generic Spark interface with a set of implicit Scala classes and types that provide more convenience and better type safety.
Note:
This is not a definitive reference for this API. It is intended as a “cheat sheet” containing only short descriptions of members and parameters, plus links to detailed documentation and examples.
For the most complete and up-to-date information, see the Spark Connector online documentation.
The Spark Connector API provides the following extension methods to Spark Scala classes:

Spark Connector Method Reference

Each entry in this section provides a brief method overview including signatures and parameter descriptions. Most entries also include links to examples and more detailed information.

DataFrameReader Extension Methods

iris()
DataFrameReader.iris() executes a query on the cluster or loads the specified table, tunes partitioning, and returns results in a DataFrame. See “Using the iris() Read and Write Methods” for more information and examples.
def iris(text: String,mfpi: N = 1): DataFrame

def iris(text: String,column: String,lo: Long,hi: Long,partitions: N): DataFrame

  • text — text of a query to be executed, or name of a table to load.
  • column, lo, hi, partitions— allow you to explicitly specify partitioning parameters.
  • mfpi — allows the server to determine partitioning parameters within certain limits.
See “Partition Tuning Options” for detailed information and examples concerning partitioning parameters (column, lo, hi, partitions, and mfpi).
address()
DataFrameReader.address() specifies the connection details of the cluster to read from. Overrides the default instance specified in the Spark configuration for the duration of this read operation.
def address(url: String,user: String = "",password: String = ""): DataFrameReader

  • url, user, passwordString values used to define a JDBC connection to the master server.
See “Connection Options” for more information and examples.

DataFrameWriter Extension Methods

iris()
DataFrameWriter.iris() saves a DataFrame to the specified table on the cluster. See “Using the iris() Read and Write Methods” for detailed information and examples.
def iris(table: String): Unit
  • table — a String containing the name of the table to write to.
See “CREATE TABLE” in the InterSystems SQL Reference for more information on table creation options.
address()
DataFrameWriter.address() specifies the connection details of the cluster to write to. Overrides the default instance specified in the Spark configuration for the duration of this write operation.
def address(url: String,user: String = "",password: String = ""): DataFrameWriter[T]

  • url, user, passwordString values used to define a JDBC connection to the master server.
See “Connection Options” for more information and examples.
description()
DataFrameWriter.description() specifies a description to document the newly created table.
def description(value: String): DataFrameWriter[T]
  • value — a String containing an arbitrary description for the table.
See “InterSystems IRIS Save Options” for more information and examples.
publicRowID()
DataFrameWriter.publicRowID() specifies whether the master RowID field of the newly created table should be publicly visible.
def publicRowID(value: Boolean): DataFrameWriter[T]
  • value — a Boolean indicating whether the table should be publicly visible.
See “InterSystems IRIS Save Options” for more information and examples.
shard()
DataFrameWriter.shard() specifies the user defined shard key for the newly created table, or specifies whether the newly created table is to be sharded. Has no effect if the table already exists and the save mode is anything other than OVERWRITE.
def shard(fields: String*): DataFrameWriter[T]
def shard(value: Boolean): DataFrameWriter[T]

  • fields — A String sequence of field names (possibly empty) to be used as the user defined shard key. If the sequence is empty then the table will be sharded on the system assigned key.
  • valueBoolean value indicating whether the newly created table is to be sharded
See “InterSystems IRIS Save Options” for more information and examples.
autobalance()
Specifies whether or not inserted records should be evenly distributed among the available shards of the cluster when saving to a table that is sharded on a system assigned shard key. Has no effect if the table is not sharded, or is sharded using a custom shard key.
def autobalance(value: Boolean): DataFrameWriter[T]
  • valueBoolean true to evenly distribute records amongst the available shards of the cluster, or false to save records into shards that are 'closest' to where the partitions of the dataset reside.
See “InterSystems IRIS Save Options” for more information and examples.

SparkSession and SparkContext Extension Methods

dataset[T]()
SparkSession.dataset[T]() executes a query on the cluster or loads the specified table, optionally tunes partitioning, and returns results in a Dataset[T]s. See “Using the dataset[T]() and rdd[T]() Methods” for more information and examples.
def dataset[T](text: String, mfpi: N = 1)(implicit arg0: Encoder[T]): Dataset[T]
def dataset[T](text: String, column: String, lo: Long, hi: Long, partitions: N)(implicit arg0: Encoder[T]): Dataset[T]
  • text — text of a query to be executed, or name of a table to load.
  • column, lo, hi, partitions— allow you to explicitly specify partitioning parameters.
  • mfpi — allows the server to determine partitioning parameters within certain limits.
See “Partition Tuning Options” for detailed information and examples concerning partitioning options.
rdd[T]()
SparkContext.rdd[T]() executes a query on the cluster or loads the specified table, optionally tunes partitioning, and computes a suitably partitioned RDD whose elements are formatted by the provided formatting function (an instance of Format[T]). See “Using the dataset[T]() and rdd[T]() Methods” for more information and examples.
def rdd[T](text: String, mfpi: N, format: Format[T])(implicit arg0: ClassTag[T]): RDD[T]

def rdd[T](text: String, column: String, lo: Long, hi: Long, partitions: N, format: Format[T])(implicit arg0: ClassTag[T]): RDD[T]
  • text — text of a query to be executed, or name of a table to load.
  • column, lo, hi, partitions— allow you to explicitly specify partitioning parameters.
  • mfpi — allows the server to determine partitioning parameters within certain limits.
  • format — an instance of Format[T] containing the function used to encode each row of the result set.
See “Partition Tuning Options” for concerning partitioning parameters (column, lo, hi, partitions, and mfpi).
See “Format[T]” for more information on creating and using an instance of Format[T].

ml.PipelineModel Extension Method

iscSave()
ml.PipelineModel.iscSave() saves the PMML definition for the given model as a Class Definition on the master instance identified by address for subsequent execution on the cluster. See “Using PMML Models with InterSystems Products” for more information and examples.
def iscSave(klass: String,schema: StructType,address: Address = Address()): Unit
  • klass — name of the class definition to create. If this class already exists it will be overwritten.
  • schema — The schema of the model’s source Dataset, on which the PMML file's data dictionary will be based.
  • address — The master instance to which the class will be written.
Previous section