Skip to main content
Previous section

Spark Connector Quick Reference

This chapter is a quick reference to the InterSystems Spark Connector API methods, which define a custom Scala interface for the Spark Connector. It extends the generic Spark interface with a set of implicit Scala classes and types that provide more convenience and better type safety.

Note:

This is not a definitive reference for the com.intersystems.spark API. It is intended as a “cheat sheet” containing only short descriptions of members and parameters, plus links to detailed documentation and examples.

For the most complete and up-to-date information, see the online Spark Connector API Reference.

The Spark Connector API provides the following extension methods to Spark Scala classes:

  • DataFrameReader Extension Methods:

    • iris() — executes a query or loads a table, tunes partitioning, and returns results in a DataFrame.

    • address() — specifies the connection details of the cluster to read from

  • DataFrameWriter Extension Methods:

    • iris() — saves a DataFrame to the given table on the cluster.

    • address() — specifies the connection details of the cluster to write to.

    • description() — specifies an arbitrary description for the newly created table.

    • publicRowID() — specifies whether the master RowID field of the newly created table should be publicly visible.

    • shard() — specifies the shard key for the newly created table.

    • autobalance() — specifies how to distribute records saved to a table using a system assigned shard key.

  • SparkSession and SparkContext Extension Methods:

    • rdd[T]() — executes a query or loads a table, tunes partitioning, formats each row of the result set with a specified function, and returns an RDD containing the formatted data.

    • dataset[T]() — executes a query or loads a table, tunes partitioning, and returns results in a Dataset with elements of type T.

  • ml.Pipeline Extension Method:

    • iscSave() — saves a model to an InterSystems IRIS PMML definition class.

Spark Connector Method Reference

Each entry in this section provides a brief method overview including signatures and parameter descriptions. Most entries also include links to examples and more detailed information.

DataFrameReader Extension Methods

iris()

DataFrameReader.iris() executes a query on the cluster or loads the specified table, tunes partitioning, and returns results in a DataFrame. See “Using the iris() Read and Write Methods” for more information and examples.

def iris(text: String,mfpi: N = 1): DataFrame
Copy code to clipboard
def iris(text: String,column: String,lo: Long,hi: Long,partitions: N): DataFrame
Copy code to clipboard
  • text — text of a query to be executed, or name of a table to load.

  • column, lo, hi, partitions— allow you to explicitly specify partitioning parameters.

  • mfpi — allows the server to determine partitioning parameters within certain limits.

See “Partition Tuning Options” for detailed information and examples concerning partitioning parameters (column, lo, hi, partitions, and mfpi).

address()

DataFrameReader.address() specifies the connection details of the cluster to read from. Overrides the default instance specified in the Spark configuration for the duration of this read operation.

def address(url: String,user: String = "",password: String = ""): DataFrameReader
Copy code to clipboard
  • url, user, passwordString values used to define a JDBC connection to the master server.

See “Connection Options” for more information and examples.

DataFrameWriter Extension Methods

iris()

DataFrameWriter.iris() saves a DataFrame to the specified table on the cluster. See “Using the iris() Read and Write Methods” for detailed information and examples.

def iris(table: String): Unit
Copy code to clipboard
  • table — a String containing the name of the table to write to.

See “CREATE TABLE” in the InterSystems SQL Reference for more information on table creation options.

address()

DataFrameWriter.address() specifies the connection details of the cluster to write to. Overrides the default instance specified in the Spark configuration for the duration of this write operation.

def address(url: String,user: String = "",password: String = ""): DataFrameWriter[T]
Copy code to clipboard
  • url, user, passwordString values used to define a JDBC connection to the master server.

See “Connection Options” for more information and examples.

description()

DataFrameWriter.description() specifies a description to document the newly created table.

def description(value: String): DataFrameWriter[T]
Copy code to clipboard
  • value — a String containing an arbitrary description for the table.

See “InterSystems IRIS Save Options” for more information and examples.

publicRowID()

DataFrameWriter.publicRowID() specifies whether the master RowID field of the newly created table should be publicly visible.

def publicRowID(value: Boolean): DataFrameWriter[T]
Copy code to clipboard
  • value — a Boolean indicating whether the table should be publicly visible.

See “InterSystems IRIS Save Options” for more information and examples.

shard()

DataFrameWriter.shard() specifies the user defined shard key for the newly created table, or specifies whether the newly created table is to be sharded. Has no effect if the table already exists and the save mode is anything other than OVERWRITE.

def shard(fields: String*): DataFrameWriter[T]
def shard(value: Boolean): DataFrameWriter[T]
Copy code to clipboard
  • fields — A String sequence of field names (possibly empty) to be used as the user defined shard key. If the sequence is empty then the table will be sharded on the system assigned key.

  • valueBoolean value indicating whether the newly created table is to be sharded

See “InterSystems IRIS Save Options” for more information and examples.

autobalance()

Specifies whether or not inserted records should be evenly distributed among the available shards of the cluster when saving to a table that is sharded on a system assigned shard key. Has no effect if the table is not sharded, or is sharded using a custom shard key.

def autobalance(value: Boolean): DataFrameWriter[T]
Copy code to clipboard
  • valueBoolean true to evenly distribute records amongst the available shards of the cluster, or false to save records into shards that are 'closest' to where the partitions of the dataset reside.

See “InterSystems IRIS Save Options” for more information and examples.

SparkSession and SparkContext Extension Methods

dataset[T]()

SparkSession.dataset[T]() executes a query on the cluster or loads the specified table, optionally tunes partitioning, and returns results in a Dataset[T]s. See “Using the dataset[T]() and rdd[T]() Methods” for more information and examples.

def dataset[T](text: String, mfpi: N = 1)(implicit arg0: Encoder[T]): Dataset[T]
Copy code to clipboard
def dataset[T](text: String, column: String, lo: Long, hi: Long, partitions: N)(implicit arg0: Encoder[T]): Dataset[T]
Copy code to clipboard
  • text — text of a query to be executed, or name of a table to load.

  • column, lo, hi, partitions— allow you to explicitly specify partitioning parameters.

  • mfpi — allows the server to determine partitioning parameters within certain limits.

See “Partition Tuning Options” for detailed information and examples concerning partitioning options.

rdd[T]()

SparkContext.rdd[T]() executes a query on the cluster or loads the specified table, optionally tunes partitioning, and computes a suitably partitioned RDD whose elements are formatted by the provided formatting function (an instance of Format[T]). See “Using the dataset[T]() and rdd[T]() Methods” for more information and examples.

def rdd[T](text: String, mfpi: N, format: Format[T])(implicit arg0: ClassTag[T]): RDD[T]
Copy code to clipboard
def rdd[T](text: String, column: String, lo: Long, hi: Long, partitions: N, format: Format[T])(implicit arg0: ClassTag[T]): RDD[T]
Copy code to clipboard
  • text — text of a query to be executed, or name of a table to load.

  • column, lo, hi, partitions— allow you to explicitly specify partitioning parameters.

  • mfpi — allows the server to determine partitioning parameters within certain limits.

  • format — an instance of Format[T] containing the function used to encode each row of the result set.

See “Partition Tuning Options” for concerning partitioning parameters (column, lo, hi, partitions, and mfpi).

See “Defining a Type Format[T] function” for more information on creating and using an instance of Format[T].

ml.PipelineModel Extension Method

iscSave()

ml.PipelineModel.iscSave() saves the PMML definition for the given model as a Class Definition on the master instance identified by address for subsequent execution on the cluster. See “Using PMML Models with InterSystems Products” for more information and examples.

def iscSave(klass: String,schema: StructType,address: Address = Address()): Unit
Copy code to clipboard
  • klass — name of the class definition to create. If this class already exists it will be overwritten.

  • schema — The schema of the model’s source Dataset, on which the PMML file's data dictionary will be based.

  • address — The master instance to which the class will be written.