docs.intersystems.com
Home  /  First Look: Scaling InterSystems IRIS for Data Volume with Sharding


Articles
First Look: Scaling InterSystems IRIS for Data Volume with Sharding
InterSystems: The power behind what matters   
Search:  


This First Look introduces you to the InterSystems IRIS™ sharding feature and its use in a sharded cluster to horizontally scale InterSystems IRIS Data Platform™ for data volume.
As part of this First Look, you will use ICM to provision a sharded cluster in a public cloud and see how sharding a table distributes its rows across the shards in a cluster.
How Can Sharding Help You?
Are you feeling the Big Data heat?
Ready or not, we are all managing more data than ever before and being asked to do more with it — and the response times demanded are growing ever shorter. Whether you care for ten million patients, process billions of financial orders a day, track a galaxy of stars, or monitor a thousand factory engines, your data platform must not only support your current data workload but scale to meet increasing demand while maintaining the performance standards you rely on and avoiding business disruption. Each business-specific workload presents different challenges to the data platform on which it operates — and as workloads grow, those challenges become even more acute.
InterSystems IRIS includes a comprehensive set of capabilities to scale your applications, which can be applied alone or in combination, depending on the nature of your workload and the specific performance challenges it faces. One of these, sharding, partitions both data and its associated cache across a number of servers, providing flexible, inexpensive performance scaling for queries and data ingestion while maximizing infrastructure value through highly efficient resource utilization. An InterSystems IRIS sharded cluster can provide significant performance benefits for a wide variety of applications, but especially for those with workloads that include one or more of the following:
Each of these factors on its own influences the potential gain from sharding, but the benefit may be enhanced where they combine. For example, a combination of all three factors — large amounts of data ingested quickly, large data sets, and complex queries that retrieve and process a lot of data — makes many of today’s analytic workloads very good candidates for sharding.
Note that these characteristics all have to do with data; the primary function of InterSystems IRIS sharding is to scale for data volume. But a sharded cluster can also include features that scale for user volume, when workloads involving some or all of these data-related factors also experience a very high query volume from large numbers of users. And sharding can be combined with vertical scaling as well. With InterSystems IRIS, you can create just the right overall scaling solution for your workload’s performance challenges.
How Does Sharding Work?
The heart of the sharded architecture is the partitioning of data and its associated cache across a number of systems. A sharded cluster partitions large database tables horizontally — that is, by row — across multiple InterSystems IRIS instances, called shard data servers, while allowing applications to transparently access these tables through a single instance, called the shard master data server (or just shard master). This architecture provides several advantages:
Nonsharded tables are stored on the shard master, while a federated software component called the sharding manager keeps track of which data is on which shard data servers and directs queries accordingly. From the perspective of the application SQL, the distinction between sharded and nonsharded tables is totally transparent.
A Basic Sharded Cluster
Try It! Deploy and Demo an InterSystems IRIS Sharded Cluster
In this exploration, you will
To introduce you to sharding without bogging you down in details, we’ve kept this simple. A sharded cluster in production requires planning and a number of decisions, so be sure not to confuse this exploration of sharding with the real thing! For example, when designing a production cluster, you would review your schema and tables to decide which tables are to be sharded, then decide how many shard data servers to deploy (typically on the order of 4 to 16) based on both the anticipated working set for the sharded data and the amount of memory available for the database caches on your servers. In this exploration, however, you’ll deploy a basic two-shard cluster. The sources listed at the end of this document will give you a good idea of what’s involved in using sharding in production. The chapter Horizontally Scaling InterSystems IRIS for Data Volume with Sharding in the Scalability Guide provides detailed information about sharding and sharded clusters.
These instructions assume you have an InterSystems IRIS sharding license and access to InterSystems software downloads.
Use ICM to Deploy the Cluster
You can find the procedure for using ICM to deploy the sharded cluster on the Amazon Web Services public cloud platform in First Look: InterSystems Cloud Manager; specifically in the section Try It! Deploy InterSystems IRIS in the Cloud with ICM. You can use the entire procedure as it is presented there. When you get to Customize definitions.json, be sure to make only one change to the definitions.json file — replace the value of the ISCLicense field with the location of your InterSystems IRIS sharding license within the ICM container. When you have finished the deployment stage (Deploy InterSystems IRIS), remain in the ICM container.
Note:
As an alternative, if you prefer, you can install the InterSystems IRIS instances on existing physical, virtual or cloud machines and deploy the sharded cluster using the Sharding API, as described in Deploy the Cluster Using the Sharding API in the Horizontally Scaling InterSystems IRIS for Data Volume with Sharding chapter of the Scalability Guide.
Use Different Shard Keys to Distribute Rows Differently, Then Query the Sharded Tables
To see how a sharded cluster partitions a sharded table across the shards based on the shard key you use, you are going to connect to the shard master and take the following steps:
Use the procedure that follows for this part of the exploration:
  1. Open a Terminal window on the shard master instance.
  2. In the Terminal window, switch to the cluster’s master namespace.
  3. Open the Terminal SQL shell:
    DB> do $SYSTEM.SQL.Shell()
    
    [SQL]DB>>
  4. Create and populate the nonsharded table, test.nonsharded, with the following SQL statements.
    CREATE TABLE test.nonsharded (field1 CHAR(5), field2 CHAR(5))
    INSERT INTO test.nonsharded (field1,field2) VALUES ('one','one')
    INSERT INTO test.nonsharded (field1,field2) VALUES ('one','two')
    INSERT INTO test.nonsharded (field1,field2) VALUES ('one','three')
    INSERT INTO test.nonsharded (field1,field2) VALUES ('two','one')
    INSERT INTO test.nonsharded (field1,field2) VALUES ('two','two')
    INSERT INTO test.nonsharded (field1,field2) VALUES ('two','three')
    INSERT INTO test.nonsharded (field1,field2) VALUES ('three','one')
    INSERT INTO test.nonsharded (field1,field2) VALUES ('three','two')
    INSERT INTO test.nonsharded (field1,field2) VALUES ('three','three')
    
    Use SELECT to see the contents of the table:
    SELECT * FROM test.nonsharded
    
    field1      field2
    one     one
    one     two
    one     three
    two     one
    two     two
    two     three
    three   one
    three   two
    three   three
    
  5. Create three sharded tables with the same fields that are in test.nonsharded, using the default shard key (RowID) for the first one, the field1 field for the second one, and the field2 field for the third, and populate each using an INSERT INTO statement selecting from test.nonsharded:
    CREATE TABLE test.rowid (field1 CHAR(5), field2 CHAR(5), SHARD)
    INSERT INTO test.rowid (field1,field2) SELECT field1,field2 FROM test.nonsharded
    
    CREATE TABLE test.field1 (field1 CHAR(5), field2 CHAR(5), SHARD KEY (field1))
    INSERT INTO test.field1 (field1,field2) SELECT field1,field2 FROM test.nonsharded
    
    CREATE TABLE test.field2 (field1 CHAR(5), field2 CHAR(5), SHARD KEY (field2))
    INSERT INTO test.field2 (field1,field2) SELECT field1,field2 FROM test.nonsharded
    
  6. Use SELECT *,%ID to display the contents of each sharded table with its RowID on the shard.
    SELECT *,%ID FROM test.rowid ORDER BY %ID
    
    field1      field2      ID
    one     one     1
    one     two     2
    two     one     3
    two     two     4
    three   three   5
    one     three   256000
    two     three   256001
    three   one     256002
    three   two     256003
    
    This distribution does not reflect the values of field1 or field2 (rows with all three values of each are located on both shards). If you delete, recreate, and reload test.rowid, the distribution may be different.
    SELECT *,%ID FROM test.field1 ORDER BY %ID
    
    field1      field2      ID
    one     one     1
    one     two     2
    one     three   3
    two     one     256000
    two     two     256001
    two     three   256002
    three   one     256003
    three   two     256004
    three   three   256005
    
    Sharding on the field1 field distributes the rows so that those with the same value of field1 are placed on the same shard. In this case, rows with value one are on one shard and those with values two and three on the other, but which values end up on which shard depends on how many shards and how many values there are.
    SELECT *,%ID FROM test.field2 ORDER BY %ID
    
    field1      field2      ID
    one     one     1
    two     one     2
    three   one     3
    one     two     256000
    one     three   256001
    two     two     256002
    two     three   256003
    three   two     256004
    three   three   256005
    
    Here, distribution is determined by the value of the field2 field.
  7. Finally, as an example of how sharding distributes work to the shard data servers, use the following SELECT statement with all three sharded tables:
    SELECT MAX(LENGTH(field2)) FROM <table>
    In each case, the result is 5, the length of the longest value, three, because the distribution of rows across the shards is entirely transparent to the query. The MAX(LENGTH(field2)) expression is computed independently on each shard, and the shard master chooses the MAX() of the results they return. For example, when the query is run on the test.field2 table, one shard returns 3, because it has only the value one in the field2 field, while the other shard returns 5; the shard master then chooses 5 as the larger of the two results.
    If you like, use EXPLAIN to show the query plan indicating explicitly how work is sent to the shards:
    EXPLAIN SELECT MAX(LENGTH(field2)) FROM <table>
More Sharded Cluster Options
Additional options for a sharded cluster include:
Learn More About Sharding
To learn more about sharding, see