Welcome, Data Engineers
As a data engineer for your InterSystems® Data Fabric Studio™ solution, your task is to set up the data pipeline, a term that refers to all the infrastructure needed to make data available within the solution. This page provides an overview of that infrastructure, from start to finish.
When data is available within the system, you and data analysts can create analysis cubes and reports based on those tables.
Data Sources and Schemas
In the product, a data source is a named configuration item that provides all the information needed to retrieve data from an external source. A classic data source is a database, but another possibility is an external system that provides an API via which an authenticated user can retrieve data. Yet another option is delimited files, either retrieved from a cloud location or pushed to the local file system. In all cases, the data source definition, usually configured by an administrator, contains all the information needed to retrieve structural information as well as data.
A data source can provide multiple kinds of data, each with a specific structure. The classic case is a database that contains many tables, where each table has a specific structure. Similarly, a specific API call returns data in a specific format. In the product, each of these structures is a schema, and the goal is to create a Data Catalog that consists of the schemas that are relevant to the business (and that are ultimately needed for reports and analysis). As a data engineer, your task is to import schemas from the data sources into the Data Catalog and then refine them and categorize them in ways that are useful to your organization.
To get started defining schemas, see Importing Schemas.
Recipes
Recipes describe how to load data from external sources into Data Fabric Studio or into external tables. Any recipe consists of some or all of the following steps, in order:
-
Staging activities, each of which loads data into a staging table.
-
Transformation activities, which can clean data in various ways.
Transformation activities update the staging table, either by adding new fields or overwriting existing fields. You can directly examine the staging table at any time.
-
Validation activities, which can compare data to desired ranges, as an example.
-
Reconciliation activities, in which you specify comparisons that define a valid reconciliation of the data.
Each validation and reconciliation activity writes a report file with any errors. For either kind of activity, an error can either halt processing or simply result in a warning message, as you choose. In all cases, an appropriate user role is alerted via the workflow module so that the data can be examined and corrected if needed, and the recipe can be rerun.
-
Data promotion activities, which uses SQL to update a final table, based on the contents of one or more staging tables. The final table can be in the native database or can be in an external database.
A recipe can also include custom steps at any stage in the processing.
To get started with this task, see Defining Recipes.
Analytics Cubes
Data Fabric Studio includes InterSystems IRIS® Adaptive Analytics, a Business Intelligence tool powered by co-development with AtScaleOpens in a new tab. This means that data engineers and data analysts can define cubes based on those tables, and then use those cubes for analytics. See Defining Cubes.
Snapshots
Data Fabric Studio provides an additional mechanism to support your data needs: snapshots. With snapshots, you can easily save data for later inspection by regulators; the snapshot can pull data from multiple tables as needed, writing them to a snapshot table, and the system applies a tag to the records. The product automatically stores all snapshot runs; that is, a new snapshot run does not overwrite a previous snapshot run.
When you have run a snapshot multiple times, applying a different tag each time, you can examine how that data changes over time. In particular, you can build a cube on the snapshot data, using the tag values as a dimension.
See Defining Snapshots.
Scheduling
The Business Scheduler provides an easy way to schedule the running of tasks: running recipes, building cubes, and performing snapshots. See Scheduling and Running Tasks.
In the initial phases of implementation, you can configure tasks to be run manually (from the Business Scheduler).
Later, when you want the tasks to be executed on a schedule, it is necessary to define the applicable calendar information and to manage dependencies among tasks. This works as follows:
-
To define calendar information, you define entities, each which has its own calendar. To simplify scheduling, the product supports a hierarchical system of entities, each of which can have its own business calendar but can inherit calender details from its parent. See Defining Entities and Calendars.
-
To specify dependencies among tasks, you apply a tag to a scheduled task and define dependency expressions in other tasks, referring to that tag. Then a task can be run (on a given day) only after the dependencies are fulfilled. See Managing Task Dependencies.