Skip to main content

Welcome, Data Engineers (2.9)

As a data engineer for your InterSystems® Data Fabric Studio™ solution, your task is to set up the data pipeline, a term that refers to the infrastructure needed to make data available within the solution (or to push the data downstream). This page provides an overview of that infrastructure, from start to finish.

When data is available within the system, you and data analysts can create analysis cubes and reports based on those tables.

Data Sources and Schemas

In the product, a data source is a named configuration item that provides all the information needed to retrieve data from an external source. A classic data source is a database, but another possibility is an external system that provides an API via which an authenticated user can retrieve data. Yet another option is delimited files, either retrieved from a cloud location or pushed to the local file system. In all cases, the data source definition, usually configured by an administrator, contains all the information needed to retrieve structural information as well as data.

A data source can provide multiple kinds of data, each with a specific structure. The classic case is a database that contains many tables, where each table has a specific structure. Similarly, a specific API call returns data in a specific format. In the product, each of these structures is a schema, and the goal is to create a Data Catalog that consists of the schemas that are relevant to the business (and that are ultimately needed for reports and analysis). As a data engineer, your task is to import schemas from the data sources into the Data Catalog and then refine them and categorize them in ways that are useful to your organization.

To get started defining schemas, see Importing Schemas.

Recipes

Recipes describe how to load data from external sources into Data Fabric Studio or into external tables or files. Any recipe consists of some or all of the following steps, in order:

  1. Staging activities, each of which loads data into a staging table.

  2. Transformation activities, which can clean data in various ways.

    Transformation activities update the staging table, either by adding new fields or overwriting existing fields. You can directly examine the staging table at any time.

  3. Validation activities, which can compare data to desired ranges, as an example.

  4. Reconciliation activities, in which you specify comparisons that define a valid reconciliation of the data.

    Each validation and reconciliation activity writes a report file with any errors. For either kind of activity, an error can either halt processing or simply result in a warning message, as you choose. In all cases, an appropriate user role is alerted via the workflow module so that the data can be examined and corrected if needed, and the recipe can be rerun.

  5. Data promotion activities, which use SQL to extract data from the staging tables and update final tables or external files. In the case of final tables, those can be in the native database or can be in an external database.

A recipe can also include custom steps at any stage in the processing.

To get started with this task, see Defining Recipes.

Analytics Cubes

Data Fabric Studio includes InterSystems IRIS® Adaptive Analytics, a Business Intelligence tool powered by co-development with AtScaleOpens in a new tab. This means that data engineers and data analysts can define cubes based on those tables, and then use those cubes for analytics. See Defining Cubes.

Snapshots

Data Fabric Studio provides an additional mechanism to support your data needs: snapshots. With snapshots, you can easily save data for later inspection by regulators; the snapshot can pull data from multiple tables as needed, writing them to a snapshot table, and the system applies a tag to the records. The product automatically stores all snapshot runs; that is, a new snapshot run does not overwrite a previous snapshot run.

When you have run a snapshot multiple times, applying a different tag each time, you can examine how that data changes over time. In particular, you can build a cube on the snapshot data, using the tag values as a dimension.

See Defining Snapshots.

Scheduling

The Business Scheduler provides an easy way to schedule the running of tasks: running recipes, building cubes, and performing snapshots. See Scheduling and Running Tasks.

In the initial phases of implementation, you can configure tasks to be run manually (from the Business Scheduler).

Later, when you want the tasks to be executed on a schedule, it is necessary to define the applicable calendar information and to manage dependencies among tasks. This works as follows:

  • To define calendar information, you define entities, each which has its own calendar. To simplify scheduling, the product supports a hierarchical system of entities, each of which can have its own business calendar but can inherit calender details from its parent. See Defining Entities and Calendars.

  • To specify dependencies among tasks, you apply a tag to a scheduled task and define dependency expressions in other tasks, referring to that tag. Then a task can be run only after the dependencies are fulfilled. See Managing Task Dependencies.

See Also

FeedbackOpens in a new tab