Bulk/Batch Processing with CData Arc

If you're looking to integrate data from multiple systems into a data warehouse or data lake, you have several options. These include batch processing, triggering workflows in real time, and publish/subscribe. CData Arc gives you a single tool with a single user interface you can use for all these approaches.

This post will focus on when and why you'd want to employ batch processing and how you can use CData Arc to perform this method of data integration.

What is Batch Processing?

Historically, most data processing technologies, such as data warehousing, were designed for batch processing. It's true that real-time and streaming data integration technologies are currently getting more play in the press. But just as radio retains a place in the age of television, batch processing won't go away any time soon.

Batch jobs process large volumes of records all at once. Newly arriving data elements are collected into a group, which is then processed at a future time. Managers have complete control over when and how processing occurs. For example, they might schedule batch jobs to run at a regular time interval (e.g. every 15 minutes, hour, night) or trigger jobs by a condition (e.g. when the batch contains more than 1MB of data) with the batches incorporating all newly modified records or all records meeting a specified condition. Jobs often work offline, running at night to avoid disrupting daily activities on production systems.

Benefits: High Performance, Lower Costs

Optimized to perform high-volume, repetitive tasks, batch processing provides a fast way to process large amounts of data. For example, inserting 20,000 rows into your database is much quicker in a batch process then inserting each row as a separate transaction. Operational costs are also reduced because automated processing eliminates the need for specialized data entry clerks.

Use Cases

Batch processing is typically used to perform high-volume, repetitive tasks that don't require the most up-to-date data. A bank, eCommerce firm or manufacturer may employ a batch process to update a data warehouse at the end of the day with production data on loan applications, sales transactions, or inventory. A company may also turn to batch processes to generate reports, print documents, and perform other non-interactive tasks that must complete reliably within certain business deadlines.

Architecture

A batch processing architecture has the following components:

Data storage. You need a repository for high volumes of data in various formats. This can be a data warehouse or a data lake.
Batch processing jobs. These jobs read source data, process it, and write the output to new storage locations.
Analytical data store. Many batch jobs are designed to prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools.
Analysis and reporting tools. The goal of many batch processing jobs is to provide insights into the data through analysis and reporting.
Orchestration. Typically, some orchestration is required to migrate or copy the data into your data storage, batch processing, analytical data store, and reporting layers.

Batch Processing with CData Arc

CData Arc offers several capabilities that make it easy for you to implement batch processing:

Batch Result -All the CData Arc data storage ports include a “Batch Results" toggle. When set to False, the port creates a separate message for each record processed. When enabled, CData Arc creates a single message for all the records in a message, so they're processed together and sent to the next port for processing as one. This process is handled in a single ‘transaction' - if there is a failure, all messages are rolled back at the same time.
Scheduled Jobs – CData Arc can schedule batch jobs to run at any interval. Users can configure CData Arc to process entire data sets, or only process changed records since the last interval. This process is configurable and can work through a timestamp, or alternatively through record flags that ensure rows will not be processed again.
Related or unrelated data - Users can create batches jobs that contain the same type of data, or unrelated data including a mix of structured and unstructured information. You can even specify related batches, such as a batch for invoices and another for supporting line items.
Bulk CSV import/export - Our ports can directly import and export CSV files to improve performance.

If you need to process large amounts of data that are not highly time sensitive with high performance, CData Arc with its tremendous flexibility and optimization of batch processes is the way to go.

EDI and Bulk/Batch Processing

If you have a partner that transmits a high volume of files, you may want to consider processing EDI messages in batch. The EDI specifications define how batch processing should be handled and the CData Arc EDI ports adhere to those specifications.

In addition, CData Arc enables users to employ additional batch processing on incoming EDI messages. For example, the ports for X12, EDIFACT, and other EDI transaction ports enable users to accept a batch of transactions, split the batch apart into separate transactions, and route each one to different locations. If a batch includes both invoices and shipping notifications, users can automatically route invoices to accounting and shipping notifications to logistics.

Selecting the Right Storage

The storage system you choose depends on the type, structure, model, and intended use of the data, as well as your requirements for schemas, consistency, and transaction speed. Whether you choose a Relational Database Management System (RDBMS), a non-relational NoSQL database, or even a distributed database (DDB), CData Arc allows for seamless automated connectivity with your choice of data storage system:

The relational database is the most prevalent data storage system in the world, and can be used to store data according to a schema that allows data to be displayed as tables with rows and columns. These databases are rooted in the use of SQL statements, and popular examples of such systems include MySQL, Microsoft SQL Server, PostgreSQL, and Oracle among others. CData Arc has ports for each of those systems among several other RDBMS applications that allow you to connect to the database and insert into them the batch processed data.
NoSQL, non-relational database systems are the preferred choice when much of the data being stored is not tabular. Unlike RDBMS, NoSQL systems can be schema agnostic, which makes it ideal for unstructured or partially-structured data storage. NoSQL systems can include key-value stores like Redis and Amazon DynamoDB, wide column stores like Cassandra and Apache HBase, document stores like MongoDB and Couchbase, or even search engines like Google Search and Elasticsearch. With several types of NoSQL systems, there is no dearth of options to choose from. Much like for RDBMS, CData Arc supports ports for all of the aforementioned systems and many more.
The distributed database is a data warehousing solution that allows ever increasing data volumes to be stored on multiple servers dispersed across a network, optimized by distributing data processing amongst several nodes. One such platform for distributed data storage is Apache Hadoop, particularly its storage part the Hadoop Distributed File System (HDFS), which is a distributed file-system that uses commodity machines and provides high throughput across all of them. Amazon Redshift, a cloud-based data warehousing solution, is another. While using such systems, you can route your data flow through an Extract, Transform, Load (ETL) process, or you can opt for the Extract, Load, Transform (ELT) variant. ArcESBs support both schools of thought, which means the application can house your data flow regardless of which process you choose. As you may have surmised CData Arc has ports for both Redshift and HDFS among other distributed database systems and distributed file systems.

All in all, CData Arc is a robust application solution designed to process and integrate large amounts of data. You can batch process high volumes of data, and connect it to any number of database systems, ERP applications, and data stores thanks to the continually growing arsenal of connectivity ports. It gives you granular control over the structure of data going into and coming out of your storage systems and allows you to automate and optimize your data flow, harnessing the power of your data in today's data-driven economy.

Download Now

Download CData Arc 2018, the fastest & easiest way to connect data and applications across your organization:

Download CData Arc

Blog