Home » Uncategorized » Building a Sharded Database Process
             

Building a Sharded Database Process

In a similar vein to my previous database shard article, I wanted to learn how to take advantage of Kettle’s Partitioning Schema and Database Cluster features. Database Clusters are really handy if you are utilizing database shards in your production system, but your database may not necessarily know how to handle the shards (also called nodes). Database shards are essentially the same database schema being applied to many live database servers. Sharding improves database performance which in turn speeds up the applications that are running on top of it.

Before anything else is done, a database connection needs to be created. This is done either through one of the database connecting steps (by clicking on ‘New’) or by going to the View tab under a transformation and right clicking on ‘Database Connections’ and selecting either New or New Connection Wizard.

We still have to fill out the General tab with database connection information.

So far, setting up a clustered database connection is very much like creating a normal one. Notice in the previous image that there is an option called ‘Clustering’.

The first thing you’ll want to do is check the box enabling clustering. Then you can fill in the information for each shard just like you would in the General tab. It is important to note that encrypted passwords will work here, but parameters do not! Want to be able to use parameters here? Vote up my Jira ticket: http://jira.pentaho.com/browse/PDI-7456 ;) . Now with clustering enabled, when you click the ‘Test’ button Kettle should try to talk with all of the listed shards. The pop-up will let you know if there are any problems with any of the shards. Also note, if your cluster is sufficiently large enough, it may take a bit to get the test results back.

Once that is done, you’ll want to create a partition schema:

Kettle makes it easy to import your cluster into a partition schema. All that has to be done is click on the ‘Import Schema’ button.

Then select which schema that will make up the partition schema:

This will bring over the cluster IDs from the database connection into the partition schema.

Now that the schema is set up, we can have some real fun! Bringing a table input step over, we can run any query we want over the database shards. For this example, let’s say there is a ‘customer’ table:

By selecting the clustered database connection, we can run our query over the shards and get back the data. There is one ‘gotcha’ here (special thanks to Slawo for solving this one ;) ). When right clicking on the Table Input step, there is a Partitioning option.

This option tells Kettle how to handle the query to the different database shards. For this example, we want to run the query across all of the shards.

Now if you run this transformation you should get data from all of the shards. Previewing the data stream will more than likely not return data from all shards if there is a significant amount.

Going Further

You can also build this code to be dynamic by providing your query/parameters via a parent job. This way the same table input for the shard can be run for many jobs. You can also use the same functions for output processes as well. If you wish to continue expanding your knowledge of Pentaho’s Kettle tool, I highly suggest grabbing a copy of any of the good books that have been written on the subject.

Pentaho Solutions

Pentaho 3.2 Data Integration Beginner’s Guide

Pentaho Data Integration 4 Cookbook

Pentaho Kettle Solutions

Copyright © 2012 - All Rights Reserved

Leave a Reply

Your email address will not be published. Required fields are marked *