Macro-level Scheduling of ETL Workflows

Anastasios Karagiannis, Panos Vassiliadis, Alkis Simitsis

9th International Workshop on Quality in Databases (QDB 2011). In conjunction with VLDB 2011, August 29th, 2011, Seattle, USA

Extract-Transform-Load (ETL) workflows (a) extract data from various sources, (b) transform, cleanse and homogenize these data, and (c) populate a target data store (e.g., a data warehouse). Typically, such processes should terminate during strict time windows and thus, ETL workflow optimization is of significant interest. In this paper, we deal with the problem of scheduling the execution of ETL activities, with the goal of minimizing ETL execution time and allocated memory. Apart from a simple, fair scheduling policy we also experiment with two policies, the first aiming to empty the largest input queue of the workflow and the second to activate the activity with the maximum tuple consumption rate. We experimentally show that the use of different scheduling policies can improve ETL performance in terms of memory consumption and execution time.

Texts and Presentations

Paper (PDF)

Presentation (PPTx)

Anastasios Karagiannis MSc Thesis (PDF)

Anastasios' presentation (in Greek)

Long Version of the paper: Anastasios Karagiannis, Panos Vassiliadis, Alkis Simitsis. Scheduling strategies for efficient ETL execution. Information Systems, 38(6), pp. 927-945, 2013.

Experimental Resources

The following code is presented on-line to allow the reproduction of results by others. We would like to to clearly state that we simply cannot support any requests for the maintainance of the code, or clarifications, explanations etc. Morover, we do not assume any responsibility for any side effects of the code (although we cannot think of, or have ever encountered, any). You are free to reuse the following code for academic purposes, provided you give the appropriate citation:

Macro-level Scheduling of ETL Workflows. Anastasios Karagiannis, Panos Vassiliadis, Alkis Simitsis. 9th International Workshop on Quality in Databases (QDB 2011), in conjunction with VLDB 2011, August 29th, 2011, Seattle, USA. Source code, datasets, presentations available at http://www.cs.uoi.gr/~pvassil/publications/2011_QDB/

(and, yes, academic honesty rules impose that this includes student projects too ;) )

TPC-H based Datasets and Scenarios (500MB)

Source code for Arktos Engine. Requires Win 32 and MS Visual Studio.

Similarly, but with tracing of the scheduling activityin a Log File. Not for performance tests.