Low maintenance data integration (ETL)

Low maintenance data integration (ETL)

By mtm from London.pm
Date: Wednesday, 13 August 2008 10:40
Duration: 30 minutes
Language:
Tags: dataprocessing etl sjerek

You can find more information on the speaker's site:


This is a tech talk about an existing ETL system used at Nestoria.co.uk (vertical search engine, 4 countries). It's the processing piece between arrived data and database insert.
http://en.wikipedia.org/wiki/Extract%2C_transform%2C_load

Lots of Perl folks have written ETL systems in the past, lots will have to write one in the future. There is often no way around a custom solution.

We will look at some best practices around 24/7 availability, monitoring, data cleansing, data quality, i18n, scaling, dealing with failures and changes ... and of course CPAN modules.

Nestoria had to integrate dozens of different formats (flatfile, database dumps, XML, custom), delivery methods (fetch, crawl, FTP) and update methods (complete, incremental, partial, custom). We thought we were prepared for everything, but over the years we learned some valuable lessons about corrupt files, failing servers, data quality, i18n issues and performance.


Attended by: Leon Brocard (‎acme‎), Peter Makholm (‎brother‎), Chisel Wright, Gabor Szabo (‎szabgab‎), Imran Chaudhry (‎icjs‎), Morten Meyling, Paul-Christophe Varoutas, Henrik Hald Nørgaard, Thomas Klausner (‎domm‎), Sue Spence (‎virtualsue‎), R Geoffrey Avery (‎rGeoffrey‎), Søren Døygaard, Kristoffer Gleditsch (‎toffer‎), Nicholas Clark, Luis Motta Campos (‎LMC‎), Darius Jokilehto, Nicholas Oxhøj (‎noxhoej‎), Stan Sawa, David Leadbeater (‎dg‎), Nigel Metheringham (‎nigelm‎), Patrick Donelan (‎patspam‎), Sebastian Willert, Søren Lund (‎slu‎), Salvador Fandiño (‎salva‎), Sven Esbjerg, allan juul, Bern, Francoise Dehinbo (‎franky‎), Tobias Henoeckl (‎hoeni‎), Darko Obradovic, Kaare Rasmussen, Lars Jorgensen, Jos Boumans (‎kane‎), Hermen Lesscher (‎hermen‎), Bart Lateur, Sébastien Aperghis-Tramoni (‎maddingue‎),