Aug 18th, 2010, 01:10 AM
Spring-Batch for a massive nightly / hourly Hive / MySQL data processing
I posted this question in stackoverflow, but got no replies:
I'm hoping to get the feedback here. Here's a copy of the question:
I'm looking into replacing a bunch of Python ETL scripts that perform a nightly / hourly data summary and statistics gathering on a massive amount of data.
What I'd like to achieve is
- Robustness - a failing job / step should be automatically restarted. In some cases I'd like to execute a recovery step instead.
- The framework must be able to recover from crashes. I guess some persistence would be needed here.
- Monitoring - I need to be able to monitor the progress of jobs / steps, and preferably see history and statistics with regards to the performance.
- Traceability - I must be able to understand the state of the executions
- Manual intervention - nice to have... being able to start / stop / pause a job from an API / UI / command line.
- Simplicity - I prefer not to get angry looks from my colleagues when I introduce the replacement... Having a simple and easy to understand API is a requirement.
The current scripts do the following:
- Collect text logs from many machines, and push them into Hadoop DFS. We may use Flume for this step in the future (see http://www.cloudera.com/blog/2010/07...-cdh3b2-flume/).
- Perform Hive summary queries on the data, and insert (overwrite) to new Hive tables / partitions.
- Extract the new summaries data into files, and load (merge) into MySql tables. This is data needed later for on-line reports.
- Perform additional joins on the newly added MySql data (from MySql tables), and update the data.
My idea is to replace the scripts with spring-batch. I also looked into Scriptella, but I believe it is too 'simple' for this case.
since I saw some bad vibes on Spring-Batch (mostly old posts) I'm hoping to get some inputs here. I also haven't seen much about spring-batch and Hive integration, which is troublesome.
Your insights and help are much appreciated.
Last edited by eran_ha; Aug 18th, 2010 at 12:42 PM.
Aug 20th, 2010, 04:59 AM
You won't see much in the way of "negative vibes" here. I can't think why there would be any (except if it's FUD).
This looks like a good match for Spring Batch - all the features you say you need are there. Some are wrapped nicely in Spring Batch Admin as well. As far as Hive integration goes - I haven't seen anyone using it with Spring Batch so there is no demand that I am aware of for native integrayion. It is a Java API though, so it won't be hard to drive from a Spring Batch job if that's what you need.
Aug 20th, 2010, 05:47 AM
Thanks Dave, your response is appreciated
I saw some non-friendly blog posts (e.g. http://www.cforcoding.com/2009/07/sp...esign-api.html), and some answers at StackOverflow which were indeed negative.
Here, I was more looking for inputs, as I haven't got any there, and hoped to see if anyone had experienced using spring-batch with a NoSql product like hadoop.
Aug 20th, 2010, 05:57 PM
I'm using Spring Batch for a few months for processing 20 different type of jobs. This is quite big transactional system with complex business logic (some jobs has at least 50 steps). What can I say is that I'm happy with Spring Batch. All your's assumtions can be achieved by the framework.
Additionaly, as probably everybady here knows it, in complex systems it is not so easy. There is not enough to write a definition of the job and processing will fly automaticaly. There is needed hard analitical and architectual work to do it good and reliable. Spring Batch has a quite good architecture and well specified interfaces. In our products we prepared many extensions to the framework. Mostyly different types of reader, writers and processors but also listeners, skip/retry policies and recovery mechanisms.
Most problems can be uncovered in performance level. In this case understanding of chunk processing model, transaction management in the model and partitioning data is significant. Good paritioning of a data can effort cuncurrent processing and concurrent processing influences on in-point-recovery.
Above aspects are very sensitive and took us a few months of testing and analises (many PoCs). I do not wana say it's very hard but it's not always obvious. Preaty similar to other frameworks
Fortunatelly Spring Batch has very good documentation. So if it will be taken sistematically you will achieve your goals. Any Ad-hoc approach may became your project fail.
Ok, that's all from me. Oh. I'm not using NoSQL databases.
Aug 21st, 2010, 07:28 AM
I expected the performance aspect to be the most trick one. Fortunately my current case seems to be much simpler than yours. I have 4-8 steps in a job (depending on how I look at it...). Our steps perform some bulk ops that I'm not sure I can optimize at all (hadoop does this heavy lifting for me in that case). And finally the current implementation is very rigid, while spring-batch will offer me much more.
It also seems like spring-batch offers ways of scaling by distributing your jobs across many machines, though I'm assuming it gets sticky at this point...
Tags for this Thread