While wandering on Github, I found the spring-hadoop project. SpringSource have done a terrible work with it. What is its status?
I can't find any documentation. Is there any? The Springsource section for spring-hadoop does not point to any. (http://www.springsource.org/spring-data)
I have been using Hadoop with spring for almost one year at my current company, a French startup called Kadeal. We built our own wrapper around it. It allows us to run the jobs and to configure them with a programmatic API. Type verification of input/output values is done at compile time using generics.
Spring-hadoop promises a tighter integration of Hadoop with the Spring framework and could indeed be really useful.
Interesting points are :
* extension of Resource for Hdfs
* custom NamespaceHandler for configuration
* support of the last Hadoop API ie o.a.h.mapreduce instead of o.a.h.mapred
* extension of ConversionService for mapping simple types to simple Writable types
* abstraction of the mapreduce framework ie no direct hadoop dependencies
* JobTemplate and GenericJobRunner
I just didn't find out yet :
* how would you configure a hadoop job runned by GenericJobRunner?
Let's say I want to change a business threshold, what would be the best way to do it? Of course, a custom configuration might not be only a single value but lots of properties...
* how reporting is handled?
It might be a bad decision but we used Hadoop counters for reporting broad actions of our jobs. Let's say I want to process 300 000 items and that 100 000 have been discarded. I would like to know why, even though I would not want to read every reason for every item one by one.
With spring-hadoop, there is no more direct access to the Hadoop API within the mapreduce functions. How would you build a crude reporting? What are the best practices?
I have in fact the same question for spring-batch. We built a crude reporting system based on a core class which looks like Hadoop's counters.
PS : Spring-hadopp is officially a subproject of spring-data. But at the same time, Hadoop is a nice tool for batches... So posting on this forum (spring-batch) makes sense from my point of view.