Mar 7th, 2011, 02:47 PM
Hadoop/Spring Batch Integration
Currently we are using Spring Batch to enable a periodic ETL process to take relational data from Oracle and write it out to delimited flatfiles. We then use the Hadoop API to copy these flatfiles to HDFS and then insert them in new external Hive table partitions via JDBC.
My question has there been any interest in an HDFS ItemWriter or ItemReader? We may take this on but wanted to make sure there isn't already something like this in the works in Spring Batch or Spring Hadoop projects.
Mar 8th, 2011, 02:27 AM
There is a plan to add a spring-hadoop-batch module to Spring Hadoop, but don't let that stop you from writing some code and then contributing it to the project. Costin already added a Resource abstraction for HDFS to spring-hadoop-core so you could already use that probably, but there might be some value in optimising reader / writer implementations to work with it.
Could you go into your use case in a bit more detail? Where would the ItemReader/Writer be used?
Mar 8th, 2011, 09:42 AM
Thanks for your response. In my particular use case I would only need the HDFS ItemWriter but would be willing to commit both ItemWriter/ItemReader if it makes sense.
I am currently using the FlatFileItemWriter to create tab delimited Hive tables from relational data stored in Oracle. I am writing the flatfiles out first to local disk and then copying them to HDFS to be used by Hive. I could skip the writing flatfiles locally and copying for just write directly with a HDFS FlatFileItemWriter.
I am new to the Spring Batch API so forgive me if my proposed approach isn't the optimal way to introduce HDFS capabilities. I will take a look at the HDFS Resource abstraction you mention in spring-hadoop-core later today.
Feb 15th, 2012, 01:55 PM
Even though the thread is old, just wanted to let you know that we are now finally getting around to doing this, I plan to write some HDFS ItemReader/ItemWriter implementations but would like to compare with what you are doing and learn more about it.
You can follow the development at http://www.springsource.org/spring-data/hadoop we are making a release very shortly.
Feb 15th, 2012, 02:30 PM
Hadoop Based Tools
After further investigation it made more sense for us to use the Sqoop tool for this particular use case. For other use cases of writing to HDFS we have mainly used the various Hadoop APIs directly i.e. Avro, HTable, Flume, etc.
Tags for this Thread