What is the correct way to distribute third party jar files to all nodes running the Map-Reduce job. Shall I use 'libs' attribute in the job configuration? Or distributed cache? Also, it is more than one jar that needs to be distributed (xalan, serializer etc.)
Also, is there a way to ensure that any of these options have worked correctly - checking a log or job file etc? Because I have tried both of these and have not been able to run the job successfully, so not sure if I am setting these correctly.
I configured Distributed cache as below, where these jars are available on the indicated hdfs path.
<hdp:cache create-symlink="true">
<hdp:classpath value="/svjain/lib/xalan-2.7.1.jar" />
<hdp:classpath value="/svjain/lib/serializer-2.7.1.jar" />
</hdp:cache>
When trying the job (libs) option, I tried below configuration, where install.dir.win and library.path refer to non-hdfs path (library.path=lib/*.jar)
<hdp:job id="tempjob"
input-path="xxx" output-path="xxx"
mapper="xxx"
jar-by-class="xxx"
libs="${install.dir.win}/${library.path}"/>


Reply With Quote