We're having an issue with getting dependencies deployed to the remote jobtracker running our Cascading job.
Our job is a pretty straightforward Cascading job:
We are using jar-by-class because we have some custom operations in our flow. Before we added jar-by-class, we were getting a ClassNotFound exception.Code:<bean id="flowDef" class="my.flow.factory.Class" factory-method="flowDef" c:_0="/input/path" c:_1="/output/path"/> <hdp:cascading-flow id="wc" definition-ref="flowDef" write-dot="dot/wc.dot" jar-setup="true" jar-by-class="my.flow.factory.Class" /> <hdp:cascading-cascade id="my-cascade" flow-ref="wc"/> <hdp:cascading-tasklet id="my-cascade-tasklet" unit-of-work-ref="my-cascade" wait-for-completion="true"/> <batch:job id="my-cascading-job"> <batch:step id="cascade-step"> <batch:tasklet ref="event-history-cascade-tasklet"/> </batch:step> </batch:job>
We no longer get a ClassNotFound exception for our custom operations, but we are now seeing ClassNotFound for a Cascading class:
Is there a good way to ship multiple JARs and add them to the classpath? The relevent documentation doesn't mention this case:Code:java.io.IOException: Split class cascading.tap.hadoop.io.MultiInputSplit not found at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:392) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.ClassNotFoundException: cascading.tap.hadoop.io.MultiInputSplit at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:861) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:390) ... 7 more
Note that no jar needs to be setup - the Cascading namespace (in particular cascading-flow, backed by FlowFactoryBean) tries to automatically setup the resulting job classpath. By default, it will automatically add the Cascading library and its dependency to Hadoo DistributedCache so that when the job runs inside the Hadoop cluster, the jars are properly found. When using custom jars (for example to add custom Cascading functions) or when running against a cluster that is already provisioned, one can customize this behaviour through the jar-setup, jar and jar-by-class. For Cascading users, these settings are the equivalent of the AppProps.setApplicationJarClass().


Reply With Quote