Results 1 to 8 of 8

Thread: Distribute third-party libraries to all nodes

  1. #1

    Question Distribute third-party libraries to all nodes

    What is the correct way to distribute third party jar files to all nodes running the Map-Reduce job. Shall I use 'libs' attribute in the job configuration? Or distributed cache? Also, it is more than one jar that needs to be distributed (xalan, serializer etc.)

    Also, is there a way to ensure that any of these options have worked correctly - checking a log or job file etc? Because I have tried both of these and have not been able to run the job successfully, so not sure if I am setting these correctly.

    I configured Distributed cache as below, where these jars are available on the indicated hdfs path.
    <hdp:cache create-symlink="true">
    <hdp:classpath value="/svjain/lib/xalan-2.7.1.jar" />
    <hdp:classpath value="/svjain/lib/serializer-2.7.1.jar" />
    </hdp:cache>

    When trying the job (libs) option, I tried below configuration, where install.dir.win and library.path refer to non-hdfs path (library.path=lib/*.jar)
    <hdp:job id="tempjob"
    input-path="xxx" output-path="xxx"
    mapper="xxx"
    jar-by-class="xxx"
    libs="${install.dir.win}/${library.path}"/>

  2. #2
    Join Date
    Jan 2005
    Location
    Bucharest, Romania
    Posts
    5,403

    Default

    xalan and serializer should not be needed as the JRE/JDK provides such libraries.
    Your configuration looks correct however note that when running from a Windows client, you need to change the JDK path separator.
    See http://static.springsource.org/sprin...tributed-cache for more information.
    Costin Leau
    SpringSource - http://www.SpringSource.com- Spring Training, Consulting, and Support - "From the Source"
    http://twitter.com/costinl
    Please use [ c o d e ] [ / c o d e ] tags

  3. #3

    Default

    Thanks Costin. I wanted to use Apache Xalan rather than the libraries bundled with Java (the xsl I am using don't compile with Sun version). So, is distributed cache the right way or using 'libs' attribute in job configuration? I have a feeling that these are not getting copied to the nodes, so any way to verify it?

  4. #4
    Join Date
    Jan 2005
    Location
    Bucharest, Romania
    Posts
    5,403

    Default

    The distributed cache helps with the classpath and, if you follow the links I've mentioned, the jars will be copied over to the nodes - you can double check this by looking into the Hadoop logs.
    However the classpath does not take precedence over the libraries included with the JVM so in this case you need to use the endorsed mechanism [1].
    As far as I know, Hadoop doesn't provide any support for it out of the box - so you would have to manually ship the libraries on each node in the dedicated JVM folder.
    Basically it's not so much a Hadoop but rather a JVM problem...

    [1] http://docs.oracle.com/javase/6/docs...des/standards/
    Costin Leau
    SpringSource - http://www.SpringSource.com- Spring Training, Consulting, and Support - "From the Source"
    http://twitter.com/costinl
    Please use [ c o d e ] [ / c o d e ] tags

  5. #5

    Default

    Thanks. I cannot control the Hadoop environment (classpath in hadoopenv.sh) nor the Java installation since it is a shared Hadoop environment. However to ensure that the Apache transformer is used rather than the Sun implementation, I am passing the following Java options to the map-reduce environment. This seems to work well, but I am struggling with getting these jars to the nodes.
    <hdp:configuration>
    fs.default.name=${fs.default.name}
    mapred.job.tracker=${mapred.job.tracker}
    mapred.map.child.java.opts=-Djavax.xml.transform.TransformerFactory=org.apache .xalan.processor.TransformerFactoryImpl
    </hdp:configuration>

  6. #6
    Join Date
    Jan 2005
    Location
    Bucharest, Romania
    Posts
    5,403

    Default

    Then make sure the distributed cache actually works - if properly configured, the jars will be made available in your job classpath and which point the mapred.map.child should be picked up.

    There might be various reasons why the property is not picked up so I recommend playing around with it - for example first do some small tests to see whether the xalan processor is available in the classpath. Then check whether the properties you are settings are not final and actually passed to the JVM.

    By the way, the distributed cache makes the jars available inside HDFS for Hadoop jobs. In your case, you want these jars to be available on the actual file-system so the JVM process can find them when it starts. It might be that w/o actually having the jars there, there's not much you can do with DistributedCache...
    Costin Leau
    SpringSource - http://www.SpringSource.com- Spring Training, Consulting, and Support - "From the Source"
    http://twitter.com/costinl
    Please use [ c o d e ] [ / c o d e ] tags

  7. #7

    Default

    Costin, Thanks a lot. Seems like the issue was related to path-separator that you indicated in your first reply. My Hadoop cluster is on Linux environment, but I am triggering the job from Windows Spring-Tool-Suite IDE. I checked the Job.xml configuration and could see ";" in mapred.job.classpath.files property-value.

    In the application context now, I have configured the script to set ":" as path-separator just before the distributed-cache configuration and looks like the Job is running fine now.

  8. #8
    Join Date
    Jan 2005
    Location
    Bucharest, Romania
    Posts
    5,403

    Default

    Great - I got bitten by that bug as well earlier in the process (hence the issue raised on the hadoop tracker) - the docs have also been updated to indicate also the workaround (not sure whether you've checked out the latest doc snapshot).
    Costin Leau
    SpringSource - http://www.SpringSource.com- Spring Training, Consulting, and Support - "From the Source"
    http://twitter.com/costinl
    Please use [ c o d e ] [ / c o d e ] tags

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •