Page 1 of 7 123 ... LastLast
Results 1 to 10 of 64

Thread: Specifying a JobJar in the Tool Tasklet.

  1. #1

    Default Specifying a JobJar in the Tool Tasklet.

    Hi everyone,

    I have a use case, when in my project I need to configure several hadoop Tool jobs, and the way I do is by having the following configuration in the spring.cfg.xml:

    Code:
    <hdp:tool-tasklet id="testId" scope="step" configuration-ref="hadoop-configuration" tool-class="com.test.myClass">
        <!-- Some properties -->
    </hdp:tool-tasklet>
    The jar file, that contains the ToolClass is included as a dependency in my project and it works fine, however there is a problem that I am facing, namely I have several JAR files with dependencies and they have different versions of libraries included on their own and since I have included all these JOB JAR files as dependencies to my project, there are bunch of duplicate classes / libraries which can potentially be different versions.

    So here is my question, is there a way for running a Tool class and by specifying the jar location, like it is possible to do with Hadoop command line arguments, such as -files or -libjars?

    Can you suggest some other method of running Tool classes without loading the actual JAR file in the classpath and without using tool-class argument?

    P.S: I am using spring-data-hadoop version: 1.0.0.M1

    Thanks in advance.


    Sincerely,
    David Gevorkyan

  2. #2
    Join Date
    Jan 2005
    Location
    Bucharest, Romania
    Posts
    5,403

    Default

    Hi David,

    We currently don't expose these parameters on the Tool namespace (as we do with streaming or job) - this looks like an omission. Can you please raise an issue on our tracker - also if you can, indicate how the command line looks like or what you would like to see in the namespace.

    Cheers,
    Costin Leau
    SpringSource - http://www.SpringSource.com- Spring Training, Consulting, and Support - "From the Source"
    http://twitter.com/costinl
    Please use [ c o d e ] [ / c o d e ] tags

  3. #3
    Join Date
    Jan 2005
    Location
    Bucharest, Romania
    Posts
    5,403

    Default

    Raised issue https://jira.springsource.org/browse/SHDP-49
    Feel use that to follow progress.
    Costin Leau
    SpringSource - http://www.SpringSource.com- Spring Training, Consulting, and Support - "From the Source"
    http://twitter.com/costinl
    Please use [ c o d e ] [ / c o d e ] tags

  4. #4

    Default

    Hi Costin,

    Thanks for the quick reply.

    Actually besides just exposing JAR file to the Tool namespace, we also need "-files" parameter, since we have some use cases when we need to provide properties file on fly, dynamically.

    So our command line looks like this:

    Code:
    hadoop jar fullpath:myJar_withDependencies.jar -files fullpath:myProp.properties -Dprop1=value1 -Dprop2=value2 -Dconfig=myProp.properties
    So ideally I want to be able to specify any file (such as property file in the above example) to be uploaded to the cluster and also be able to specify the jar with dependencies to be uploaded to the server.

    So if you can expose the same parameters to the Tool namespace as you have done for the streaming job, that would be great, namely the "file", "archive" and "lib".

    Sincerely,
    David

  5. #5
    Join Date
    Jan 2005
    Location
    Bucharest, Romania
    Posts
    5,403

    Default

    Hi David,

    I'm almost done with exposing the params (file/archive/lib) but I'm not sure about the "jar" param. Hadoop jar currently just calls the Main class of the jar as a way to pass in configuration (the command line arguments). That's not needed in a Spring app since it throws out any existing configuration (including the hadoop one).

    With the upcoming improvements the command above would look like this:

    Code:
    <hdp:tool-runner id="someTool" tool-class="org.foo.SomeTool" configuration-ref="hadoop-configuration" 
        properties-location="myProp.properties" files="myProp.properties">  
         <hdp:arg value="data/in.txt"/>   
        <hdp:arg value="data/out.txt"/>       
        prop1=value1
        prop2=value2
    </hdp:tool-runner>
    Note the Tool instance (which can be configured) or class is still required and that's because the Tool (which is just a glorified Main) is executed in-process - we don't create a different JVM for it so we need it to be available. If my understanding is correct in your case, you have a lot of dependencies but that shouldn't be a problem since we only load the tool class - we disregard the rest of the classes and as long as your tool does that as well, there shouldn't be a problem.
    Let me know if this solves your problem and if not why?
    Last edited by Costin Leau; Apr 12th, 2012 at 10:38 AM.
    Costin Leau
    SpringSource - http://www.SpringSource.com- Spring Training, Consulting, and Support - "From the Source"
    http://twitter.com/costinl
    Please use [ c o d e ] [ / c o d e ] tags

  6. #6
    Join Date
    Jan 2005
    Location
    Bucharest, Romania
    Posts
    5,403

    Default

    Commit the updates in master - you can pick the changes in the next snapshot.
    Costin Leau
    SpringSource - http://www.SpringSource.com- Spring Training, Consulting, and Support - "From the Source"
    http://twitter.com/costinl
    Please use [ c o d e ] [ / c o d e ] tags

  7. #7

    Default

    Hi Costin,

    The issue is we are attempting to replace our current shell script with spring batch. The shell script would look something like:

    hadoop -jobjar job1.jar ...
    ...
    hadoop -jobjar job2.jar ...
    .......
    hadoop -jobjar job10.jar ...


    These job jars have conflicting versions of libraries in them (for example jackson 1.4 and jackson 1.94), and even have different versions of spring contained within them.

    How would you propose handling this case? We can not simply just put all 10 jars in the classpath. Perhaps a classloader approach would work?
    Last edited by davidgevorkyan; Apr 12th, 2012 at 12:32 PM.

  8. #8
    Join Date
    Jan 2005
    Location
    Bucharest, Romania
    Posts
    5,403

    Default

    Out of curiosity what does the jar contain? Do you specify a main class or use the MANIFEST.MF instead? And what does the "main" file do? Does it implement certain interfaces or contracts?

    Back to your use case, there are some problems here:

    a. each of your commands, forks a separate VM. In each one the jars are put in the classpath but since each sits in a separate VM, there are no conflicts.
    b. everything is command-line based. This means any configuration used needs to be passed through there (whether it's a bootstrapping property file or not).
    c. the Main class isn't application friendly - as far as it's concerned it's the only app running so it tends to do System.exit() -> we might be able to bypass that (bytecode instrumentation) but I'd like to avoid that if possible since there are a lot of subtleties involved.

    b doesn't make a lot of sense in an app (whether it uses Spring or not) since it simply disregards its context and only looks at the command line. a) and c) might be addressed by using a dedicated classloader.

    I'll try to come up with something however in the meantime you might be able to go around this by pointing directly to the job that the tool/main is setting up. It's not ideal but it's worth giving a try. This can work since SHDP doesn't need or use the job as we're talking care of the Hadoop setup only.
    Costin Leau
    SpringSource - http://www.SpringSource.com- Spring Training, Consulting, and Support - "From the Source"
    http://twitter.com/costinl
    Please use [ c o d e ] [ / c o d e ] tags

  9. #9

    Default

    The jar file contains every dependency of the hadoop job. This is standard for older versions of hadoop (newer ones do support something closer to a classpath). Its basically equivalanet to jar -xf *.jar ; jar -cf job1.jar * with a little cleanup.

    An example job looks something like
    class Job implements Tool
    ...
    static void main(String args[]) {
    Configuration config = new Configuration();
    DateTime date = new DateTime();
    config.setLong(JobConfFactory.CURRENT_DATE_IN_MILL S, date.getMillis());
    System.exit(ToolRunner.run(config, new Job(), args));
    }

    I'm pretty sure to support the hadoop jobjar concept we need to have a tasklet that uploads the jobjar to hdfs, creates a custom classloader, loads the local jobjar in it and then use the existing tasklet code. This would prevent the classpath polution of placing multiple jobjars in the spring batch classpath.
    Last edited by davidgevorkyan; Apr 12th, 2012 at 01:35 PM.

  10. #10
    Join Date
    Jan 2005
    Location
    Bucharest, Romania
    Posts
    5,403

    Default

    Thanks - the information is useful. There might be an easier solution then the one you mentioned but testing will tell whether it works or not.
    Out of curiosity, do you specify the classname or use the manifest.mf instead?
    Costin Leau
    SpringSource - http://www.SpringSource.com- Spring Training, Consulting, and Support - "From the Source"
    http://twitter.com/costinl
    Please use [ c o d e ] [ / c o d e ] tags

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •