PDA

View Full Version : Specifying a JobJar in the Tool Tasklet.



davidgevorkyan
Apr 9th, 2012, 08:38 PM
Hi everyone,

I have a use case, when in my project I need to configure several hadoop Tool jobs, and the way I do is by having the following configuration in the spring.cfg.xml:


<hdp:tool-tasklet id="testId" scope="step" configuration-ref="hadoop-configuration" tool-class="com.test.myClass">
<!-- Some properties -->
</hdp:tool-tasklet>

The jar file, that contains the ToolClass is included as a dependency in my project and it works fine, however there is a problem that I am facing, namely I have several JAR files with dependencies and they have different versions of libraries included on their own and since I have included all these JOB JAR files as dependencies to my project, there are bunch of duplicate classes / libraries which can potentially be different versions.

So here is my question, is there a way for running a Tool class and by specifying the jar location, like it is possible to do with Hadoop command line arguments, such as -files or -libjars?

Can you suggest some other method of running Tool classes without loading the actual JAR file in the classpath and without using tool-class argument?

P.S: I am using spring-data-hadoop version: 1.0.0.M1

Thanks in advance.


Sincerely,
David Gevorkyan

Costin Leau
Apr 10th, 2012, 02:51 AM
Hi David,

We currently don't expose these parameters on the Tool namespace (as we do with streaming or job) - this looks like an omission. Can you please raise an issue on our tracker - also if you can, indicate how the command line looks like or what you would like to see in the namespace.

Cheers,

Costin Leau
Apr 10th, 2012, 12:27 PM
Raised issue https://jira.springsource.org/browse/SHDP-49
Feel use that to follow progress.

davidgevorkyan
Apr 10th, 2012, 12:56 PM
Hi Costin,

Thanks for the quick reply.

Actually besides just exposing JAR file to the Tool namespace, we also need "-files" parameter, since we have some use cases when we need to provide properties file on fly, dynamically.

So our command line looks like this:


hadoop jar fullpath:myJar_withDependencies.jar -files fullpath:myProp.properties -Dprop1=value1 -Dprop2=value2 -Dconfig=myProp.properties

So ideally I want to be able to specify any file (such as property file in the above example) to be uploaded to the cluster and also be able to specify the jar with dependencies to be uploaded to the server.

So if you can expose the same parameters to the Tool namespace as you have done for the streaming job, that would be great, namely the "file", "archive" and "lib".

Sincerely,
David

Costin Leau
Apr 12th, 2012, 09:25 AM
Hi David,

I'm almost done with exposing the params (file/archive/lib) but I'm not sure about the "jar" param. Hadoop jar currently just calls the Main class of the jar as a way to pass in configuration (the command line arguments). That's not needed in a Spring app since it throws out any existing configuration (including the hadoop one).

With the upcoming improvements the command above would look like this:




<hdp:tool-runner id="someTool" tool-class="org.foo.SomeTool" configuration-ref="hadoop-configuration"
properties-location="myProp.properties" files="myProp.properties">
<hdp:arg value="data/in.txt"/>
<hdp:arg value="data/out.txt"/>
prop1=value1
prop2=value2
</hdp:tool-runner>

Note the Tool instance (which can be configured) or class is still required and that's because the Tool (which is just a glorified Main) is executed in-process - we don't create a different JVM for it so we need it to be available. If my understanding is correct in your case, you have a lot of dependencies but that shouldn't be a problem since we only load the tool class - we disregard the rest of the classes and as long as your tool does that as well, there shouldn't be a problem.
Let me know if this solves your problem and if not why?

Costin Leau
Apr 12th, 2012, 10:38 AM
Commit the updates in master - you can pick the changes in the next snapshot.

davidgevorkyan
Apr 12th, 2012, 12:28 PM
Hi Costin,

The issue is we are attempting to replace our current shell script with spring batch. The shell script would look something like:

hadoop -jobjar job1.jar ...
...
hadoop -jobjar job2.jar ...
.......
hadoop -jobjar job10.jar ...


These job jars have conflicting versions of libraries in them (for example jackson 1.4 and jackson 1.94), and even have different versions of spring contained within them.

How would you propose handling this case? We can not simply just put all 10 jars in the classpath. Perhaps a classloader approach would work?

Costin Leau
Apr 12th, 2012, 12:53 PM
Out of curiosity what does the jar contain? Do you specify a main class or use the MANIFEST.MF instead? And what does the "main" file do? Does it implement certain interfaces or contracts?

Back to your use case, there are some problems here:

a. each of your commands, forks a separate VM. In each one the jars are put in the classpath but since each sits in a separate VM, there are no conflicts.
b. everything is command-line based. This means any configuration used needs to be passed through there (whether it's a bootstrapping property file or not).
c. the Main class isn't application friendly - as far as it's concerned it's the only app running so it tends to do System.exit() -> we might be able to bypass that (bytecode instrumentation) but I'd like to avoid that if possible since there are a lot of subtleties involved.

b doesn't make a lot of sense in an app (whether it uses Spring or not) since it simply disregards its context and only looks at the command line. a) and c) might be addressed by using a dedicated classloader.

I'll try to come up with something however in the meantime you might be able to go around this by pointing directly to the job that the tool/main is setting up. It's not ideal but it's worth giving a try. This can work since SHDP doesn't need or use the job as we're talking care of the Hadoop setup only.

davidgevorkyan
Apr 12th, 2012, 01:29 PM
The jar file contains every dependency of the hadoop job. This is standard for older versions of hadoop (newer ones do support something closer to a classpath). Its basically equivalanet to jar -xf *.jar ; jar -cf job1.jar * with a little cleanup.

An example job looks something like
class Job implements Tool
...
static void main(String args[]) {
Configuration config = new Configuration();
DateTime date = new DateTime();
config.setLong(JobConfFactory.CURRENT_DATE_IN_MILL S, date.getMillis());
System.exit(ToolRunner.run(config, new Job(), args));
}

I'm pretty sure to support the hadoop jobjar concept we need to have a tasklet that uploads the jobjar to hdfs, creates a custom classloader, loads the local jobjar in it and then use the existing tasklet code. This would prevent the classpath polution of placing multiple jobjars in the spring batch classpath.

Costin Leau
Apr 12th, 2012, 02:05 PM
Thanks - the information is useful. There might be an easier solution then the one you mentioned but testing will tell whether it works or not.
Out of curiosity, do you specify the classname or use the manifest.mf instead?

davidgevorkyan
Apr 12th, 2012, 04:11 PM
I am actually specifying the classname.

Costin Leau
Apr 13th, 2012, 12:59 PM
Hi David,

I've updated the tool support so now a jar file (not available on the classpath) can be specified - the loading process is done on a separate classloader so multiple versions and libraries can be used:



<hdp:tool-runner id="tool-jar" tool-class="test.SomeTool" jar="some-tool.jar"/>


Note that currently we don't do any copying or unpacking of the jar so things like nested /libs or /classes won't work - I'll add support for these (legacy) formats after (Ortodox) Easter. Feedback is welcome - also the structure of your jars is useful.

Cheers,

davidgevorkyan
Apr 13th, 2012, 04:55 PM
Hi Costin,

Thanks for looking into this.

Can I get the latest artifact from somewhere?

Regarding your question about our jar structure: it doesn't have any nested libs, so it only has META-INF directory and compiled classes in the corresponding package directories.

Sincerely,
David

Costin Leau
Apr 14th, 2012, 01:08 AM
Of course, see [1]. Simply add the snapshot repo in your gradle/maven build and all of SHDP version and its dependencies (including non-SpringSource) will be downloaded from there.

[1] http://www.springsource.org/spring-data/hadoop#maven

davidgevorkyan
Apr 16th, 2012, 02:39 PM
Thanks Costin,

I downloaded the latest snapshot version and found one issue.

Seems that you have removed Scope parameter from the latest version.
This parameter is required for our cases, since we are constructing arguments based on the jobParameters and these values are returned only in case scope="step" for the tasklet.
See an example of tasklet that uses "jobParameters".



<hdp:tool-tasklet id="taskletId" scope="step" configuration-ref="hadoop-configuration" tool-class="SomeClass">
<hdp:arg value="${propertyVal1}#{jobParameters['RUN_ID']}${propertyVal2}"/>
</hdp:tool-tasklet>


Sincerely,
David

Costin Leau
Apr 16th, 2012, 03:09 PM
That was probably an unintended modification (though I don't recall scope being exposed). You should however still be able to use it through the beans namespace (beans XML that is):

<bean class="org.springframework.data.hadoop.mapreduce.ToolTask let" scope="step" p:tool-class="SomeClass" p:configuration-ref=""/>

I'll fix the scope tomorrow morning (my time) but I'm interested to see whether the classloader update fixes your core issue.

Cheers,

davidgevorkyan
Apr 16th, 2012, 07:53 PM
Thanks Costin,

I tried to specify the jar's relative and full paths, but none of them worked.
Here is the bean definition:



<beans:bean id="test_hadoopTasklet" class="org.springframework.data.hadoop.mapreduce.ToolTask let" scope="step"
p:tool-class="${test_tool_class}"
p:jar="test-jobjar.jar"
p:configuration-ref="hadoop-configuration">
<beans:property name="arguments">
<beans:list>
<beans:value>${value1}</beans:value>
<beans:value>${value2}</beans:value>
<beans:value>${part1}#{jobParameters['RUN_ID']}${part2}</beans:value>
<beans:value>${part3}#{jobParameters['RUN_ID']}${part4}</beans:value>
</beans:list>
</beans:property>
</beans:bean>


The following exception is being thrown:



org.springframework.beans.TypeMismatchException: Failed to convert property value of type 'java.lang.String' to required type 'java.lang.Class' for property 'toolClass'; nested exception is java.lang.IllegalArgumentException: Cannot find class [package.ClassName] at org.springframework.beans.factory.support.Abstract AutowireCapableBeanFactory.doCreateBean(AbstractAu towireCapableBeanFactory.java:527) at org.springframework.beans.factory.support.Abstract AutowireCapableBeanFactory.createBean


Please let me know if I am doing something wrong.


Sincerely,
David

Costin Leau
Apr 17th, 2012, 12:38 AM
Are you sure you're using the latest 1.0.0.BUILD-SNAPSHOT? Can you post the name of the artifact? toolClass is not a class anymore but a String so there shouldn't be any conversion error.

Costin Leau
Apr 17th, 2012, 06:52 AM
Pushed an update which adds support for nested libraries (legacy jars). The latest snapshot is [1] 1.0.0.BUILD-20120417.114024-66.jar

[1] http://repo.springsource.org/webapp/search/artifact?1&q=spring-data-hadoop

Costin Leau
Apr 17th, 2012, 12:16 PM
And another update - the scope attribute is still there for tool-tasklet. That is, assuming you are using the correct SNAPSHOT (as mentioned above).

davidgevorkyan
Apr 17th, 2012, 12:47 PM
Hi Costin,

Thanks for the update.
Will try and let you know.

davidgevorkyan
Apr 17th, 2012, 02:32 PM
Hi Costin,

Did basic testing, everything works correctly. Haven't tested the case with specifying the property file, will work on that as well and give you an update.

Thanks for looking into this.

P.S: Seems that somehow I downloaded wrong snapshot version from february, that's why I had issues before.

Sincerely,
David

Costin Leau
Apr 18th, 2012, 06:37 AM
That's great! Let me know how it goes and of course, if you have any suggestions - bring them on :).

Cheers,

davidgevorkyan
Apr 23rd, 2012, 12:43 PM
Hi Costin,

Is this change going to be available in the Milestone release?


Sincerely,
David

Costin Leau
Apr 23rd, 2012, 02:49 PM
Of course. This functionality (which has now been extended to hdp:job as well - meaning one can configure a Hadoop job (with all its dependencies) from an external jar, not on the classpath) will be available in the next release along with the HBase extensions and potentially some security improvements just to name a few.
The ETA is probably second half of May but don't quote me on that - keeping an eye on JIRA should help.

Hth,

davidgevorkyan
Apr 23rd, 2012, 07:57 PM
Nice to hear that :)

Btw, I was adding more jobs and I came across the following issue:

I have a my_job.jar, which has the following classes:


package test.inner.mypackage;

import test.inner.MultipleOutputNamingDecider;

public class MyJob extends Configured implements Tool {
public final JobConf createJobConf(String[] args) {
final JobConf conf = new JobConf(getConf(), MyJob.class);
conf.setJobName("My Job Name");
...
conf.setOutputFormat(MultipleOutputNamingDecider.c lass);
}

public static void main(String[] args) {
new MyJob().configuredBy(args).run();
System.exit(0);
}
}


My tasklet is in the following form:



<hdp:tool-tasklet id="MyJob_hadoopTasklet" scope="step" configuration-ref="hadoop-configuration"
tool-class="test.inner.mypackage.MyJob" jar="my_job.jar">
...
</hdp:tool-tasklet>


When I am running the tasklet, I am getting the following class not found exception:



java.lang.RuntimeException: java.lang.ClassNotFoundException: test.inner.MultipleOutputNamingDecider at
org.apache.hadoop.conf.Configuration.getClass(Conf iguration.java:1028) at
org.apache.hadoop.mapred.JobConf.getOutputFormat(J obConf.java:619) at
org.apache.hadoop.mapred.JobClient$2.run(JobClient .java:874) at
org.apache.hadoop.mapred.JobClient$2.run(JobClient .java:833) at
java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.do As(UserGroupInformation.java:1157) at
org.apache.hadoop.mapred.JobClient.submitJobIntern al(JobClient.java:833) at
org.apache.hadoop.mapred.JobClient.submitJob(JobCl ient.java:807) at
org.apache.hadoop.mapred.JobClient.runJob(JobClien t.java:1242) at
test.inner.mypackage.MyJob.run(MyJob.java:57) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.j ava:65) at
org.springframework.data.hadoop.mapreduce.ToolExec utor.runTool(ToolExecutor.java:47) at
org.springframework.data.hadoop.mapreduce.ToolTask let.execute(ToolTasklet.java:33)


I thought you have covered these cases, or I am mistaken?

P.S: I am using spring-data-hadoop-1.0.0.BUILD-20120423.231511-73 version.

Sincerely,
David

Costin Leau
Apr 24th, 2012, 02:10 AM
I'll take a look - I've probably missed a 'configuration' spot.

davidgevorkyan
May 1st, 2012, 03:10 PM
Hi Costin,

Did you have a chance to look into this?

Sincerely,
David

Costin Leau
May 3rd, 2012, 08:01 AM
Hi,

Sorry for the delay - I was on the road through EU for the SpringOne / CloudFoundry tour.
I managed to replicate your problem and applied a fix - it is available in master and forced a nightly build so please go ahead and try out the latest snapshot.

The issue was in the way, for tool execution (and unfortunately through-out its usage), the Hadoop configuration does not preserve or copies the set classloader and relies on the thread context classloader as well (which is a fragile mechanism at best). This is now handled by the tool support - let me know whether the latest update works for you.

Cheers!

davidgevorkyan
May 4th, 2012, 07:50 PM
Hi Costin,

Thanks for the update. I downloaded latest Snapshot and it worked for me.
We still have 1 type of job, which I haven't tested, namely when I need to provide property file on fly.
Will test that on Monday and let you know if there are any issues.


Sincerely,
David

Costin Leau
May 8th, 2012, 04:42 AM
Great. Could you share more on how you pass the properties to your jobs and the use of Spring Batch?
Do the jobs pass information between each other - and if so, how? Anything that you think it's missing?

Cheers,

davidgevorkyan
May 11th, 2012, 07:37 PM
Hi Costin,

I did extensive testing with property files and here are my findings:

Below is my tasklet configuration:


<hdp:tool-tasklet id="hadoopTasklet" scope="step" configuration-ref="hadoop-configuration"
tool-class=MyJob" jar="my_job.jar" properties-location="myProp1.properties" files="myProp1.properties">
...
</hdp:tool-tasklet>


Here is one of the spring configs in my_job.jar:


<bean id="propertyPlaceholderConfigurer"
class="org.springframework.beans.factory.config.PropertyP laceholderConfigurer">
<property name="ignoreUnresolvablePlaceholders" value="true" />
<property name="ignoreResourceNotFound" value="true" />
<property name="locations">
<list>
<value>classpath:myProp1.properties</value>
</list>
</property>
</bean>


I can tell you for sure that properties-location property is working, since all the values from the property file were loaded and can be seen in the Job Configuration page in Hadoop.

At the same time the files property doesn't work, since I am getting the following errors:



attempt_201107061330_12011_r_000321_0: 2012-05-11 16:59:31,477 WARN [PropertyPlaceholderConfigurer]
Could not load properties from class path resource [myProp1.properties]: class path resource
[myProp1.properties] cannot be opened because it does not exist
...


I tried to change it to files="classpath:myProp1.properties" but it didn't solve the issue.

When invoking from command line we were using -files fullPath/myProp1.properties


Please let me know if I need to add something else.


Sincerely,
David

Costin Leau
May 11th, 2012, 08:50 PM
It looks like you ran into a bug [1] :(
The namespace parser currently doesn't properly parse the files/archives/libs properties (even though you did specify them).
I'll try to address that shortly - unfortunately I'm travelling (again) as we speak and will be in transit until next week; in the worse case scenario I'll have an update by next weekend.

Thanks for your feedback and patience.

[1] https://jira.springsource.org/browse/SHDP-74

Cheers,

davidgevorkyan
May 18th, 2012, 12:51 PM
Hi Costin,

Please let me know as soon as you have an update on this.


Sincerely,
David

Costin Leau
May 18th, 2012, 08:32 PM
Haven't forgot this. Due to some hectic problem (and some bad food) I didn't get a chance to take a look at this - but I plan to do this shortly this weekend or early next week as I'm currently on my way home.

Costin Leau
May 23rd, 2012, 08:44 AM
Hi David,

The issue has been fixed and the update pushed upstream and a nightly build published.

P.S. The namespace was actually working but the tool wasn't configured properly - which has been now been addressed.

Feedback is welcome!

davidgevorkyan
May 25th, 2012, 01:27 PM
Hi Costin,

Thanks for the update, I haven't been able to test the changes yet, since the cluster under a complete rebuild.
I will do and let you know as soon as possible.

Sincerely,
David

Costin Leau
Jun 4th, 2012, 03:43 AM
Hi David,

Any update?

davidgevorkyan
Jun 5th, 2012, 01:07 AM
Hi Costin,

Sorry for the late reply, the cluster was rebuilt on Friday and I only got a chance to test it today.
I just run an end to end test of the project against spring-data-hadoop-1.0.0.BUILD-20120603.231511-119 and everything is working well. I will run another test tomorrow, will let you know if I find something else.


Sincerely,
David

Costin Leau
Jun 5th, 2012, 05:29 AM
That's good to hear. By the way, the latest snapshot also supports security (the docs should get in there by the end of the day) meaning you can now specify a "user" as an impersonation for running Tool#run.
For what it's worth, this is supported across the board by all MR components (streaming, job & tool) and Pig as well.

davidgevorkyan
Jun 7th, 2012, 12:19 AM
Hi Costin,

For the impersonation, can you elaborate a little what are you talking about? Are you using some property in hadoop, such as "hadoop.job.ugi"?

Sincerely,
David

Costin Leau
Jun 7th, 2012, 03:23 AM
See the docs at [1] - code can be executed using using a different entity (assuming the cluster has been configured accordingly).

[1] http://repo.springsource.org/libs-snapshot-local/org/springframework/data/spring-data-hadoop/1.0.0.BUILD-SNAPSHOT/spring-data-hadoop-1.0.0.BUILD-SNAPSHOT-docs.zip

Costin Leau
Jun 7th, 2012, 03:50 AM
By the way, another feature available in the latest nightly build is the fallback to mainClass for tool namespace, if a jar is mentioned and no tool* is specified. Basically if your main class is the Tool (applies to most cases I've seen so far), simply point the tool to the jar and everything will be scanned automatically.

Costin Leau
Jun 7th, 2012, 07:15 AM
By the way, the snapshot docs are now published only in browsing format as well:
http://static.springsource.org/spring-hadoop/docs/snapshot/reference/html/security.html

davidgevorkyan
Jun 10th, 2012, 11:38 PM
That's really cool, especially locating mainClass and considering that as Tool.
I have 2 questions:

1) Can I specify more than 1 property file in the files argument (are the file names comma separated)?
2) Is it possible to cut a milestone release with the current functionality, since we have been waiting for that and are very excited to use it in one of our applications?

Sincerely,
David

Costin Leau
Jun 11th, 2012, 03:04 AM
1. Yes you can. I've just committed a fix for this in master (https://jira.springsource.org/browse/SHDP-79). This applies for files, archives and libs and you can specify even patterns - files="dev/props.properties, cfg/*.properties"
2. M2 is going to be released this week. Watch the forum for the release announcement.

davidgevorkyan
Jun 11th, 2012, 12:00 PM
Sounds good, thanks.

Costin Leau
Jun 13th, 2012, 07:14 AM
1.0.0.M2 was just released. See http://forum.springsource.org/showthread.php?127401-Spring-for-Apache-Hadoop-1-0-0-M2-Released

davidgevorkyan
Jun 15th, 2012, 06:12 PM
Thanks Costin,

I just noticed some duplications in the dependencies. For example:


Found in:
commons-beanutils:commons-beanutils:jar:1.7.0:compile
commons-beanutils:commons-beanutils-core:jar:1.8.0:compile


I would just advice to use Maven Enforcer plugin:



<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-enforcer-plugin</artifactId>
<version>1.1</version>

<dependency>
<groupId>org.codehaus.mojo</groupId>
<artifactId>extra-enforcer-rules</artifactId>
<version>1.0-alpha-3</version>
</dependency>

Costin Leau
Jun 16th, 2012, 02:26 AM
Those are not from our project - probably are transitive dependencies from different dependencies. And we are not using Maven but Gradle.
We could try to find out who pulls them in but even then excluding them will most likely cause issue - we could exclude 1.8 and leave 1.7 but still... It will end up as a messy situation trying to 'fix' something that we don't control...

davidgevorkyan
Jun 16th, 2012, 02:47 AM
Ahh, I see. We are using Maven Enforcer, since having duplicate classes in the classpath can potentially cause unknown issues and that has been the case several times when we had transitive dependencies that would interfere with each other.
I have excluded 1.8 from our project, but agree with you that it's not worth spending too much effort on that, unless it is causing some weird behavior.

Sincerely,
David

davidgevorkyan
Jun 25th, 2012, 02:45 PM
Hi Costin,

Is there a way to output Job Counters of the executing Hadoop Tool Tasklet?


Sincerely,
David

Costin Leau
Jun 25th, 2012, 03:59 PM
Not out of the box. Care to give a code example of what you are looking for?

Cheers.

davidgevorkyan
Jun 25th, 2012, 05:02 PM
We are using org.apache.hadoop.mapred.Reporter for doing some reporting when running hadoop jobs.
We have enumeration with the counters and we do reporter.incrCounter(Enum, 1), and after a specific job is finished, all of the Job Counter statistics will be shown, something like:

Counter Map Reduce Total
Value1 0 932,644 932,644
Value2 0 46,125,154 46,125,154
Value3 0 932,644 932,644

As you can see, same counter can be incremented both on the mapper and reducer sides, if needed.

Let me know if you need more information.

P.S: Btw, tried to send a message to you through this forum, but seems that your inbox is full :)

Costin Leau
Jun 26th, 2012, 12:05 PM
Right but I'm still missing on what type of work SHDP can do? As far as I can tell you don't need any specific configuration for this to work? Or am I missing something?

P.S. Yeah, my inbox gets full every months or so, and I ended up cleaning it some years ago...

davidgevorkyan
Jun 26th, 2012, 12:27 PM
Ok, so usually when I am running the job from console after the job ends, all these statistics are printed out, so I can crawl the logs and extract important information for reporting purposes, such as sending email after each job run.
If SHDP can print out all these statistics after it executes the jobs, that would be great, what do you think?

Costin Leau
Jun 26th, 2012, 01:00 PM
I think I know what you're issue it - the job tasklet runs the job in a non-verbose manner. I can make that configurable so the information shows up. Out of curiosity how are you using this data - nobody really reads the logs.

davidgevorkyan
Jun 26th, 2012, 05:40 PM
Suppose that each job calculates some stats during execution, for example:


TOTAL_USERS
ACTIVE_USERS
SUBSCRIBER_USERS
...


These are incremented in the reducer, as it finds more Users of a specific type. Please note that these numbers might be different depending on when it is run and for which geographic location, so we need to know these numbers to understand for example, how did our Marketing Campaign X impact user growth, or decline, etc...

This is just a simple example, since the counters are of different types and can be used for many other purposes.

Costin Leau
Jun 27th, 2012, 01:30 AM
Right. But the counters are currently incremented right - you just can't see their output in the console, is that right?

davidgevorkyan
Jun 27th, 2012, 03:03 PM
When I am running the job from console, the counter values are printed after the job is done.
From the SHDP perspective, you are right, they are actually incremented, but I can't see them in console.

davidgevorkyan
Jul 2nd, 2012, 05:33 PM
I was also wondering, is it possible to add more logging when running hadoop tool tasklets, such as to show Jar file, Tool Class, Hadoop Arguments, Hadoop Job Configuration Properties (the ones that we are specifying/overriding from the tasklet).
They can be very helpful when having many hadoop jobs and doing manual inspection in the logs.

As you suggested, it can be controlled by specifying verbose flag.

Sincerely,
David

Costin Leau
Jul 9th, 2012, 01:17 PM
I've raised SHDP-89 and added the option to job-tasklet namespace. Note that the verbose option will only affect the job-tasklet - since a tool takes care of running its own job, it's out of our control.
However since you mentioned you wanted info about the tool execution, I've added various logging on trace and info level.

Try the master and see whether it works for you. If not please comment SHDP-89.

https://jira.springsource.org/browse/SHDP-89

davidgevorkyan
Jul 27th, 2012, 04:36 AM
Hi Costin,

Thanks for taking look at it, I will try and let you know how it works for me.

One quick question: I just faced an issue in a long running Spring Batch process that submits jobs using SDHP, namely I got an "java.lang.OutOfMemoryError: PermGen space" error and by looking at the monitoring data, seems that the loaded classes aren't unloaded, which means from the top of my head, that the class loader that actually loaded the classes isn't Garbage Collected.

Did you run any tests on that, do you think the ParentLastURLClassLoader is being properly garbage collected and classes that it loads, unloaded?

The issue might be totally in a different part, but the increase is high and since SDHP is the only part that loads jar files with dependencies, my guess is that it might have caused that increase.

Please let me know what do you think.

Sincerely,
David

Costin Leau
Jul 27th, 2012, 07:09 AM
Hi David,

I've done some debugging and found something. However since this thread is becoming epic (7 pages and 63 replies) I have created a new thread here to keep things clean:
http://forum.springsource.org/showthread.php?128787-ClassLoader-leak-when-using-jars
Let's follow the discussion there.