PDA

View Full Version : JdbcCursorInputSource and OutOfMemoryError



sandrine
Feb 19th, 2008, 04:27 AM
Hi,

I've made a batch using JdbcCursorInputSource (spring batch m3). The more lines I read, the more java heap is consumed and then the batch ends by OutOfMemoryError (250Mo used for 200.000 lines in my database).
I can execute my batch using the -Xmx option but I think that's not a good solution :
the consumed memory should not increase with the number of lines returned by the cursor.

JdbcCursorInputSource seems not to close the resultset at commit time.
Is there some foreseen evolution to answer to this problem?
I would think JdbcCursorInputSource should create Jdbc resources (statement & resultset) at the beginning of each commitInterval and close them at the end of the commitInterval. This would help manage database locks and resource usage.

Thanks beforehand.

lucasward
Feb 19th, 2008, 09:31 AM
The cursor doesn't close the connection at commit time because doing so would close the cursor, requiring you to open it again when starting the next transaction which would be very expensive.

It shouldn't be taking up any heap size, since the only thing it's holding onto is the resultset itself, are you holding the output of the resultset in some kind of collection that isn't being freed?

sandrine
Feb 20th, 2008, 01:55 AM
The only thing I do with the resultSet is to pass it to the mapRow method of my RowMapper in which I read it to create some domain Object.
Then I write domain Object in a file using a Writer that extends FlatFileItem Writer.
I have replaced my processor by a MockProcessor which do nothing and I've had the same result : an increase of consumed memory during the batch's execution.

Do you think there's a solution at my problem?
Do you want an extract of my code?

(sorry for my bad english)

lucasward
Feb 20th, 2008, 03:55 PM
I think posting your code would be helpful. I would imagine there's probably a pesky memory leak somewhere, potentially in the RowMapper, but if there is an issue with the InputSource seeing the code can help to narrow it down.

sandrine
Feb 21st, 2008, 10:46 AM
I have done many tests and I have found that the memory increases until 550 Mo when I use

stmt = con.createStatement(ResultSet.TYPE_SCROLL_INSENSIT IVE,ResultSet.CONCUR_READ_ONLY,ResultSet.HOLD_CURS ORS_OVER_COMMIT);
and until 50 Mo when I use

stmt = con.createStatement();
Then it seems to be a jdbc/Oracle problem.

I know that JdbcCursorInputSource needs to use TYPE_SCROLL_INSENSITIVE to use absolute method of ResultSet.
If you have any suggestion, I'm happy to read it.

If you always want my code with JdbcCursorInputSource or my test without JdbcCursorInputSource, say to me.

I have also test to close the ResultSet at commit time but you were right, the performances are very bad.

lucasward
Feb 21st, 2008, 02:09 PM
Hmmm....i would definitely like to look at your test code. Have you tried setting the fetch size and/or max rows?

sandrine
Feb 22nd, 2008, 09:36 AM
To detect that the problem comes from TYPE_SCROLL... choice at create statement time, I have created a batch having the following job.xml:

<bean id="step1" parent="simpleStep">
<property name="commitInterval" value="20000" />
<property name="saveRestartData" value="false" />
<property name="allowStartIfComplete" value="true"/>
<property name="tasklet">
<bean class="org.springframework.batch.execution.tasklet.Restar tableItemProviderTasklet">
<property name="itemProvider" ref="DBItemProvider" />
<property name="itemProcessor" ref="compositeItemProcessor"/>
</bean>

DBItemProvider:

<bean id="DBItemProvider" class="mypackage.VersementItemProvider">
<property name="dataSource" ref="dataSource" />
<property name="mapper">
<bean class="mypackage.VersementRowMapper" />
</property>
</bean>
The provider don't realy use the mapper in my test then I won't give you mapper's code.

Processor:

<bean id="compositeItemProcessor" class="org.springframework.batch.item.processor.Composite ItemProcessor">
<property name="itemProcessors">
<list>
<!--<bean id="fileItemProcessor" class="org.springframework.batch.item.processor.ItemWrite rItemProcessor">
<property name="itemWriter" ref="fileVersementItemWriter"/>
</bean>-->
<bean id="mockProcessor" class="com.natixis.sphinx.batch.modele.processor.MockProc essor"/>
<!--<bean id="DBItemProcessor" class="org.springframework.batch.item.processor.ItemWrite rItemProcessor">
<property name="itemWriter" ref="sqlSupportDao"/>
</bean>-->
</list>
</property>
</bean>
Only the MockProcessor is used in my test and it does nothing.
its code:

public class MockProcessor implements ItemProcessor {

public void process(Object data) throws Exception {
// TODO Raccord de méthode auto-généré

}

}

My VersementItemProvider is:

public class VersementItemProvider implements ItemProvider {


private Log log=LogFactory.getLog(VersementItemProvider.class) ;

private static final String SELECT_SUPPORT_RECORD = "SELECT c.ceseqo \"contrat\", m.ceseqo, s.ceseqo \"support\", m.amount \"amount\",s.cotaux \"taux\",t.valcot \"cote\",s.nbpart \"nbpart\" "+
" FROM sphmcontrat c, sphmmvt m, sphmsupport s,sphmtypesupport t " +
" WHERE m.dateff<= '01/02/2008' and m.coperi=1 and m.cocont=c.ceseqo and c.ceseqo=s.cocont and t.ceseqo=s.cosupp and (c.coetco=1 or c.coetco=2) order by c.ceseqo, m.ceseqo, s.ceseqo";

private static final String SELECT_CLIENT_RECORD = "select p.ceseqo, p.cocivi,p.nompat,p.nommar,p.prenom,p.preno2,p.dah eme,p.cosecu,p.cositf,a.librue,a.copost,a.libvil,a .copays,a.coposn,a.copayn,a.libvin,c.cochar,c.dahe m1,c.numrib from sphmclient c, sphmadresse a, sphmpersonne p where c.cochar=2 and c.copers=p.ceseqo and p.ceseqo=a.copers order by p.ceseqo";
private Connection con;

private Statement stmt;

protected ResultSet rs;

private DataSource dataSource;
private RowMapper mapper;
private boolean initialized = false;


public Object next() {
if (!initialized){
try {
con = dataSource.getConnection();

stmt = con.createStatement(ResultSet.TYPE_SCROLL_INSENSIT IVE,ResultSet.CONCUR_READ_ONLY, ResultSet.HOLD_CURSORS_OVER_COMMIT);
rs=stmt.executeQuery(SELECT_CLIENT_RECORD);
initialized=true;
} catch (SQLException e) {
log.error(e);
throw new RuntimeException(e);
}

}
Object result = null;
try {
if (rs.next()){
result = new Object();//mapper.mapRow(rs, rs.getRow());
System.out.println(rs.getRow());
} else{
con.commit();
rs.close();
stmt.close();
con.close();
}
} catch (SQLException e) {
log.error(e);
throw new RuntimeException(e);
}
return result;
}

public void setDataSource(DataSource dataSource) {
this.dataSource = dataSource;
}
public void setMapper(RowMapper mapper) {
this.mapper = mapper;
}
public Object getKey(Object item) {
return item;
}
public boolean recover(Object data, Throwable cause) {
return false;
}
}

With this code, my batch returns 199913 lines, consums 550 Mo of memory, the processor time is about 14 s and the duration is about 1minute 20 s.

When I change create statement like this in my provider:

stmt = con.createStatement();
my batch returns always 199913 lines but consums 55 Mo of memory; processor time and duration are similar than previously.

I have done some tests using setMaxRows(): the less number of rows is read, the less memory is consumed and the less is the duration but I don't have all the results.
If I close the ResultSet each 10000 records, 85 Mo of memory is consumed with the same time processor as previously but the duration is 5 minutes.

When I increase fetch size, the performance are better (30 s instead of 1minute 20s for fetch size=10000) and 550 Mo of memory are always consumed.

If you want some precisions about a test or if you want the code of the batch which uses JdbcCursorInputSource, say to me.

lucasward
Feb 22nd, 2008, 12:23 PM
So, it seems like tweaking the fetchsize and max rows gives you the result you're looking for?

550 mb is definitely too much, but you mentioned the tweaks got it down to 55 mb, which seems reasonable, unless I'm missing something.

Dave Syer
Feb 23rd, 2008, 03:31 PM
It looks to me like there was no improvement in the heap usage, but a large ( as expected) improvement in the throughput when the fetch size was tweaked. But it also looks like the problem with the heap size must be in the jdbc driver, if anything. I guess I can sort of understand why the driver might want to eat memory if a cursor is open with such lax policies (it might be called on to reset to any row at any time), and so it probably comes down to a performance tradeoff being made by the RDBMS vendor. Maybe you could try another driver or database platform?

gcollin
Feb 26th, 2008, 09:22 AM
I encounter the same problem:

Memory consumption is increasing alot with the data.
I understand it's not a spring-batch error, but I would like a work around.

I don't need, in my particuliar case, to have a resultset with random access to rows. (No restart in this case).

I would be interested in overriding executeQuery () or the createStatement in my own class, but it's a private method.

Gerard COLLIN

ndefreitas
Feb 27th, 2008, 10:43 PM
Did you guys try using Ecplise TPTP to help narrow down which objects are hanging around, and what's referencing them?

gcollin
Feb 28th, 2008, 03:09 AM
In fact, I could get eclipse profiler to run (timeout in running process) and the client don't want to buy a tool.


Anyway, if it's a normal behaviour of the oracle driver to keep data in memory when using SCROLLABLE resultset, I guess there's not much I can do.

I think I'm going to rewrite a custom JdbcCursorInputSource to override this problem.

Gerard COLLIN

ndefreitas
Feb 28th, 2008, 06:31 AM
You can send the profile data to a file, and then analyze it later - this way you shouldn't timeout. Also, if you want a better tool try JInsight from IBM: http://www.alphaworks.ibm.com/tech/jinsightlive

gcollin
Mar 5th, 2008, 02:24 AM
Howver, this kind of test is planned for later.

Gérard

gonzalad
Mar 25th, 2008, 05:31 AM
Hello,

Very interesting thread.

Just googled a bit looking at jdbc drivers implementation of scrollability.

It appears most jdbc drivers don't implement server side scrollability, but emulate it on the client side.
ResultSet scrollability is usefull only with Sql server on Jtds drivers.

So, I'm a bit sceptic about JdbcCursorItemReader usability for reading large volumes of data.
One way of using JdbcCursorItemReader is like gcollin told to remove TYPE_SCROLL_INSENSITIVE, and forget abour restartability (never call absolute()).
Another way is perhaps on restart to reread the data from the last line read (using a separate restartSql and storing the pk of the last processed item ?).

This link shows a list of drivers and whether they implement or not server side rs scrollability : http://www.hibernate.org/314.html.
Here's a summary (I've just copy/paste the info for quick reference) :
Oracle : [KO] Client side support
Sql server on Jtds : [OK] Server side support
Sybase on Jtds : [KO] Client side support
Mysql [KO] : Client side support.
Hsqldb [KO] : scrollable rs not supported.
Firebird JDBC driver : [KO] client side.
PostgreSQL JDBC driver ?
Informix JDBC driver : [OK] by creating a temporary table. [1] Reference documentation for Oracle (http://download.oracle.com/docs/cd/B28359_01/java.111/b31224/resltset.htm#i1022319).

Oracle reference documentation claims scrollability is implemented on the client side.
Because the underlying server does not support scrollable cursors, Oracle JDBC must implement scrollability in a separate layer.
It is important to be aware that this is accomplished by using a client-side memory cache to store rows of a scrollable result set.

Important:
Because all rows of any scrollable result set are stored in the client-side cache, a situation where the result set contains many rows, many columns, or very large columns might cause the client-side Java Virtual Machine (JVM) to fail. Do not specify scrollability for a large result set.In fact, Oracle jdbc driver 11.1.0.6.0 default implementation (oracle.jdbc.driver.OracleResultSetCacheImpl) just stores every line of the resultSet in a Vector. The lines are stored in the Vector when they are requested (for instance, if you call rs.last(), it just retrieves every line of the resultSet and store them in the Vector).
Oracle allows to override the default behaviour by implementing the OracleResultSetCache interface (for instance caching a number of items equal to the commit-intervall for Spring Batch). Not sure it's the way to go since it's really Oracle-specific.

Hope this search helped !

lucasward
Mar 25th, 2008, 11:08 AM
That's a really great post. I've known about issues such as this for awhile, which is why I also created the DrivingQueryItemReader. It's the approach used on another client to solve this issue. The problem is, there's really a limit to the number of keys you can store in memory within a jvm. At this client we solved it by 'partitioning' the driving query with a maximum key limit in mind, something like 100,000 keys per partition. You could then split the data up with a query such as : select count(ID) from T_FOOS order by ID. (The order by would be important here) Then you could kick off N jobs where N equals the number of partitions. The JobParameters would then contain the begin and end ID for that partition (giving you separate instances). There's a little work to be done to make sure the KeyCollector implementations can pull and set up a prepared statement from the JobParameters, which I'll be working through today and tomorrow. If you look at the latest trunk I've put something forward that works for JdbcCursorItemReader, but I agree there's some work that needs to be done there too. (I may move some implementation pieces into samples before the release)

One long term solution to the cursor problem might be to make it forward only and buffer up until mark. It would require holding what the RowMapper returned in a buffer until the next mark, in which case the buffer contents can be thrown away. The only reason the buffer is needed is so that rollback can be supported. I don't have time to implement that right now, but it's an enhancement I'm looking to make in the future.

gonzalad
Apr 1st, 2008, 02:16 PM
Thanks Lucas !
You fixed this issue with 1.0.0-FINAL.

(I was testing my own buffered implementation, I'll throw it away :eek:)

lucasward
Apr 1st, 2008, 02:19 PM
Yeah, sorry about that, I forgot to mention in this thread that I fixed in final. I wasn't going to touch it until 1.1, but then when I dug into it, i really started to think of it as a bug. It didn't take too long to fix, I'm glad it works for you.

Dave Syer
Apr 1st, 2008, 03:36 PM
I'm glad we fixed the issue. As a further improvement, why couldn't we use the RDBMS to do the scrolling? (It would be platform dependent SQL.) DrivingQueryItemReader would benefit from this as well.

lucasward
Apr 1st, 2008, 03:53 PM
I'm not sure I fully understand Dave? It seems like to do that you're talking about some form of pagination? Isn't that quiet a bit different from opening a database cursor? I think you're right that it might hold some promise for the driving query reader, but I think it would be fundamentally different from a cursor reader, unless I'm completely misunderstanding.

gonzalad
Apr 1st, 2008, 04:13 PM
As a further improvement, why couldn't we use the RDBMS to do the scrolling?Something like the following for Oracle ?
Taken from http://www.oracle.com/technology/oramag/oracle/06-sep/o56asktom.html, section 'Pagination with ROWNUM' :

select *
from ( select /*+ FIRST_ROWS(n) */
a.*, ROWNUM rnum
from ( your_query_goes_here,
with order by ) a
where ROWNUM <=
:MAX_ROW_TO_FETCH )
where rnum >= :MIN_ROW_TO_FETCH;

lucasward
Apr 1st, 2008, 04:21 PM
Right, but wouldn't that be something like a paginated item reader, rather than a cursor item reader? It seems like two completely different things.

Dave Syer
Apr 2nd, 2008, 02:26 AM
If the framework is smart enough to know about the different RDBMS flavours the interface for the user is identical to the existing readers, isnt it? So we could implement it as a new reader, but that would mean two choices to achieve the same goal with one clear winner (as long as you were on a supported platform).

gonzalad
Apr 2nd, 2008, 03:32 AM
Sorry to come in the middle of a very interesting conversation (this new jdbc reader can be really interesting)


I'm glad we fixed the issue.
We found a bug in 1.0.0-FINAL JdbcCursorItemReader (sorry !).

It happens on second or subsequent restarts.
This is because bufferredReader don't remember the previous processedRowCount.
bufferredReader#processedRowCount only contains the rows processed on the current restart.

When update is called, JdbcCursorItemReader stores in the execution context the rows during this restart. It should store the rows processed during this restart AND the previous ones.

This issue can be resolved changing the JdbcCursorItemReader#open code - see //CHANGE, //END CHANGE block at the end (note : a new constructor BufferredResultSetReader(ResultSet,RowMapper,long processedRowCount) would have been better than setting the field directly) :



public void open(ExecutionContext context) {
Assert.state(!initialized, "Stream is already initialized. Close before re-opening.");
Assert.isNull(rs);
Assert.notNull(context, "ExecutionContext must not be null");
executeQuery();
initialized = true;


long currentProcessedRow = 0;
if (context.containsKey(getKey(CURRENT_PROCESSED_ROW) )) {
try {
currentProcessedRow = context.getLong(getKey(CURRENT_PROCESSED_ROW));
while(rs.next()){
if(rs.getRow() == currentProcessedRow){
break;
}
}
} catch (SQLException se) {
throw getExceptionTranslator().translate("Attempted to move ResultSet to last committed row", sql, se);
}
}


//CHANGE
bufferredReader = new BufferredResultSetReader(rs, mapper);
bufferredReader.processedRowCount = currentProcessedRow;
//END CHANGE
}

lucasward
Apr 2nd, 2008, 10:30 AM
You're right, I'll add an extra constructor:

http://jira.springframework.org/browse/BATCH-549

lucasward
Apr 2nd, 2008, 10:52 AM
It's fixed in 1.0.1 dev.

gonzalad
Apr 2nd, 2008, 12:52 PM
Thanks Lucas !

hessenmob
May 31st, 2008, 03:56 PM
HI All,

actually I'am getting this problem again when using a HibernateCursoItemReader. The driver caches every Row internally which results into 500MB heap data. This constantly crashes all my bigger data dump tasks.
Is the bug that existed in the JdbcCursorItemReader not fixed in the hibernate one?
Any ideas?

I'am using the thin oracle jdbc type 4 driver on 10g.

ndefreitas
May 31st, 2008, 04:28 PM
Do you have the crash dump file? Are you using an IBM or Sun JVM?

hessenmob
May 31st, 2008, 06:54 PM
Hi,

i will get the crash dump as soon as i can, i found the "leak" by profiling through the application with jprofiler. in oracle.jdbc the number of byte[] is constantly rising.
The driver stores all the read rows in the OracleResultSetCacheImpl in the resultset. As this is a normal behaviour of the oracle driver, i wonder how i can change that behaviour. Any idea?

I'am using sun 1.6, attatched a screen shot.
Did a dump by specifiing: -XX:+HeapDumpOnOutOfMemoryError
Used the SAP Memory Analyzer, very good to dig through dumps.

h-t-t-p://img-up.net/?up=oracleProb1DC2Xi0.png
h-t-t-p://img-up.net/?up=dumpVisualX1oB1n.png
Sorry, new links, old ones didn't work.

Sorry, attatchments are resized to strong, can't link pictures as i'am a new user (need 15 posts).

ndefreitas
May 31st, 2008, 08:32 PM
You can email them to nigel dot defreitas at gmail dot com.

lucasward
Jun 1st, 2008, 12:21 PM
It seems like there has to be some kind of setting in hibernate for this. With the jdbc version is changed the cursor to forward only from scroll sensitive. I'm assuming that hibernate must be using a scroll sensitive one as well. I'm not a hibernate expert, but it seems like there should be a way to control that.

Dave Syer
Jun 1st, 2008, 03:38 PM
You could also try using the normal DrivingQueryItemReader or JdbcCursorItemReader to grab the primary keys and use Hibernate in the Writer/Tranformer to load the object and do whatever is necessary in the context of a single commit interval.

hessenmob
Jun 1st, 2008, 08:42 PM
Thanks for your answers, i will check if i can change the cursor to forward only.

To avoid those problems in future I implemented a DrivingHibernateCursorReader. The example the iBatis implemention gave to me was quite easy to transfer to hibernate. I'am now using a SingleColumn KeyGenerator that gets all keys via a JdbcTemplate. Then i chunk those in my DrivingItemReader and read with a statement like this: "select from Table where id < n+1000 and id > n" the read object chunks are then transfered into a queue from which the read method then reads the elements. An implementation where i just read one object for a key at a time performed to slow. Is this a good way to implement it? The performance decrease compared to the Cursor implementation is approx. at 15-20%.

hessenmob
Jun 2nd, 2008, 08:18 AM
This is how the statement is executed in JdbcCursorReader:


preparedStatement = this.con.prepareStatement(sql, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY, ResultSet.HOLD_CURSORS_OVER_COMMIT);

This is how the cursor is opened in the HibernateReader:


if (useStatelessSession) {
statelessSession = sessionFactory.openStatelessSession();
cursor = statelessSession.createQuery(queryString).scroll() ;
} else {
statefulSession = sessionFactory.openSession();
cursor = statefulSession.createQuery(queryString).scroll();
}

Maybe the Hibernate one should be opened by using this method

org.hibernate.ScrollableResults scroll(org.hibernate.ScrollMode scrollMode) throws org.hibernate.HibernateException;

with the following value:
FORWARD_ONLY

Should i file a bug report?

lucasward
Jun 2nd, 2008, 09:21 AM
That looks like it would be correct, although I would want to do some testing with it to ensure rollbacks were handled correctly.

Creating an issue with your full observations would be helpful.

hessenmob
Jun 3rd, 2008, 11:32 AM
Setting the cursor to FORWARD_ONLY does the trick. I will file a report.