May 1st, 2012, 05:07 AM
Neo4j batch processing
This might make more sense over in the Batch forum but the issue does seem Neo4j specific.
I am trying to run a batch process reading in a million SQL records via JPA, processing them and writing out about 20/30 million Neo4j nodes. Everything seems to be working well except that it gradually slows down and then runs out of memory.
I am using chained transactions.
I noticed in the Neo4j messages log the following which continues until it crashes.
2012-05-01 09:58:25.972+0000: GC Monitor: Application threads blocked for an additional 814ms [total block time: 604.551s]
2012-05-01 09:58:29.301+0000: GC Monitor: Application threads blocked for an additional 1720ms [total block time: 606.271s]
2012-05-01 09:58:33.129+0000: GC Monitor: Application threads blocked for an additional 2117ms [total block time: 608.388s]
How can I allow the GC to run? If I am reading the log correctly it isn't being allowed to run at all, ever.
I thought that it would run every time a page is read from jpa but this appears not to happen.
May 1st, 2012, 12:54 PM
If you are using Spring Batch. What is your job/step's chunk size?
Based on what you posted, I am assuming it isn't releasing your references to you domain objects that represent your nodes/relationships.
The chunk size will set when a transaction is committed based on N number of records.
You can also set your batch job/step to use a task executor to use more than just one thread.
There still might be something in Spring Data Neo4j that is also not releasing the domain objects. So you might have to call something like evict() in JPA/Hibernate in Neo4J, as well as in your JPA code.
Those are just guesses without seeing more information.
May 1st, 2012, 07:32 PM
Hi Mark thanks for your response.
For testing purposes I reduced the chunk and page read size down from 2000 to 10. I can confirm the commits are taking place by checking the two data stores. I have several batch jobs that would compete so I have used a single executing thread for some time.
I did a bit more research and I think that I have not implemented it correctly. I was saving neo objects in the process stage. I think the correct way would be to implement a writer which would perform the eviction and cleanup. I am however quite new to batch so this may be way off.
I have switched from embedded to REST now and the problem has been resolved as have a few other issues.
Last edited by msduk; May 1st, 2012 at 07:37 PM.
May 1st, 2012, 09:54 PM
Yes, processing is for process like transformation.
Reading is for reading and writing is for writing.
In a chunk oriented Batch, reading and processing is on a single item, where a writer gets the chunk as a List of those items.
so read an item, process that item, read an item, process that item, back and forth until chunk size, once reach chunk size, pass the list to the writer. At the end of the writer commit the transaction.
Also, readers hold onto what they read in a "cache" in memory until the writer is done.
I'd still make those changes regardless if going to REST removed some issues.