Mar 10th, 2011, 06:35 PM
StaxEventItemReader ISO-8859-1 Character Normalization
Newbie here. I have a batch program that uses a StaxEventItemReader to input some XML. The XML is UTF-8 and contains some ISO-8859-1 Latin characters.
The Stax parser works fine issuing nextEvent calls until it gets to retrieving the XMLEvent that contains one of these characters, throwing this exception:
com.ctc.wstx.exc.WstxIOException: Invalid UTF-8 middle byte 0x6e (at char #35463, byte #31999)
Looking for ideas on how to either normalize the data at the point of retrieving the event...or somehow configuring the StaxEventItemReader so it normalize the data. Any ideas?
Mar 15th, 2011, 07:55 AM
You probably need to consult the documentation for your XML library (Woodstox by the looks of it)? But it is telling you there is an invalid byte, so are you sure it is really ISO-8859-1?
Mar 31st, 2011, 10:37 AM
Thanks for the reply, Dave. Yes, WoodStox is our XML library...I took your advice and checked out the documentation but unfortunately it did not provide any information or clues on how to handle this scenario. The links to their issues/bugs database and logs show as unavailble.
I am also pretty sure it is ISO-8859-1. Googling, I have found similar issues with the same character - when I delete that character, it runs fine. Posts seem to hint at differences between encoding used by StAX reader implementations (e.g. StAXUtils.createXMLStreamReader(InputStream)) vs. encoding the String uses (JVM default).
Was really hoping to see some properties/interface of the StaxEventItemReader available to normalize characters, set encoding options, alter the type of underlying reader it uses, etc. Any ideas are welcomed. Thank you!
Tags for this Thread