Trying to read Japanese CSV file in Java - shift-jis

I am trying to read a Japanese content CSV file which is downloaded and extracted pragmatically.
Code to read the CSV
String splitBy = ",";
BufferedReader br;// = new BufferedReader(new FileReader(pathOfExcel + "\\KEN_ALL.CSV "));
br = new BufferedReader(new InputStreamReader(new FileInputStream(pathOfExcel + "\\KEN_ALL1.CSV"),"SHIFT-JIS"));
String line = "";
int cnt = 0;
while((line = br.readLine()) != null){
//System.out.println("Count :: " + cnt++);
List<Object> excelList = new ArrayList<Object>();
if(line != null){
String[] splitCells = line.split(splitBy);
excelList.add(splitCells[0].replace("\"", ""));
excelList.add(splitCells[1].replace("\"", ""));
excelList.add(splitCells[2].replace("\"", ""));
excelList.add(splitCells[3].replace("\"", ""));
excelList.add(splitCells[4].replace("\"", ""));
excelList.add(splitCells[5].replace("\"", ""));
excelList.add(splitCells[6].replace("\"", ""));
excelList.add(splitCells[7].replace("\"", ""));
excelList.add(splitCells[8].replace("\"", ""));
returnList.add(excelList);
}
}
br.close();
I have tried both UTF-8 and SHIFT-JIS as shown in the following code.
br = new BufferedReader(new InputStreamReader(new FileInputStream(pathOfExcel + "\\KEN_ALL1.CSV"),"UTF-8"));
When I was trying to encode with UTF-8 and SHIFT-JIS the " excelList.add(splitCells[3].replace("\"", ""));" will be returning the following outputs. But where as the original output should be ホッカイドウ
UTF-8 - ί¶²ÄÞ³
Shift-JIS - テ篠ッツカツイテ�楪ウ

The file KEN_ALL1.CSV is the file provided by JAPAN POST Co.,Ltd., right?
https://www.post.japanpost.jp/zipcode/dl/kogaki-zip.html
I could read the file collectly with your program, so I think the program has no problem.
I think your file might have some problem. Can you read the CSV file with text editor that can show the character encoding of the file (e.g. Notepad++)? Is the content of the file showed collectly, and is the character encoding really Shift-JIS like this?

Related

MalformedInputException: Input length = 1 while reading text file with Files.readAllLines(Path.get("file").get(0);

Why am I getting this error? I'm trying to extract information from a bank statement PDF and tally different bills for the month. I write the data from a PDF to a text file so I can get specific data from the file (e.g. ASPEN HOME IMPRO, then iterate down to what the dollar amount is, then read that text line to a string)
When the Files.readAllLines(Path.get("bankData").get(0) code is run, I get the error. Any thoughts why? Encoding issue?
Here is the code:
public static void main(String[] args) throws IOException {
File file = new File("C:\\Users\\wmsai\\Desktop\\BankStatement.pdf");
PDFTextStripper stripper = new PDFTextStripper();
BufferedWriter bw = new BufferedWriter(new FileWriter("bankData"));
BufferedReader br = new BufferedReader(new FileReader("bankData"));
String pdfText = stripper.getText(Loader.loadPDF(file)).toUpperCase();
bw.write(pdfText);
bw.flush();
bw.close();
LineNumberReader lineNum = new LineNumberReader(new FileReader("bankData"));
String aspenHomeImpro = "PAYMENT: ACH: ASPEN HOME IMPRO";
String line;
while ((line = lineNum.readLine()) != null) {
if (line.contains(aspenHomeImpro)) {
int lineNumber = lineNum.getLineNumber();
int newLineNumber = lineNumber + 4;
String aspenData = Files.readAllLines(Paths.get("bankData")).get(0); //This is the code with the error
System.out.println(newLineNumber);
break;
} else if (!line.contains(aspenHomeImpro)) {
continue;
}
}
}
So I figured it out. I had to check the properties of the text file in question (I'm using Eclipse) to figure out what the actual encoding of the text file was.
Then, when creating the file in the program, encode the text file to UTF-8 so that Files.readAllLines could read and grab the data I wanted to get.

Why won't BufferedWriter write URL content to text file?

I'm trying to write the text from the URL to a text file in batches of 35 lines, pushing enter to continue to the next batch of 35 lines. If I don't try and write to the file in batches of 35 lines it works great and writes all of the content to the text file. But when I try and use the if statement to print in batches of 35 it won't print to the file unless I push enter around 15 times. And even then it doesn't print everything. I seems like it has something to do with the if statement but I can't figure it out.
String urlString = "https://www.gutenberg.org/files/46768/46768-0.txt";
try {
URL url = new URL(urlString);
try(Scanner input = new Scanner(System.in);
InputStream stream = url.openStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
BufferedWriter writer = new BufferedWriter(new FileWriter("C:\\Users\\mattj\\Documents\\JuliusCeasar.txt"));) {
String line;
int PAGE_LENGTH = 35;
int lineCount = 0;
while ((line = reader.readLine()) != null) {
System.out.println(line);
writer.write(line + "\n");
lineCount++;
if (lineCount == PAGE_LENGTH){
System.out.println();
System.out.println("- - - Press enter to continue - - -");
input.nextLine();
lineCount = 0;
}
}
}
} catch (MalformedURLException e) {
System.out.println("We encountered a problem regarding the following URL:\n"
+ urlString + "\nEither no legal protocol could be found or the "
+ "string could not be parsed.");
e.printStackTrace();
} catch (IOException e) {
System.out.println("Attempting to open a stream from the following URL:\n"
+ urlString + "\ncaused a problem.");
e.printStackTrace();
}
I don't know Java, but there's very similar concepts in .NET. I think there's a couple of things to consider here.
BufferWriter will not write to the file immediately, it acts - as the name suggests - as a buffer, collecting up write requests over time then doing it in batch. BufferWriter has a flush method to flush the 'queued' up writes to the file immediately - so I'd do this when you hit your 35 (never flush on every write).
Also, BufferedReader and BufferedWriter are closable, so ensure to wrap them in a try statement to make sure resources are properly unlocked/cleared.

Change encoding of extracted text using PdfTextExtractor.GetTextFromPage?

I want to extract some text using itextsharp but instead of 'Köln' I get 'K\0ln', instead of 'Währung' I get 'W\0hrung' (and more examples), which means both 'ä' and 'ö' gets replaced by '\0'.
How do I set the encoding?
using (PdfReader reader = new PdfReader(sSourceFileName))
{
PdfReader.unethicalreading = true;
string sText = PdfTextExtractor.GetTextFromPage(reader, iPageFilter, new SimpleTextExtractionStrategy());
}

Need some eclipse search/replace regex help to speed things up

So I have had an issue for a while now and thought it was worth the time to ask the more experienced regex guys if there was a way to fix this issue with a quick search and replace.
So i use a tool which generates java code(not written in java or I would manually fix the cause directly), however, it has an issue calling variables before an object is created.
This always occurs only once per object, but not for every object, the object name is unknown, and the error is always the line directly before the constructor is called. This is the format the error is always in:
this.unknownObjectName.mirror = true;
this.unknownObjectName = new Model(unknown, parameter, values);
I know there should be a trick to fix this, as a simple string replace simply will not work since 'unknownObjectName' is unknown.
Would this even be possible with regex, if so, please enlighten me :)
This is how the code SHOULD read:
this.unknownObjectName = new Model(unknown, parameter, values);
this.unknownObjectName.mirror = true;
For complex models, this error may happen hundreds of times, so this will indeed save a lot of time. That and I would rather walk on hot coals then do mindless busy work like fixing all these manually :)
Edit:
I through together a java app that does the job.
public static void main(String args[]){
File file = new File(args[0]);
File file2 = new File(file.getParentFile(), "fixed-" + file.getName());
try {
if(file2.exists()) {
file2 = new File(file.getParentFile(), "fixed-" + System.currentTimeMillis() + "-" + file.getName());
}
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file2)));
String line, savedline = null, lastInitVar = "";
while((line = br.readLine()) != null){
if(line.contains("= new ")){
String varname = line.substring(0, line.indexOf("=")).trim();
lastInitVar = varname;
}else if(line.contains(".mirror")){
String varname = line.substring(0, line.indexOf(".mirror")).trim();
if(!lastInitVar.equals(varname)){
savedline = line;
continue;
}
}else if(savedline != null && savedline.contains(lastInitVar)){
bw.write(savedline + "\n");
savedline = null;
}
bw.write(line + "\n");
}
bw.flush();
bw.close();
br.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Over thinking it
Write a program to read line by line and when you see a object access before a constructor don't write it out, write out the next line and then write out the buffered line, rinse repeat.
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. - Jamie Zawinski
Regular Expressions are for matching patterns not state based logic.

' ', hexadecimal value 0x1F, is an invalid character. Line 1, position 1

I am trying to read a xml file from the web and parse it out using XDocument. It normally works fine but sometimes it gives me this error for day:
**' ', hexadecimal value 0x1F, is an invalid character. Line 1, position 1**
I have tried some solutions from Google but they aren't working for VS 2010 Express Windows Phone 7.
There is a solution which replace the 0x1F character to string.empty but my code return a stream which doesn't have replace method.
s = s.Replace(Convert.ToString((byte)0x1F), string.Empty);
Here is my code:
void webClient_OpenReadCompleted(object sender, OpenReadCompletedEventArgs e)
{
using (var reader = new StreamReader(e.Result))
{
int[] counter = { 1 };
string s = reader.ReadToEnd();
Stream str = e.Result;
// s = s.Replace(Convert.ToString((byte)0x1F), string.Empty);
// byte[] str = Convert.FromBase64String(s);
// Stream memStream = new MemoryStream(str);
str.Position = 0;
XDocument xdoc = XDocument.Load(str);
var data = from query in xdoc.Descendants("user")
select new mobion
{
index = counter[0]++,
avlink = (string)query.Element("user_info").Element("avlink"),
nickname = (string)query.Element("user_info").Element("nickname"),
track = (string)query.Element("track"),
artist = (string)query.Element("artist"),
};
listBox.ItemsSource = data;
}
}
XML file:
http://music.mobion.vn/api/v1/music/userstop?devid=
0x1f is a Windows control character. It is not valid XML. Your best bet is to replace it.
Instead of using reader.ReadToEnd() (which by the way - for a large file - can use up a lot of memory.. though you can definitely use it) why not try something like:
string input;
while ((input = sr.ReadLine()) != null)
{
string = string + input.Replace((char)(0x1F), ' ');
}
you can re-convert into a stream if you'd like, to then use as you please.
byte[] byteArray = Encoding.ASCII.GetBytes( input );
MemoryStream stream = new MemoryStream( byteArray );
Or else you could keep doing readToEnd() and then clean that string of illegal characters, and convert back to a stream.
Here's a good resource for cleaning illegal characters in your xml - chances are, youll have others as well...
https://seattlesoftware.wordpress.com/tag/hexadecimal-value-0x-is-an-invalid-character/
What could be happening is that the content is compressed in which case you need to decompress it.
With HttpHandler you can do this the following way:
var client = new HttpClient(new HttpClientHandler
{
AutomaticDecompression = DecompressionMethods.GZip
| DecompressionMethods.Deflate
});
With the "old" WebClient you have to derive your own class to achieve the similar effect:
class MyWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
return request;
}
}
Above taken from here
To use the two you would do something like this:
HttpClient
using (var client = new HttpClient(new HttpClientHandler { AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate }))
{
using (var stream = client.GetStreamAsync(url))
{
using (var sr = new StreamReader(stream.Result))
{
using (var reader = XmlReader.Create(sr))
{
var feed = System.ServiceModel.Syndication.SyndicationFeed.Load(reader);
foreach (var item in feed.Items)
{
Console.WriteLine(item.Title.Text);
}
}
}
}
}
WebClient
using (var stream = new MyWebClient().OpenRead("http://myrss.url"))
{
using (var sr = new StreamReader(stream))
{
using (var reader = XmlReader.Create(sr))
{
var feed = System.ServiceModel.Syndication.SyndicationFeed.Load(reader);
foreach (var item in feed.Items)
{
Console.WriteLine(item.Title.Text);
}
}
}
}
This way you also recieve the benefit of not having to .ReadToEnd() since you are working with the stream instead.
Consider using System.Web.HttpUtility.HtmlDecode if you're decoding content read from the web.
If you are having issues replacing the character
For me there were some issues if you try to replace using the string instead of the char. I suggest trying some testing values using both to see what they turn up. Also how you reference it has some effect.
var a = x.IndexOf('\u001f'); // 513
var b = x.IndexOf(Convert.ToString((byte)0x1F)); // -1
x = x.Replace(Convert.ToChar((byte)0x1F), ' '); // Works
x = x.Replace(Convert.ToString((byte)0x1F), " "); // Fails
I blagged this
I had the same issue and found that the problem was a  embedded in the xml.
The solution was:
s = s.Replace("", " ")
I'd guess it's probably an encoding issue but without seeing the XML I can't say for sure.
In terms of your plan to simply replace the character but not being able to, because you have a stream rather than a text, simply read the stream into a string and then remove the characters you don't want.
Works for me.........
string.Replace(Chr(31), "")
I used XmlSerializer to parse XML and faced the same exception.
The problem is that the XML string contains HTML codes of invalid characters
This method removes all invalid HTML codes from string (based on this thread - https://forums.asp.net/t/1483793.aspx?Need+a+method+that+removes+illegal+XML+characters+from+a+String):
public static string RemoveInvalidXmlSubstrs(string xmlStr)
{
string pattern = "&#((\\d+)|(x\\S+));";
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(xmlStr))
{
xmlStr = regex.Replace(xmlStr, new MatchEvaluator(m =>
{
string s = m.Value;
string unicodeNumStr = s.Substring(2, s.Length - 3);
int unicodeNum = unicodeNumStr.StartsWith("x") ?
Convert.ToInt32(unicodeNumStr.Substring(1), 16)
: Convert.ToInt32(unicodeNumStr);
//according to https://www.w3.org/TR/xml/#charsets
if ((unicodeNum == 0x9 || unicodeNum == 0xA || unicodeNum == 0xD) ||
((unicodeNum >= 0x20) && (unicodeNum <= 0xD7FF)) ||
((unicodeNum >= 0xE000) && (unicodeNum <= 0xFFFD)) ||
((unicodeNum >= 0x10000) && (unicodeNum <= 0x10FFFF)))
{
return s;
}
else
{
return String.Empty;
}
})
);
}
return xmlStr;
}
Nobody can answer if you don't show relevant info - I mean the Xml content.
As a general advice I would put a breakpoint after ReadToEnd() call. Now you can do a couple of things:
Reveal Xml content to this forum.
Test it using VS Xml visualizer.
Copy-paste the string into a txt file and investigate it offline.