Why am I getting this error? I'm trying to extract information from a bank statement PDF and tally different bills for the month. I write the data from a PDF to a text file so I can get specific data from the file (e.g. ASPEN HOME IMPRO, then iterate down to what the dollar amount is, then read that text line to a string)
When the Files.readAllLines(Path.get("bankData").get(0) code is run, I get the error. Any thoughts why? Encoding issue?
Here is the code:
public static void main(String[] args) throws IOException {
File file = new File("C:\\Users\\wmsai\\Desktop\\BankStatement.pdf");
PDFTextStripper stripper = new PDFTextStripper();
BufferedWriter bw = new BufferedWriter(new FileWriter("bankData"));
BufferedReader br = new BufferedReader(new FileReader("bankData"));
String pdfText = stripper.getText(Loader.loadPDF(file)).toUpperCase();
bw.write(pdfText);
bw.flush();
bw.close();
LineNumberReader lineNum = new LineNumberReader(new FileReader("bankData"));
String aspenHomeImpro = "PAYMENT: ACH: ASPEN HOME IMPRO";
String line;
while ((line = lineNum.readLine()) != null) {
if (line.contains(aspenHomeImpro)) {
int lineNumber = lineNum.getLineNumber();
int newLineNumber = lineNumber + 4;
String aspenData = Files.readAllLines(Paths.get("bankData")).get(0); //This is the code with the error
System.out.println(newLineNumber);
break;
} else if (!line.contains(aspenHomeImpro)) {
continue;
}
}
}
So I figured it out. I had to check the properties of the text file in question (I'm using Eclipse) to figure out what the actual encoding of the text file was.
Then, when creating the file in the program, encode the text file to UTF-8 so that Files.readAllLines could read and grab the data I wanted to get.
I open with epplus an excel file.
After reading some data, I would like to close the package:
pck.Stream.Close()
pck.Dispose()
Unfortunatelly the excel file is still blocked. I need to close the whole application to get the excel file unlocked.
I have googled, but found nothing useful except the above.
How are you opening the file? The following creates, saves, reopens, prints, and finally deletes all withing the same thread without issue. I can even set a breakpoint anywhere and delete the file. Reading the file this way should not lock the file since it is pulled into memory:
[TestMethod]
public void OpenReopenPrintDeleteTest()
{
//Create some data
var existingFile = new FileInfo(#"c:\temp\temp.xlsx");
if (existingFile.Exists)
existingFile.Delete();
using (var package = new ExcelPackage(existingFile))
{
var workbook = package.Workbook;
workbook.Worksheets.Add("newsheet");
package.Save();
}
using (var package = new ExcelPackage(existingFile))
{
var workbook = package.Workbook;
var worksheet = workbook.Worksheets.First();
//The data
worksheet.Cells["A1"].Value = "Col1";
worksheet.Cells["A2"].Value = "sdf";
worksheet.Cells["A3"].Value = "ghgh";
worksheet.Cells["B1"].Value = "Col2";
worksheet.Cells["B2"].Value = "Group B";
worksheet.Cells["B3"].Value = "Group A";
worksheet.Cells["C1"].Value = "Col3";
worksheet.Cells["C2"].Value = 634.5;
worksheet.Cells["C3"].Value = 274.5;
worksheet.Cells["D1"].Value = "Col4";
worksheet.Cells["D2"].Value = 996440;
worksheet.Cells["D3"].Value = 185780;
package.Save();
}
//Reopen the file
using (var package = new ExcelPackage(existingFile))
{
var workBook = package.Workbook;
if (workBook != null)
{
if (workBook.Worksheets.Count > 0)
{
var currentWorksheet = workBook.Worksheets.First();
var lastrow = currentWorksheet.Dimension.End.Row;
var lastcol = currentWorksheet.Dimension.End.Column;
for (var i = 1; i <= lastrow; i++)
for (var j = 1; j <= lastcol; j++)
Console.WriteLine(currentWorksheet.Cells[i, j].Value);
}
}
}
//Delete the file
existingFile.Delete();
}
I was having the same problem... But I found the solution:
During the "SaveAs" method, I was creating a FileStream that was not being disposed.
Before:
ExcelPackage excel_package = new ExcelPackage(new MemoryStream());
//...
//Do something here
//...
excel_package.SaveAs(new FileStream("filename.xlsx", FileMode.Create));
excel_package.Dispose();
After:
ExcelPackage excel_package = new ExcelPackage(new MemoryStream());
//...
//Do something here
//...
var file_stream = new FileStream("filename.xlsx", FileMode.Create);
excel_package.SaveAs(file_stream);
file_stream.Dispose();
excel_package.Dispose();
Note that the Memory Stream opened during the ExcelPackage declaration was not explicitly disposed, because the last command "excel_package.Dispose()" already does this internally.
Hope it help.
c# excel epplus
I am trying to use the iText library with c# to capture the text portion of pdf files.
I created a pdf from excel 2013 (exported) and then copied the sample from the web of how to use itext (added the lib ref to the project).
It reads perfectly the first page but it gets garbled info after that. It is keeping part of the first page and merging the info with the next page. The commented lines is when I was trying to solve the problem, the string "thePage" is recreated inside the for loop.
Here is the code. I can email the pdf to whoever can help with this issue.
Thanks in advance
public static string ExtractTextFromPdf(string path)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
//string[] theLines;
//theLines = new string[COLUMNS];
//string thePage;
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string thePage = "";
thePage = PdfTextExtractor.GetTextFromPage(reader, i, its);
string [] theLines = thePage.Split('\n');
foreach (var theLine in theLines)
{
text.AppendLine(theLine);
}
// text.AppendLine(" ");
// Array.Clear(theLines, 0, theLines.Length);
// thePage = "";
}
return text.ToString();
}
}
A strategy object collects text data and does not know if a new page has started or not.
Thus, use a new strategy object for each page.
i am a new member an i really like this site because it help me always
my problem is
i want replace word document using openxml and add a page break
end then i want to write replaced text second page
here my codes
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(#"d:\a.docx", true))
{
using (StreamReader reader = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
text = reader.ReadToEnd();
}
Regex regexText = new Regex("#db#");
text = regexText.Replace(text, textBox4.Text.Trim());
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(text);
}
MainDocumentPart mainPart = wordDoc.MainDocumentPart;
Run r = new Run();
Paragraph para = new Paragraph(new Run(new Break() { Type = BreakValues.Page }));
using (StreamWriter sw1 = new StreamWriter(mainPart.GetStream(FileMode.Create)))
{
sw1.Write(text);
}
mainPart.Document.Body.InsertAfter(para, mainPart.Document.Body.LastChild);
mainPart.Document.Save();
}
}
I suggest you insert a page break in your a.docx in advance. Then, use MergeField to locate where you want to replace with.
Here is the example
I am trying to read a xml file from the web and parse it out using XDocument. It normally works fine but sometimes it gives me this error for day:
**' ', hexadecimal value 0x1F, is an invalid character. Line 1, position 1**
I have tried some solutions from Google but they aren't working for VS 2010 Express Windows Phone 7.
There is a solution which replace the 0x1F character to string.empty but my code return a stream which doesn't have replace method.
s = s.Replace(Convert.ToString((byte)0x1F), string.Empty);
Here is my code:
void webClient_OpenReadCompleted(object sender, OpenReadCompletedEventArgs e)
{
using (var reader = new StreamReader(e.Result))
{
int[] counter = { 1 };
string s = reader.ReadToEnd();
Stream str = e.Result;
// s = s.Replace(Convert.ToString((byte)0x1F), string.Empty);
// byte[] str = Convert.FromBase64String(s);
// Stream memStream = new MemoryStream(str);
str.Position = 0;
XDocument xdoc = XDocument.Load(str);
var data = from query in xdoc.Descendants("user")
select new mobion
{
index = counter[0]++,
avlink = (string)query.Element("user_info").Element("avlink"),
nickname = (string)query.Element("user_info").Element("nickname"),
track = (string)query.Element("track"),
artist = (string)query.Element("artist"),
};
listBox.ItemsSource = data;
}
}
XML file:
http://music.mobion.vn/api/v1/music/userstop?devid=
0x1f is a Windows control character. It is not valid XML. Your best bet is to replace it.
Instead of using reader.ReadToEnd() (which by the way - for a large file - can use up a lot of memory.. though you can definitely use it) why not try something like:
string input;
while ((input = sr.ReadLine()) != null)
{
string = string + input.Replace((char)(0x1F), ' ');
}
you can re-convert into a stream if you'd like, to then use as you please.
byte[] byteArray = Encoding.ASCII.GetBytes( input );
MemoryStream stream = new MemoryStream( byteArray );
Or else you could keep doing readToEnd() and then clean that string of illegal characters, and convert back to a stream.
Here's a good resource for cleaning illegal characters in your xml - chances are, youll have others as well...
https://seattlesoftware.wordpress.com/tag/hexadecimal-value-0x-is-an-invalid-character/
What could be happening is that the content is compressed in which case you need to decompress it.
With HttpHandler you can do this the following way:
var client = new HttpClient(new HttpClientHandler
{
AutomaticDecompression = DecompressionMethods.GZip
| DecompressionMethods.Deflate
});
With the "old" WebClient you have to derive your own class to achieve the similar effect:
class MyWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
return request;
}
}
Above taken from here
To use the two you would do something like this:
HttpClient
using (var client = new HttpClient(new HttpClientHandler { AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate }))
{
using (var stream = client.GetStreamAsync(url))
{
using (var sr = new StreamReader(stream.Result))
{
using (var reader = XmlReader.Create(sr))
{
var feed = System.ServiceModel.Syndication.SyndicationFeed.Load(reader);
foreach (var item in feed.Items)
{
Console.WriteLine(item.Title.Text);
}
}
}
}
}
WebClient
using (var stream = new MyWebClient().OpenRead("http://myrss.url"))
{
using (var sr = new StreamReader(stream))
{
using (var reader = XmlReader.Create(sr))
{
var feed = System.ServiceModel.Syndication.SyndicationFeed.Load(reader);
foreach (var item in feed.Items)
{
Console.WriteLine(item.Title.Text);
}
}
}
}
This way you also recieve the benefit of not having to .ReadToEnd() since you are working with the stream instead.
Consider using System.Web.HttpUtility.HtmlDecode if you're decoding content read from the web.
If you are having issues replacing the character
For me there were some issues if you try to replace using the string instead of the char. I suggest trying some testing values using both to see what they turn up. Also how you reference it has some effect.
var a = x.IndexOf('\u001f'); // 513
var b = x.IndexOf(Convert.ToString((byte)0x1F)); // -1
x = x.Replace(Convert.ToChar((byte)0x1F), ' '); // Works
x = x.Replace(Convert.ToString((byte)0x1F), " "); // Fails
I blagged this
I had the same issue and found that the problem was a embedded in the xml.
The solution was:
s = s.Replace("", " ")
I'd guess it's probably an encoding issue but without seeing the XML I can't say for sure.
In terms of your plan to simply replace the character but not being able to, because you have a stream rather than a text, simply read the stream into a string and then remove the characters you don't want.
Works for me.........
string.Replace(Chr(31), "")
I used XmlSerializer to parse XML and faced the same exception.
The problem is that the XML string contains HTML codes of invalid characters
This method removes all invalid HTML codes from string (based on this thread - https://forums.asp.net/t/1483793.aspx?Need+a+method+that+removes+illegal+XML+characters+from+a+String):
public static string RemoveInvalidXmlSubstrs(string xmlStr)
{
string pattern = "&#((\\d+)|(x\\S+));";
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(xmlStr))
{
xmlStr = regex.Replace(xmlStr, new MatchEvaluator(m =>
{
string s = m.Value;
string unicodeNumStr = s.Substring(2, s.Length - 3);
int unicodeNum = unicodeNumStr.StartsWith("x") ?
Convert.ToInt32(unicodeNumStr.Substring(1), 16)
: Convert.ToInt32(unicodeNumStr);
//according to https://www.w3.org/TR/xml/#charsets
if ((unicodeNum == 0x9 || unicodeNum == 0xA || unicodeNum == 0xD) ||
((unicodeNum >= 0x20) && (unicodeNum <= 0xD7FF)) ||
((unicodeNum >= 0xE000) && (unicodeNum <= 0xFFFD)) ||
((unicodeNum >= 0x10000) && (unicodeNum <= 0x10FFFF)))
{
return s;
}
else
{
return String.Empty;
}
})
);
}
return xmlStr;
}
Nobody can answer if you don't show relevant info - I mean the Xml content.
As a general advice I would put a breakpoint after ReadToEnd() call. Now you can do a couple of things:
Reveal Xml content to this forum.
Test it using VS Xml visualizer.
Copy-paste the string into a txt file and investigate it offline.