Change encoding of extracted text using PdfTextExtractor.GetTextFromPage? - itext

I want to extract some text using itextsharp but instead of 'Köln' I get 'K\0ln', instead of 'Währung' I get 'W\0hrung' (and more examples), which means both 'ä' and 'ö' gets replaced by '\0'.
How do I set the encoding?
using (PdfReader reader = new PdfReader(sSourceFileName))
{
PdfReader.unethicalreading = true;
string sText = PdfTextExtractor.GetTextFromPage(reader, iPageFilter, new SimpleTextExtractionStrategy());
}

Related

MalformedInputException: Input length = 1 while reading text file with Files.readAllLines(Path.get("file").get(0);

Why am I getting this error? I'm trying to extract information from a bank statement PDF and tally different bills for the month. I write the data from a PDF to a text file so I can get specific data from the file (e.g. ASPEN HOME IMPRO, then iterate down to what the dollar amount is, then read that text line to a string)
When the Files.readAllLines(Path.get("bankData").get(0) code is run, I get the error. Any thoughts why? Encoding issue?
Here is the code:
public static void main(String[] args) throws IOException {
File file = new File("C:\\Users\\wmsai\\Desktop\\BankStatement.pdf");
PDFTextStripper stripper = new PDFTextStripper();
BufferedWriter bw = new BufferedWriter(new FileWriter("bankData"));
BufferedReader br = new BufferedReader(new FileReader("bankData"));
String pdfText = stripper.getText(Loader.loadPDF(file)).toUpperCase();
bw.write(pdfText);
bw.flush();
bw.close();
LineNumberReader lineNum = new LineNumberReader(new FileReader("bankData"));
String aspenHomeImpro = "PAYMENT: ACH: ASPEN HOME IMPRO";
String line;
while ((line = lineNum.readLine()) != null) {
if (line.contains(aspenHomeImpro)) {
int lineNumber = lineNum.getLineNumber();
int newLineNumber = lineNumber + 4;
String aspenData = Files.readAllLines(Paths.get("bankData")).get(0); //This is the code with the error
System.out.println(newLineNumber);
break;
} else if (!line.contains(aspenHomeImpro)) {
continue;
}
}
}
So I figured it out. I had to check the properties of the text file in question (I'm using Eclipse) to figure out what the actual encoding of the text file was.
Then, when creating the file in the program, encode the text file to UTF-8 so that Files.readAllLines could read and grab the data I wanted to get.

Insert multiple lines of text into a Rich Text content control with OpenXML

I'm having difficulty getting a content control to follow multi-line formatting. It seems to interpret everything I'm giving it literally. I am new to OpenXML and I feel like I must be missing something simple.
I am converting my multi-line string using this function.
private static void parseTextForOpenXML(Run run, string text)
{
string[] newLineArray = { Environment.NewLine, "<br/>", "<br />", "\r\n" };
string[] textArray = text.Split(newLineArray, StringSplitOptions.None);
bool first = true;
foreach (string line in textArray)
{
if (!first)
{
run.Append(new Break());
}
first = false;
Text txt = new Text { Text = line };
run.Append(txt);
}
}
I insert it into the control with this
public static WordprocessingDocument InsertText(this WordprocessingDocument doc, string contentControlTag, string text)
{
SdtElement element = doc.MainDocumentPart.Document.Body.Descendants<SdtElement>().FirstOrDefault(sdt => sdt.SdtProperties.GetFirstChild<Tag>().Val == contentControlTag);
if (element == null)
throw new ArgumentException("ContentControlTag " + contentControlTag + " doesn't exist.");
element.Descendants<Text>().First().Text = text;
element.Descendants<Text>().Skip(1).ToList().ForEach(t => t.Remove());
return doc;
}
I call it with something like...
doc.InsertText("Primary", primaryRun.InnerText);
Although I've tried InnerXML and OuterXML as well. The results look something like
Example AttnExample CompanyExample AddressNew York, NY 12345 or
<w:r xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:t>Example Attn</w:t><w:br /><w:t>Example Company</w:t><w:br /><w:t>Example Address</w:t><w:br /><w:t>New York, NY 12345</w:t></w:r>
The method works fine for simple text insertion. It's just when I need it to interpret the XML that it doesn't work for me.
I feel like I must be super close to getting what I need, but my fiddling is getting me nowhere. Any thoughts? Thank you.
I believe the way I was trying to do it was doomed to fail. Setting the Text attribute of an element is always going to be interpreted as text to be displayed it seems. I ended up having to take a slightly different tack. I created a new insert method.
public static WordprocessingDocument InsertText(this WordprocessingDocument doc, string contentControlTag, Paragraph paragraph)
{
SdtElement element = doc.MainDocumentPart.Document.Body.Descendants<SdtElement>().FirstOrDefault(sdt => sdt.SdtProperties.GetFirstChild<Tag>().Val == contentControlTag);
if (element == null)
throw new ArgumentException("ContentControlTag " + contentControlTag + " doesn't exist.");
OpenXmlElement cc = element.Descendants<Text>().First().Parent;
cc.RemoveAllChildren();
cc.Append(paragraph);
return doc;
}
It starts the same, and gets the Content Control by searching for it's Tag. But then I get it's parent, remove the Content Control elements that were there and just replace them with a paragraph element.
It's not exactly what I had envisioned, but it seems to work for my needs.

Insert HTML in docx file

I have made an application that fills wordfiles with customxmlparts now I am trying to put text into a textfield, but it has HTML in it and I want it to show the styling of it. I tried converting it to rich text format but that just gets pasted in the word file. Here is an example of the code:
var taskId = Guid.NewGuid();
var tempFilePath = $"{Path.GetTempPath()}/{taskId}";
using (var templateStream = new FileStream($"{tempFilePath}.docx", FileMode.CreateNew))
{
templateStream.Write(template, 0, template.Length);
// 1. Fill template.
using (WordprocessingDocument doc = WordprocessingDocument.Open(templateStream, true))
{
MainDocumentPart mainDocument = doc.MainDocumentPart;
if (mainDocument.CustomXmlParts != null)
{
mainDocument.DeleteParts<CustomXmlPart>(mainDocument.CustomXmlParts);
}
CustomXmlPart cxp = mainDocument.AddCustomXmlPart(CustomXmlPartType.CustomXml);
foreach (var line in data.Lines)
{
if (line.MoreInfo != null && line.MoreInfo != " ") {
}
}
var xmlData = ObjectToXml(data);
using (var stream = GenerateStreamFromString(tempFilePath, xmlData))
{
cxp.FeedData(stream);
}
mainDocument.Document.Save();
}
}
You can't just write the HTML formatted text into a DOCX field, you would need to convert it into a WordprocessingML format.
However, there is another way that you could try and that is to insert an "AltChunk" element. That element represents a sort of like a placeholder which can reference a HTML file and then when the DOCX file is opened in MS Word, it will make that HTML to WordprocessingML conversion for you. For details see: How to Use altChunk for Document Assembly
Alternatively you could use some third party, like GemBox.Document, which can make that HTML to WordprocessingML conversion for you.
For example check this Set Content example:
// Set content using HTML tags
document.Sections[0].Blocks[4].Content.LoadText(
"Paragraph 5 <b>(part of this paragraph is bold)</b>", LoadOptions.HtmlDefault);

Remove Content controls after adding text using open xml

By the help of some very kind community members here I managed to programatically create a function to replace text inside content controls in a Word document using open xml. After the document is generated it removes the formatting of the text after I replace the text.
Any ideas on how I can still keep the formatting in word and remove the content control tags ?
This is my code:
using (var wordDoc = WordprocessingDocument.Open(mem, true))
{
var mainPart = wordDoc.MainDocumentPart;
ReplaceTags(mainPart, "FirstName", _firstName);
ReplaceTags(mainPart, "LastName", _lastName);
ReplaceTags(mainPart, "WorkPhoe", _workPhone);
ReplaceTags(mainPart, "JobTitle", _jobTitle);
mainPart.Document.Save();
SaveFile(mem);
}
private static void ReplaceTags(MainDocumentPart mainPart, string tagName, string tagValue)
{
//grab all the tag fields
IEnumerable<SdtBlock> tagFields = mainPart.Document.Body.Descendants<SdtBlock>().Where
(r => r.SdtProperties.GetFirstChild<Tag>().Val == tagName);
foreach (var field in tagFields)
{
//remove all paragraphs from the content block
field.SdtContentBlock.RemoveAllChildren<Paragraph>();
//create a new paragraph containing a run and a text element
Paragraph newParagraph = new Paragraph();
Run newRun = new Run();
Text newText = new Text(tagValue);
newRun.Append(newText);
newParagraph.Append(newRun);
//add the new paragraph to the content block
field.SdtContentBlock.Append(newParagraph);
}
}
Keeping the style is a tricky problem as there could be more than one style applied to the text you are trying to replace. What should you do in that scenario?
Assuming a simple case of one style (but potentially over many Paragraphs, Runs and Texts) you could keep the first Text element you come across per SdtBlock and place your required value in that element then delete any further Text elements from the SdtBlock. The formatting from the first Text element will then be maintained. Obviously you can apply this theory to any of the Text blocks; you don't have to necessarily use the first. The following code should show what I mean:
private static void ReplaceTags(MainDocumentPart mainPart, string tagName, string tagValue)
{
IEnumerable<SdtBlock> tagFields = mainPart.Document.Body.Descendants<SdtBlock>().Where
(r => r.SdtProperties.GetFirstChild<Tag>().Val == tagName);
foreach (var field in tagFields)
{
IEnumerable<Text> texts = field.SdtContentBlock.Descendants<Text>();
for (int i = 0; i < texts.Count(); i++)
{
Text text = texts.ElementAt(i);
if (i == 0)
{
text.Text = tagValue;
}
else
{
text.Remove();
}
}
}
}

How to best detect encoding in XML file?

To load XML files with arbitrary encoding I have the following code:
Encoding encoding;
using (var reader = new XmlTextReader(filepath))
{
reader.MoveToContent();
encoding = reader.Encoding;
}
var settings = new XmlReaderSettings { NameTable = new NameTable() };
var xmlns = new XmlNamespaceManager(settings.NameTable);
var context = new XmlParserContext(null, xmlns, "", XmlSpace.Default,
encoding);
using (var reader = XmlReader.Create(filepath, settings, context))
{
return XElement.Load(reader);
}
This works, but it seems a bit inefficient to open the file twice. Is there a better way to detect the encoding such that I can do:
Open file
Detect encoding
Read XML into an XElement
Close file
Ok, I should have thought of this earlier. Both XmlTextReader (which gives us the Encoding) and XmlReader.Create (which allows us to specify encoding) accepts a Stream. So how about first opening a FileStream and then use this with both XmlTextReader and XmlReader, like this:
using (var txtreader = new FileStream(filepath, FileMode.Open))
{
using (var xmlreader = new XmlTextReader(txtreader))
{
// Read in the encoding info
xmlreader.MoveToContent();
var encoding = xmlreader.Encoding;
// Rewind to the beginning
txtreader.Seek(0, SeekOrigin.Begin);
var settings = new XmlReaderSettings { NameTable = new NameTable() };
var xmlns = new XmlNamespaceManager(settings.NameTable);
var context = new XmlParserContext(null, xmlns, "", XmlSpace.Default,
encoding);
using (var reader = XmlReader.Create(txtreader, settings, context))
{
return XElement.Load(reader);
}
}
}
This works like a charm. Reading XML files in an encoding independent way should have been more elegant but at least I'm getting away with only one file open.
Another option, quite simple, is to use Linq to XML. The Load method automatically reads the encoding from the xml file. You can then get the encoder value by using the XDeclaration.Encoding property.
An example from MSDN:
// Create the document
XDocument encodedDoc16 = new XDocument(
new XDeclaration("1.0", "utf-16", "yes"),
new XElement("Root", "Content")
);
encodedDoc16.Save("EncodedUtf16.xml");
Console.WriteLine("Encoding is:{0}", encodedDoc16.Declaration.Encoding);
Console.WriteLine();
// Read the document
XDocument newDoc16 = XDocument.Load("EncodedUtf16.xml");
Console.WriteLine("Encoded document:");
Console.WriteLine(File.ReadAllText("EncodedUtf16.xml"));
Console.WriteLine();
Console.WriteLine("Encoding of loaded document is:{0}", newDoc16.Declaration.Encoding);
While this may not server the original poster, as he would have to refactor a lot of code, it is useful for someone who has to write new code for their project, or if they think that refactoring is worth it.