How to read excel cell format (both .XLS & .XLSX) - openxml

I need to read the formatting information available in a cell in excel. The excel file might contain an cell with text like "Some sample text". Each word would have a different formatting information. Like word "Some" might be bold and have a different font color and size. The next word might have different one.
Can we read the individual formatting information set for a single cell value? if yes, please let me know how to do the same.
I had used NPOI and OpenXML SDK 2.5 and with no luck. Prefer not to use excel interop, since this would be processed in the server.

Here is a complete "hacker's answer" to your question.
I made a small Excel file with a single cell like this:
I saved the file, and closed it. Next, I changed the extension of the file from .xlsx to .zip - this allows you to open it and look at the contents. In a subfolder named xl you find a file called "sharedStrings.xml" that contains all the strings in the workbook, and their formatting within the cell. This is a giant array - this is done so that if the same value is entered in multiple cells, it is stored only once. Work through this file to get all the strings in all the cells, and their format...
For this particular cell, the sharedStrings.xml file contents start out as follows:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="1" uniqueCount="1">
<si>
<r>
<t xml:space="preserve">This </t>
</r>
<r>
<rPr>
<b/>
<sz val="11"/>
<color theme="1"/>
<rFont val="Calibri"/>
<family val="2"/>
<scheme val="minor"/>
</rPr>
<t>cell</t>
</r>
<r>
<rPr>
<sz val="11"/>
<color theme="1"/>
... etc
By reading this file you will be able to reconstruct all the strings with their formatting. There MUST be a better way...

You are talking about reading RichText value from Cell. It will be supported by NPOI 2.0 final release. The release will happen in the first quarter, 2014.

Related

Copying Nattable cells and pasting in excel does not work properly

My nattable looks like the one below.
When I copy the cells and paste in excel the cells look distorted as below.
The line break is not captured properly in the process.
The bit of code referred for copying is the same as that in the Nattable example
Is this a bug and solved in the next versions or am I missing on something.
I would not say that this is a bug in NatTable. Pasting a line break into Excel inside a cell is not that simple. You can search for this topic and see the real issue. When you copy content from NatTable, the content includes the line breaks of your cell data. The paste operation in Excel takes those line breaks and interpretes them as new row and not new line inside a cell.
You can of course implement and register a custom CopyDataCommandHandler that performs special operations to replace a line breaks in NatTable content with something that Excel handles as line breaks inside a cell.
The solution as of now is as follows :-
In Excel if the content that is sent to the Clipboard is present in double quotes and the \n is included in the double quotes Excel interprets this as a single cell content and adds the line break in the cell
Alternatively since this is a table while copying content to the clipboard we can convert it to html tags appropriately and convert it to html format which excel reads and converts appropriately.
Refer below image
However while copying between cells from Nattable to Nattable this is taken care of.

Writing CR+LF into Open XML from a Database

I'm trying to take some data stored in a database and populate a Word template's Content Controls with it using the Open XML SDK. The data contains paragraphs and so there are carriage return and line feed characters in it. The data is stored in the database as nvarchar.
When I open the generated document, the CR+LF combination shows up as a question mark with a box around it (not sure the name of this character). This is actually two sequences back to back, so CR+LF CR+LF equals two strange characters:
If I unzip the .docx, take the Custom XML part and do a hex dump, I can clearly see 0d0a 0d0a so the CR+LF is there. Word is just printing it weird.
I've tried enforcing UTF-8 encoding in my XmlWriter's settings, but that didn't seem to help:
Dim docStream As New MemoryStream
Dim settings As XmlWriterSettings = New XmlWriterSettings()
settings.Encoding = New UTF8Encoding(False)
Dim docWriter As XmlWriter = XmlTextWriter.Create(docStream, settings)
Does anyone know how I can get Word to render these characters correctly when written to a .docx through the Open XML SDK?
To bind to a Word 2013 rich text control, your XML element has to contain a complete docx. See [MS-DOCX]:
the data stored in the XML element will be an escaped string comprised of a flattened WordprocessingML document representing the formatted data in the structured document tag range.
Earlier versions couldn't bind a rich text control.
Things should work though (with CR/LF, not w:br), if you bind to a plain text control, with multiline set to true.

jasper-report : Tag to use in String to create an unnumbered list in word

Take a regular String in java containing the following :
String test = "hello I'm a list [ul][li]item1[li]item2[/ul]";
Now suppose I want to use this string for printing in an html document all I have to do is replace the [] by <> and I'm in business. But I also want to use this string in a jasper report that export to a docx format. If I want to have a unnumbered list to appear in word what kind of replacement should I use ? Is that even possible ?
I've been thinking to use RTF tags that I read about here
https://metacpan.org/pod/distribution/RTF-Writer/lib/RTF/Cookbook.pod#RTF-Document-Formatting-Commands
This sort of approach is misguided. You should hold your data separate from how it is rendered and allow Jasper to render the HTML or RTF accordingly.
If you modified your data structure to be something like the following your rendering in different formats would be a lot easier:
<data>
<heading>hello I'm a list</heading>
<items>
<item>item1</item>
<item>item2</item>
</items>
</data>

How to parse .pdf files in Perl?

How to parse .pdf files in Perl?
Is perl is more efficient or should I use any other language?
When I want to extract text from a PDF, I feed it to pdftohtml (part of Poppler) using the -xml output option. This produces an XML file which I parse using XML::Twig (or any other XML parser you like except XML::Simple).
The XML format is fairly simple. You get a <page> element for each page in the PDF, which contains <fontspec> elements describing the fonts used and a <text> element for each line of text. The <text> elements may contain <b> and <i> tags for bold and italic text (which is why XML::Simple can't parse it properly).
You do need to use the top and left attributes of the <text> tags to get them in the right order, because they aren't necessarily emitted in top-to-bottom order. The coordinate system has 0,0 in the upper left corner of the page with down and right being positive. Dimensions are in PostScript points (72 points per inch).
I personally use CAM::PDF.
my $doc=CAM::PDF->new($fileName) || die "$CAM::PDF::errStr\n"; CAM::PDF>asciify(/$pdfString);`
Pdfs are not designed for parsing, but for display/printing - thus anything is always try and error and it is quite possible that it is impossible to parse if everything is graphics. A good indicator is if you can copy and paste the content from the pdf into an editor. If this works, then you are in business.
Look at the CPAN and, specifically, if you want to do OCR, see PDF::OCR2
I don't know of any module that parses, that is, if you to extract the text from them. There are a number of modules that let you manipulate them. Try PDF::API2.

How would I go about parsing this .xml for iPhone table

I am making a directory for an app and I need to parse the the names, e-mail, phone #, and office for each item that I want to display in a UITableView. I have a class made but I have never really dealt with pasring anything past simple txt files.
I need to load a URL to a xml file, which consists of the following type of data at the bottom. It does not have xml tags, but it is saved as a .xml
I have read up on the NSXMLParsers, but I wasn't sure if that would be the correct way to do this or if there was an easier way.
Example of part of the .xml file below, this is just part of a few hundred lines that are organized in the same manner, by division, department, then person.
Thanks for any help!
http://cs.millersville.edu/School of Science and MathematicsDr.FirstH.LastRoddy Science CenterFirst.Last#millersville.edu872-3838Computer ScienceMrs.First.LastRoddy Science CenterFirst.Last#millersville.edu872-3858Computer ScienceDr.FirstH.LastRoddy Science CenterFirstH.LastRoddy#millersville.edu872-3470Computer ScienceDr.FirstH.LastRoddy Science CenterFirst.Last#millersville.edu872-3724Computer ScienceMs.FirstA.GilbertLast Science CenterFirst.Last#millersville.edu871-2214Computer ScienceDr.FirstH.LastRoddy Science CenterFirst.Last#millersville.edu872-3666
There's no way you can use a xml parser for this file.
Instead you may try to use NSScanner to parse the text file. A couple of tutorials are listed here:
Parsing CSV Data
Writing a parser using NSScanner
without the xml tags, your file is as good as a plain text file...
rows separated by new line character....
and each line contains data separated by a dot (.) or something like that. figure out the pattern and parse it like you would parse a text file...