How to best detect encoding in XML file? - encoding

To load XML files with arbitrary encoding I have the following code:
Encoding encoding;
using (var reader = new XmlTextReader(filepath))
{
reader.MoveToContent();
encoding = reader.Encoding;
}
var settings = new XmlReaderSettings { NameTable = new NameTable() };
var xmlns = new XmlNamespaceManager(settings.NameTable);
var context = new XmlParserContext(null, xmlns, "", XmlSpace.Default,
encoding);
using (var reader = XmlReader.Create(filepath, settings, context))
{
return XElement.Load(reader);
}
This works, but it seems a bit inefficient to open the file twice. Is there a better way to detect the encoding such that I can do:
Open file
Detect encoding
Read XML into an XElement
Close file

Ok, I should have thought of this earlier. Both XmlTextReader (which gives us the Encoding) and XmlReader.Create (which allows us to specify encoding) accepts a Stream. So how about first opening a FileStream and then use this with both XmlTextReader and XmlReader, like this:
using (var txtreader = new FileStream(filepath, FileMode.Open))
{
using (var xmlreader = new XmlTextReader(txtreader))
{
// Read in the encoding info
xmlreader.MoveToContent();
var encoding = xmlreader.Encoding;
// Rewind to the beginning
txtreader.Seek(0, SeekOrigin.Begin);
var settings = new XmlReaderSettings { NameTable = new NameTable() };
var xmlns = new XmlNamespaceManager(settings.NameTable);
var context = new XmlParserContext(null, xmlns, "", XmlSpace.Default,
encoding);
using (var reader = XmlReader.Create(txtreader, settings, context))
{
return XElement.Load(reader);
}
}
}
This works like a charm. Reading XML files in an encoding independent way should have been more elegant but at least I'm getting away with only one file open.

Another option, quite simple, is to use Linq to XML. The Load method automatically reads the encoding from the xml file. You can then get the encoder value by using the XDeclaration.Encoding property.
An example from MSDN:
// Create the document
XDocument encodedDoc16 = new XDocument(
new XDeclaration("1.0", "utf-16", "yes"),
new XElement("Root", "Content")
);
encodedDoc16.Save("EncodedUtf16.xml");
Console.WriteLine("Encoding is:{0}", encodedDoc16.Declaration.Encoding);
Console.WriteLine();
// Read the document
XDocument newDoc16 = XDocument.Load("EncodedUtf16.xml");
Console.WriteLine("Encoded document:");
Console.WriteLine(File.ReadAllText("EncodedUtf16.xml"));
Console.WriteLine();
Console.WriteLine("Encoding of loaded document is:{0}", newDoc16.Declaration.Encoding);
While this may not server the original poster, as he would have to refactor a lot of code, it is useful for someone who has to write new code for their project, or if they think that refactoring is worth it.

Related

Word found unreadable content in xxx.docx after split a docx using openxml

I have a full.docx which includes two math questions, the docx embeds some pictures and MathType equation (oleobject), I split the doc according to this, get two files (first.docx, second.docx) , first.docx works fine, the second.docx, however, pops up a warning dialog when I try to open it:
"Word found unreadable content in second.docx. Do you want to recover the contents of this document? If you trust the source of this document, click Yes."
After click "Yes", the doc can be opened, the content is also correct, I want to know what is wrong with the second.docx? I have checked it with "Open xml sdk 2.5 productivity tool", but found no reason. Very appreciated for any help. Thanks.
The three files have been uploaded to here.
Show some code:
byte[] templateBytes = System.IO.File.ReadAllBytes(TEMPLATE_YANG_FILE);
using (MemoryStream templateStream = new MemoryStream())
{
templateStream.Write(templateBytes, 0, (int)templateBytes.Length);
string guidStr = Guid.NewGuid().ToString();
using (WordprocessingDocument document = WordprocessingDocument.Open(templateStream, true))
{
document.ChangeDocumentType(DocumentFormat.OpenXml.WordprocessingDocumentType.Document);
MainDocumentPart mainPart = document.MainDocumentPart;
mainPart.Document = new Document();
Body bd = new Body();
foreach (DocumentFormat.OpenXml.Wordprocessing.Paragraph clonedParagrph in lst)
{
bd.AppendChild<DocumentFormat.OpenXml.Wordprocessing.Paragraph>(clonedParagrph);
clonedParagrph.Descendants<Blip>().ToList().ForEach(blip =>
{
var newRelation = document.CopyImage(blip.Embed, this.wordDocument);
blip.Embed = newRelation;
});
clonedParagrph.Descendants<DocumentFormat.OpenXml.Vml.ImageData>().ToList().ForEach(imageData =>
{
var newRelation = document.CopyImage(imageData.RelationshipId, this.wordDocument);
imageData.RelationshipId = newRelation;
});
}
mainPart.Document.Body = bd;
mainPart.Document.Save();
}
string subDocFile = System.IO.Path.Combine(this.outDir, guidStr + ".docx");
this.subWordFileLst.Add(subDocFile);
File.WriteAllBytes(subDocFile, templateStream.ToArray());
}
the lst contains Paragraph cloned from original docx using:
(DocumentFormat.OpenXml.Wordprocessing.Paragraph)p.Clone();
Using productivity tool, found oleobjectx.bin not copied, so I add below code after copy Blip and ImageData:
clonedParagrph.Descendants<OleObject>().ToList().ForEach(ole =>
{
var newRelation = document.CopyOleObject(ole.Id, this.wordDocument);
ole.Id = newRelation;
});
Solved the issue.

iText not returning text contents of a PDF after first page

I am trying to use the iText library with c# to capture the text portion of pdf files.
I created a pdf from excel 2013 (exported) and then copied the sample from the web of how to use itext (added the lib ref to the project).
It reads perfectly the first page but it gets garbled info after that. It is keeping part of the first page and merging the info with the next page. The commented lines is when I was trying to solve the problem, the string "thePage" is recreated inside the for loop.
Here is the code. I can email the pdf to whoever can help with this issue.
Thanks in advance
public static string ExtractTextFromPdf(string path)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
//string[] theLines;
//theLines = new string[COLUMNS];
//string thePage;
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string thePage = "";
thePage = PdfTextExtractor.GetTextFromPage(reader, i, its);
string [] theLines = thePage.Split('\n');
foreach (var theLine in theLines)
{
text.AppendLine(theLine);
}
// text.AppendLine(" ");
// Array.Clear(theLines, 0, theLines.Length);
// thePage = "";
}
return text.ToString();
}
}
A strategy object collects text data and does not know if a new page has started or not.
Thus, use a new strategy object for each page.

Create doc file from template and adding data from database using open xml

I have a word template and I want to create doc file from that, also I want to replace add data in place of bookmarks present in the template.
I have been able to create a doc file, but I am not able to understand, how to add data in place of bookmarks?
My code till now:
private void CreateSampleWordDocument()
{
string sourceFile = Path.Combine(Environment.CurrentDirectory, "GeneralWelcomeLetter.dotx");
string destinationFile = Path.Combine(Environment.CurrentDirectory, "Sample.docx");
try
{
File.Copy(sourceFile, destinationFile, true);
WordprocessingDocument document = WordprocessingDocument.Open(destinationFile, true);
document.ChangeDocumentType(DocumentFormat.OpenXml.WordprocessingDocumentType.Document);
MainDocumentPart mainPart = document.MainDocumentPart;
DocumentSettingsPart documentSettingPart1 = mainPart.DocumentSettingsPart;
AttachedTemplate attachedTemplate1 = new AttachedTemplate() { Id = "relationId1" };
documentSettingPart1.Settings.Append(attachedTemplate1);
}
catch
{
}
}
Now to add data from database in place of bookmarks?

Streaming OpenXML result

By using OpenXML to manipulating a Word document (as a template), the server application saves the new content as a temporary file and then sends it to user to download.
The question is how to make these content ready to download without saving it on the server as a temporary file? Is it possible to save OpenXML result as a byte[] or Stream instead of saving it as a file?
Using this page:
OpenXML file download without temporary file
I changed my code to this one:
byte[] result = null;
byte[] templateBytes = System.IO.File.ReadAllBytes(wordTemplate);
using (MemoryStream templateStream = new MemoryStream())
{
templateStream.Write(templateBytes, 0, (int)templateBytes.Length);
using (WordprocessingDocument doc = WordprocessingDocument.Open(templateStream, true))
{
MainDocumentPart mainPart = doc.MainDocumentPart;
...
mainPart.Document.Save();
templateStream.Position = 0;
using (MemoryStream memoryStream = new MemoryStream())
{
templateStream.CopyTo(memoryStream);
result = memoryStream.ToArray();
}
}
}
You can create the WordprocessingDocument and then use the Save() method to save it to a Stream.
http://msdn.microsoft.com/en-us/library/cc882497
var memoryStream = new MemoryStream();
document.Clone(memoryStream);

how to programmatically access the builtin properties of an open xml worddoc file

i would like to access some built in properties(like author,last modified date,etc.) of an open xml word doc file. i would like to use open xml sdk2.0 for this purpose. so i wonder if there is any class or any way i could programmatically access these builtin properties.
An explanation of the following method can be found here, but pretty much you need to pass in the properties that you want to get out of the core.xml file to this method and it will return the value:
public static string WDRetrieveCoreProperty(string docName, string propertyName)
{
// Given a document name and a core property, retrieve the value of the property.
// Note that because this code uses the SelectSingleNode method,
// the search is case sensitive. That is, looking for "Author" is not
// the same as looking for "author".
const string corePropertiesSchema = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties";
const string dcPropertiesSchema = "http://purl.org/dc/elements/1.1/";
const string dcTermsPropertiesSchema = "http://purl.org/dc/terms/";
string propertyValue = string.Empty;
using (WordprocessingDocument wdPackage = WordprocessingDocument.Open(docName, true))
{
// Get the core properties part (core.xml).
CoreFilePropertiesPart corePropertiesPart = wdPackage.CoreFilePropertiesPart;
// Manage namespaces to perform XML XPath queries.
NameTable nt = new NameTable();
XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
nsManager.AddNamespace("cp", corePropertiesSchema);
nsManager.AddNamespace("dc", dcPropertiesSchema);
nsManager.AddNamespace("dcterms", dcTermsPropertiesSchema);
// Get the properties from the package.
XmlDocument xdoc = new XmlDocument(nt);
// Load the XML in the part into an XmlDocument instance.
xdoc.Load(corePropertiesPart.GetStream());
string searchString = string.Format("//cp:coreProperties/{0}", propertyName);
XmlNode xNode = xdoc.SelectSingleNode(searchString, nsManager);
if (!(xNode == null))
{
propertyValue = xNode.InnerText;
}
}
return propertyValue;
}
You can also use the packaging API:
using System.IO.Packaging.Package;
[...]
using (var package = Package.Open(path))
{
package.PackageProperties.Creator = Environment.UserName;
package.PackageProperties.LastModifiedBy = Environment.UserName;
}
That works also for other open XML formats like power point.
package.Save();
Then
package.closed;
I think that Is the best way.