Problems reading special characters from word document using office interop in C#

Problems reading special characters from word document using office interop in C# - ms-word

I'm trying to read text from a word .DOC using Microsoft.Office.Interop.Word
The text has some temperature degrees e.g. 95°F.
When I get the string from Range.Text, it becomes 95(F
Here is the C# code:
private Microsoft.Office.Interop.Word.Application _wordApp;
Microsoft.Office.Interop.Word.Document wordDocument
= _wordApp.Documents.Open(fileName, false, true);
for (int tableCounter = 1; tableCounter <= wordDocument.Tables.Count; tableCounter++)
{
var inputTable = wordDocument.Tables[tableCounter];
for (int cellCounter = 1; cellCounter <= inputTable.Range.Cells.Count; cellCounter++)
{
var problemText = inputTable.Range.Cells[cellCounter].Range.Text;
}
}

Related

itextsharp extract text pdf not working

I'm having trouble getting the text from the page.
Object reference error not set to an instance of an object, in the bold line.
String extractText = PdfTextExtractor.GetTextFromPage(pdfReader, i);
Follow the code below
var pdfText = new StringBuilder();
using (var pdfReader = new PdfReader(cbPdf.SelectedValue + ""))
{
for (var i = 0; i <= pdfReader.NumberOfPages; i++)
{
String extractText = PdfTextExtractor.GetTextFromPage(pdfReader, i);
extractText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(extractText)));
pdfText.Append(extractText);
}
}
rtxtTexto.Text = pdfText.ToString();

iText numbers pages 1-based, i.e. the first page has number 1.
You already did take that into account at the end of your loop (by comparing using <=), merely not at the start (where you start at 0).
Thus,
for (var i = 1; i <= pdfReader.NumberOfPages; i++)
That being said, as far as I know your line
extractText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(extractText)));
is nonsense.

Programmatically create Infopath form in SharePoint form library with dates

I am programmatically creating InfoPath forms in a form library within SharePoint 2010 from data in a CSV file. It all works fine apart from the date fields. The form will refuse to open with a format error. I have tried multiple ways of formatting the date but no luck so far. Code below...
If I format 2016-10-10 then it does show in the Forms Library view but I still can not open the form. It just shows a datatype error.
// Get the data from CSV file.
string[,] values = LoadCsv("ImportTest.csv");
//Calulate how many columns and rows in the dataset
int countCols = values.GetUpperBound(1) + 1;
int countRows = values.GetUpperBound(0) + 1;
string rFormSite = "siteurl";
// opens the site
SPWeb webSite = new SPSite(rFormSite).OpenWeb();
// gets the blank file to copy
SPFile BLANK = webSite.Folders["EventSubmissions"].Files["Blank.xml"];
// reads the blank file into an xml document
MemoryStream inStream = new MemoryStream(BLANK.OpenBinary());
XmlTextReader reader = new XmlTextReader(inStream);
XmlDocument xdBlank = new XmlDocument();
xdBlank.Load(reader);
reader.Close();
inStream.Close();
//Get latest ID from the list
int itemID = GetNextID(webSite, "EventSubmissions");
if (itemID == -1) return;
//Iterate each row of the dataset
for (int row = 1; row < countRows; row++)
{
//display current event name
Console.WriteLine("Event name - " + values[row, 4]);
XmlDocument xd = xdBlank;
XmlElement root = xd.DocumentElement;
//Cycling through all columns of the document//
for (int col = 0; col < countCols; col++)
{
string field = values[0, col];
string value = values[row, col];
switch (field)
{
case "startDate":
value = //How do format the date here ;
break;
case "endDate":
value = "";
break;
case "AutoFormID":
value = itemID.ToString();
break;
}
XmlNodeList nodes = xd.GetElementsByTagName("my:" + field);
foreach (XmlNode node in nodes)
{
node.InnerText = value;
}
}
// saves the XML Document back as a file
System.Text.ASCIIEncoding encoding = new System.Text.ASCIIEncoding();
SPFile newFile = webSite.Folders["EventSubmissions"].Files.Add(itemID.ToString() + ".xml", (encoding.GetBytes(xd.OuterXml)), true);
itemID++;
}
Console.WriteLine("Complete");
Console.ReadLine();
Thanks

For me this worked
DateTime.Now.ToString("yyyy-MM-dd")

How to make 'ő' and 'ű' work in Java?

String word = inputField.getText();
int wordLength = word.length();
boolean backWord = false;
boolean longWord = false;
String backArray[]=new String[6];
backArray[0] = "a";
backArray[1] = "á";
backArray[2] = "ö";
backArray[3] = "ő";
backArray[4] = "ü";
backArray[5] = "ű";
for (int i = 0;i < wordLength ;i ++ ) {
String character = word.substring(i, i + 1);
for (int j = 0;j < 5;j ++) {
if (character.equals(backArray[j])) {
backWord = true;
}
}
}
if (backWord) {
outputField.setText(word+"ban");
}
else {
outputField.setText(word+"ben");
}
This is the code I wrote for an applet for conjugating Hungarian nouns while taking vowel harmony into consideration. For the unaware, the TL;DR of vowel harmony is that Hungarian has lots of suffixes and you can determine which suffix to use based on the vowels in a word.
This code works fine for all the vowels, except for ő and ű. So if my input is 'szálloda', the output will be 'szállodaban'. However, if my input is 'idő' (weather) the output will be 'időben', though it should be 'időban' according to the code.
I assumed this is because java somehow doesn't recognize these two letters because the code works fine for the other ones. Is that the problem? And if so, how do I solve it?

Export MS Word Document pages to Images

I want to export MS word(docx/doc) document pages to Image(jpeg/png).
I am doing same for presentation(pptx/ppt) using office interop export api for each slide, but didn't found corresponding API for word.
Need suggestion for API/alternate approach for achieving this.

Based on this similar question: "Saving a word document as an image" you could do something like this:
const string basePath = #"C:\Users\SomeUser\SomePath\";
var docPath = Path.Combine(basePath, "documentA.docx");
var app = new Application()
{
Visible = true
};
var doc = app.Documents.Open(docPath);
foreach (Window window in doc.Windows)
{
foreach (Pane pane in window.Panes)
{
for (var i = 1; i <= pane.Pages.Count; i++)
{
var page = pane.Pages[i];
var bits = page.EnhMetaFileBits;
var target = Path.Combine(basePath, string.Format("page-no-{0}", i));
using (var ms = new MemoryStream(bits))
{
var image = Image.FromStream(ms);
var pngTarget = Path.ChangeExtension(target, "png");
image.Save(pngTarget, ImageFormat.Png);
}
}
}
}
app.Quit();
Basically, I'm using the Page.EhmMetaFileBits property which, according to the documentation:
Returns a Object that represents a picture representation of how a
page of text appears.
... and based on that, I create an image and save it to the disk.

' ', hexadecimal value 0x1F, is an invalid character. Line 1, position 1

I am trying to read a xml file from the web and parse it out using XDocument. It normally works fine but sometimes it gives me this error for day:
**' ', hexadecimal value 0x1F, is an invalid character. Line 1, position 1**
I have tried some solutions from Google but they aren't working for VS 2010 Express Windows Phone 7.
There is a solution which replace the 0x1F character to string.empty but my code return a stream which doesn't have replace method.
s = s.Replace(Convert.ToString((byte)0x1F), string.Empty);
Here is my code:
void webClient_OpenReadCompleted(object sender, OpenReadCompletedEventArgs e)
{
using (var reader = new StreamReader(e.Result))
{
int[] counter = { 1 };
string s = reader.ReadToEnd();
Stream str = e.Result;
// s = s.Replace(Convert.ToString((byte)0x1F), string.Empty);
// byte[] str = Convert.FromBase64String(s);
// Stream memStream = new MemoryStream(str);
str.Position = 0;
XDocument xdoc = XDocument.Load(str);
var data = from query in xdoc.Descendants("user")
select new mobion
{
index = counter[0]++,
avlink = (string)query.Element("user_info").Element("avlink"),
nickname = (string)query.Element("user_info").Element("nickname"),
track = (string)query.Element("track"),
artist = (string)query.Element("artist"),
};
listBox.ItemsSource = data;
}
}
XML file:
http://music.mobion.vn/api/v1/music/userstop?devid=

0x1f is a Windows control character. It is not valid XML. Your best bet is to replace it.
Instead of using reader.ReadToEnd() (which by the way - for a large file - can use up a lot of memory.. though you can definitely use it) why not try something like:
string input;
while ((input = sr.ReadLine()) != null)
{
string = string + input.Replace((char)(0x1F), ' ');
}
you can re-convert into a stream if you'd like, to then use as you please.
byte[] byteArray = Encoding.ASCII.GetBytes( input );
MemoryStream stream = new MemoryStream( byteArray );
Or else you could keep doing readToEnd() and then clean that string of illegal characters, and convert back to a stream.
Here's a good resource for cleaning illegal characters in your xml - chances are, youll have others as well...
https://seattlesoftware.wordpress.com/tag/hexadecimal-value-0x-is-an-invalid-character/

What could be happening is that the content is compressed in which case you need to decompress it.
With HttpHandler you can do this the following way:
var client = new HttpClient(new HttpClientHandler
{
AutomaticDecompression = DecompressionMethods.GZip
| DecompressionMethods.Deflate
});
With the "old" WebClient you have to derive your own class to achieve the similar effect:
class MyWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
return request;
}
}
Above taken from here
To use the two you would do something like this:
HttpClient
using (var client = new HttpClient(new HttpClientHandler { AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate }))
{
using (var stream = client.GetStreamAsync(url))
{
using (var sr = new StreamReader(stream.Result))
{
using (var reader = XmlReader.Create(sr))
{
var feed = System.ServiceModel.Syndication.SyndicationFeed.Load(reader);
foreach (var item in feed.Items)
{
Console.WriteLine(item.Title.Text);
}
}
}
}
}
WebClient
using (var stream = new MyWebClient().OpenRead("http://myrss.url"))
{
using (var sr = new StreamReader(stream))
{
using (var reader = XmlReader.Create(sr))
{
var feed = System.ServiceModel.Syndication.SyndicationFeed.Load(reader);
foreach (var item in feed.Items)
{
Console.WriteLine(item.Title.Text);
}
}
}
}
This way you also recieve the benefit of not having to .ReadToEnd() since you are working with the stream instead.

Consider using System.Web.HttpUtility.HtmlDecode if you're decoding content read from the web.

If you are having issues replacing the character
For me there were some issues if you try to replace using the string instead of the char. I suggest trying some testing values using both to see what they turn up. Also how you reference it has some effect.
var a = x.IndexOf('\u001f'); // 513
var b = x.IndexOf(Convert.ToString((byte)0x1F)); // -1
x = x.Replace(Convert.ToChar((byte)0x1F), ' '); // Works
x = x.Replace(Convert.ToString((byte)0x1F), " "); // Fails
I blagged this

I had the same issue and found that the problem was a  embedded in the xml.
The solution was:
s = s.Replace("", " ")

I'd guess it's probably an encoding issue but without seeing the XML I can't say for sure.
In terms of your plan to simply replace the character but not being able to, because you have a stream rather than a text, simply read the stream into a string and then remove the characters you don't want.

Works for me.........
string.Replace(Chr(31), "")

I used XmlSerializer to parse XML and faced the same exception.
The problem is that the XML string contains HTML codes of invalid characters
This method removes all invalid HTML codes from string (based on this thread - https://forums.asp.net/t/1483793.aspx?Need+a+method+that+removes+illegal+XML+characters+from+a+String):
public static string RemoveInvalidXmlSubstrs(string xmlStr)
{
string pattern = "&#((\\d+)|(x\\S+));";
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(xmlStr))
{
xmlStr = regex.Replace(xmlStr, new MatchEvaluator(m =>
{
string s = m.Value;
string unicodeNumStr = s.Substring(2, s.Length - 3);
int unicodeNum = unicodeNumStr.StartsWith("x") ?
Convert.ToInt32(unicodeNumStr.Substring(1), 16)
: Convert.ToInt32(unicodeNumStr);
//according to https://www.w3.org/TR/xml/#charsets
if ((unicodeNum == 0x9 || unicodeNum == 0xA || unicodeNum == 0xD) ||
((unicodeNum >= 0x20) && (unicodeNum <= 0xD7FF)) ||
((unicodeNum >= 0xE000) && (unicodeNum <= 0xFFFD)) ||
((unicodeNum >= 0x10000) && (unicodeNum <= 0x10FFFF)))
{
return s;
}
else
{
return String.Empty;
}
})
);
}
return xmlStr;
}

Nobody can answer if you don't show relevant info - I mean the Xml content.
As a general advice I would put a breakpoint after ReadToEnd() call. Now you can do a couple of things:
Reveal Xml content to this forum.
Test it using VS Xml visualizer.
Copy-paste the string into a txt file and investigate it offline.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Problems reading special characters from word document using office interop in C# - ms-word

Related

itextsharp extract text pdf not working

Programmatically create Infopath form in SharePoint form library with dates

How to make 'ő' and 'ű' work in Java?

Export MS Word Document pages to Images

' ', hexadecimal value 0x1F, is an invalid character. Line 1, position 1

Categories

Resources