itextsharp extract text pdf not working - itext

I'm having trouble getting the text from the page.
Object reference error not set to an instance of an object, in the bold line.
String extractText = PdfTextExtractor.GetTextFromPage(pdfReader, i);
Follow the code below
var pdfText = new StringBuilder();
using (var pdfReader = new PdfReader(cbPdf.SelectedValue + ""))
{
for (var i = 0; i <= pdfReader.NumberOfPages; i++)
{
String extractText = PdfTextExtractor.GetTextFromPage(pdfReader, i);
extractText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(extractText)));
pdfText.Append(extractText);
}
}
rtxtTexto.Text = pdfText.ToString();

iText numbers pages 1-based, i.e. the first page has number 1.
You already did take that into account at the end of your loop (by comparing using <=), merely not at the start (where you start at 0).
Thus,
for (var i = 1; i <= pdfReader.NumberOfPages; i++)
That being said, as far as I know your line
extractText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(extractText)));
is nonsense.

Related

Programmatically create Infopath form in SharePoint form library with dates

I am programmatically creating InfoPath forms in a form library within SharePoint 2010 from data in a CSV file. It all works fine apart from the date fields. The form will refuse to open with a format error. I have tried multiple ways of formatting the date but no luck so far. Code below...
If I format 2016-10-10 then it does show in the Forms Library view but I still can not open the form. It just shows a datatype error.
// Get the data from CSV file.
string[,] values = LoadCsv("ImportTest.csv");
//Calulate how many columns and rows in the dataset
int countCols = values.GetUpperBound(1) + 1;
int countRows = values.GetUpperBound(0) + 1;
string rFormSite = "siteurl";
// opens the site
SPWeb webSite = new SPSite(rFormSite).OpenWeb();
// gets the blank file to copy
SPFile BLANK = webSite.Folders["EventSubmissions"].Files["Blank.xml"];
// reads the blank file into an xml document
MemoryStream inStream = new MemoryStream(BLANK.OpenBinary());
XmlTextReader reader = new XmlTextReader(inStream);
XmlDocument xdBlank = new XmlDocument();
xdBlank.Load(reader);
reader.Close();
inStream.Close();
//Get latest ID from the list
int itemID = GetNextID(webSite, "EventSubmissions");
if (itemID == -1) return;
//Iterate each row of the dataset
for (int row = 1; row < countRows; row++)
{
//display current event name
Console.WriteLine("Event name - " + values[row, 4]);
XmlDocument xd = xdBlank;
XmlElement root = xd.DocumentElement;
//Cycling through all columns of the document//
for (int col = 0; col < countCols; col++)
{
string field = values[0, col];
string value = values[row, col];
switch (field)
{
case "startDate":
value = //How do format the date here ;
break;
case "endDate":
value = "";
break;
case "AutoFormID":
value = itemID.ToString();
break;
}
XmlNodeList nodes = xd.GetElementsByTagName("my:" + field);
foreach (XmlNode node in nodes)
{
node.InnerText = value;
}
}
// saves the XML Document back as a file
System.Text.ASCIIEncoding encoding = new System.Text.ASCIIEncoding();
SPFile newFile = webSite.Folders["EventSubmissions"].Files.Add(itemID.ToString() + ".xml", (encoding.GetBytes(xd.OuterXml)), true);
itemID++;
}
Console.WriteLine("Complete");
Console.ReadLine();
Thanks
For me this worked
DateTime.Now.ToString("yyyy-MM-dd")

Reading PDF document with iTextSharp creates string with repeating first page

I currently use iTextSharp to read in some PDF files and parse them by using the string I receive. I have encountered a strange behavior with some PDF files. When getting the string back of a for example 4 page PDF, the string is filled with the pages in the following order:
1 2 1 3 1 4
My code for reading the files is as follows:
using (PdfReader reader = new PdfReader(fileStream))
{
StringBuilder sb = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
if (!string.IsNullOrWhiteSpace(text))
sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
}
Debug.WriteLine(sb.ToString());
}
Here is a link to a file with which this behaviour occurs:
https://onedrive.live.com/redir?resid=D9FEFF3BF45E05FD!1536&authkey=!AFLRlskAvlg89yY&ithint=file%2cpdf
Hope you guys can help me out!
Thanks to Chris Haas I found out was going wrong. The samples found online on how to use iTextSharp.Pdf are incorrect or incorrect for my implementation.
The SimpleTextExtractionStrategy needs to be instantiated for every page you try to read. Not doing this will multiply each previous page in the resulting string.
Also the line where the StringBuilder is being appended can be changed from:
sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
to
sb.Append(text);
Thus the following code gives the correct result:
using (PdfReader reader = new PdfReader(fileStream))
{
StringBuilder sb = new StringBuilder();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, new SimpleTextExtractionStrategy());
if (!string.IsNullOrWhiteSpace(text))
sb.Append(text);
}
Debug.WriteLine(sb.ToString());
}

Problems reading special characters from word document using office interop in C#

I'm trying to read text from a word .DOC using Microsoft.Office.Interop.Word
The text has some temperature degrees e.g. 95°F.
When I get the string from Range.Text, it becomes 95(F
Here is the C# code:
private Microsoft.Office.Interop.Word.Application _wordApp;
Microsoft.Office.Interop.Word.Document wordDocument
= _wordApp.Documents.Open(fileName, false, true);
for (int tableCounter = 1; tableCounter <= wordDocument.Tables.Count; tableCounter++)
{
var inputTable = wordDocument.Tables[tableCounter];
for (int cellCounter = 1; cellCounter <= inputTable.Range.Cells.Count; cellCounter++)
{
var problemText = inputTable.Range.Cells[cellCounter].Range.Text;
}
}

Replace the text in pdf document using itextSharp

I want to replace a particular text in PDF document. I am currently using itextSharp library to play with PDF documents.
I had extracted the bytes from pdfdocument and then replaced that byte and then write the document again with the bytes but it is not working. In the below example I am trying to replace string 1234 with 5678
Any advise on how to perform this would be helpful.
PdfReader reader = new PdfReader(opf.FileNames[i]);
byte[] pdfbytes = reader.GetPageContent(1);
PdfString oldstring = new PdfString("1234");
PdfString newstring = new PdfString("5678");
byte[] byte1022 = oldstring.GetOriginalBytes();
byte[] byte1067 = newstring.GetOriginalBytes();
int position = 0;
for (int j = 0; j <pdfbytes.Length ; j++)
{
if (pdfbytes[j] == byte1022[0])
{
if (pdfbytes[j+1] == byte1022[1])
{
if (pdfbytes[j+2] == byte1022[2])
{
if (pdfbytes[j+3] == byte1022[3])
{
position = j;
break;
}
}
}
}
}
pdfbytes[position] = byte1067[0];
pdfbytes[position + 1] = byte1067[1];
pdfbytes[position + 2] = byte1067[2];
pdfbytes[position + 3] = byte1067[3];
File.WriteAllBytes(opf.FileNames[i].Replace(".pdf","j.pdf"), pdfbytes);
What makes you think 1234 is part of the page's content stream and not of a form XObject? Your code is never going to work in general if you don't parse all the resources of a page.
Also: I see GetPageContent(), but I don't see you using SetPageContent() anywhere. How are the changes ever going to be stored in the PdfReader object?
Moreover, I don't see you using PdfStamper to write the altered PdfReader contents to a file.
Finally: I'm to shy to quote the words of Leonard Rosenthol, Adobe's PDF Architect, but ask him, and he'll tell you personally that you shouldn't do what you're trying to do. PDF is NOT a format for editing.Read the intro of chapter 6 of the book I wrote on iText: http://www.manning.com/lowagie2/samplechapter6.pdf

how to check each elements of string array contains data or not in c#

i have created web application and using textbox and it can contains multiple line of data becoz i have set its textmode property is multiline.
my problem is that i want to check each line contain data or not so i using count variable which count how many line contain data.
string[] data;
int cntindex;
data = txt_invoicenumber.Text.ToString().Split("\n".ToCharArray());
cntindex = data.Length;
for (j = 0; j < cntindex; j++)
{
if (data[j]!="")
{
inv_count++;
}
}
Its not working.
Please help me.
I guess this is because new line is \r\n so there is a '\r' also on empty lines.
Change the if statement to:
if (data[j].Trim().Length != 0)
Firstly, You don't need to ToString() the .Text property as it is already a string.
try this
string[] lines = txt_invoicenumber.Text.Split(Environment.NewLine);
int lineCount = 0;
foreach(string line in lines)
{
if(!string.IsNullOrEmpty(line))
{
lineCount ++;
this.ProcessLine(line);
}
}
var lb = new String[] { "\r\n" };
var lines = txt_invoicenumber.Text.Split(lb, StringSplitOptions.None).Length;
This will count empty lines too. If you don't want to count empty lines, use the StringSplitOptions.RemoveEmptyEntries value.
Don't count 100% on "\r\n" if you have little control over your environment though.
This is the answer I came up with.
String[] lines = TextBox1.Text.Split(new Char[] { '\r', '\n' },
StringSplitOptions.RemoveEmptyEntries);
Int32 validLineCount = lines.Length;