Why Tessnet2 can't extract text?

Why Tessnet2 can't extract text? - tesseract

I using "tessnet2_64.dll"
This is my code , extract text:
try
{
var image = new Bitmap(#"D:\Tessnet2\C#\test2.jpg");
var ocr = new Tesseract();
// ocr.SetVariable("tessedit_char_whitelist", "0123456789"); // If digit only
//#"C:\OCRTest\tessdata" contains the language package, without this the method crash and app breaks
ocr.Init(#"D:\Tessnet2\C#\tessdata", "eng", true);
var result = ocr.DoOCR(image, Rectangle.Empty);
foreach (Word word in result)
Console.WriteLine("{0} : {1}", word.Confidence, word.Text);
Console.ReadLine();
}
catch (Exception exception)
{
}
Result Output:
146: I-18110
47: 88
How can extract text "Hello"?
Thanks all.

I had resloved by changed from:
ocr.Init(#"D:\Tessnet2\C#\tessdata", "eng", true);
to
ocr.Init(#"D:\Tessnet2\C#\tessdata", "eng", false);

Related

MarkLogic Java Client JSONDocumentManager word search returns incorrect total number of results

jsonDocumentManager.setPageLength(pagination.getItemsPerPage());
var offset = ((pagination.getPage() - 1) * pagination.getItemsPerPage()) + 1;
try (var documentPage = jsonDocumentManager.search(queryDefinition, offset)) {
var results = new ArrayList<TransportOrder>();
for (var documentRecord : documentPage) {
try {
var jsonParser = documentRecord.getContent(new JacksonParserHandle()).get();
...
} catch (IOException e) {
System.out.println("Something went wrong");
}
}
var totalNumberOfResults = documentPage.getTotalSize();
System.out.println("totalNumberOfResults : " + totalNumberOfResults);
}
}
where the queryDefinition contains multiple word queries:
qb.word(qb.jsonProperty(searchField), StructuredQueryBuilder.FragmentScope.DOCUMENTS,
new String[] {"wildcarded"}, 1.0, wildcardString);
The problem is the totalNumberOfResults is the total items from the database not only the items that match the search criteria. It is mentioned in the documentation for getTotalSize that: "The total count (most likely an estimate) of all possible items in the set."
I need to get somehow the actual number of items found by the search. Any idea how I can achieve this?
https://docs.marklogic.com/javadoc/client/com/marklogic/client/Page.html#getTotalSize--

counting the number of character in a text using FileReader

I am new in this superb place. I got help several times from this site. I have seen many answers regarding my question that was previously discussed but i am facing problem to count the number of characters using FileReader. It's working using Scanner. This is what i tried:
class CountCharacter
{
public static void main(String args[]) throws IOException
{
File f = new File("hello.txt");
int charCount=0;
String c;
//int lineCount=0;
if(!f.exists())
{
f.createNewFile();
}
BufferedReader br = new BufferedReader(new FileReader(f));
while ( (c=br.readLine()) != null) {
String s = br.readLine();
charCount = s.length()-1;
charCount++;
}
System.out.println("NO OF LINE IN THE FILE, NAMED " +f.getName()+ " IS " +charCount);
}
}`

It looks to me that each time you go through the loop, you assign the charCount to be the length of the line that iteration of the loop is concerned with. i.e. instead of
charCount = s.Length() -1;
try
charCount = charCount + s.Length();
EDIT:
If you have say the document with the contents "onlyOneLine"
Then when you first hit the while check the br.readLine() will make the BufferredReader read the first line, during the while's code block however br.readLine() is called again which advances the BufferredReader to the second line of the document, which will return null. As null is assigned to s, and you call length(), then NPE is thrown.
try this for the while block
while ( (c=br.readLine()) != null) {
charCount = charCount + c.Length(); }

c#-openxml word Replacement and page break

i am a new member an i really like this site because it help me always
my problem is
i want replace word document using openxml and add a page break
end then i want to write replaced text second page
here my codes
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(#"d:\a.docx", true))
{
using (StreamReader reader = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
text = reader.ReadToEnd();
}
Regex regexText = new Regex("#db#");
text = regexText.Replace(text, textBox4.Text.Trim());
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(text);
}
MainDocumentPart mainPart = wordDoc.MainDocumentPart;
Run r = new Run();
Paragraph para = new Paragraph(new Run(new Break() { Type = BreakValues.Page }));
using (StreamWriter sw1 = new StreamWriter(mainPart.GetStream(FileMode.Create)))
{
sw1.Write(text);
}
mainPart.Document.Body.InsertAfter(para, mainPart.Document.Body.LastChild);
mainPart.Document.Save();
}
}

I suggest you insert a page break in your a.docx in advance. Then, use MergeField to locate where you want to replace with.
Here is the example

Bolding with Rich Text Values in iTextSharp

Is it possible to bold a single word within a sentence with iTextSharp? I'm working with large paragraphs of text coming from xml, and I am trying to bold several individual words without having to break the string into individual phrases.
Eg:
document.Add(new Paragraph("this is <b>bold</b> text"));
should output...
this is bold text

As #kuujinbo pointed out there is the XMLWorker object which is where most of the new HTML parsing work is being done. But if you've just got simple commands like bold or italic you can use the native iTextSharp.text.html.simpleparser.HTMLWorker class. You could wrap it into a helper method such as:
private Paragraph CreateSimpleHtmlParagraph(String text) {
//Our return object
Paragraph p = new Paragraph();
//ParseToList requires a StreamReader instead of just text
using (StringReader sr = new StringReader(text)) {
//Parse and get a collection of elements
List<IElement> elements = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(sr, null);
foreach (IElement e in elements) {
//Add those elements to the paragraph
p.Add(e);
}
}
//Return the paragraph
return p;
}
Then instead of this:
document.Add(new Paragraph("this is <b>bold</b> text"));
You could use this:
document.Add(CreateSimpleHtmlParagraph("this is <b>bold</b> text"));
document.Add(CreateSimpleHtmlParagraph("this is <i>italic</i> text"));
document.Add(CreateSimpleHtmlParagraph("this is <b><i>bold and italic</i></b> text"));

I know that this is an old question, but I could not get the other examples here to work for me. But adding the text in Chucks with different fonts did.
//define a bold font to be used
Font boldFont = FontFactory.GetFont(FontFactory.HELVETICA_BOLD, 12);
//add a phrase and add Chucks to it
var phrase2 = new Phrase();
phrase2.Add(new Chunk("this is "));
phrase2.Add(new Chunk("bold", boldFont));
phrase2.Add(new Chunk(" text"));
document.Add(phrase2);

Not sure how complex your Xml is, but try XMLWorker. Here's a working example with an ASP.NET HTTP handler:
<%# WebHandler Language="C#" Class="boldText" %>
using System;
using System.IO;
using System.Web;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.xml;
using iTextSharp.tool.xml;
public class boldText : IHttpHandler {
public void ProcessRequest (HttpContext context) {
HttpResponse Response = context.Response;
Response.ContentType = "application/pdf";
StringReader xmlSnippet = new StringReader(
"<p>This is <b>bold</b> text</p>"
);
using (Document document = new Document()) {
PdfWriter writer = PdfWriter.GetInstance(
document, Response.OutputStream
);
document.Open();
XMLWorkerHelper.GetInstance().ParseXHtml(
writer, document, xmlSnippet
);
}
}
public bool IsReusable { get { return false; } }
}
You may have to pre-process your Xml before sending it to XMLWorker. (notice the snippet is a bit different from yours) Support for parsing HTML/Xml was released relatively recently, so your mileage may vary.

Here is another XMLWorker example that uses a different overload of ParseHtml and returns a Phrase instead of writing it directly to the document.
private static Phrase CreateSimpleHtmlParagraph(String text)
{
var p = new Phrase();
var mh = new MyElementHandler();
using (TextReader sr = new StringReader("<html><body><p>" + text + "</p></body></html>"))
{
XMLWorkerHelper.GetInstance().ParseXHtml(mh, sr);
}
foreach (var element in mh.elements)
{
foreach (var chunk in element.Chunks)
{
p.Add(chunk);
}
}
return p;
}
private class MyElementHandler : IElementHandler
{
public List<IElement> elements = new List<IElement>();
public void Add(IWritable w)
{
if (w is iTextSharp.tool.xml.pipeline.WritableElement)
{
elements.AddRange(((iTextSharp.tool.xml.pipeline.WritableElement)w).Elements());
}
}
}

' ', hexadecimal value 0x1F, is an invalid character. Line 1, position 1

I am trying to read a xml file from the web and parse it out using XDocument. It normally works fine but sometimes it gives me this error for day:
**' ', hexadecimal value 0x1F, is an invalid character. Line 1, position 1**
I have tried some solutions from Google but they aren't working for VS 2010 Express Windows Phone 7.
There is a solution which replace the 0x1F character to string.empty but my code return a stream which doesn't have replace method.
s = s.Replace(Convert.ToString((byte)0x1F), string.Empty);
Here is my code:
void webClient_OpenReadCompleted(object sender, OpenReadCompletedEventArgs e)
{
using (var reader = new StreamReader(e.Result))
{
int[] counter = { 1 };
string s = reader.ReadToEnd();
Stream str = e.Result;
// s = s.Replace(Convert.ToString((byte)0x1F), string.Empty);
// byte[] str = Convert.FromBase64String(s);
// Stream memStream = new MemoryStream(str);
str.Position = 0;
XDocument xdoc = XDocument.Load(str);
var data = from query in xdoc.Descendants("user")
select new mobion
{
index = counter[0]++,
avlink = (string)query.Element("user_info").Element("avlink"),
nickname = (string)query.Element("user_info").Element("nickname"),
track = (string)query.Element("track"),
artist = (string)query.Element("artist"),
};
listBox.ItemsSource = data;
}
}
XML file:
http://music.mobion.vn/api/v1/music/userstop?devid=

0x1f is a Windows control character. It is not valid XML. Your best bet is to replace it.
Instead of using reader.ReadToEnd() (which by the way - for a large file - can use up a lot of memory.. though you can definitely use it) why not try something like:
string input;
while ((input = sr.ReadLine()) != null)
{
string = string + input.Replace((char)(0x1F), ' ');
}
you can re-convert into a stream if you'd like, to then use as you please.
byte[] byteArray = Encoding.ASCII.GetBytes( input );
MemoryStream stream = new MemoryStream( byteArray );
Or else you could keep doing readToEnd() and then clean that string of illegal characters, and convert back to a stream.
Here's a good resource for cleaning illegal characters in your xml - chances are, youll have others as well...
https://seattlesoftware.wordpress.com/tag/hexadecimal-value-0x-is-an-invalid-character/

What could be happening is that the content is compressed in which case you need to decompress it.
With HttpHandler you can do this the following way:
var client = new HttpClient(new HttpClientHandler
{
AutomaticDecompression = DecompressionMethods.GZip
| DecompressionMethods.Deflate
});
With the "old" WebClient you have to derive your own class to achieve the similar effect:
class MyWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
return request;
}
}
Above taken from here
To use the two you would do something like this:
HttpClient
using (var client = new HttpClient(new HttpClientHandler { AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate }))
{
using (var stream = client.GetStreamAsync(url))
{
using (var sr = new StreamReader(stream.Result))
{
using (var reader = XmlReader.Create(sr))
{
var feed = System.ServiceModel.Syndication.SyndicationFeed.Load(reader);
foreach (var item in feed.Items)
{
Console.WriteLine(item.Title.Text);
}
}
}
}
}
WebClient
using (var stream = new MyWebClient().OpenRead("http://myrss.url"))
{
using (var sr = new StreamReader(stream))
{
using (var reader = XmlReader.Create(sr))
{
var feed = System.ServiceModel.Syndication.SyndicationFeed.Load(reader);
foreach (var item in feed.Items)
{
Console.WriteLine(item.Title.Text);
}
}
}
}
This way you also recieve the benefit of not having to .ReadToEnd() since you are working with the stream instead.

Consider using System.Web.HttpUtility.HtmlDecode if you're decoding content read from the web.

If you are having issues replacing the character
For me there were some issues if you try to replace using the string instead of the char. I suggest trying some testing values using both to see what they turn up. Also how you reference it has some effect.
var a = x.IndexOf('\u001f'); // 513
var b = x.IndexOf(Convert.ToString((byte)0x1F)); // -1
x = x.Replace(Convert.ToChar((byte)0x1F), ' '); // Works
x = x.Replace(Convert.ToString((byte)0x1F), " "); // Fails
I blagged this

I had the same issue and found that the problem was a  embedded in the xml.
The solution was:
s = s.Replace("", " ")

I'd guess it's probably an encoding issue but without seeing the XML I can't say for sure.
In terms of your plan to simply replace the character but not being able to, because you have a stream rather than a text, simply read the stream into a string and then remove the characters you don't want.

Works for me.........
string.Replace(Chr(31), "")

I used XmlSerializer to parse XML and faced the same exception.
The problem is that the XML string contains HTML codes of invalid characters
This method removes all invalid HTML codes from string (based on this thread - https://forums.asp.net/t/1483793.aspx?Need+a+method+that+removes+illegal+XML+characters+from+a+String):
public static string RemoveInvalidXmlSubstrs(string xmlStr)
{
string pattern = "&#((\\d+)|(x\\S+));";
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(xmlStr))
{
xmlStr = regex.Replace(xmlStr, new MatchEvaluator(m =>
{
string s = m.Value;
string unicodeNumStr = s.Substring(2, s.Length - 3);
int unicodeNum = unicodeNumStr.StartsWith("x") ?
Convert.ToInt32(unicodeNumStr.Substring(1), 16)
: Convert.ToInt32(unicodeNumStr);
//according to https://www.w3.org/TR/xml/#charsets
if ((unicodeNum == 0x9 || unicodeNum == 0xA || unicodeNum == 0xD) ||
((unicodeNum >= 0x20) && (unicodeNum <= 0xD7FF)) ||
((unicodeNum >= 0xE000) && (unicodeNum <= 0xFFFD)) ||
((unicodeNum >= 0x10000) && (unicodeNum <= 0x10FFFF)))
{
return s;
}
else
{
return String.Empty;
}
})
);
}
return xmlStr;
}

Nobody can answer if you don't show relevant info - I mean the Xml content.
As a general advice I would put a breakpoint after ReadToEnd() call. Now you can do a couple of things:
Reveal Xml content to this forum.
Test it using VS Xml visualizer.
Copy-paste the string into a txt file and investigate it offline.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why Tessnet2 can't extract text? - tesseract

I had resloved by changed from: ocr.Init(#"D:\Tessnet2\C#\tessdata", "eng", true); to ocr.Init(#"D:\Tessnet2\C#\tessdata", "eng", false);

Related

MarkLogic Java Client JSONDocumentManager word search returns incorrect total number of results

counting the number of character in a text using FileReader

c#-openxml word Replacement and page break

Bolding with Rich Text Values in iTextSharp

' ', hexadecimal value 0x1F, is an invalid character. Line 1, position 1

Categories

Resources