Forcing Tesseract to match pattern (four digits in a row) - tesseract

I'm trying to get Tesseract (using the Tess4J wrapper) to match only a specific pattern. The pattern is four digits in a row, which I think would be \d\d\d\d. Here is a VERY small subset of the image I'm feeding tesseract (the floorplans are restricted, so I'm cautious to post much more of it): http://mike724.com/view/a06771
I'm using the following java code:
File imageFile = new File("/<redacted>/file.pdf");
Tesseract instance = Tesseract.getInstance();
instance.setTessVariable("load_system_dawg", "F");
instance.setTessVariable("load_freq_dawg", "F");
instance.setTessVariable("user_words_suffix", "");
instance.setTessVariable("user_patterns_suffix", "\\d\\d\\d\\d");
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
The problem I'm running into is that tesseract seems to not be honoring these configuration options, I still get text/words in the results. I expect to get only the room numbers (ex. 2950).

You have not configured this correctly.
user_patterns_suffix is meant to indicate the file extension of a text file that contains your patterns, e.g.
user_patterns_suffix pats
would mean you need to put a file in the tesseract tessdata folder
tessdata/eng.pats
... assuming eng was the language you were using.
See more here:
http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data
I do recall that user patterns may not be any shorter than 6 fixed chars before a pattern so you may not be able to accomplish this in any case - but try the correct config first.

They look like init-only parameters; as such, they need to be in a configs file, for instance, named bazaar placed under configs folder, to be be passed into setConfigs method.
instance.setConfigs(Arrays.asList("bazaar");
References:
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc
https://github.com/tesseract-ocr/tesseract/wiki/ControlParams
http://tess4j.sourceforge.net/docs/docs-1.4/

Related

Migrating from itext2 to itext7

Years ago, I wrote a small app in itext2 to gather reports on a weekly basis and concatenate them into one PDF. The app used com.lowagie.text.pdf.PdfCopy to copy and merge the PDFs. And it worked fine. Performed exactly as expected.
A few weeks ago I looked into migrating the application to itex7. To that end, I used the copyPagesTo method of com.itextpdf.kernel.pdf.PdfDocument. When run on the same file set, this produces warnings like:
WARN PdfNameTree - Name "section.1" already exists in the name tree; old value will be replaced by the new one.
When I click on the link to "section.1" in the first document of the merged PDF, I am taken to "section.1" of the last document. Not what I expected and not what happens when using the itext2 app. In the PDF's produced by itext2, if I click on the link to "section.1" of the first document in the combined PDF, I am taken to section 1 of the first document.
There is a hint in Javadocs for copyPagesTo saying
If outlines destination names are the same in different documents, all
such outlines will lead to a single location in the resultant
document. In this case iText will log a warning. This can be avoided
by renaming destinations names in the source document.
There is however, no explanation of how this should be done. I find it odd that this should be necessary in itext7, although it wasn't in itext2.
Is there a simple way to get around his problem?
I've also tried the Sejda desktop app and it produces correct results, but I would prefer to automate the process through a batch script.
My guess is iText 2 didn't even know it might be a problem.
If iText can't deduplicate destination names, the procedure is roughly:
Follow /Catalog -> /Names -> /Dests in each document to find the destination name tree.
Deduplicate the names, by adding suffixes. Remember that a name with a suffix added might be equal to an existing name in the same or another document. Be careful!
Now you can rewrite the destination name trees. Since you have only used suffixes, you can do this in place - the lexicographic ordering of the names is unaltered so the search tree structure is not broken.
Now, rewrite destination links in each PDF for the new names. For example any dictionary entry with key /Dest, or any /D in a /GoTo action.
Now, after all this preprocessing, the files will merge without name clashes.
(I know all this because I've just implemented it for my own PDF software. It's slightly hairy stuff, but not intractable.)
If you like, I can provide a devel version of cpdf with this functionality, if you would like to test it.

How to get the extension file of Input Stream

I have a code
var test = Base64.getDecoder.decode(base64);
ByteArrayInputStream(test);
var input_stream = new ByteArrayInputStream(test);
Logger.debug(test.getClass.getSimpleName)
How do I get the file extension of the variable input_stream?
I believe you are asking how to determine the image file format from its bytes.
This has already been answered here: Java get image extension/type using BufferedImage from URL
I would not use the term "file extension" to refer to the file format -- while many systems (including Windows) conventionally use the file name extension to indicate the file format, you cannot always rely on this convention, and these are two separate concepts. I found the above question by Googling "java how to detect image type"
Good luck!

Novacode LineChart type

I have a code that implements a Novacode.LineChart. And the LineChart type which is shown by default is this one:
But I dont want this type of chart, I want it without points, like this:
This is the code where I create the chart:
LineChart c = new LineChart();
c.AddLegend(ChartLegendPosition.Bottom, false);
c.Grouping = Grouping.Stacked;
Anyone knows how can I hide thoose points and show only the lines? Thanks to everyone!!
Your question is shown up while I was searching for the exact same feature. It's probably a bit late but I hope it would be useful for other people in need of this feature.
My so called answer is not more than a few lines of dirty and unmanageable hack so unless you are not in dire need, I do not recommend to follow this way.
I also do not know if is it an approved approach here but I prefer to write the solution step by step so it may help you to grasp the concept and use better methods.
After I have realized that I was unable to use DocX to create a line chart without markers, using currently provided API, I wanted to know what were the differences between actual and desired output. So I saved a copy of .docx file with line chart after I manually edited the chart to expected result.
Before and after the edit
As you may already know, a .docx is a container format and essentially comprised of a few different folders and files. You can open it up with a .zip archive extractor. I used 7-Zip for this task and found chart file at location of /word/charts/chart1.xml but this may differ depending on the file, but you can easily figure it out.
Compared both of chart1.xml files and the difference was, the file without the markers had and extra XML tag with an additional attribute;
<c:marker>
<c:symbol val="none" />
</c:marker>
I had to somehow add this segment of code to chart. I added these up to example code provided by DocX. You can follow up from: DocX/ChartSample.cs at master
This is where the fun begins. Easy part first.
using System.Xml;
using System.Xml.Linq;
using Xceed.Words.NET;
// Create a line chart.
var line_chart = new LineChart();
// Create the data.
var PlaceholderData = ChartData.GenerateRandomDataForLinechart();
// Create and add series
var Series_1 = new Series("Your random chart with placeholder data");
Series_1.Bind(PlaceholderData, "X-Axis", "Y-Axis");
line_chart.AddSeries(Series_1);
// Create a new XmlDocument object and clone the actual chart XML
XmlDocument XMLWithNewTags = new XmlDocument();
XMLWithNewTags.LoadXml(line_chart.Xml.ToString());
I've used XPath Visualizer Tool to determine the XPath query, which is important to know because you can't just add the marker tag to somewhere and expect it to work. Why do I tell this? Because I appended marker tag on a random line and expected it to work. Naive.
// Set a namespace manager with the proper XPath location and alias
XmlNamespaceManager NSMngr = new XmlNamespaceManager(XMLWithNewTags.NameTable);
string XPathQuery = "/c:chartSpace/c:chart/c:plotArea/c:lineChart/c:ser";
string xmlns = "http://schemas.openxmlformats.org/drawingml/2006/chart";
NSMngr.AddNamespace("c", xmlns);
XmlNode NewNode = XMLWithNewTags.SelectSingleNode(XPathQuery, NSMngr);
Now create necessary tags on newly created XML Document object with specified namespace
XmlElement Symbol = XMLWithNewTags.CreateElement("c", "symbol", xmlns);
Symbol.SetAttribute("val", "none");
XmlElement Marker = XMLWithNewTags.CreateElement("c", "marker", xmlns);
Marker.AppendChild(Symbol);
NewNode.AppendChild(Marker);
And we should copy the contents of latest changes to actual XML object. But oops, understandably it is defined as private so it is a read-only object. This is where I thought like "Okay, I've fiddled enough with this. I better find another library" but then decided to go on because reasons.
Downloaded DocX repo, changed this line to
get; set;
recompiled, copied Xceed.Words.NET.dll to both projectfolder/packages and projectfolder/projectname/bin/Debug folder and finally last a few lines were
// Copy the contents of latest changes to actual XML object
line_chart.Xml = XDocument.Parse(XMLWithNewTags.InnerXml);
// Insert chart into document
document.InsertChart(line_chart);
// Save this document to disk.
document.Save();
Is it worth it? I'm not sure but I have learned a few things while working on it. There're probably lots of bad programming practises in this answer so please tell me if you see one. Sorry for meh English.

Search string formatting in Elouqa API

I'm using the Elouqa Rest API in an integration with another product and I want to implement a file browser. As part of this I want to get a list of the folders inside another folder. Theapi documents here say that a search string can be appended but don't give any clues as to the format of the search string. I've tried various things but so far I'm just getting empty results. An example is here:
/API/rest/1.0/assets/email/folders?search=folderId+%3D+250
I've tried with and without +'s and with and without url encoding the = sign, also various combinations of quote marks but so far nothing.
I believe what you want is a slightly different endpoint e.g.:
/API/rest/1.0/assets/email/folder/250/contents
Which would provide a list of folders contained with folder 250
If you wanted to search for a given folder name then you would use
/API/rest/1.0/assets/email/folders?search=foldername
Hope that helps!

generic text reading

I am working on a project where I need to read some generic text...I am looking for any api by I can read generic text and also can convert it to .csv file...
Can any one plz help...
using java on windows os...
--------------------------MORE Detail---------------------------------------------------------------------------------------
let me clarify:
Assume I have a pdf document or for that matter any file type document. I intend to use Print to Generic text printer option and get the file in that format.Finally, I intend to use some API which shoudl enable me to programatically read this Generic Text Format file. I intend to extract text from this generic text file.
So, be it any file (.doc/.pdf/.xls etc wtatever), I intend to create a Generic Text Format file using print option. Then run my code to read those files and extract some information.
PS: Assume that I have a Status report form with standard fields. Ok. But, some people might submit in .pdf, some in .doc , some in text format. But, every document contains same fields, but probably with diferent layouts.
Now, I am looking for a generic solution, by which i shoudl be able to convert every file type in to generic text file format and then apply some logic to extract my Status report fields.
In Java this is more or less what you need to read a text file, assuming it's comma separated (just change the string in the "line.split" method if you need something else). It also skips the header.
public void parse(String filename) throws IOException {
File file = new File(filename);
FileInputStream fis = new FileInputStream(file);
InputStreamReader isr = new InputStreamReader(fis);
BufferedReader br = new BufferedReader(isr);
String line;
int header = 1;
while ((line = br.readLine()) != null) {
if (header == 1) {
header = 2;
continue; // skips header
}
String[] splitter = line.split(",");
// do whatever
System.out.println(splitter[0]);
}
}
CSV is a format for data in columns. It's not very useful for, say, a Wikipedia article.
The Apache Tika library will take all kinds of data and turn it into bland XML, from which you can make CSV as you like.
It would help if you would edit your question to clarify 'generic' versus' generated', and tell more about the data.
As for Windows printer drivers, are you looking to do something like 'print to pdf' as 'print to csv'? If so, I suspect that you need to start from MSDN samples of printer drivers and code this the hard way.
The so-called 'generic text file format' is not a structured format. It's completely unpredictable what you will find in there for any given input to the printer system.
A generic free book: Text Processing in Python
Just used the standard Java classes for I/O:
BufferedWriter, File, FileWriter, IOException, PrintWriter
.csv is simply a comma-separated values file. So just name your output file with a .csv extension.
You'll also need to figure out how you'd like to split your content.
Here are Java examples to get you going:
writing to a text file
how to read lines from a file