generic text reading - text-processing

I am working on a project where I need to read some generic text...I am looking for any api by I can read generic text and also can convert it to .csv file...
Can any one plz help...
using java on windows os...
--------------------------MORE Detail---------------------------------------------------------------------------------------
let me clarify:
Assume I have a pdf document or for that matter any file type document. I intend to use Print to Generic text printer option and get the file in that format.Finally, I intend to use some API which shoudl enable me to programatically read this Generic Text Format file. I intend to extract text from this generic text file.
So, be it any file (.doc/.pdf/.xls etc wtatever), I intend to create a Generic Text Format file using print option. Then run my code to read those files and extract some information.
PS: Assume that I have a Status report form with standard fields. Ok. But, some people might submit in .pdf, some in .doc , some in text format. But, every document contains same fields, but probably with diferent layouts.
Now, I am looking for a generic solution, by which i shoudl be able to convert every file type in to generic text file format and then apply some logic to extract my Status report fields.

In Java this is more or less what you need to read a text file, assuming it's comma separated (just change the string in the "line.split" method if you need something else). It also skips the header.
public void parse(String filename) throws IOException {
File file = new File(filename);
FileInputStream fis = new FileInputStream(file);
InputStreamReader isr = new InputStreamReader(fis);
BufferedReader br = new BufferedReader(isr);
String line;
int header = 1;
while ((line = br.readLine()) != null) {
if (header == 1) {
header = 2;
continue; // skips header
}
String[] splitter = line.split(",");
// do whatever
System.out.println(splitter[0]);
}
}

CSV is a format for data in columns. It's not very useful for, say, a Wikipedia article.
The Apache Tika library will take all kinds of data and turn it into bland XML, from which you can make CSV as you like.
It would help if you would edit your question to clarify 'generic' versus' generated', and tell more about the data.
As for Windows printer drivers, are you looking to do something like 'print to pdf' as 'print to csv'? If so, I suspect that you need to start from MSDN samples of printer drivers and code this the hard way.
The so-called 'generic text file format' is not a structured format. It's completely unpredictable what you will find in there for any given input to the printer system.

A generic free book: Text Processing in Python

Just used the standard Java classes for I/O:
BufferedWriter, File, FileWriter, IOException, PrintWriter
.csv is simply a comma-separated values file. So just name your output file with a .csv extension.
You'll also need to figure out how you'd like to split your content.
Here are Java examples to get you going:
writing to a text file
how to read lines from a file

Related

How to get the extension file of Input Stream

I have a code
var test = Base64.getDecoder.decode(base64);
ByteArrayInputStream(test);
var input_stream = new ByteArrayInputStream(test);
Logger.debug(test.getClass.getSimpleName)
How do I get the file extension of the variable input_stream?
I believe you are asking how to determine the image file format from its bytes.
This has already been answered here: Java get image extension/type using BufferedImage from URL
I would not use the term "file extension" to refer to the file format -- while many systems (including Windows) conventionally use the file name extension to indicate the file format, you cannot always rely on this convention, and these are two separate concepts. I found the above question by Googling "java how to detect image type"
Good luck!

Novacode LineChart type

I have a code that implements a Novacode.LineChart. And the LineChart type which is shown by default is this one:
But I dont want this type of chart, I want it without points, like this:
This is the code where I create the chart:
LineChart c = new LineChart();
c.AddLegend(ChartLegendPosition.Bottom, false);
c.Grouping = Grouping.Stacked;
Anyone knows how can I hide thoose points and show only the lines? Thanks to everyone!!
Your question is shown up while I was searching for the exact same feature. It's probably a bit late but I hope it would be useful for other people in need of this feature.
My so called answer is not more than a few lines of dirty and unmanageable hack so unless you are not in dire need, I do not recommend to follow this way.
I also do not know if is it an approved approach here but I prefer to write the solution step by step so it may help you to grasp the concept and use better methods.
After I have realized that I was unable to use DocX to create a line chart without markers, using currently provided API, I wanted to know what were the differences between actual and desired output. So I saved a copy of .docx file with line chart after I manually edited the chart to expected result.
Before and after the edit
As you may already know, a .docx is a container format and essentially comprised of a few different folders and files. You can open it up with a .zip archive extractor. I used 7-Zip for this task and found chart file at location of /word/charts/chart1.xml but this may differ depending on the file, but you can easily figure it out.
Compared both of chart1.xml files and the difference was, the file without the markers had and extra XML tag with an additional attribute;
<c:marker>
<c:symbol val="none" />
</c:marker>
I had to somehow add this segment of code to chart. I added these up to example code provided by DocX. You can follow up from: DocX/ChartSample.cs at master
This is where the fun begins. Easy part first.
using System.Xml;
using System.Xml.Linq;
using Xceed.Words.NET;
// Create a line chart.
var line_chart = new LineChart();
// Create the data.
var PlaceholderData = ChartData.GenerateRandomDataForLinechart();
// Create and add series
var Series_1 = new Series("Your random chart with placeholder data");
Series_1.Bind(PlaceholderData, "X-Axis", "Y-Axis");
line_chart.AddSeries(Series_1);
// Create a new XmlDocument object and clone the actual chart XML
XmlDocument XMLWithNewTags = new XmlDocument();
XMLWithNewTags.LoadXml(line_chart.Xml.ToString());
I've used XPath Visualizer Tool to determine the XPath query, which is important to know because you can't just add the marker tag to somewhere and expect it to work. Why do I tell this? Because I appended marker tag on a random line and expected it to work. Naive.
// Set a namespace manager with the proper XPath location and alias
XmlNamespaceManager NSMngr = new XmlNamespaceManager(XMLWithNewTags.NameTable);
string XPathQuery = "/c:chartSpace/c:chart/c:plotArea/c:lineChart/c:ser";
string xmlns = "http://schemas.openxmlformats.org/drawingml/2006/chart";
NSMngr.AddNamespace("c", xmlns);
XmlNode NewNode = XMLWithNewTags.SelectSingleNode(XPathQuery, NSMngr);
Now create necessary tags on newly created XML Document object with specified namespace
XmlElement Symbol = XMLWithNewTags.CreateElement("c", "symbol", xmlns);
Symbol.SetAttribute("val", "none");
XmlElement Marker = XMLWithNewTags.CreateElement("c", "marker", xmlns);
Marker.AppendChild(Symbol);
NewNode.AppendChild(Marker);
And we should copy the contents of latest changes to actual XML object. But oops, understandably it is defined as private so it is a read-only object. This is where I thought like "Okay, I've fiddled enough with this. I better find another library" but then decided to go on because reasons.
Downloaded DocX repo, changed this line to
get; set;
recompiled, copied Xceed.Words.NET.dll to both projectfolder/packages and projectfolder/projectname/bin/Debug folder and finally last a few lines were
// Copy the contents of latest changes to actual XML object
line_chart.Xml = XDocument.Parse(XMLWithNewTags.InnerXml);
// Insert chart into document
document.InsertChart(line_chart);
// Save this document to disk.
document.Save();
Is it worth it? I'm not sure but I have learned a few things while working on it. There're probably lots of bad programming practises in this answer so please tell me if you see one. Sorry for meh English.

Forcing Tesseract to match pattern (four digits in a row)

I'm trying to get Tesseract (using the Tess4J wrapper) to match only a specific pattern. The pattern is four digits in a row, which I think would be \d\d\d\d. Here is a VERY small subset of the image I'm feeding tesseract (the floorplans are restricted, so I'm cautious to post much more of it): http://mike724.com/view/a06771
I'm using the following java code:
File imageFile = new File("/<redacted>/file.pdf");
Tesseract instance = Tesseract.getInstance();
instance.setTessVariable("load_system_dawg", "F");
instance.setTessVariable("load_freq_dawg", "F");
instance.setTessVariable("user_words_suffix", "");
instance.setTessVariable("user_patterns_suffix", "\\d\\d\\d\\d");
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
The problem I'm running into is that tesseract seems to not be honoring these configuration options, I still get text/words in the results. I expect to get only the room numbers (ex. 2950).
You have not configured this correctly.
user_patterns_suffix is meant to indicate the file extension of a text file that contains your patterns, e.g.
user_patterns_suffix pats
would mean you need to put a file in the tesseract tessdata folder
tessdata/eng.pats
... assuming eng was the language you were using.
See more here:
http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data
I do recall that user patterns may not be any shorter than 6 fixed chars before a pattern so you may not be able to accomplish this in any case - but try the correct config first.
They look like init-only parameters; as such, they need to be in a configs file, for instance, named bazaar placed under configs folder, to be be passed into setConfigs method.
instance.setConfigs(Arrays.asList("bazaar");
References:
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc
https://github.com/tesseract-ocr/tesseract/wiki/ControlParams
http://tess4j.sourceforge.net/docs/docs-1.4/

itextsharp PDF to text dump

I am looking for a way to actually get the contents of the file itself, in its text format, dumped. E.g.: i don't want a dictionary object, i don't want some sort of extractionstrategy option, i just want the same text document that itextsharp uses to parse... the WHOLE thing as a string or stringbuilder...
I have not yet found a way to do this using any tools what so ever... my problem is that i am trying to read a dynamic PDF into a C# application... and we all know that those darn dynamic PDFs can't be parsed by iTextSharp (AcroForm and AcroFields always comes up empty), so i figured that if i can get the actual text dump of the entire file, i can see what it looks like and parse it myself for this specific task (e.g.: make a class for each document i know i can received, and make a map there based on what i see).
If anyone can help me do that, or even better, find a way, in C#, to extract the XML Source for the PDF (kinda like clicking the XML Source tab in LiveCycle) instead, it would be greatly appreciated.
Thanks!
Matt
If you are looking for the actual operators and commands of each page in the raw text format, try the following code:
var reader = new PdfReader("test.pdf");
int intPageNum = reader.NumberOfPages;
for (int i = 1; i <= intPageNum; i++)
{
byte[] contentBytes = reader.GetPageContent(i);
File.WriteAllBytes("page-" + i + ".txt", contentBytes);
}
reader.Close();
I am looking for a way to actually get the contents of the file
itself, in its text format, dumped. E.g.: i don't want a dictionary
object, i don't want some sort of extractionstrategy option, i just
want the same text document that itextsharp uses to parse... the WHOLE
thing as a string or stringbuilder...
Unfortunately the data that itextsharp uses to parse are not yet text; the operators in that data are given in some textual format but the actual glyphs may be given in a completely arbitrary ad-hoc encoding. That been said, often some standard encoding is used as it is the most simple solution for the components in use. You cannot in general count on that, though. The answer by VahidN shows you how to access the starting points for that content; not seldomly, though, that page content data he extracts only contain references to resources which are contained in different objects.
my problem is that i am trying to read a dynamic PDF into a C#
application... and we all know that those darn dynamic PDFs can't be
parsed by iTextSharp (AcroForm and AcroFields always comes up empty),
This sounds as if you actually have a completely different task at hand. Dynamic forms and their contents are not part of the page content but instead stored in a separate XML Forms Architecture stream.
iText in Action, 2nd edition, in chapter 8 gives you some information on how to access the XFA stream data, for a first glimps look at the sample XfaMovie.cs.
You might also want to look at the iText XML Worker project for easier manipulation of XFA streams.
if you just want to dump the text, try this:
PdfReader reader = new PdfReader(pdfFileName);
String text = "";
nPages = reader.NumberOfPages;
for (int i = 0; i < nPages; i++)
{
text += PdfTextExtractor.GetTextFromPage(reader, i + 1);
}

What is the best file parsing solution for converting files?

I am looking for the best solution for custom file parsing for our enterprise import routines. I want to basically change one file format into a standard file format and have one routine that imports that data into the database. I need to be able to create custom scripts for each client since its difficult to get the customer to comply with a standard or template format. I have looked at PowerShell and Iron Python to do this so far but I am not sure this is the route I want to go. I have also looked at some tools such as Talend which is a drag and drop style tool which may or may not give me what I want as far as flexibility. We are a .NET shop and have created custom code to do this in the past but I need something that is quicker to create then coding custom parsing functions each time we get a new file format in.
Depending on the complexity and variability of your work, you should consider an ETL tool like SSIS (SQL Server Integration Services).
Python is wonderful for this kind of thing. That's why we use. Each new customer transfer is a new adventure and Python gives us the flexibility to respond quickly.
Edit. All python scripts that read files are "custom file parsers". Without an actual example, it's not sensible to provide a detailed example.
with open( "some file", "r" ) as source:
for line in source:
process( line )
That's about all there is to a "custom file parser". If you're parsing .csv or .xml files, then Python has modules for that. If you're parsing fixed-format files, you'd use string slicing operations. If you're parsing other files (X12? JSON? YAML?) you'll need appropriate parsers.
Tab-Delim.
from collections import namedtuple
RecordLayout = namedtuple('RecordLayout',['field1','field2','field3',...])
def process( aLine ):
record = RecordLayout( aLine.split('\t') )
...
Fixed Layout.
from collections import namedtuple
RecordLayout = namedtuple('RecordLayout',['field1','field2','field3',...])
def process( aLine ):
fields = ( aLine[:10], aLine[10:20], aLine[20:30], ... )
record = RecordLayout( fields )
...