I am looking for the best way to read a CSV file line by line. I want to know what are the most efficient ways of doing the same. I am particulary concerned when the size of the file is big. Are the file reading utilities avilable in the .NET classes the most effient?
(PS: I have searched for the word 'Efficient' to know if someone has already posted similar question before posting this.)
You can try this: fast CSV Parser.
Here's the benchmark result:
To give more down to earth numbers,
with a 45 MB CSV file containing 145
fields and 50,000 records, the reader
was processing about 30 MB/sec. So all
in all, it took 1.5 seconds! The
machine specs were P4 3.0 GHz, 1024
MB.
The file handling in .NET is fine. If you want to read a line at a time in a comfortable way using foreach, I have a simple LineReader class you might want to look at (with a more general purpose version in MiscUtil).
With that you can use:
foreach (string line in new LineReader(file))
{
// Do the conversion or whatever
}
It's also easily usable with LINQ. You may or may not find this helpful - if the accepted solution works for you, that may be a better "off the shelf" solution.
Related
I am using:
importdata(fileName,'',headerLength)
To get data from a text file which is carriage return line feed delimited. The problem I have is that the files are relatively large and there are several thousand of them, which makes the data loading slow. I only want a small part of the file so I would like to know if I can use importdata to realise this?
Something like this:
importdata(fileName,'',headerLength:dataEnd);
This does not work and I can't find any support for doing something like this in the importdata documentation.
Does anyone know of a more suitable function?
If you know the lines (the row number) in each file you wish to load in,
You can use a slower, more traditional way of reading in your data. The readline.m allows you to do this:
http://uk.mathworks.com/matlabcentral/fileexchange/20026-readline-m-v3-0--jun--2009-
This allows you to read whichever line you want from your data block, but it is much slower than your normal csvread/textscan, but could be considered overall faster if you know which lines you are looking for.
I need to replace a file on a zip using iOS. I tried many libraries with no results. The only one that kind of did the trick was zipzap (https://github.com/pixelglow/zipzap) but this one is no good for me, because what really do is re-zip the file again with the change and besides of this process be to slow for me, also do something that loads the whole file on memory and make my application crash.
PS: If this is not possible or way to complicated, I can settle for rename or delete an specific file.
You need to find a framework where you can modify how data is read and written. You would then use some form of mmap to essentially read and write small chunks. Searching on NSData and mmap resulted in this Post, however you can use mmap from the posix level too. Ps it will be slower than using pure memory no way around that.
Got it WORKING!! JXZip (https://github.com/JanX2/JXZip) has made exactly what I need, they link to libzip (http://www.nih.at/libzip/) that is a fully equiped library for working with ZIP files and JXZip have all the necessary Objective-C wrapper code. Thanks for all the replys.
For archive purposes, as the author of zipzap:
Actually zipzap does exactly what you want. If you replace an entry within a zip file, zipzap will do the minimum necessary to update it: it will skip writing all entries before the replaced entry, then write out the entry, then write out all entries after the replaced entry without recompressing. At the moment, it does require sufficient memory for the entries after the replaced entry though.
I want to extract all the keywords from a huge pdf file [50MB] ?
which module is good for large pdf files to parse ?
I'm concerned with memory for parsing huge file & extracting almost all the keywords !
Here i want SAX kind of parsing [one go parsing ] & not DOM kind of [ analogy to XML].
To read text out of a PDF, we use CAM::PDF, and it worked just fine. It wasn't hugely fast on some larger files, but the ability to handle large files was not bad. We certainly had a few that were ~100Mb, and which were handled OK. If I recall, we struggled with a few that were 130Mb on a 32-bit (Windows) Perl, but we had a whole lot of other stuff in memory at the time. We did look at PDF::API2, but it seemed more oriented to generating PDFs that reading from them. We didn't throw large files into PDF::API2, so I can't give a real benchmark figure.
The only significant downside we found with using CAM::PDF is that PDF 1.6 is becoming more common, and that doesn't work at all in CAM::PDF yet. That might not be an issue for you, but it might be something to consider.
In answer to your question, I'm pretty sure both modules read the whole source PDF into memory in one form or another, but I don't think CAM::PDF builds as many more complex structures out of it. So neither is really SAX-like, but CAM::PDF seemed to be lighter in general, and can retrieve one page at a time, so might reduce the load for extracting very large texts.
I'm trying to write a 3D OBJ file parser for the iPhone/iPad and am trying to figure out the absolute fastest method to do this. I'm not super familiar with c++ but have a few ideas. Some of the ideas I had on how to approach this were to read the whole file into a string then parse from there. Another idea was to read the file line by line and put it into vectors, but that sounds like it might be slower. I know there are a lot of tricks to make c++ extremely fast. Anyone want to take a stab at this? After I parse the file, I'm going to re-save it as a binary file for fast loading on subsequent startups. Thanks.
Check this source: http://code.google.com/p/iphonewavefrontloader/
It works pretty fast, so you can learn how it is implemented.
I have a bunch of PDF files and my Perl program needs to do a full-text search of them to return which ones contain a specific string.
To date I have been using this:
my #search_results = `grep -i -l \"$string\" *.pdf`;
where $string is the text to look for.
However this fails for most pdf's because the file format is obviously not ASCII.
What can I do that's easiest?
Clarification:
There are about 300 pdf's whose name I do not know in advance. PDF::Core is probably overkill. I am trying to get pdftotext and grep to play nice with each other given I don't know the names of the pdf's, I can't find the right syntax yet.
Final solution using Adam Bellaire's suggestion below:
#search_results = `for i in \$( ls ); do pdftotext \$i - | grep --label="\$i" -i -l "$search_string"; done`;
The PerlMonks thread here talks about this problem.
It seems that for your situation, it might be simplest to get pdftotext (the command line tool), then you can do something like:
my #search_results = `pdftotext myfile.pdf - | grep -i -l \"$string\"`;
My library, CAM::PDF, has support for extracting text, but it's an inherently hard problem given the graphical orientation of PDF syntax. So, the output is sometimes gibberish. CAM::PDF bundles a getpdftext.pl program, or you can invoke the functionality like so:
my $doc = CAM::PDF->new($filename) || die "$CAM::PDF::errstr\n";
for my $pagenum (1 .. $doc->numPages()) {
my $text = $doc->getPageText($pagenum);
print $text;
}
I second Adam Bellaire solution. I used pdftotext utility to create full-text index of my ebook library. It's somewhat slow but does its job. As for full-text, try PLucene or KinoSearch to store full-text index.
You may want to look at PDF::Core.
The easiest fulltext index/seach I've used is mysql. You just insert into the table with the appropriate index on it. You need to spend some time working out the relative weightings for fields (a match in the title might score higher than a match in the body), but this is all possible, albeit with some hairy sql.
Plucene is deprecated (there hasn't been any active work on it in the last two years afaik) in favour of KinoSearch. KinoSearch grew, in part, out of understanding the architectural limitations of Plucene.
If you have ~300 pdfs, then once you've extracted the text from the PDF (assuming the PDF has text and not just images of text ;) and depending on your query volumes you may find grep is sufficient.
However, I'd strongly suggest the mysql/kinosearch route as they have covered a lot of ground (stemming, stopwords, term weighting, token parsing) that you don't benefit from getting bogged down with.
KinoSearch is probably faster than the mysql route, but the mysql route gives you more widely used standard software/tools/developer-experience. And you get the ability to use the power of sql to augement your freetext search queries.
So unless you're talking HUGE data-sets and insane query volumes, my money would be on mysql.
You could try Lucene (the Perl port is called Plucene). The searches are incredibly fast and I know that PDFBox already knows how to index PDF files with Lucene. PDFBox is Java, but chances are there is something very similar somewhere in CPAN. Even if you can't find something that already adds PDF files to a Lucene index it shouldn't be more than a few lines of code to do it yourself. Lucene will give you quite a few more searching options than simply looking for a string in a file.
There's also a very quick and dirty way. Text in a PDF file is actually stored as plain text. If you open a PDF in a text editor or use 'strings' you can see the text in there. The binary junk is usually embedded fonts, images, etc.