Paginate a big text file

Paginate a big text file - perl

I have a big text file. Each line in the file is a record. And I need to parse the text file and show only 20 records in a HTML table at a time. I will have to support sorting as well.
What I am currently doing is read the file line by line based on the parameters start, stop, and page_size which is provided in querystring. It seems to work fine until I have to sort the records, because in order to sort I need to process every line in the text file.
So is there a Unix command which can I extract from line to line and sort? I tried grep but I do not know enough it to get this problem solved.

Take a look at the pr command. This is what we use to use all the time to paginate big files. You can set the page length, headers, footers, turn on line numbers, etc.
There's probably even a way to munge the output into HTML.

How big is the file?
man sort
Here

Related

Extracting data from complex output text file using perl and placing into new text file

The complete output text file is hundreds of lines long, with relevant nuclear cross sections and a plethora of other data that I do not need for this particular problem. I am trying to extract the columns of data under "BURNUP" and the first "K-INF" from the file I attached. I am trying to extract this data and place it into a separate file. I am a newbie, and have a similar perl script from a professor. I have tried to adapt it to the information I am looking for but the only result I am receiving are the 2 print statements. Any suggestions?

en masse inline editing in an uncompressed PDF

I have a large PDF (~20mb, 160 mb. uncompressed).
I need to do a find and replace in the text in it, about 1000 times.
Here is what I tried.
Via SVG
Tranform to SVG (inkscape)
Read SVG line by line and do the replace in the file
Transform back to PDF
=> bad output, probably due to some geometric transform matrix in the SVG, the text is not well rendered
Creating ~1000 sed command
Uncompress PDF
Perform each replace with a sed command
Recompress PDF
=> way too long. each sed command takes about 20 sec, leading to several hours of process
Read line-by-line and replace
Uncompress PDF
Read line by line the PDF
find text to be replaced
replace using perl
write line to a new file
Compress the new file
=> due to left data-stream in the uncompressed PDF, the new file is apparently damaged (writing binary as lines of text)
I wonder if it would be possible to read line-by-line the uncompressed PDF, but do the editing directly in it. How could I do this?
I have searched for perl inline editing, but it performs the changes in the whole file at once, while I'd like to edit a single line.
Other ideas are more than welcome ;)
Following advise, I used CAM::PDF, this was the most efficient and simple solution

There is no difference between 2. and 3. Sed reads the input file line by line and writes changed lines into the output file. If you fed -i switch to it, sed just opens the input file and then unlinks (it's what rm do) then opens the output file with the same name and writes into. That's it. No magic involved. So if you damaged content by Perl, but not by sed you do something different than by sed. The main difference is, you can make Perl script way faster for replacing many strings. See Using sed on text files with a csv
The main trick is you can compile regexp for search nad replace which works in linear time.
my %replace = ( foo => 'bar' );
my $re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;
while (<>) {
s/$re/$replace{$1}/g;
}
You can use it with your original approach, but I would recommend to make it in Perl script which allows you to keep the regexp and replace hash between pdf files. You can also try it to combine with CAM::PDF. There is the example script changepagestring.pl in it. You can also look at PDF::API2 which would require more work but may provide better result. But remember, PDF format is not intended for modification.

You can follow the pdftk steps as described in
How to find and replace text in a existing PDF file with PDFTK (or other command line application)
You can first split the PDF into smaller documents with a few pages each, replace the text and again merge them together - all using pdftk.
There is also the PDFEdit software (http://pdfedit.cz/en/index.html). It is a GUI app with a scripting interface. You can process individual pages and then do a find replace using scripting commands. See if it loads your PDF.

Processing text inside variable before writing it into file

I'm using Perl WWW::Mechanize package in order to fetch and process data from some websites. Usually my way of action is as follows:
Fetch a webpage
$mech->get("$url");
Save the webpage contents in a variable (BTW, I'm not sure if it's the right way to save this amount of text inside a scalar which, as far as I know, supposed to be used for a single value)
my $list = $mech->content();
Use a subroutine that I've created to write the contents of the variable to a text file. (The writetoFile subroutine includes few more features, like path and existing file validations..)
writeToFile("$filename.tmp","$path",$list);
Processing the text in a file created in the previous step by creating an additional file and save the processed content there (Then deleting the initial temporary file).
What I wonder about, is whether it is possible to perform the processing before storing the text in a file, directly inside the $list variable? The whole process is working as expected but I don't really like the logic behind it and it seems a bit inefficient as well, since I have to rewrite the same file multiple times.
EDIT:
Just to give a bit more information about what I'm actually after when I process the variable contents. So the data I fetch from the website in this case is actually a list of items separated by a blank line and the first line is irrelevant to me. So what I'm doing while processing this data is 2 things:
Remove the empty (CRLF) lines
Remove the first line if it includes a particular text.
Ideally I want to save the processed list (no blank spaces and first line removed) in a file without creating any additional files on the way. In order to save the file I would like to use the writeToFile sub (I wrote) since it also performs validation on whether such file already exists (If a file will be saved before final processing - the writeToFile will always rewrite the existing file).
Hope it makes sense.

You're looking for split. The pattern depends: use (?<=\n) split at a new line character and keep it. If that doesn't matter, use \R to include all sort of line breaks.
foreach my $line (split qr/\R/, $mech->content) {
…
}
Now the obligatory HTML-parsing-with-regex admonishment: if you get HTML source with Mechanize, parsing it line-by-line does not make much sense. You probably want to process the HTML-stripped text version of the document instead, or pass the HTML source to a parser such as Web::Query to declaratively get at the pieces you need.

Reading large csv files with strings containing commas as one field

I have a large .csv file (~26000 rows). I want to be able to read it into matlab. Another problem is that it contains a collection of strings delimited by commas in one of the fields.
I'm having trouble reading it. I tried stuff like tdfread, which won't work here. Any tricks with textscan i should be aware about?
Is there any other way?

I'm not sure what is generating your CSV file but that is your problem.
The point of a CSV file, is that the file itself designates separation of fields. If the text of the CSV contains commas, then nothing you can do will help you. How would ANY program know when the text in a single field contains commas, or when that comma is a field delimiter?
Proper CSV would have a text qualifier. Some generators/readers gives you the option to use one. The standard text qualifier is a " (quote). Its changeable, though, because your text may contain those, too.
Again, its all about generating proper CSV content.

There's a chance that xlsread won't give you the answer you expect -- do the strings always appear in the same columns, for example? I think (as everyone else seems to :-) that it would be more robust to just use
fid = fopen('yourfile.csv');
and then either textscan
t = textscan(fid, '%s', delimiter', sprintf('\n'));
t = t{1};
or just fgetl (the example in the help is perfect).
After that you can do some line-by-line processing -- using textscan again on the text content of each line, for example, is a nice, quick way to get a cell-array that will allow fast analysis of each line.

You have a problem because you're reading it in as a .csv, and you have commas within your data. You can get it in Excel and manipulate the date, possibly extract the unwanted commas with Excel formulas. I work with .csv files for DB imports quite a bit. I imagine matLab has similar rules, which is - no commas in your data.
Can you tell us more about your data? Are there commas throughout, our just one column? Maybe you can read it in as tab delimited?

Are you using a Unix system? The reason I am asking is that you could use a command-line function such as sed and regular expressions to clean those data files before you pass them into Matlab. Here is a link that explains how to do exactly what you are looking for.

Since, as others have observed, your file is CSV with commas inside what you think of as a single field, it's going to be hard to persuade Matlab that that really is only one field. I think your best strategy is going to be to read one line at a time, into a string acting as a buffer, and to translate it, field-by-field, into the variables or other data structures that you want. Since Matlab has in-built regular expression capabilities this shouldn't be too hard.
And, as others have already suggested, posting a sample of your data would help us to help you.

One easy solution is:
path='C:\folder1\folder2\';
data = 'data.csv';
data = dataset('xlsfile',sprintf('%s\%s', path,data));
Of course you could also do the following:
[data,path] = uigetfile('C:\folder1\folder2\*.csv');
data = dataset('xlsfile',sprintf('%s\%s', path,data));
now you will have loaded the data as dataset. An easy way to get a column 1 for example is
double(data(1))

How can I make log4perl output easier to read?

When using log4perl, the debug log layout that I'm using is :
log4perl.appender.D10.layout=PatternLayout
log4perl.appender.D10.layout.ConversionPattern=%d [pid=%P] %p %F{1} (%L) %M %m%n
log4perl.appender.D10.Filter = DebugAndUp
This produces very verbose debug logs, for example:
2008/11/26 11:57:28 [pid=25485] DEBUG SomeModule.pm (331) functions::SomeModule::Test Test XXX was successfull
2008/11/26 11:57:29 [pid=25485] ERROR SomeOtherUnrelatedModule.pm (99999) functions::SomeModule::AnotherTest AnotherTest YYY has faled
This works great, and provides excellent debugging data.
However, each line of the debug log contains different function names, pid length, etc. This makes each line layout differently, and makes reading debug logs much harder than it needs to be.
Is there a way in log4perl to format the line so that the debugging metadata (everything up until the actual log message) be padded at the end with spaces/tabs, and have the actual message start at the same column of text?

You can pad the single fields that make up your entries. For example [pid=%5P] will always give you at least 5 characters for the PID.
The "Quantify Placeholders" section in the docs for Log::Log4perl::Layout gives more details.

There are a couple of ways to go with this, although you have to figure out which one works better for your situation:
Use a different appender if you are working live. Have that appender use a pattern that shows only the information you want. If you're working in a single process, for instance, your alternate appender might leave off the PID and the timestamp. You might only need the file name and line number.
Use %n to put newlines in the right place. That makes it multi-line output that is slightly harder to parse later, but you can choose another sequence for the input record separator (say, a literal "[EOL]") to make it easy to read entry-by-entry.
Log to a database instead of a file. For your reports, select just the columns you want to inspect.
Log everything, but write a filter to go through the log file ad-hoc to display just the parts that you want to see, such as only the debugging messages, the entries between certain times, only the entries involving a file, and so on.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse