editing / splitting / saving data in a text file - perl

I have a text file called playlist.pls which is dynamically created, and in the text file I have thousands of lines that look like this:
File000001=/home/ubu32sc/Documents/octave/pre/wavefn_0001.wav
File000002=/home/ubu32sc/Documents/octave/pre/wavefn_0002.wav
File000003=/home/ubu32sc/Documents/octave/pre/wavefn_0003.wav
File000004=/home/ubu32sc/Documents/octave/pre/wavefn_0004.wav
File000005=/home/ubu32sc/Documents/octave/pre/wavefn_0005.wav
File000006=/home/ubu32sc/Documents/octave/pre/wavefn_0006.wav
File000007=/home/ubu32sc/Documents/octave/pre/wavefn_0007.wav
File000008=/home/ubu32sc/Documents/octave/pre/wavefn_0008.wav
File000009=/home/ubu32sc/Documents/octave/pre/wavefn_0009.wav
File000010=/home/ubu32sc/Documents/octave/pre/wavefn_0010.wav etc...
I need to have the data in the text file split into several different files.
example:
The play1.pls file would contain:
File000001=/home/ubu32sc/Documents/octave/pre/wavefn_0001.wav
File000002=/home/ubu32sc/Documents/octave/pre/wavefn_0002.wav
File000003=/home/ubu32sc/Documents/octave/pre/wavefn_0003.wav
The play2.pls file would contain:
File000004=/home/ubu32sc/Documents/octave/pre/wavefn_0004.wav
File000005=/home/ubu32sc/Documents/octave/pre/wavefn_0005.wav
File000006=/home/ubu32sc/Documents/octave/pre/wavefn_0006.wav
The play3.pls file would contain:
File000007=/home/ubu32sc/Documents/octave/pre/wavefn_0007.wav
File000008=/home/ubu32sc/Documents/octave/pre/wavefn_0008.wav
File000009=/home/ubu32sc/Documents/octave/pre/wavefn_0009.wav
The play4.pls file would contain:
File000010=/home/ubu32sc/Documents/octave/pre/wavefn_0010.wav etc...
What's the best way to go about doing this I was thinking about using octave/matlab to do this but I think this would be over kill and resource intensive to run a for loop on a text file with 10's of thousands of lines. Is grep or perl the proper thing to use and or should I use another type of program? and if so how could I do this with it?
I'm using Ubuntu 32 10.04 6 gig ram
Thanks

As you mentionned it, Matlab / Octave seems to be an overkill if you just want to split a text file into multiple files.
There are a thousand ways to do this (espcially on a unix system) so just pick yours.
One of the possibilities is to use split which goes like this:
split --lines=3 file prefix

Related

How to generate a 10000 lines test file from original file with 10 lines?

I want to test an application with a file containing 10000 lines of records (plus header and footer lines). I have a test file with 10 lines now, so I want to duplicate these line 1000 times. I don't want to create a C# code in my app to generate that file (is only for test), so I am looking for a different and simple way to do that.
What kind of tool can I use to do that? CMD? Visual Studio/VS Code extension? Any thought?
If your data is textual, load the 10 records from your test file into an editor. Select all, copy, insert at the end of file. Repeat until the file is of length 10000+
This procedure requires ceil(log_2(1000)) cycles, 10 in your case, in general ceil(log_2(<target_number_of_lines>/<base_number_of_lines>)).
Alternative (large files)
Modern editors should not have performance problems here. However, the principle can be applied using a cat cli command. Assuming that you copy the original file into a file named dup0.txt proceed as follows:
cat dup0.txt dup0.txt >dup1.txt
cat dup1.txt dup1.txt >dup0.txt
leaving you with the quadrupled number of lines in dup0.txt.

Extracting data from complex output text file using perl and placing into new text file

The complete output text file is hundreds of lines long, with relevant nuclear cross sections and a plethora of other data that I do not need for this particular problem. I am trying to extract the columns of data under "BURNUP" and the first "K-INF" from the file I attached. I am trying to extract this data and place it into a separate file. I am a newbie, and have a similar perl script from a professor. I have tried to adapt it to the information I am looking for but the only result I am receiving are the 2 print statements. Any suggestions?

en masse inline editing in an uncompressed PDF

I have a large PDF (~20mb, 160 mb. uncompressed).
I need to do a find and replace in the text in it, about 1000 times.
Here is what I tried.
Via SVG
Tranform to SVG (inkscape)
Read SVG line by line and do the replace in the file
Transform back to PDF
=> bad output, probably due to some geometric transform matrix in the SVG, the text is not well rendered
Creating ~1000 sed command
Uncompress PDF
Perform each replace with a sed command
Recompress PDF
=> way too long. each sed command takes about 20 sec, leading to several hours of process
Read line-by-line and replace
Uncompress PDF
Read line by line the PDF
find text to be replaced
replace using perl
write line to a new file
Compress the new file
=> due to left data-stream in the uncompressed PDF, the new file is apparently damaged (writing binary as lines of text)
I wonder if it would be possible to read line-by-line the uncompressed PDF, but do the editing directly in it. How could I do this?
I have searched for perl inline editing, but it performs the changes in the whole file at once, while I'd like to edit a single line.
Other ideas are more than welcome ;)
Following advise, I used CAM::PDF, this was the most efficient and simple solution
There is no difference between 2. and 3. Sed reads the input file line by line and writes changed lines into the output file. If you fed -i switch to it, sed just opens the input file and then unlinks (it's what rm do) then opens the output file with the same name and writes into. That's it. No magic involved. So if you damaged content by Perl, but not by sed you do something different than by sed. The main difference is, you can make Perl script way faster for replacing many strings. See Using sed on text files with a csv
The main trick is you can compile regexp for search nad replace which works in linear time.
my %replace = ( foo => 'bar' );
my $re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;
while (<>) {
s/$re/$replace{$1}/g;
}
You can use it with your original approach, but I would recommend to make it in Perl script which allows you to keep the regexp and replace hash between pdf files. You can also try it to combine with CAM::PDF. There is the example script changepagestring.pl in it. You can also look at PDF::API2 which would require more work but may provide better result. But remember, PDF format is not intended for modification.
You can follow the pdftk steps as described in
How to find and replace text in a existing PDF file with PDFTK (or other command line application)
You can first split the PDF into smaller documents with a few pages each, replace the text and again merge them together - all using pdftk.
There is also the PDFEdit software (http://pdfedit.cz/en/index.html). It is a GUI app with a scripting interface. You can process individual pages and then do a find replace using scripting commands. See if it loads your PDF.

Cleaning up text files with sed?

I have a bunch of text files that need cleaning up. Example
`E..4B?#.#...
..9J5.....P0.z.n9.9.. ........
.k#a..5
E...y^#.r...J5..
E...y_#.r...J5..
..9.P..n9..0.z............
….2..3..9…n7…..#.yr`
Is there any way sed can do this? Like notice weird patterns?
For this answer, I will assume that you have access to standard unix/linux tools.
Your file might be in some word-processor format. If so, the best way to get rid of the junk is to open it with that program. You may be able to find out which with file:
$ file mysteryfile
mysteryfile: Composite Document File V2 Document, Little Endian, Os: Windows, Version 6.1 ....
If that doesn't work, there is a standard unix utility for extracting text from binary files. It is called strings:
$ strings mysteryfile
Some
Recovered Text
...
The behavior of strings can be fine tuned with several options. See man strings.

Maximum number of file handles that can be opened in Perl

I am working on a Perl script that opens a huge file and which has the records in the below format. Script might run in Solaris 10 or HP UX 11.0
Filename1 , col1, col2
Filename1 , col1, col2
Filename2 , col1, col2
Filename3 , col1, col2
When I read the first field file name of the input file I need to create a new file if it doesn't exists and print the rest of the fields to the file. There might be 13000 unique file names in the input file. What is the maximum number of file handles that I can open in Solaris 10 or hpux 11? Will I be able to open 13000 file handles? I am planning to use a hash to store the file handles for writing it to the files and closing it. Also how can I easily get the unique file name from the first field across the whole file? Is there a easy way to do it rather than reading each line of the file?
The maximum number of filehandles is OS depended (and is configurable)
See ulimit (manual page is here)
However opening that many file handles is unreasonable. Have a rethink about your algorithm.
No, there's no way to get all the unique filenames without reading the entire file. But you can generate this list as you're processing the file. When you read a line, add the filename as the key of a hash. At the end, print the keys of the hash.
I don't know what your system allows, but you can open more file handles than your system permits using the FileCache module. This is a core Perl module, so you shouldn't even need to install it.
There is no way to get the first column out of a text file without reading the whole file, because text files don't really have an internal structure of columns or even lines; they are just one long string of data. The only way to find each "line" is to go through the whole file and look for newline characters.
However, even huge files are generally processed quite quickly by Perl. This is unlikely to be a problem. Here is simple code to get the unique filenames (assuming your file is opened as FILE):
my %files;
while (<FILE>) { /^(\S+)/ and $files{$1}++; }
This ends up with a count of how many times each file occurs. It assumes that your filenames don't contain any spaces. I did a quick test of this with >30,000 lines, and it was instantaneous.