Splitting a concatenated file based on header text - perl

I have a few very large files which are basically a concatenation of several small files and I need to split them into their constituent files. I also need to name the files the same as the original files.
For example the files QMAX123 and QMAX124 have been concatenated to:
;QMAX123 - Student
... file content ...
;QMAX124 - Course
... file content ...
I need to recreate the file QMAX123 as
;QMAX123 - Student
... file content ...
And QMAX124 as
;QMAX124 - Course
... file content ...
The original file's header ;QMAX<some number> is unique and only appears as a header in the file.
I used the script below to split the content of the files, but I haven't been able to adapt it to get the file names right.
awk '/^;QMAX/{close("file"f);f++}{print $0 > "file"f}' <filename>
So I can either adapt that script to name the file correctly or I can rename the split files created using the script above based on the content of the file, whichever is easier.
I'm currently using cygwin bash (which has perl and awk) if that has any bearing on your answer.

The following Perl should do the trick
use warnings ;
use strict ;
my $F ; #will hold a filehandle
while (<>) {
if ( / ^ ; (\S+) /x) {
my $filename = $1 ;
open $F, '>' , $filename or die "can't open $filename " ;
} else {
next unless defined $F ;
print $F $_ or warn "can't write" ;
}
}
Note it discards any input before a line with filename next unless defined $F ; You may care to generate an error or add a default file. Let me know and I can change it

With Awk, it's as simple as
awk '/^;QMAX/ {filename = substr($1,2)} {print >> filename}' input_file

Related

How to rename multiple files in a folder with a specific format?

I have many files in a folder with the format '{galaxyID}-cutout-HSC-I-{#}-pdr2_wide.fits', where {galaxyID} and {#} are different numbers for each file. Here are some examples:
2185-cutout-HSC-I-9330-pdr2_wide.fits
992-cutout-HSC-I-10106-pdr2_wide.fits
2186-cutout-HSC-I-9334-pdr2_wide.fits
I want to change the format of all files in this folder to match the following:
2185_HSC-I.fits
992_HSC-I.fits
2186_HSC-I.fits
namely, I want to take out "cutout", the second number, and "pdr2_wide" from each file name. I would prefer to do this in either Perl or Python. For my Perl script, so far I have the following:
rename [-n];
my #parts=split /-/;
my $this=$parts[0].$parts[1].$parts[2].$parts[3].$parts[4].$parts[5];
$_ = $parts[0]."_".$parts[2]."_".$parts[3];
*fits
which gives me the error message
Not enough arguments for rename at ./rename.sh line 3, near "];" Execution of ./rename.sh aborted due to compilation errors.
I included the [-n] because I want to make sure the changes are what I want before actually doing it; either way, this is in a duplicated directory just for safety.
It looks like you are using the rename you get on Ubuntu (it's not the one that's on my ArchLinux box), but there are other ones out there. But, you've presented it oddly. The brackets around -n shouldn't be there and the ; ends the command.
The syntax, if you are using what I think you are, is this:
% rename -n -e PERL_EXPR file1 file2 ...
The Perl expression is the argument to the -e switch, and can be a simple substitution. Note that this expression is a string that you give to -e, so that probably needs to be quoted:
% rename -n -e 's/-\d+-pdr2_wide//' *.fits
rename(2185-cutout-HSC-I-9330-pdr2_wide.fits, 2185-cutout-HSC-I.fits)
And, instead of doing this in one step, I'd do it in two:
% rename -n -e 's/-cutout-/-/; s/-\d+-pdr2_wide//' *.fits
rename(2185-cutout-HSC-I-9330-pdr2_wide.fits, 2185-HSC-I.fits)
There are other patterns that might make sense. Instead of taking away parts, you can keep parts:
% rename -n -e 's/\A(\d+).*(HSC-I).*/$1-$2.fits/' *.fits
rename(2185-cutout-HSC-I-9330-pdr2_wide.fits, 2185-HSC-I.fits)
I'd be inclined to use named captures so the next poor slob knows what you are doing:
% rename -n -e 's/\A(?<galaxy>\d+).*(HSC-I).*/$+{galaxy}-$2.fits/' *.fits
rename(2185-cutout-HSC-I-9330-pdr2_wide.fits, 2185-HSC-I.fits)
From your description {galaxyID}-cutout-HSC-I-{#}-pdr2_wide.fits, I assume that cutout-HSC-I is fixed.
Here's a script that will do the rename. It takes a list of files on stdin. But, you could adapt to take the output of readdir:
#!/usr/bin/perl
master(#ARGV);
exit(0);
sub master
{
my($oldname);
while ($oldname = <STDIN>) {
chomp($oldname);
# find the file extension/suffix
my($ix) = rindex($oldname,".");
next if ($ix < 0);
# get the suffix
my($suf) = substr($oldname,$ix);
# only take filenames of the expected format
next unless ($oldname =~ /^(\d+)-cutout-(HSC-I)/);
# get the new name
my($newname) = $1 . "_" . $2 . $suf;
printf("OLDNAME: %s NEWNAME: %s\n",$oldname,$newname);
# rename the file
# change to "if (1)" to actually do it
if (0) {
rename($oldname,$newname) or
die("unable to rename '$oldname' to '$newname' -- $!\n");
}
}
}
For your sample input file, here's the program output:
OLDNAME: 2185-cutout-HSC-I-9330-pdr2_wide.fits NEWNAME: 2185_HSC-I.fits
OLDNAME: 992-cutout-HSC-I-10106-pdr2_wide.fits NEWNAME: 992_HSC-I.fits
OLDNAME: 2186-cutout-HSC-I-9334-pdr2_wide.fits NEWNAME: 2186_HSC-I.fits
The above is how I usually do things but here's one with just a regex. It's fairly strict in what it accepts [for safety], but you can adapt as desired:
#!/usr/bin/perl
master(#ARGV);
exit(0);
sub master
{
my($oldname);
while ($oldname = <STDIN>) {
chomp($oldname);
# only take filenames of the expected format
next unless ($oldname =~ /^(\d+)-cutout-(HSC-I)-\d+-pdr2_wide([.].+)$/);
# get the new name
my($newname) = $1 . "_" . $2 . $3;
printf("OLDNAME: %s NEWNAME: %s\n",$oldname,$newname);
# rename the file
# change to "if (1)" to actually do it
if (0) {
rename($oldname,$newname) or
die("unable to rename '$oldname' to '$newname' -- $!\n");
}
}
}

How to print result STDOUT to a temporary blank new file in the same directory in Perl?

I'm new in Perl, so it's maybe a very basic case that i still can't understand.
Case:
Program tell user to types the file name.
User types the file name (1 or more files).
Program read the content of file input.
If it's single file input, then it just prints the entire content of it.
if it's multi files input, then it combines the contents of each file in a sequence.
And then print result to a temporary new file, which located in the same directory with the program.pl .
file1.txt:
head
a
b
end
file2.txt:
head
c
d
e
f
end
SINGLE INPUT program ioSingle.pl:
#!/usr/bin/perl
print "File name: ";
$userinput = <STDIN>; chomp ($userinput);
#read content from input file
open ("FILEINPUT", $userinput) or die ("can't open file");
#PRINT CONTENT selama ada di file tsb
while (<FILEINPUT>) {
print ; }
close FILEINPUT;
SINGLE RESULT in cmd:
>perl ioSingle.pl
File name: file1.txt
head
a
b
end
I found tutorial code that combine content from multifiles input but cannot adapt the while argument to code above:
while ($userinput = <>) {
print ($userinput);
}
I was stucked at making it work for multifiles input,
How am i suppose to reformat the code so my program could give result like this?
EXPECTED MULTIFILES RESULT in cmd:
>perl ioMulti.pl
File name: file1.txt file2.txt
head
a
b
end
head
c
d
e
f
end
i appreciate your response :)
A good way to start working on a problem like this, is to break it down into smaller sections.
Your problem seems to break down to this:
get a list of filenames
for each file in the list
display the file contents
So think about writing subroutines that do each of these tasks. You already have something like a subroutine to display the contents of the file.
sub display_file_contents {
# filename is the first (and only argument) to the sub
my $filename = shift;
# Use lexical filehandl and three-arg open
open my $filehandle, '<', $filename or die $!;
# Shorter version of your code
print while <$filehandle>;
}
The next task is to get our list of files. You already have some of that too.
sub get_list_of_files {
print 'File name(s): ';
my $files = <STDIN>;
chomp $files;
# We might have more than one filename. Need to split input.
# Assume filenames are separated by whitespace
# (Might need to revisit that assumption - filenames can contain spaces!)
my #filenames = split /\s+/, $files;
return #filenames;
}
We can then put all of that together in the main program.
#!/usr/bin/perl
use strict;
use warnings;
my #list_of_files = get_list_of_files();
foreach my $file (#list_of_files) {
display_file_contents($file);
}
By breaking the task down into smaller tasks, each one becomes easier to deal with. And you don't need to carry the complexity of the whole program in you head at one time.
p.s. But like JRFerguson says, taking the list of files as command line parameters would make this far simpler.
The easy way is to use the diamond operator <> to open and read the files specified on the command line. This would achieve your objective:
while (<>) {
chomp;
print "$_\n";
}
Thus: ioSingle.pl file1.txt file2.txt
If this is the sole objective, you can reduce this to a command line script using the -p or -n switch like:
perl -pe '1' file1.txt file2.txt
perl -ne 'print' file1.txt file2.txt
These switches create implicit loops around the -e commands. The -p switch prints $_ after every loop as if you had written:
LINE:
while (<>) {
# your code...
} continue {
print;
}
Using -n creates:
LINE:
while (<>) {
# your code...
}
Thus, -p adds an implicit print statement.

Globbing and regular expression prob with bash script

I have a problem with my regex:
My script is written in perl.
#!/usr/bin/perl
# Inverse les colonnes 1 et 2
while(<>){
my #cols = split (/\|/);
print "$cols[-3]/$cols[-4]\n";
}
exit;
I create an alias using the command :
alias inverseur="perl /laboratoire10/inverseur_colonnes.pl
I am hoping to accomplish the following:
Write a "bash" script that creates a file container for each movie title (.avi) in the file.
The original file is: http://www.genxvideo.com/genxinventory-current.xls
but I have since renamed it to liste_films.csv .
All quotation marks, spaces, dashes, and other strange characters must be replaced by an underscore, "_".
The group would become the directory name and the title of the movie will follow the file name suffix( .avi). In order to do this, the code must process the fields "title" and "class" in reverse. You can reverse the fields "title" and "class" with the alias "inverter" created earlier.
The script will obviously create each directory in "/laboratoire10" before creating the .avi files. There should be 253 valid directories total. Directories are being created through a "|" with the command "xargs mkdir-pv /."
I need help augmenting my current code with a command to find .avi files whose name contains the string min/maj "wood
It is very hard to understand what exactly you are trying to do. Under the assumption you have a | separated CSV and wish to have a directory tree with CATEGORY/TITLE and the file named "cans.avi" under each directory with that name, here is a one liner perl script.
perl -mText::CSV -e '$csv = Text::CSV->new({ sep_char=>"|",binary=>1,auto_diag => 1 } ) || die; open my $fh, "<", $ARGV[0] or die; while (my $row = $csv->getline($fh)) { $file = cleaner($row->[1])."/".cleaner($row->[0]); print "mkdir $file; touch $file/cans.avi\n"; } sub cleaner($) { my($f) = #_; $f =~ s/\W/_/g; $f;}' ~/tmp/genxinventory-current.csv
I converted the XLS file to | separated CSV using libreoffice, so your conversion mileage (kilometerage?) may vary.

How can I check if contents of one file exist in another in Perl?

Requirement:-
File1 has contents like -
ABCD00000001,\some\some1\ABCD00000001,Y,,5 (this indicates there are 5 file in total in unit)
File2 has contents as ABCD00000001
So what i need to do is check if ABCD00000001 from File2 exist in File1 -
if yes{
print the output to Output.txt till it finds another ',Y,,X'}
else{ No keep checking}
Anyone? Any help is greatly appreciated.
Hi Arkadiy Output should be :- any filename from file 2 -ABCD00000001 in file1 and from Y to Y .
for ex :- file 1 structure will be :-
ABCD00000001,\some\some1\ABCD00000001,Y,,5
ABCD00000001,\some\some1\ABCD00000002
ABCD00000001,\some\some1\ABCD00000003
ABCD00000001,\some\some1\ABCD00000004
ABCD00000001,\some\some1\ABCD00000005
ABCD00000001,\some\some1\ABCD00000006,Y,,2
so out put should contain all line between
ABCD00000001,\some\some1\ABCD00000001,Y,,5 and
ABCD00000001,\some\some1\ABCD00000006,Y,,2
#!/usr/bin/perl -w
use strict;
my $optFile = "C:\\Documents and Settings\\rgolwalkar\\Desktop\\perl_scripts\\SampleOPT1.opt";
my $tifFile = "C:\\Documents and Settings\\rgolwalkar\\Desktop\\perl_scripts\\tif_to_stitch.txt";
print "Reading OPT file now\n";
open (OPT, $optFile);
my #opt_in_array = <OPT>;
close(OPT);
foreach(#opt_in_array){
print();
}
print "\nReading TIF file now\n";
open (TIF, $tifFile);
my #tif_in_array = <TIF>;
close(TIF);
foreach(#tif_in_array){
print();
}
so all it does it is reads 2 files "FYI -> I am new to programming"
Try breaking up your problem into discrete steps. It seems that you need to do this (although your question is not very clear):
open file1 for reading
open file2 for reading
read file1, line by line:
for each line in file1, check if there is particular content anywhere in file2
Which part are you having difficulty with? What code have you got so far? Once you have a line in memory, you can compare it to another string using a regular expression, or perhaps a simpler form of comparison.
OK, I'll bite (partially)...
First general comments. Use strict and -w are good, but you are not checking for the results of open or explicitly stating your desired read/write mode.
The contents of your OPT file kinda sorta looks like it is CSV and the second field looks like a Windows path, true? If so, use the appropriate library from CPAN to parse CSV and verify your file names. Misery and pain can be the result otherwise...
As Ether stated earlier, you need to read the file OPT then match the field you want. If the first file is CSV, first you need to parse it without destroying your file names.
Here is a small snippet that will parse your OPT file. At this point, all it does is print the fields, but you can add logic to match to the other file easily. Just read (slurp) the entire second file into a single string and match with your chosen field from the first:
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new();
my #opt_fields;
while (<DATA>) {
if ($csv->parse($_)) {
push #opt_fields, [ $csv->fields() ];
} else {
my $err = $csv->error_input;
print "Failed to parse line: $err";
}
}
foreach my $ref (#opt_fields) {
# foreach my $field (#$ref) { print "$field\n"; }
print "The anon array: #$ref\n";
print "Use to match?: $ref->[0]\n";
print "File name?: $ref->[1]\n";
}
__DATA__
ABCD00000001,\some\some1\ABCD00000001,Y,,5
ABCD00000001,\some\some1\ABCD00000002
ABCD00000001,\some\some1\ABCD00000003
ABCD00000001,\some\some1\ABCD00000004
ABCD00000001,\some\some1\ABCD00000005
ABCD00000001,\some\some1\ABCD00000006,Y,,2

How can I remove non-unique lines from a large file with Perl?

Duplicate data removal using Perl called within via a batch file within Windows
A DOS window in Windows called via a batch file.
A batch file calls the Perl script which carries out the actions. I have the batch file.
The code script I have works duplicate data is removal so long as the data file is not too big.
The problem that requires resolving is with data files which are larger, (2 GB or more), with this size of file a memory error occurs when trying to load the complete file in to an array for duplicate data removal.
The memory error occurs in the subroutine at:-
#contents_of_the_file = <INFILE>;
(A completely different method is acceptable so long as it solves this issue, please suggest).
The subroutine is:-
sub remove_duplicate_data_and_file
{
open(INFILE,"<" . $output_working_directory . $output_working_filename) or dienice ("Can't open $output_working_filename : INFILE :$!");
if ($test ne "YES")
{
flock(INFILE,1);
}
#contents_of_the_file = <INFILE>;
if ($test ne "YES")
{
flock(INFILE,8);
}
close (INFILE);
### TEST print "$#contents_of_the_file\n\n";
#unique_contents_of_the_file= grep(!$unique_contents_of_the_file{$_}++, #contents_of_the_file);
open(OUTFILE,">" . $output_restore_split_filename) or dienice ("Can't open $output_restore_split_filename : OUTFILE :$!");
if ($test ne "YES")
{
flock(OUTFILE,1);
}
for($element_number=0;$element_number<=$#unique_contents_of_the_file;$element_number++)
{
print OUTFILE "$unique_contents_of_the_file[$element_number]\n";
}
if ($test ne "YES")
{
flock(OUTFILE,8);
}
}
You are unnecessarily storing a full copy of the original file in #contents_of_the_file and -- if the amount of duplication is low relative to the file size -- nearly two other full copies in %unique_contents_of_the_file and #unique_contents_of_the_file. As ire_and_curses noted, you can reduce the storage requirements by making two passes over the data: (1) analyze the file, storing information about the line numbers of non-duplicate lines; and (2) process the file again to write non-dups to the output file.
Here is an illustration. I don't know whether I've picked the best module for the hashing function (Digest::MD5); perhaps others will comment on that. Also note the 3-argument form of open(), which you should be using.
use strict;
use warnings;
use Digest::MD5 qw(md5);
my (%seen, %keep_line_nums);
my $in_file = 'data.dat';
my $out_file = 'data_no_dups.dat';
open (my $in_handle, '<', $in_file) or die $!;
open (my $out_handle, '>', $out_file) or die $!;
while ( defined(my $line = <$in_handle>) ){
my $hashed_line = md5($line);
$keep_line_nums{$.} = 1 unless $seen{$hashed_line};
$seen{$hashed_line} = 1;
}
seek $in_handle, 0, 0;
$. = 0;
while ( defined(my $line = <$in_handle>) ){
print $out_handle $line if $keep_line_nums{$.};
}
close $in_handle;
close $out_handle;
You should be able to do this efficiently using hashing. You don't need to store the data from the lines, just identify which ones are the same. So...
Don't slurp - Read one line at a time.
Hash the line.
Store the hashed line representation as a key in a Perl hash of lists. Store the line number as the first value of the list.
If the key already exists, append the duplicate line number to the list corresponding to that value.
At the end of this process, you'll have a data-structure identifying all the duplicate lines. You can then do a second pass through the file to remove those duplicates.
Perl does heroic things with large files, but 2GB may be a limitation of DOS/Windows.
How much RAM do you have?
If your OS doesn't complain, it may be best to read the file one line at a time, and write immediately to output.
I'm thinking of something using the diamond operator <> but I'm reluctant to suggest any code because on the occasions I've posted code, I've offended a Perl guru on SO.
I'd rather not risk it. I hope the Perl cavalry will arrive soon.
In the meantime, here's a link.
Here's a solution that works no matter how big the file is. But it doesn't use RAM exclusively, so its slower than a RAM-based solution. You can also specify the amount of RAM you want this thing to use.
The solution uses a temporary file that the program treats as a database with SQLite.
#!/usr/bin/perl
use DBI;
use Digest::SHA 'sha1_base64';
use Modern::Perl;
my $input= shift;
my $temp= 'unique.tmp';
my $cache_size_in_mb= 100;
unlink $temp if -f $temp;
my $cx= DBI->connect("dbi:SQLite:dbname=$temp");
$cx->do("PRAGMA cache_size = " . $cache_size_in_mb * 1000);
$cx->do("create table x (id varchar(86) primary key, line int unique)");
my $find= $cx->prepare("select line from x where id = ?");
my $list= $cx->prepare("select line from x order by line");
my $insert= $cx->prepare("insert into x (id, line) values(?, ?)");
open(FILE, $input) or die $!;
my ($line_number, $next_line_number, $line, $sha)= 1;
while($line= <FILE>) {
$line=~ s/\s+$//s;
$sha= sha1_base64($line);
unless($cx->selectrow_array($find, undef, $sha)) {
$insert->execute($sha, $line_number)}
$line_number++;
}
seek FILE, 0, 0;
$list->execute;
$line_number= 1;
$next_line_number= $list->fetchrow_array;
while($line= <FILE>) {
$line=~ s/\s+$//s;
if($next_line_number == $line_number) {
say $line;
$next_line_number= $list->fetchrow_array;
last unless $next_line_number;
}
$line_number++;
}
close FILE;
Well you could use the inline replace mode of command line perl.
perl -i~ -ne 'print unless $seen{$_}++' uberbigfilename
In the "completely different method" category, if you've got Unix commands (e.g. Cygwin):
cat infile | sort | uniq > outfile
This ought to work - no need for Perl at all - which may, or may not, solve your memory problem. However, you will lose the ordering of the infile (as outfile will now be sorted).
EDIT: An alternative solution that's better able to deal with large files may be by using the following algorithm:
Read INFILE line-by-line
Hash each line to a small hash (e.g. a hash# mod 10)
Append each line to a file unique to the hash number (e.g. tmp-1 to tmp-10)
Close INFILE
Open and sort each tmp-# to a new file sortedtmp-#
Mergesort sortedtmp-[1-10] (i.e. open all 10 files and read them simultaneously), skipping duplicates and writing each iteration to the end output file
This will be safer, for very large files, than slurping.
Parts 2 & 3 could be changed to a random# instead of a hash number mod 10.
Here's a script BigSort that may help (though I haven't tested it):
# BigSort
#
# sort big file
#
# $1 input file
# $2 output file
#
# equ sort -t";" -k 1,1 $1 > $2
BigSort()
{
if [ -s $1 ]; then
rm $1.split.* > /dev/null 2>&1
split -l 2500 -a 5 $1 $1.split.
rm $1.sort > /dev/null 2>&1
touch $1.sort1
for FILE in `ls $1.split.*`
do
echo "sort $FILE"
sort -t";" -k 1,1 $FILE > $FILE.sort
sort -m -t";" -k 1,1 $1.sort1 $FILE.sort > $1.sort2
mv $1.sort2 $1.sort1
done
mv $1.sort1 $2
rm $1.split.* > /dev/null 2>&1
else
# work for empty file !
cp $1 $2
fi
}