i m trying to write a perl script to deal with some 3+ gb text files, that are structured like :
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212123x534534534534xx4545454x232323xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
I want to perform two operations :
Count the number of delimiters per line and compare it to a static number (ie 5), those lines that exceed said number should be output to a file.control.
Remove duplicates on the file by substring($line, 0, 7) - first 7 numbers, but i want to preserve order. I want the output of that in a file.output.
I have coded this in simple shell script (just bash), but it took too long to process, the same script calling on perl one liners was quicker, but i m interested in a way to do this purely in perl.
The code i have so far is :
open $file_hndl_ot_control, '>', $FILE_OT_CONTROL;
open $file_hndl_ot_out, '>', $FILE_OT_OUTPUT;
# INPUT.
open $file_hndl_in, '<', $FILE_IN;
while ($line_in = <$file_hndl_in>)
{
# Calculate n. of delimiters
my $delim_cur_line = $line_in =~ y/"$delimiter"//;
# print "$commas \n"
if ( $delim_cur_line != $delim_amnt_per_line )
{
print {$file_hndl_ot_control} "$line_in";
}
# Remove duplicates by substr(0,7) maintain order
my substr_in = substr $line_in, 0, 11;
print if not $lines{$substr_in}++;
}
And i want the file.output file to look like
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
and the file.control file to look like :
(assuming delimiter control number is 6)
4352342xx23232xxx345545x45454x23232xxx
Could someone assist me? Thank you.
Posting edits : Tried code
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
open(my $fh1, ">>", "outputcontrol.txt");
open(my $fh2, ">>", "outputoutput.txt");
while ( <> ) {
my $count = ($_ =~ y/x//);
print "$count \n";
# print $_;
if ( $count != $delim_amnt_per_line )
{
print fh1 $_;
}
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print fh2;
}
I dont know if i m supposed to post new code in here. But i tried the above, based on your example. What baffles me (i m still very new in perl) is that it doesnt output to either filehandle, but if i redirected from the command line just as you said, it worked perfect. The problem is that i need to output into 2 different files.
It looks like entries with the same seven-character prefix may appear anywhere in the file, so it's necessary to use a hash to keep track of which ones have already been encountered. With a 3GB text file this may result in your perl process running out of memory, in which case a different approach is necessary. Please give this a try and see if it comes in under the bar
The tr/// operator (the same as y///) doesn't accept variables for its character list, so I've used eval to create a subroutine delimiters() that will count the number of occurrences of $delimiter in $_
It's usually easiest to pass the input file as a parameter on the command line, and redirect the output as necessary. That way you can run your program on different files without editing the source, and that's how I've written this program. You should run it as
$ perl filter.pl my_input.file > my_output.file
use strict;
use warnings 'all';
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
eval "sub delimiters { tr/$delimiter// }";
while ( <> ) {
next if delimiters() == $delim_amnt_per_line;
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print;
}
output
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
i have multiple csv files, i want to merge all those files.....
i am showing some of my sample csv files below...
M1DL1_Interpro_sum.csv
IPR017690,Outer membrane, omp85 target,821
IPR014729,Rossmann,327
IPR013785,Aldolase,304
IPR015421,Pyridoxal,224
IPR003594,ATPase,179
IPR000531,TonB receptor,150
IPR018248,EF-hand,10
M1DL2_Interpro_sum.csv
IPR017690,Outer membrane, omp85 target,728
IPR013785,Aldolase,300
IPR014729,Rossmann,261
IPR015421,Pyridoxal,189
IPR011991,Winged,113
IPR000873,AMP-dependent synthetase/ligase,111
M1DL3_Interpro_sum.csv
IPR017690,Outer membrane,905
IPR013785,Aldolase,367
IPR014729,Rossmann,338
IPR015421,Pyridoxal,271
IPR003594,ATPase,158
IPR018248,EF-hand,3
now to merge these files i have tried the following code
#ARGV = <merge_csvfiles/*.csv>;
print #ARGV[0],"\n";
open(PAGE,">outfile.csv") || die"Can't open outfile.csv\n";
while($i<scalar(#ARGV))
{
open(FILE,#ARGV[$i]) || die"Can't open ...#ARGV[$i]...\n";
$data.=join("",<FILE>);
close FILE;
print"file completed...",$i+1,"\n";
$i++;
}
#data=split("\n",$data);
#data2=#data;
print scalar(#data);
for($i=0;$i<scalar(#data);$i++)
{
#id1=split(",",#data[$i]);
$id_1=#id1[0];
#data[$j]=~s/\n//;
if(#data[$i] ne "")
{
print PAGE "\n#data[$i],";
for($j=$i+1;$j<scalar(#data2);$j++)
{
#id2=split(",",#data2[$j]);
$id_2=#id2[0];
if($id_1 eq $id_2)
{
#data[$j]=~s/\n//;
print PAGE "#data2[$j],";
#data2[$j]="";
#data[$j]="";
print "match found at ",$i+1," and ",$j+1,"\n";
}
}
}
print $i+1,"\n";
}
merge_csvfiles is a folder which contains all the files
output of above code is
IPR017690,Outer membrane,821,IPR017690,Outer membrane ,728,IPR017690,Outer membrane,905
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR003594,ATPase,179,IPR003594,ATPase,158
IPR000531,TonB receptor,150
IPR018248,EF-hand,10,IPR018248,EF-hand,3
IPR011991,Winged,113
IPR000873,AMP-dependent synthetase/ligase
but i want the output in following format....
IPR017690,Outer membrane,821,IPR017690,Outer membrane ,728,IPR017690,Outer membrane,905
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR003594,ATPase,179,0,0,0,IPR003594,ATPase,158
IPR000531,TonB receptor,150,0,0,0,0,0,0
IPR018248,EF-hand,10,0,0,0,IPR018248,EF-hand,3
0,0,0,IPR011991,Winged,113,0,0,0
0,0,0,IPR000873,AMP-dependent synthetase/ligase,111,0,0,0
Has anybody got any idea how can i do this?
Thank you for the help
As mentioned in Miguel Prz's comment, you haven't explained how you want the merge to be performed, but, judging by the "desired output" sample, it appears that what you want is to concatenate lines with matching IDs from all three input files into a single line in the output file, with "0,0,0" taking the place of any lines which don't appear in a given file.
So, then:
#!/usr/bin/env perl
use strict;
use warnings;
my #input_files = glob 'merge_csvfiles/*.csv';
my %data;
for my $i (0 .. $#input_files) {
open my $infh, '<', $input_files[$i]
or die "Failed to open $input_files[$i]: $!";
while (<$infh>) {
chomp;
my $id = (split ',', $_, 2)[0];
$data{$id}[$i] = $_;
}
print "Input file read: $input_files[$i]\n";
}
open my $outfh, '>', 'outfile.csv' or die "Failed to open outfile.csv: $!";
for my $id (sort keys %data) {
my #merge_data;
for my $i (0 .. $#input_files) {
push #merge_data, $data{$id}[$i] || '0,0,0';
}
print $outfh join(',', #merge_data) . "\n";
}
The first loop collects all the lines from each file into a hash of arrays. The hash keys are the IDs, so the lines for that ID from all files are kept together, and the value for each key is (a reference to) an array of the line associated with that ID in each file; using an array for this allows us to keep track of values which are missing as well as those which are present.
The second loop then takes the keys of that hash (in alphabetical order) and, for each one, creates a temporary array of the values associated with that ID, substituting "0,0,0" for missing values, joins them into a single string, and prints that to the output file.
The results, in outfile.csv, are:
IPR000531,TonB receptor,150,0,0,0,0,0,0
0,0,0,IPR000873,AMP-dependent synthetase/ligase,111,0,0,0
IPR003594,ATPase,179,0,0,0,IPR003594,ATPase,158
0,0,0,IPR011991,Winged,113,0,0,0
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR017690,Outer membrane, omp85 target,821,IPR017690,Outer membrane, omp85 target,728,IPR017690,Outer membrane,905
IPR018248,EF-hand,10,0,0,0,IPR018248,EF-hand,3
Edit: Added explanations requested by OP in comments
can u expalain me the working of my $id = (split ',', $_, 2)[0]; and $# in this program
my $id = (split ',', $_, 2)[0]; gets the text prior to the first comma in the last line of text that was read:
Because I didn't specify what variable to put the data in, while (<$infh>) reads it into the default variable $_.
split ',', $_, 2 splits up the value of $_ into a list of comma-separated fields. The 2 at the end tells it to only produce at most 2 fields; the code will work fine without the 2, but, since I only need the first field, splitting into more parts isn't necessary.
Putting (...)[0] around the split command turns the returned list of fields into an (anonymous) array and returns the first element of that array. It's the same as if I'd written my #fields = split ',', $_, 2; my $id = $fields[0];, but shorter and without the extra variable.
$#array returns the highest-numbered index in the array #array, so for my $i (0 .. $#array) just means "loop over the indexes for all elements in #array". (Note that, if I hadn't needed the value of the index counter, I would have instead looped over the array's data directly, by using for my $filename (#input_files), but it would have been less convenient to keep track of the missing values if I'd done it that way.)
I have written a perl script that splits 3 columns into scalars and replaces various values in the second column using regex. This part works fine, as shown below. What I would like to do, though, is change the first column ($item_id into a series of sequential numbers that restart when the original (numeric) value of $item_id changes.
For example:
123
123
123
123
2397
2397
2397
2397
8693
8693
8693
8693
would be changed to something like this (in a column):
1
2
3
4
1
2
3
4
1
2
3
4
This could either replace the first column or be a new fourth column.
I understand that I might do this through a series of if-else statements and tried this, but that doesn't seem to play well with the while procedure I've already got working for me. - Thanks, Thom Shepard
open(DATA,"< text_to_be_processed.txt");
while (<DATA>)
{
chomp;
my ($item_id,$callnum,$data)=split(/\|/);
$callnum=~s/110/\%I/g;
$callnum=~s/245/\%T/g;
$callnum=~s/260/\%U/g;
print "$item_id\t$callnum\t$data\n";
} #End while
close DATA;
The basic steps are:
Outside of the loop declare the counter and a variable holding the previous $item_id.
Inside the loop you do three things:
reset the counter to 1 if the current $item_id differs from the previous one, otherwise increase it
use that counter, e.g. print it
remember the previous value
With code this could look something similar to this (untested):
my ($counter, $prev_item_id) = (0, '');
while (<DATA>) {
# do your thing
$counter = $item_id eq $prev_item_id ? $counter + 1 : 1;
$prev_item_id = $item_id;
print "$item_id\t$counter\t...\n";
}
This goes a little further than just what you asked...
Use lexical filehandles
[autodie] makes open throw an error automatically
Replace the call nums using a table
Don't assume the data is sorted by item ID
Here's the code.
use strict;
use warnings;
use autodie;
open(my $fh, "<", "text_to_be_processed.txt");
my %Callnum_Map = (
110 => '%I',
245 => '%T',
260 => '%U',
);
my %item_id_count;
while (<$fh>) {
chomp;
my($item_id,$callnum,$data) = split m{\|};
for my $search (keys %Callnum_Map) {
my $replace = $Callnum_Map{$search};
$callnum =~ s{$search}{$replace}g;
}
my $item_count = ++$item_id_count{$item_id};
print "$item_id\t$callnum\t$data\t$item_count\n";
}
By using a hash, it does not presume the data is sorted by item ID. So if it sees...
123|foo|bar
456|up|down
123|left|right
789|this|that
456|black|white
123|what|huh
It will produce...
1
1
2
1
2
3
This is more robust, assuming you want a count of how many times you've seen an item id in the whole file. If you want how many times its been seen consecutively, use Mortiz's solution.
Is this what you are looking for?
open(DATA,"< text_to_be_processed.txt");
my $counter = 0;
my $prev;
while (<DATA>)
{
chomp;
my ($item_id,$callnum,$data)=split(/\|/);
$callnum=~s/110/\%I/g;
$callnum=~s/245/\%T/g;
$callnum=~s/260/\%U/g;
++$counter;
$item_id = $counter;
#reset counter if $prev is different than $item_id
$counter = 0 if ($prev ne $item_id );
$prev = $item_id;
print "$item_id\t$callnum\t$data\n";
} #End while
close DATA;