separate and outfile numeric values based on similarity of the values. - perl

I think perl can do this, but I am pretty new to perl.
Hoping somebody can help me.
I have file like this (actual file is more than ten thousands lines, values are in ascending order, some values are duplicated).
1
2
2
35
45
I want to separate those lines into separate files based on the similarity of the values (for example difference of the value is less than 30).
outfile1
1
2
2
outfile2
35
45
Thanks

This is done very simply by just opening a new file every time it is necessary, i.e. for the first line of data and thereafter every time there is a gap of 30 or more.
This program expects the name of the input file as a parameter on the command line.
use strict;
use warnings;
use autodie;
my ($last, $fileno, $fh);
while (<>) {
my ($this) = /(\d+)/;
unless (defined $last and $this < $last + 30) {
open $fh, '>', 'outfile'.++$fileno;
}
print $fh $_;
$last = $this;
}

It should really be easy. Just remember the previous value in a variable so you can see whether the difference is large enough. You also have to count the output files created so far so you can name a new file when needed.
#!/usr/bin/perl
use warnings;
use strict;
my $threshold = 30;
my $previous;
my $count_out = 0;
my $OUTPUT;
while (<>) {
if (not defined $previous or $_ > $previous + $threshold) {
open $OUTPUT, '>', "outfile" . $count_out++ or die $!;
}
print $OUTPUT $_;
$previous = $_;
}

Related

Split file Perl

I want to split parts of a file. Here is what the start of the file looks like (it continues in same way):
Location Strand Length PID Gene
1..822 + 273 292571599 CDS001
906..1298 + 130 292571600 trxA
I want to split in Location column and subtract 822-1 and do the same for every row and add them all together. So that for these two results the value would be: (822-1)+1298-906) = 1213
How?
My code right now, (I don't get any output at all in the terminal, it just continue to process forever):
use warnings;
use strict;
my $infile = $ARGV[0]; # Reading infile argument
open my $IN, '<', $infile or die "Could not open $infile: $!, $?";
my $line2 = <$IN>;
my $coding = 0; # Initialize coding variable
while(my $line = $line2){ # reading the file line by line
# TODO Use split and do the calculations
my #row = split(/\.\./, $line);
my #row2 = split(/\D/, $row[1]);
$coding += $row2[0]- $row[0];
}
print "total amount of protein coding DNA: $coding\n";
So what I get from my code if I put:
print "$coding \n";
at the end of the while loop just to test is:
821
1642
And so the first number is correct (822-1) but the next number doesn't make any sense to me, it should be (1298-906). What I want in the end outside the loop:
print "total amount of protein coding DNA: $coding\n";
is the sum of all the subtractions of every line i.e. 1213. But I don't get anything, just a terminal that works on forever.
As a one-liner:
perl -nE '$c += $2 - $1 if /^(\d+)\.\.(\d+)/; END { say $c }' input.txt
(Extracting the important part of that and putting it into your actual script should be easy to figure out).
Explicitly opening the file makes your code more complicated than it needs to be. Perl will automatically open any files passed on the command line and allow you to read from them using the empty file input operator, <>. So your code becomes as simple as this:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $total;
while (<>) {
my ($min, $max) = /(\d+)\.\.(\d+)/;
next unless $min and $max;
$total += $max - $min;
}
say $total;
If this code is in a file called adder and your input data is in add.dat, then you run it like this:
$ adder add.dat
1213
Update: And, to explain where you were going wrong...
You only ever read a single line from your file:
my $line2 = <$IN>;
And then you continually assign that same value to another variable:
while(my $line = $line2){ # reading the file line by line
The comment in this line is wrong. I'm not sure where you got that line from.
To fix your code, just remove the my $line2 = <$IN> line and replace your loop with:
while (my $line = <$IN>) {
# your code here
}

How loading of file into memory in perl

i have tried with some script for sorting a input text file in descending order and print only top usage customer.
input text file containts:
NAME,USAGE,IP
example :
Abc,556,10.2.3.5
bbc,126,14.2.5.6
and so on, this is very large file and i am trying to avoid loading file into memory.
I have tried with following script.
use warnings ;
use strict;
my %hash = ();
my $file = $ARGV[0] ;
open (my $fh, "<", $file) or die "Can't open the file $file: ";
while (my $line =<$fh>)
{
chomp ($line) ;
my( $name,$key,$ip) = split /,/, $line;
$hash{$key} = [ $name, $ip ];
}
my $count= 0 ;
foreach ( sort { $b <=> $a } keys %hash ){
my $value = $hash{$_};
print "$_ #{$value} \n" ;
last if (++$count == 5);
}
Output should be sorted based on usage and it will show the name and IP for respective usage. " `
I think you want to print the five lines of the file that have the highest value in the second column
That can be done by a sort of insertion sort that checks each line of the file to see if it comes higher than the lowest of the five lines most recently found, but it's easier to just accumulate a sensible subset of the data, sort it, and discard all but the top five
Here, I have array #top containing lines from the file. When there are 100 lines in the array, it is sorted and reduced to the five maximal entries. Then the while loop continues to add lines to the file until it reaches the limit again or the end of the file has been reached, when the process is repeated. That way, no more than 100 lines from the file are ever help in memory
I have generated a 1,000-line data file to test this with random values between 100 and 2,000 in column 2. The output below is the result
use strict;
use warnings 'all';
open my $fh, '<', 'usage.txt' or die $!;
my #top;
while ( <$fh> ) {
push #top, $_;
if ( #top >= 100 or eof ) {
#top = sort {
my ($aa, $bb) = map { (split /,/)[1] } ($a, $b);
$bb <=> $aa;
} #top;
#top = #top[0..4];
}
}
print #top;
output
qcmmt,2000,10.2.3.5
ciumt,1999,10.2.3.5
eweae,1998,10.2.3.5
gvhwv,1998,10.2.3.5
wonmd,1993,10.2.3.5
The standard way to do this is to create a priority queue that contains k items, where k is the number of items you want to return. So if you want the five lines that have the highest value, you'd do the following:
pq = new priority_queue
add the first five items in the file to the priority queue
for each remaining line in the file
if value > lowest value on pq
remove lowest value on the pq
add new value to pq
When you're done going through the file, pq will contain the five items with the highest value.
To do this in Perl, use the Heap::Priority module.
This will be faster and use less memory than the other suggestions.
Algorithm remembering the last 5 biggest rows.
For each row, check with the lowest memorized element. If more - are stored in the array before next biggest item with unshift lowest.
use warnings;
use strict;
my $file = $ARGV[0] ;
my #keys=(0,0,0,0,0);
my #res;
open (my $fh, "<", $file) or die "Can't open the file $file: ";
while(<$fh>)
{
my($name,$key,$ip) = split /,/;
next if($key<$keys[0]);
for(0..4) {
if($_==4 || $key<$keys[$_+1]) {
#keys[0..$_-1]=#keys[1..$_] if($_>0);
$keys[$_]=$key;
$res[$_]=[ $name, $ip ];
last;
}
}
}
for(0..4) {
print "$keys[4-$_] #{$res[4-$_]}";
}
Test on file from 1M random rows (20 Mbytes):
Last items (This algorithm):
Start 1472567980.91183
End 1472567981.94729 (duration 1.03546 seconds)
full sort in memory (Algorithm of #Rishi):
Start 1472568441.00438
End 1472568443.43829 (duration 2.43391 seconds)
sort by parts of 100 rows (Algorithm of #Borodin):
Start 1472568185.21896
End 1472568195.59322 (duration 10.37426 seconds)

How to find lines containing a match between two files in perl?

I'm a novice at using perl. What I want to do is compare two files. One is my index file that I am calling "temp." I am attempting to use this to search through a main file that I am calling "array." The index file has only numbers in it. There are lines in my array that have those numbers. I've been trying to find the intersection between those two files, but my code is not working. Here's what I've been trying to do.
#!/usr/bin/perl
print "Enter the input file:";
my $filename=<STDIN>;
open (FILE, "$filename") || die "Cannot open file: $!";
my #array=<FILE>;
close(FILE);
print "Enter the index file:";
my $temp=<STDIN>;
open (TEMP, "$temp") || die "Cannot open file: $!";
my #temp=<TEMP>;
close(TEMP);
my %seen= ();
foreach (#array) {
$seen{$_}=1;
}
my #intersection=grep($seen{$_}, #temp);
foreach (#intersection) {
print "$_\n";
}
If I can't use intersection, then what else can I do to move each line that has a match between the two files?
For those of you asking for the main file and the index file:
Main file:
1 CP TRT
...
14 C1 MPE
15 C2 MPE
...
20 CA1 MPE
Index file
20
24
22
17
18
...
I want to put those lines that contain one of the numbers in my index file into a new array. So using this example, only
20 CA1 MPE would be placed into a new array.
My main file and index file are both longer than what I've shown, but that hopefully gives you an idea on what I'm trying to do.
I am assuming something like this?
use strict;
use warnings;
use Data::Dumper;
# creating arrays instead of reading from file just for demo
# based on the assumption that your files are 1 number per line
# and no need for any particular parsing
my #array = qw/1 2 3 20 60 50 4 5 6 7/;
my #index = qw/10 12 5 3 2/;
my #intersection = ();
my %hash1 = map{$_ => 1} #array;
foreach (#index)
{
if (defined $hash1{$_})
{
push #intersection, $_;
}
}
print Dumper(\#intersection);
==== Out ====
$VAR1 = [
'5',
'3',
'2'
];
A few things:
Always have use strict; and use warnings; in your program. This will catch a lot of possible errors.
Always chomp after reading input. Perl automatically adds \n to the end of lines read. chomp removes the \n.
Learn a more modern form of Perl.
Use nemonic variable names. $temp doesn't cut it.
Use spaces to help make your code more readable.
You never stated the errors you were getting. I assume it has to do with the fact that the input from your main file doesn't match your index file.
I use a hash to create an index that the index file can use via my ($index) = split /\s+/, $line;:
#! /usr/bin/env perl
#
use strict;
use warnings;
use autodie;
use feature qw(say);
print "Input file name: ";
my $input_file = <STDIN>;
chomp $input_file; # Chomp Input!
print "Index file name: ";
my $index_file = <STDIN>;
chomp $index_file; # Chomp Input!
open my $input_fh, "<", $input_file;
my %hash;
while ( my $line = <$input_fh> ) {
chomp $line;
#
# Using split to find the item to index on
#
my ($index) = split /\s+/, $line;
$hash{$index} = $line;
}
close $input_fh;
open my $index_fh, "<", $index_file;
while ( my $index = <$index_fh> ) {
chomp $index;
#
# Now index can look up lines
#
if( exists $hash{$index} ) {
say qq(Index: $index Line: "$hash{$index}");
}
else {
say qq(Index "$index" doesn't exist in file.);
}
}
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
#ARGV = 'main_file';
open(my $fh_idx, '<', 'index_file');
chomp(my #idx = <$fh_idx>);
close($fh_idx);
while (defined(my $r = <>)) {
print $r if grep { $r =~ /^[ \t]*$_/ } #idx;
}
You may wish to replace those hardcoded file names for <STDIN>.
FYI: The defined call inside a while condition might be "optional".

Getting unique random line (at each script run) from an text file with perl

Having an text file like the next one called "input.txt"
some field1a | field1b | field1c
...another approx 1000 lines....
fielaNa | field Nb | field Nc
I can choose any field delimiter.
Need a script, what at every discrete run will get one unique (never repeated) random line from this file, until used all lines.
My solution: I added one column into a file, so have
0|some field1a | field1b | field1c
...another approx 1000 lines....
0|fielaNa | field Nb | field Nc
and processing it with the next code:
use 5.014;
use warnings;
use utf8;
use List::Util;
use open qw(:std :utf8);
my $file = "./input.txt";
#read all lines into array and shuffle them
open(my $fh, "<:utf8", $file);
my #lines = List::Util::shuffle map { chomp $_; $_ } <$fh>;
close $fh;
#search for the 1st line what has 0 at the start
#change the 0 to 1
#and rewrite the whole file
my $random_line;
for(my $i=0; $i<=$#lines; $i++) {
if( $lines[$i] =~ /^0/ ) {
$random_line = $lines[$i];
$lines[$i] =~ s/^0/1/;
open($fh, ">:utf8", $file);
print $fh join("\n", #lines);
close $fh;
last;
}
}
$random_line = "1|NO|more|lines" unless( $random_line =~ /\w/ );
do_something_with_the_fields(split /\|/, $random_line))
exit;
It is an working solution, but not very nice one, because:
the line order is changing at each script run
not concurrent script-run safe.
How to write it more effective and more elegantly?
What about keeping a shuffled list of the line numbers in a different file, removing the first one each time you use it? Some locking might be needed to asure concurent script-run safety.
From perlfaq5.
How do I select a random line from a file?
Short of loading the file into a database or pre-indexing the lines in
the file, there are a couple of things that you can do.
Here's a reservoir-sampling algorithm from the Camel Book:
srand;
rand($.) < 1 && ($line = $_) while <>;
This has a significant advantage in space over reading the whole file
in. You can find a proof of this method in The Art of Computer
Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.
You can use the File::Random module which provides a function for that
algorithm:
use File::Random qw/random_line/;
my $line = random_line($filename);
Another way is to use the Tie::File module, which treats the entire
file as an array. Simply access a random array element.
All Perl programmers should take the time to read the FAQ.
Update: To get a unique random line each time you're going to have to store state. The easiest way to store the state is to remove the lines that you've used from the file.
This program uses the Tie::File module to open your input.txt file as well as an indices.txt file.
If indices.txt is empty then it is initialised with the indices of all the records in input.txt in a shuffled order.
Each run, the index at the end of the list is removed and the corresponding input record displayed.
use strict;
use warnings;
use Tie::File;
use List::Util 'shuffle';
tie my #input, 'Tie::File', 'input.txt'
or die qq(Unable to open "input.txt": $!);
tie my #indices, 'Tie::File', 'indices.txt'
or die qq(Unable to open "indices.txt": $!);
#indices = shuffle(0..$#input) unless #indices;
my $index = pop #indices;
print $input[$index];
Update
I have modified this solution so that it populates a new indices.txt file only if it doesn't already exist and not, as before, simply when it is empty. That means a new sequence of records can be printed simply by deleting the indices.txt file.
use strict;
use warnings;
use Tie::File;
use List::Util 'shuffle';
my ($input_file, $indices_file) = qw( input.txt indices.txt );
tie my #input, 'Tie::File', $input_file
or die qq(Unable to open "$input_file": $!);
my $first_run = not -f $indices_file;
tie my #indices, 'Tie::File', $indices_file
or die qq(Unable to open "$indices_file": $!);
#indices = shuffle(0..$#input) if $first_run;
#indices or die "All records have been displayed";
my $index = pop #indices;
print $input[$index];

calculate average of small parts of column, grouped by key in perl?

This question is quite similar to this one How can I get the average and standard deviations grouped by key? but I don't manage to modify it to fit my problem.
I have a lot of files (.csv) with 7 columns, the last three columns look like this:
col5,col6,col7
1408,1,123
1408,2,234
1408,3,345
1408,4,456
1408,5,567
1408,6,678
1409,0,123
1409,1,234
1409,2,345
1409,3,456
1409,4,567
1409,5,678
1409,6,789
...
N,0,123
N,1,234
N,2,345
N,3,456
N,4,567
N,5,678
N,6,789
What I want to do is to calculate the average of the last column (col7) for all the values that have the same value in column 5 (col5), so 1408, 1409, 1410, ... until N and I don't know N. I want to print this average value next to the line (in col8) which contains a 3 in column 6 (col6). Do note that the value in column 6 (col6) goes from 0 to 6, but the first number of the file is not always 0. So what I want is:
col1,col2,col3,col4,col5,col6,col7,col8
bla,bla,bla,bla,1408,3,345,400.5
bla,bla,bla,bla,1409,3,456,456
...
bla,bla,bla,bla,N,3,456,456
I have some script I can use to calculate the average, but I have to be able to put my values into an array for that. Below is what I tried to do, but it doesn't work. Also, I'm just trying to learn Perl on my own, so if it looks like crap, I'm just trying!
open (FILE, "<", $dir.$file) or die;
my #lines = <FILE>;
foreach my $line(#lines) {
my ($col1,$col2,$col3,$col4,$col5,$col6,$col7) = split(/\,/, $line);
push #arrays5, $col5;
}
foreach my $array5(#arrays5) {
foreach my $line(#lines) {
my ($col1,$col2,$col3,$col4,$col5,$col6,$col7) = split(/\,/, $line);
if ($array5 == $col5) {
push #arrays7, $col7;
}
}
}
close(FILE);
One way using Text::CSV_XS module. It's not a built-in one, so it has to be installed from CPAN or similar tool.
Content of script.pl:
use warnings;
use strict;
use Text::CSV_XS;
my ($offset, $col_total, $row3, $rows_processed);
## Check arguments to the script.
die qq[Usage: perl $0 <input-file>\n] unless #ARGV == 1;
## Open input file.
open my $fh, q[<], shift or die qq[Open error: $!\n];
## Create the CSV object.
my $csv = Text::CSV_XS->new or
die qq[ERROR: ] . Text::CSV_XS->error_diag();
## Read file content seven lines each time.
while ( my $rows = $csv->getline_all( $fh, $offset, 7 ) ) {
## End when there is no more rows.
last unless #$rows;
## For each row in the group of seven...
for my $row ( 0 .. $#{$rows} ) {
## Get value of last column.
my $last_col_value = $rows->[ $row ][ $#{$rows->[$row]} ];
## If last column is not a number it is the header, so print it
## appending the eigth column and read next one.
unless ( $last_col_value =~ m/\A\d+\Z/ ) {
$csv->print( \*STDOUT, $rows->[ $row ] );
printf qq[,%s\n], q[col8];
next;
}
## Acumulate total amount for last column.
$col_total += $last_col_value;
## Get third row. The output will be this row with the
## average appended.
if ( $rows->[ $row ][-2] == 3 ) {
$row3 = [ #{ $rows->[ $row ] } ];
}
## Count processed rows.
++$rows_processed;
}
## Print row with its average.
if ( $rows_processed > 0 && ref $row3 ) {
$csv->print( \*STDOUT, $row3 );
printf qq[,%g\n], $col_total / $rows_processed;
}
## Initialize variables.
$col_total = $rows_processed = 0;
undef $row3;
}
Content of infile:
col1,col2,col3,col4,col5,col6,col7
bla,bla,bla,bla,1408,1,123
bla,bla,bla,bla,1408,2,234
bla,bla,bla,bla,1408,3,345
bla,bla,bla,bla,1408,4,456
bla,bla,bla,bla,1408,5,567
bla,bla,bla,bla,1408,6,678
bla,bla,bla,bla,1409,0,123
bla,bla,bla,bla,1409,1,234
bla,bla,bla,bla,1409,2,345
bla,bla,bla,bla,1409,3,456
bla,bla,bla,bla,1409,4,567
bla,bla,bla,bla,1409,5,678
bla,bla,bla,bla,1409,6,789
Run it like:
perl script.pl infile
With following output:
col1,col2,col3,col4,col5,col6,col7,col8
bla,bla,bla,bla,1408,3,345,400.5
bla,bla,bla,bla,1409,3,456,456
Before we try to complete the answer, would you try this and tell me how close it comes to what you want?
#!/usr/bin/perl
use warnings;
use strict;
my $target = 3;
my %summary;
while(<>) {
chomp;
my ($col1,$col2,$col3,$col4,$col5,$col6,$col7) = split /\,/;
$summary{$col5}{total} += $col7;
++$summary{$col5}{count};
$summary{$col5}{line} = $_ if $col6 == $target;
}
$summary{$_}{average} = $summary{$_}{total} / $summary{$_}{count}
for keys %summary;
print "${summary{$_}{line}},${summary{$_}{average}}\n"
for sort keys %summary;
If close enough, then you may wish to finish on your own. If not, then we can discuss the matter further.
Note that you can replace the <> with <FILE> if you prefer to read from your data file rather than from standard input.
IMPLEMENTATION NOTES
The code relies on Perl's autovivification feature. Observe for instance the line ++$summary{$col5}{count};, which seems initially to increment a nonexistent counter. However, this is actually standard Perl idiom. If you try to do something arithmetical (like incrementation) to an object that does not exist, Perl implicitly creates the object, initializes it to zero, and then does the thing you wanted (like incrementation) to it.
It would probably be unwise for a more sober programming language like C++ to autovivify, but years of experience suggest that autovivification strikes the right balance between order and convenience in a slightly less sober language like Perl.
On a more elementary level, the code will probably make sense only to those used to Perl's hashes. However, if you've not used Perl's hashes before, this would be as good a chance as any to learn them. The hash is a central pillar of the language, and the above makes a fairly typical example of its use.
In this case, we have a hash of hashes, which again is fairly typical.
This should do the trick. Replace Cols[index] appropriately.
use Data::Dumper ;
open (FILE, "<", '/tmp/myfile') or die;
my #lines ;
my (%Sum,%Count);
chomp(#lines = <FILE>);
foreach my $line(#lines) {
next if $line =~ /col/;
my #Cols = split /,/, $line;
$Sum{$Cols[0]} += $Cols[2] ;
$Count{$Cols[0]}++;
}
foreach my $line(#lines) {
if($line=~/col/) {
print "$line,colX\n" ;
next;
}
my #Cols = split /,/, $line;
if($Cols[1]==3) {
print "$line,",$Sum{$Cols[0]}/$Count{$Cols[0]},"\n" ;
} else {
print "$line,-1\n";
}
}
Sample input /tmp/myfile
col5,col6,col7
1408,1,123
1408,2,234
1408,3,345
1408,4,456
1408,5,567
1408,6,678
1409,0,123
1409,1,234
Sample output
col5,col6,col7,colX
1408,1,123,-1
1408,2,234,-1
1408,3,345,400.5
1408,4,456,-1
1408,5,567,-1
1408,6,678,-1
1409,0,123,-1
1409,1,234,-1