I'm reading from a CSV file and populating a Hash based on Key-Value Pairs.
The First Column of the record is the key, and the rest of the record is the value. However, for some file I need to make first 2 columns as Key and the rest of the record is value. I have written it as below based on if loop by checking the number of Key Columns, but I wanted to know if there is any better way to do this?
use strict;
use warnings;
open my $fh, '<:encoding(utf8)', 'Sample.csv'
or die "Couldn't open Sample.csv";
my %hash;
my $KeyCols=2;
while (<$fh>) {
chomp;
if ($KeyCols==1) {
next unless /^(.*?),(.*)$/;
$hash{$1} = $2;
}
elsif ($KeyCols==2) {
next unless /^(.*?),(.*?),(.*)$/;
$hash{$1.$2} = $3;
}
}
Here is one way to allow for any number of key columns (not just 1 or 2), but it uses split instead of a regex:
use warnings;
use strict;
my %hash;
my $KeyCols = 2;
while (<DATA>) {
chomp;
my #cols = split /,/, $_, $KeyCols+1;
next unless #cols > $KeyCols;
my $v = pop #cols;
my $k = join '', #cols;
$hash{$k} = $v;
}
__DATA__
a,b,c,d,e,f
q,w,e,r,t,y
This is a self-contained code example.
A big assumption is that your CSV file does not contain commas in the data itself. You should be using a CSV parser such as Text::CSV anyway.
Perhaps it is better to define variables at first lines of the code -- otherwise you have to jump all over the code.
You can define regex based on your $KeyCols and processing code will be same as before.
use strict;
use warnings;
use feature 'say';
my $KeyCols = 2;
my $fname = 'Sample.csv';
my %hash;
my $re;
if( $KeyCols == 2 ) {
$re = qr/^(.*?,.*?),(.*)$/
} else {
$re = qr/^(.*?),(.*)$/;
}
open my $fh, '<:encoding(utf8)', $fname
or die "Couldn't open $fname";
while (<$fh>) {
chomp;
next unless /$re/;
$hash{$1} = $2;
}
close $fh;
Related
I am new to perl, trying to read a file with columns and creating an array.
I am having a file with following columns.
file.txt
A 15
A 20
A 33
B 20
B 45
C 32
C 78
I wanted to create an array for each unique item present in A with its values assigned from second column.
eg:
#A = (15,20,33)
#B = (20,45)
#C = (32,78)
Tried following code, only for printing 2 columns
use strict;
use warnings;
my $filename = $ARGV[0];
open(FILE, $filename) or die "Could not open file '$filename' $!";
my %seen;
while (<FILE>)
{
chomp;
my $line = $_;
my #elements = split (" ", $line);
my $row_name = join "\t", #elements[0,1];
print $row_name . "\n" if ! $seen{$row_name}++;
}
close FILE;
Thanks
Firstly some general Perl advice. These days, we like to use lexical variables as filehandles and pass three arguments to open().
open(my $fh, '<', $filename) or die "Could not open file '$filename' $!";
And then...
while (<$fh>) { ... }
But, given that you have your filename in $ARGV[0], another tip is to use an empty file input operator (<>) which will return data from the files named in #ARGV without you having to open them. So you can remove your open() line completely and replace the while with:
while (<>) { ... }
Second piece of advice - don't store this data in individual arrays. Far better to store it in a more complex data structure. I'd suggest a hash where the key is the letter and the value is an array containing all of the numbers matching that letter. This is surprisingly easy to build:
use strict;
use warnings;
use feature 'say';
my %data; # I'd give this a better name if I knew what your data was
while (<>) {
chomp;
my ($letter, $number) = split; # splits $_ on whitespace by default
push #{ $data{$letter} }, $number;
}
# Walk the hash to see what we've got
for (sort keys %data) {
say "$_ : #{ $data{$_ } }";
}
Change the loop to be something like:
while (my $line = <FILE>)
{
chomp($line);
my #elements = split (" ", $line);
push(#{$seen{$elements[0]}}, $elements[1]);
}
This will create/append a list of each item as it is found, and result in a hash where the keys are the left items, and the values are lists of the right items. You can then process or reassign the values as you wish.
I have a list of Accession numbers that I want to pair randomly using a Perl script below:
#!/usr/bin/perl -w
use List::Util qw(shuffle);
my $file = 'randomseq_acc.txt';
my #identifiers = map { (split /\n/)[1] } <$file>;
chomp #identifiers;
#Shuffle them and put in a hash
#identifiers = shuffle #identifiers;
my %pairs = (#identifiers);
#print the pairs
for (keys %pairs) {
print "$_ and $pairs{$_} are partners\n";
but keep getting errors.
The accession numbers in the file randomseq_acc.txt are:
1094711
1586007
2XFX_C
Q27031.2
P22497.2
Q9TVU5.1
Q4N4N8.1
P28547.2
P15711.1
AAC46910.1
AAA98602.1
AAA98601.1
AAA98600.1
EAN33235.2
EAN34465.1
EAN34464.1
EAN34463.1
EAN34462.1
EAN34461.1
EAN34460.1
I needed to add the closing right curly brace to be able to compile the script.
As arrays are indexed from 0, (split /\n/)[1] returns the second field, i.e. what follows newline on each line (i.e. nothing). Change it to [0] to make it work:
my #identifiers = map { (split /\n/)[0] } <$file>; # Still wrong.
The diamond operator needs a file handle, not a file name. Use open to associate the two:
open my $FH, '<', $file or die $!;
my #identifiers = map { (split /\n/)[0] } <$FH>;
Using split to remove a newline is not common. I'd probably use something else:
map { /(.*)/ } <$FH>
# or
map { chomp; $_ } <$FH>
# or, thanks to ikegami
chomp(my #identifiers = <$FH>);
So, the final result would be something like the following:
#!/usr/bin/perl
use warnings;
use strict;
use List::Util qw(shuffle);
my $filename = '...';
open my $FH, '<', $filename or die $!;
chomp(my #identifiers = <$FH>);
my %pairs = shuffle(#identifiers);
print "$_ and $pairs{$_} are partners\n" for keys %pairs;
My first file looks like:
CHR id position
1 rs58108140 10583
1 rs189107123 10611
1 rs180734498 13302
1 rs144762171 13327
1 chr1:13957:D 13957
And my second file looks like:
CHR SNP POS RiskAl OTHER_ALLELE RAF logOR Pval
10 rs1999138 110140096 T C 0.449034245446375 0.0924443 1.09e-06
6 rs7741604 20839503 C A 0.138318264238111 0.127947 1.1e-06
8 rs1486006 82553172 G C 0.833130882716561 0.147456 1.12727730194884e-06
My script reads in the first file and stores it in an array, and then I would like to find rsIDs from column 2 of the first file that are in column 2 in the second file. I think I am having a problem with how I'm matching the expressions. Here is my script:
#! perl -w
use strict;
use warnings;
my $F = shift #ARGV;
my #snps;
open IN, "$F";
while (<IN>) {
next if m/CHR/;
my #L = split;
push #snps, [$L[0], $L[1], $L[2]] if $L[0] !~ m/[XY]/;
}
close IN;
open IN, "DIAGRAMv3sansWTCCCqc0clumpd_noTCF7L2regOrLeadOrPlt1em6clumps- CHR_SNP_POS_RiskAl_OtherAl_RAF_logOR_Pval.txt";
while (<IN>) {
my #L = split;
next if m/CHR/;
foreach (#snps) {
next if ($L[0] != ${$_}[0]);
# if not on same chromosome
if ($L[0] = ${$_}[0]) {
# if on same chromosome
if ($L[1] =~ ${$_}[1]) {
print "$L[0] $L[1] ${$_}[2]\n";
last;
}
}
}
}
Your code doesn't seem to correspond to your description. You are comparing both the first and second columns of the file rather than just the second.
The main problems are:
You use $L[0] = ${$_}[0] to compare the first columns. This will do an assigmment instead of a comparison. You should use $L[0] == ${$_}[0] instead or, better, $L[0] == $_->[0]
You use $L[1] =~ ${$_}[1] to compare the second columns. This will check whether ${$_}[1] is a substring of $L[1]. You could use anchors like $L[1] =~ /^${$_}[1]$/ but it's much better to just do a string comparison as $L[1] eq $_->[1]
The easiest way is to read the second file first so as to build a list of values that you want included from the first file. I have written it so that it does what your code looks like it's supposed to do, i.e. match the first two columns.
That would look like this
use strict;
use warnings;
use autodie;
my ($f1, $f2) = #_;
my %include;
open my $fh2, '<', $f2;
while (<$fh2>) {
my #fields = split;
my $key = join '|', #fields[0,1];
++$include{$key};
}
close $fh2;
open my $fh1, '<', $f1;
while (<$fh1>) {
my #fields = split;
my $key = join '|', #fields[0,1];
print "#fields[0,1,2]\n" if $include{$key};
}
close $fh1;
output
Unfortunately your choice of sample data doesn't include any records in the first file that have matching keys in the second, so there is no output!
Update
This is a corrected version of your own program. It should work, but it is far more efficient and concise to use hashes, as above
use strict;
use warnings;
use autodie;
my ($filename) = #ARGV;
my #snps;
open my $in_fh, '<', $filename;
<$in_fh>; # Discard header line
while (<$in_fh>) {
my #fields = split;
push #snps, \#fields unless $fields[0] =~ /[XY]/;
}
close $in_fh;
open $in_fh, '<', 'DIAGRAMv3sansWTCCCqc0clumpd_noTCF7L2regOrLeadOrPlt1em6clumps- CHR_SNP_POS_RiskAl_OtherAl_RAF_logOR_Pval.txt';
<$in_fh>; # Discard header line
while (<$in_fh>) {
my #fields = split;
for my $snp (#snps) {
next unless $fields[0] == $snp->[0] and $fields[1] eq $snp->[1];
print "$fields[0] $fields[1] $snp->[2]\n";
last;
}
}
close $in_fh;
I have 3 or multiple files I need to merge, the data looks like this..
file 1
0334.45656
0334.45678
0335.67899
file 2
0334.89765
0335.12346
0335.56789
file 3
0334.12345
0335.45678
0335.98764
Expected output in file 4,
0334.89765
0334.89765
0334.89765
0334.12345
0335.67899
0335.12346
0335.56789
0335.45678
0335.98764
So far I have tried but data in 4rth file does not come in sorted order,
#!/usr/bin/perl
my %hash;
my $outFile = "outFile.txt";
foreach $file(#ARGV)
{
print "$file\n";
open (IN, "$file") || die "cannot open file $!";
open (OUT,">>$outFile") || die "cannot open file $!";
while ( <IN> )
{
chomp $_;
($timestamp,$data) = split (/\./,$_);
$hash{$timeStamp}{'data'}=$data;
if (defined $hash{$timeStamp})
{
print "$_\n";
print OUT"$_\n";
}
}
}
close (IN);
close (OUT);
I wouldn't normally suggest this, but unix utilties should be able to handle this just fine.
cat the 3 files together.
use sort to sort the merged file.
However, using perl, could just do the following:
#!/usr/bin/perl
use strict;
use warnings;
my #data;
push #data, $_ while (<>);
# Because the numbers are all equal length, alpha sort will work here
print for sort #data;
However, as we've discussed, it's possible that the files will be extremely large. Therefore it will be more efficient both in memory and speed if you're able to take advantage of the fact that all the files are already sorted.
The following solution therefore streams the files, pulling out the next one in order each loop of the while:
#!/usr/bin/perl
# Could name this catsort.pl
use strict;
use warnings;
use autodie;
# Initialize File handles
my #fhs = map {open my $fh, '<', $_; $fh} #ARGV;
# First Line of each file
my #data = map {scalar <$_>} #fhs;
# Loop while a next line exists
while (#data) {
# Pull out the next entry.
my $index = (sort {$data[$a] cmp $data[$b]} (0..$#data))[0];
print $data[$index];
# Fill In next Data at index.
if (! defined($data[$index] = readline $fhs[$index])) {
# End of that File
splice #fhs, $index, 1;
splice #data, $index, 1;
}
}
Using Miller's idea in a more reusable way,
use strict;
use warnings;
sub get_sort_iterator {
my #fhs = map {open my $fh, '<', $_ or die $!; $fh} #_;
my #d;
return sub {
for my $i (0 .. $#fhs) {
# skip to next file handle if it doesn't exists or we have value in $d[$i]
next if !$fhs[$i] or defined $d[$i];
# reading from $fhs[$i] file handle was success?
if ( defined($d[$i] = readline($fhs[$i])) ) { chomp($d[$i]) }
# file handle at EOF, not needed any more
else { undef $fhs[$i] }
}
# compare as numbers, return undef if no more data
my ($index) = sort {$d[$a] <=> $d[$b]} grep { defined $d[$_] } 0..$#d
or return;
# return value from $d[$index], and set it to undef
return delete $d[$index];
};
}
my $iter = get_sort_iterator(#ARGV);
while (defined(my $x = $iter->())) {
print "$x\n";
}
output
0334.12345
0334.45656
0334.45678
0334.89765
0335.12346
0335.45678
0335.56789
0335.67899
0335.98764
Suppose every input files are already in ascending order and have at least one line in them, this script could merge them in ascending order:
#!/usr/bin/perl
use warnings;
use strict;
use List::Util 'reduce';
sub min_index {
reduce { $_[$a] < $_[$b] ? $a : $b } 0 .. $#_;
}
my #fhs = map { open my $fh, '<', $_; $fh } #ARGV;
my #data = map { scalar <$_> } #fhs;
while (#data) {
my $idx = min_index(#data);
print "$data[$idx]";
if (! defined($data[$idx] = readline $fhs[$idx])) {
splice #data, $idx, 1;
splice #fhs, $idx, 1;
}
}
Note: this is basic the same as the second script offered by #Miller, but a bit clearer and more concise.
I suggest this solution, which uses a sorted array of hashes - each hash corresponding to an input file, and containing a file handle fh, the last line read line and the timestamp extracted from the line timestamp.
The hash at the end of the array always corresponds to the input that has the smallest value for the timestamp, so all that is necessary is to repeateedly pop the next value from the array, print its data, read the next line and (if it hasn't reached eof) insert it back into the array in sorted order.
This could produce an appreciable increase in speed over the repeated sorting of all the data for each output line that other answers use.
Note that the program expects the list of input files as parameters on the command line, and sends its merged output to STDOUT. It also assumes that the input files are already sorted.
use strict;
use warnings;
use autodie;
my #data;
for my $file (#ARGV) {
my $item;
open $item->{fh}, '<', $file;
insert_item($item, \#data);
}
while (#data) {
my $item = pop #data;
print $item->{line};
insert_item($item, \#data);
}
sub insert_item {
my ($item, $array) = #_;
return if eof $item->{fh};
$item->{line} = readline $item->{fh};
($item->{timestamp}) = $item->{line} =~ /^(\d+)/;
my $i = 0;
++$i while $i < #$array and $item->{timestamp} < $array->[$i]{timestamp};
splice #$array, $i, 0, $item;
}
output
0334.45656
0334.89765
0334.12345
0334.45678
0335.12346
0335.45678
0335.67899
0335.56789
0335.98764
i have a very simple perl question regarding pattern matching problem.
I am reading file with a list of names (fileA).
I would like to check if any of these names exist in another file (fileB).
if ($name -e $fileB){
do something
}else{
do something else
}
it is in a way to check if a pattern exists in a file.
I have tried
open(IN, $controls) or die "Can't open the control file\n";
while(my $line = <IN>){
if ($name =~ $line ){
print "$name\tfound\n";
}else{
print "$name\tnotFound\n";
}
}
This is repeating itself as it checks and prints every entry rather than checking whether the name exists or not.
When you are doing compare one list to another, you're interested in hashes. A hash is an array that is keyed and the list itself has no order. A hash can only have a single instance of a particular key (but different keys can have the same data).
What you can do is go through the first file, and create a hash keyed by that line. Then, you go through the second folder and check to see if any of those lines match any keys in your hash:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie; #You don't have to check if "open" fails.
use constant {
FIRST_FILE => 'file1.txt',
SECOND_FILE => 'file2.txt',
};
open my $first_fh, "<", FIRST_FILE;
# Get each line as a hash key
my %line_hash;
while ( my $line = <$first_fh> ) {
chomp $line;
$line_hash{$line} = 1;
}
close $first_fh;
Now each line is a key in your hash %line_hash. The data really doesn't matter. The important part is the value of the key itself.
Now that I have my hash of the lines in the first file, I can read in the second file and see if that line exists in my hash:
open my $second_fh, "<", SECOND_FILE;
while ( my $line = <$second_fh> ) {
chomp $line;
if ( exists $line_hash{$line} ) {
say qq(I found "$line" in both files);
}
}
close $second_fh;
There's a map function too that can be used:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie; #You don't have to check if "open" fails.
use constant {
FIRST_FILE => 'file1.txt',
SECOND_FILE => 'file2.txt',
};
open my $first_fh, "<", FIRST_FILE
chomp ( my #lines = <$first_fh> );
# Get each line as a hash key
my %line_hash = map { $_ => 1 } #lines;
close $first_fh;
open my $second_fh, "<", SECOND_FILE;
while ( my $line = <$second_fh> ) {
chomp $line;
if ( exists $line_hash{$line} ) {
say qq(I found "$line" in both files);
}
}
close $second_fh;
I am not a great fan of map because I don't find it that much more efficient and it is harder to understand what is going on.
To check whether a pattern exists in a file, you have to open the file and read its content. The fastest way how to search for inclusion of two lists is to store the content in a hash:
#!/usr/bin/perl
use strict;
use warnings;
open my $LST, '<', 'fileA' or die "fileA: $!\n";
open my $FB, '<', 'fileB' or die "fileB: $!\n";
my %hash;
while (<$FB>) {
chomp;
undef $hash{$_};
}
while (<$LST>) {
chomp;
if (exists $hash{$_}) {
print "$_ exists in fileB.\n";
}
}
I have just given an algorithm kind of code which is not tested.
But i feel this does the job for you.
my #a;
my $matched
my $line;
open(A,"fileA");
open(A,"fileB");
while(<A>)
{
chomp;
push #a,$_;
}
while(<B>)
{
chomp;
$line=$_;
$matched=0;
for(#a){if($line=~/$_/){last;$matched=1}}
if($matched)
{
do something
}
else
{
do something else
}
}