Compare 3 tab delimited files and print matches in Perl - perl

I want to match column 1 of file 1 with column 1 of file 2 and then column 2 of file 1 with column 1 of file 3 and then print the matches. The columns in the files are separated by tabs. For example:
file 1:
fji01dde AIDJFMGKG
dlp02sle VMCFIJGM
cmr03lsp CKEIFJ
file 2:
fji01dde 25 30
dlp02sle 40 50
cmr03lsp 60 70
file 3:
AIDJFMGKG
CKEIFJ
output needs to be:
fji01dde AIDJFMGKG 25 30
cmr03lsp CKEIFJ 60 70
I only want lines that are common in all three files.
The below code works well for the first two files, but I need to incorporate the third file. Any ideas?
#!/usr/bin/env perl
use strict;
my (%file1,%file2);
## Open the 1st file
open(A,"file1");
while(<A>){
chomp;
## Split the current line on tabs into the #F array.
my #F=split(/\t/);
push #{$file1{$F[0]}},#F[1..$#F];
}
## Open the 2nd file
open(B,"file2");
while(<B>){
chomp;
## Split the current line on tabs into the #F array.
my #F=split(/\t/);
if (defined($file1{$F[0]})) {
foreach my $col (#{$file1{$F[0]}}) {
print "$F[0]\t$col\t#F[1..$#F]\n";
}
}
}

The algorithm seems to be...
for each line in 1
if 1.1 and 2.1 match AND
1.2 appears in 3.1
then
combine 1.1, 1.2, 2.2 and 2.3
Because there's plenty of edge cases in parsing CSV files, don't do it by hand. Use Text::CSV_XS. It can also handle turning CSV files into hashes for us and it's super efficient.
What we'll do is parse all the files. The first file is left as a list, but the other two are put into hashes keyed on the columns that we're going to search on.
NOTE: The names $data are horrible, but I don't know what sort of data these files represent.
use strict;
use warnings;
use Text::CSV_XS qw(csv);
my #csv_files = #ARGV;
# Parse all the CSV files into arrays of arrays.
my $data1 = csv( in => $csv_files[0], sep_char => "\t" );
# Parse the other CSV files into hashes of rows keyed on the columns we're going to search on.
my $data2 = csv( in => $csv_files[1],
sep_char => "\t",
headers => ["code", "num1", "num2"],
key => "code"
);
my $data3 = csv( in => $csv_files[2],
sep_char => "\t",
headers => ["CODE"],
key => "CODE"
);
for my $row1 (#$data1) {
my $row2 = $data2->{$row1->[0]};
my $row3 = $data3->{$row1->[1]};
if( $row2 && $row3 ) {
print join "\t", $row1->[0], $row1->[1], $row2->{num1}, $row2->{num2};
print "\n";
}
}
This reads all the files into memory. If the files are very large this can be a problem. You can reduce memory usage by iterating through file1 one row at a time instead of slurping it all in.

Related

perl - fetch column names from file

I have the following command in my perl script:
my #files = `find $basedir/ -type f -iname '$sampleid*.summary.csv'`; #there are multiple summary.csv files in my basedir. I store them in an array
my $summary = `tail -n 1 $files[0]`; #Each summary.csv contains a header line and a line with data. I fetch here the last line.
chomp($summary);
my #sp = split(/,/,$summary); # I split based on ','
my $gender = $sp[11]; # the values from column 11 are stored in $gender
my $qc = $sp[2]; # the values from column 2 are stored in $gender
Now, I'm experiencing the situation where my *summary.csv files don't have the same number of columns. They do all have 2 lines, where the first line represents the header.
What I want now is not storing the values from column 11 in gender, but I want to store the values from the column 'Gender' in $gender.
How can I achieve this?
First try at solution:
my %hash = ();
my $header = `head -n 1 $files[0]`; #reading the header
chomp ($header);
my #colnames = split (/,/,$header);
my $keyfield = $colnames[#here should be the column with the name 'Gender']
push #{ $hash{$keyfield} };
my $gender = $sp[$keyfield]
You will have to read the header line as well as the data to know what column holds which information. This is done easiest by writing actual Perl code instead of shelling out to various command line utilities. See further below for that solution.
Fixing your solution also requires a hash. You need to read the header line first, store the header fields in an array (as you've already done), and then read the data line. The data needs to be a hash, not an array. A hash is a map of keys and values.
# read the header and create a list of header fields
my $header = `head -n 1 $files[0]`;
chomp ($header);
my #colnames = split (/,/,$header);
# read the data line
my $summary = `tail -n 1 $files[0]`;
chomp($summary);
my %sp; # use a hash for the data, not an array
# use a hash slice to fill in the columns
#sp{#colnames} = split(/,/,$summary);
my $gender = $sp{Gender};
The tricky part here is this line.
#sp{#colnames} = split(/,/,$summary);
We have declared %sp as a hash, but we now access it with a # sigil. That's because we are taking a hash slice, as indicated by the curly braces {}. The slice we take is all elements with the names of the values in #colnames. There is more than one value, so the return value is not a scalar (with a $) any more. There is a list of return values, so the sigil turns to #. Now we use that list on the left hand side (that's called an LVALUE), and assign the result of the split to that list.
Doing it with modern Perl
The following program will use File::Find::Rule to replace your find command, and Text::CSV to read the CSV file. It grabs all the files, then opens one at a time. The header line will be read first, and fed into the Text::CSV object, so that it can then give back a hash reference, which you can use to access every field by name.
I've written it in a way that it will only read one line for each file, as you said there are only two lines per file. You can easily extend that to be a loop.
use strict;
use warnings;
use File::Find::Rule;
use Text::CSV;
my $sampleid;
my $basedir;
my $csv = Text::CSV->new(
{
binary => 1,
sep => ',',
}
) or die "Cannot use CSV: " . Text::CSV->error_diag;
my #files = File::Find::Rule->file()->name("$sampleid*.summary.csv")->in($basedir);
foreach my $file (#files) {
open my $fh, '<', $file or die "Can't open $file: $!";
# get the headers
my #cols = #{ $csv->getline($fh) };
$csv->column_names(#cols);
# read the first line
my $row = $csv->getline_hr($fh);
# do whatever you you want with the row
print "$file: ", $row->{gender};
}
Please note that I have not tested this program.

Issues parsing a CSV file in perl using Text::CSV

I'm trying to use Text::CSV to parse this CSV file. Here is how I am doing it:
open my $fh, '<', 'test.csv' or die "can't open csv";
my $csv = Text::CSV_XS->new ({ sep_char => "\t", binary => 1 , eol=> "\n"});
$csv->column_names($csv->getline($fh));
while(my $row = $csv->getline_hr($fh)) {
# use row
}
Because the file has 169,252 rows (not counting the headers line), I expect the loop to run that many times. However, it only runs 8 times and gives me 8 rows. I'm not sure what's happening, because the CSV just seems like a normal CSV file with \n as the line separator and \t as the field separator. If I loop through the file like this:
while(my $line = <$fh>) {
my $fields = $csv->parse($line);
}
Then the loop goes through all rows.
Text::CSV_XS is silently failing with an error. If you put the following after your while loop:
my ($cde, $str, $pos) = $csv->error_diag ();
print "$cde, $str, $pos\n";
You can see if there were errors parsing the file and you get the output:
2034, EIF - Loose unescaped quote, 336
Which means the column:
GT New Coupe 5.0L CD Wheels: 18" x 8" Magnetic Painted/Machined 6 Speakers
has an unquoted escape string (there is no backslash before the ").
The Text::CSV perldoc states:
allow_loose_quotes
By default, parsing fields that have quote_char characters inside an unquoted field, like
1,foo "bar" baz,42
would result in a parse error. Though it is still bad practice to allow this format, we cannot help there are some vendors that make their applications spit out lines styled like this.
If you change your arguments to the creation of Text::CSV_XS to:
my $csv = Text::CSV_XS->new ({ sep_char => "\t", binary => 1,
eol=> "\n", allow_loose_quotes => 1 });
The problem goes away, well until row 105265, when Error 2023 rears its head:
2023, EIQ - QUO character not allowed, 406
Details of this error in the perldoc:
2023 "EIQ - QUO character not allowed"
Sequences like "foo "bar" baz",qu and 2023,",2008-04-05,"Foo, Bar",\n will cause this error.
Setting your quote character empty (setting quote_char => '' on your call to Text::CSV_XS->new()) does seem to work around this and allow processing of the whole file. However I would take time to check if this is a sane option with the CSV data.
TL;DR The long and short is that your CSV is not in the greatest format, and you will have to work around it.

Parse tab delimited file into hash of array

I'm a perl novice attempting to perform the following:
1) Take a user input
2) Match the input with instances of that value from column1 of file 1 and store the corresponding value from the column 2 in a hash, hash of array or hash of hash. (below code stores in hash of array but I'm not sure if this is optimal to accomplish 3 below)
3) I need to find all instances (if they exist) of the first column in file 2 = column 2 in file 1.
For simplicity I've provided sample file below.
I'm attempting to take a user input of 'AAA' in column 1 of the input file into a hash or array, as the key for all corresponding values in column 2.
My input file has multiple instances of 'AAA' in column 1 with different values for column 2, also there are multiple instances of 'AAA' and 'BBB' in columns 1 & 2. I believe in order to output this properly I need to use a hash of hash but I'm not sure syntactically how to approach it.
I've tried searching this site and found some examples but I'm afraid I'm only confusing myself more.
Example of input file.
AAA BBB
AAA CCC
AAA BBB
BBB DDD
CCC AAA
Example of my code
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
use Data::Dumper;
#declare values
my %hash = ();
#Get protein name from user
print "Get column 1 value: ";
my $value = <STDIN>;
chomp $value;
#open input file
open FILE, "file" or die("unable to open file\n");
while(my $line = <FILE>) {
chomp($line);
my($column1, $column2) = split("\t", $line);
if ($column1 eq $value) {
push #{ $hash{$column1} }, $column2;
}
}
close FILE;
print Dumper(\%hash);
Code output
$VAR1 = {
'AAA' => [
'BBB',
'CCC'
]
};
My question is will my current hash of array setup work best for reading column 1 in file 2 and comparing it with column 2 of file 1? Or should I approach it differently?
Your current code overwrites the value of $hash{$column1} on each iteration. You can use push to add a new element to the array instead of overwriting by changing this line:
$hash{$column1} = [$column2];
to
push #{ $hash{$column1} }, $column2;
Note that the data structure you're creating is not a hash of hashes but a hash of arrays.

Perl grep not returning expected value

I have the following code:
#!/usr/bin/perl
# splits.pl
use strict;
use warnings;
use diagnostics;
my $pivotfile = "myPath/Internal_Splits_Pivot.txt";
open PIVOTFILE, $pivotfile or die $!;
while (<PIVOTFILE>) { # loop through each line in file
next if ($. == 1); # skip first line (contains business segment code)
next if ($. == 2); # skip second line (contains transaction amount text)
my #fields = split('\t',$_); # split fields for line into an array
print scalar(grep $_, #fields), "\n";
}
Given that the data in the text file is this:
4 G I M N U X
Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount
0000-13-I21 600
0001-8V-034BLA 2,172 2,172
0001-8V-191GYG 13,125 4,375
0001-9W-GH5B2A -2,967.09 2,967.09 25.00
I would expect the output from the perl script to be: 2 3 3 4 given the amount of defined elements in each line. The file is a tab delimited text file with 8 columns.
Instead I get 3 4 3 4 and I have no idea why!
For background, I am using Counting array elements in Perl as the basis for my development, as I am trying to count the number of elements in the line to know if I need to skip that line or not.
I suspect you have spaces mixed with the tabs in some places, and your grep test will consider " " true.
What does:
use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper [<PIVOTFILE>];
show?
The problem should be in this line:
my #fields = split('\t',$_); # split fields for line into an array
The tab character doesn't get interpolated. And your file doesn't seem to be tab-only separated, at least here on SO. I changed the split regex to match arbitrary whitespace, ran the code on my machine and got the "right" result:
my #fields = split(/\s+/,$_); # split fields for line into an array
Result:
2
3
3
4
As a side note:
For background, I am using Counting array elements in Perl as the basis for my development, as I am trying to count the number of elements in the line to know if I need to skip that line or not.
Now I understand why you use grep to count array elements. That's important when your array contains undefined values like here:
my #a;
$a[1] = 42; # #a contains the list (undef, 42)
say scalar #a; # 2
or when you manually deleted entries:
my #a = split /,/ => 'foo,bar'; # #a contains the list ('foo', 'bar')
delete $a[0]; # #a contains the list (undef, 'bar')
say scalar #a; # 2
But in many cases, especially when you're using arrays to just store list without operating on single array elements, scalar #a works perfectly fine.
my #a = (1 .. 17, 1 .. 25); # (1, 2, ..., 17, 1, 2, .., 25)
say scalar #a; # 42
It's important to understand, what grep does! In your case
print scalar(grep $_, #fields), "\n";
grep returns the list of true values of #fields and then you print how many you have. But sometimes this isn't what you want/expect:
my #things = (17, 42, 'foo', '', 0); # even '' and 0 are things
say scalar grep $_ => #things # 3!
Because the empty string and the number 0 are false values in Perl, they won't get counted with that idiom. So if you want to know how long an array is, just use
say scalar #array; # number of array entries
If you want to count true values, use this
say scalar grep $_ => #array; # number of true values
But if you want to count defined values, use this
say scalar grep defined($_) => #array; # number of defined values
I'm pretty sure you already know this from the other answers on the linked page. In hashes, the situation is a little bit more complex because setting something to undef is not the same as deleteing it:
my %h = (a => 0, b => 42, c => 17, d => 666);
$h{c} = undef; # still there, but undefined
delete $h{d}; # BAM! $h{d} is gone!
What happens when we try to count values?
say scalar grep $_ => values %h; # 1
because 42 is the only true value in %h.
say scalar grep defined $_ => values %h; # 2
because 0 is defined although it's false.
say scalar grep exists $h{$_} => qw(a b c d); # 3
because undefined values can exist. Conclusion:
know what you're doing instead of copy'n'pasting code snippets :)
There are not only tabs, but there are spaces as well.
trying out with splitting by space works
Look below
#!/usr/bin/perl
# splits.pl
use strict;
use warnings;
use diagnostics;
while (<DATA>) { # loop through each line in file
next if ($. == 1); # skip first line (contains business segment code)
next if ($. == 2); # skip second line (contains transaction amount text)
my #fields = split(" ",$_); # split fields by SPACE
print scalar(#fields), "\n";
}
__DATA__
4 G I M N U X
Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount
0000-13-I21 600
0001-8V-034BLA 2,172 2,172
0001-8V-191GYG 13,125 4,375
0001-9W-GH5B2A -2,967.09 2,967.09 25.00
Output
2
3
3
4
Your code works for me. The problem may be that the input file contains some "hidden" whitespace fields (eg. other whitespace than tabs). For instance
A<tab><space><CR> gives two fields, A and <space><CR>
A<tab>B<tab><CR> gives three, A, B, <CR> (remember, the end of line is part of the input!)
I suggest you to chomp every line you use; other than that, you will have to clean the array from whitespace-only fields. Eg.
scalar(grep /\S/, #fields)
should do it.
A lot of great help on this question, and quickly too!
After a long, drawn-out learning process, this is what I came up with that worked quite well, with intended results.
#!/usr/bin/perl
# splits.pl
use strict;
use warnings;
use diagnostics;
my $pivotfile = "myPath/Internal_Splits_Pivot.txt";
open PIVOTFILE, $pivotfile or die $!;
while (<PIVOTFILE>) { # loop through each line in file
next if ($. == 1); # skip first line (contains business segment code)
next if ($. == 2); # skip second line (contains transaction amount text)
chomp $_; # clean line of trailing \n and white space
my #fields = split(/\t/,$_); # split fields for line into an array
print scalar(grep $_, #fields), "\n";
}

How can I correctly process this file containing tab separated values in Perl?

I am fairly new to Perl and know next to nothing about Perl's 'proper' syntax.
I have a text file that I use everyday with a listing of names, and other info for our users. This file changes daily and sometimes has two rows in it(tab delimited), and other times has 100+ rows in it.
The file also varies between 6-9 columns of data in a row. I have put together a Perl script that uses the split function on tabs, but the issue I am running into is that if I take row a, which has 5 columns in it and then add a second row b that has 6 columns in it that are all populated with data.
I cannot figure out how to get Perl to see that row a only has 5 columns of data and to continue parsing the text file from that point forward. It continues, but the output wraps lines strangely. How can I get around this issue? I hope that made sense.
You will have to post some code and possibly some sample data, but here's a code that is parsing rows of different lengths without issue.
Script:
#!/usr/bin/perl
use strict;
while (<STDIN>)
{
chomp;
my #info = split("\t");
print join(";", #info), "\n";
}
exit;
Test File:
jsmith 101 777-222-5555 Office 1 Building 1 Manager
aposse 104 777-222-5556 Office 2 Building 2 Stock Clerk
jbraza 105 777-222-5557 Office 3
mcuzui 102 777-222-5557 Office 3 Building 3 Cashier
ghines 107 777-222-5557 Office 3
Output:
%> test.pl < file.txt
jsmith;101;777-222-5555;Office 1;Building 1;Manager
aposse;104;777-222-5556;Office 2;Building 2;Stock Clerk
jbraza;105;777-222-5557;Office 3
mcuzui;102;777-222-5557;Office 3;Building 3;Cashier
ghines;107;777-222-5557;Office 3
You should post some sample data and code and explain desired behavior in terms of what the code currently does and what you want it to do. split will give you as many fields as there are in the input.
#!/usr/bin/perl
use strict; use warnings;
while ( my $row = <DATA> ) {
last unless $row =~ /\S/;
chomp $row;
my #cells = split /\t/, $row;
print "< #cells >\n";
}
__DATA__
1 2 3 4 5
a b c d e f
Text::CSV module can be used for parsing tab-separated-values as well. In reality, Text::CSV could parse values delimited by any character.
Relevant excerpt from its POD:
The module accepts either strings or
files as input and can utilize any
user-specified characters as
delimiters, separators, and escapes so
it is perhaps better called ASV
(anything separated values) rather
than just CSV.
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new( { 'sep_char' => "\t" } );
open my $fh, '<', 'data.tsv' or die "Unable to open: $!";
my #rows;
while ( my $row_ref = $csv->getline($fh) ) {
push #rows, $row_ref;
}
$csv->sep_char('|');
for my $row_ref (#rows) {
$csv->combine(#$row_ref);
print $csv->string(), "\n";
}