Parse tab delimited file into hash of array

Parse tab delimited file into hash of array - perl

I'm a perl novice attempting to perform the following:
1) Take a user input
2) Match the input with instances of that value from column1 of file 1 and store the corresponding value from the column 2 in a hash, hash of array or hash of hash. (below code stores in hash of array but I'm not sure if this is optimal to accomplish 3 below)
3) I need to find all instances (if they exist) of the first column in file 2 = column 2 in file 1.
For simplicity I've provided sample file below.
I'm attempting to take a user input of 'AAA' in column 1 of the input file into a hash or array, as the key for all corresponding values in column 2.
My input file has multiple instances of 'AAA' in column 1 with different values for column 2, also there are multiple instances of 'AAA' and 'BBB' in columns 1 & 2. I believe in order to output this properly I need to use a hash of hash but I'm not sure syntactically how to approach it.
I've tried searching this site and found some examples but I'm afraid I'm only confusing myself more.
Example of input file.
AAA BBB
AAA CCC
AAA BBB
BBB DDD
CCC AAA
Example of my code
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
use Data::Dumper;
#declare values
my %hash = ();
#Get protein name from user
print "Get column 1 value: ";
my $value = <STDIN>;
chomp $value;
#open input file
open FILE, "file" or die("unable to open file\n");
while(my $line = <FILE>) {
chomp($line);
my($column1, $column2) = split("\t", $line);
if ($column1 eq $value) {
push #{ $hash{$column1} }, $column2;
}
}
close FILE;
print Dumper(\%hash);
Code output
$VAR1 = {
'AAA' => [
'BBB',
'CCC'
]
};
My question is will my current hash of array setup work best for reading column 1 in file 2 and comparing it with column 2 of file 1? Or should I approach it differently?

Your current code overwrites the value of $hash{$column1} on each iteration. You can use push to add a new element to the array instead of overwriting by changing this line:
$hash{$column1} = [$column2];
to
push #{ $hash{$column1} }, $column2;
Note that the data structure you're creating is not a hash of hashes but a hash of arrays.

Related

perl - fetch column names from file

I have the following command in my perl script:
my #files = `find $basedir/ -type f -iname '$sampleid*.summary.csv'`; #there are multiple summary.csv files in my basedir. I store them in an array
my $summary = `tail -n 1 $files[0]`; #Each summary.csv contains a header line and a line with data. I fetch here the last line.
chomp($summary);
my #sp = split(/,/,$summary); # I split based on ','
my $gender = $sp[11]; # the values from column 11 are stored in $gender
my $qc = $sp[2]; # the values from column 2 are stored in $gender
Now, I'm experiencing the situation where my *summary.csv files don't have the same number of columns. They do all have 2 lines, where the first line represents the header.
What I want now is not storing the values from column 11 in gender, but I want to store the values from the column 'Gender' in $gender.
How can I achieve this?
First try at solution:
my %hash = ();
my $header = `head -n 1 $files[0]`; #reading the header
chomp ($header);
my #colnames = split (/,/,$header);
my $keyfield = $colnames[#here should be the column with the name 'Gender']
push #{ $hash{$keyfield} };
my $gender = $sp[$keyfield]

You will have to read the header line as well as the data to know what column holds which information. This is done easiest by writing actual Perl code instead of shelling out to various command line utilities. See further below for that solution.
Fixing your solution also requires a hash. You need to read the header line first, store the header fields in an array (as you've already done), and then read the data line. The data needs to be a hash, not an array. A hash is a map of keys and values.
# read the header and create a list of header fields
my $header = `head -n 1 $files[0]`;
chomp ($header);
my #colnames = split (/,/,$header);
# read the data line
my $summary = `tail -n 1 $files[0]`;
chomp($summary);
my %sp; # use a hash for the data, not an array
# use a hash slice to fill in the columns
#sp{#colnames} = split(/,/,$summary);
my $gender = $sp{Gender};
The tricky part here is this line.
#sp{#colnames} = split(/,/,$summary);
We have declared %sp as a hash, but we now access it with a # sigil. That's because we are taking a hash slice, as indicated by the curly braces {}. The slice we take is all elements with the names of the values in #colnames. There is more than one value, so the return value is not a scalar (with a $) any more. There is a list of return values, so the sigil turns to #. Now we use that list on the left hand side (that's called an LVALUE), and assign the result of the split to that list.
Doing it with modern Perl
The following program will use File::Find::Rule to replace your find command, and Text::CSV to read the CSV file. It grabs all the files, then opens one at a time. The header line will be read first, and fed into the Text::CSV object, so that it can then give back a hash reference, which you can use to access every field by name.
I've written it in a way that it will only read one line for each file, as you said there are only two lines per file. You can easily extend that to be a loop.
use strict;
use warnings;
use File::Find::Rule;
use Text::CSV;
my $sampleid;
my $basedir;
my $csv = Text::CSV->new(
{
binary => 1,
sep => ',',
}
) or die "Cannot use CSV: " . Text::CSV->error_diag;
my #files = File::Find::Rule->file()->name("$sampleid*.summary.csv")->in($basedir);
foreach my $file (#files) {
open my $fh, '<', $file or die "Can't open $file: $!";
# get the headers
my #cols = #{ $csv->getline($fh) };
$csv->column_names(#cols);
# read the first line
my $row = $csv->getline_hr($fh);
# do whatever you you want with the row
print "$file: ", $row->{gender};
}
Please note that I have not tested this program.

How to refer each column of a text file by hash in perl

I am having a file which is storing the result of a query separated by | (pipe sign). I want each column to be refereed by a hash.
eg. the f.txt file containment is:
aaa|bbb|ccc
ddd|eee|fff
ggg|hhh|iii
I need the o/p as:
a{a}=> {aaa,ddd,ccc}
a{b}=> {bbb,eee,hhh}
a{c}=> {ccc,fff,iii}
Please advice on the same.

I think you'd be better off representing your data as an Array of Array instead of a Hash of Hashes. This is because you don't have any idea what your keys are. An array is an ordered list of data, and you can at least refer to a particular row and column of your data this way without making up keys for it.
If you know the names of your columns, you might want an Array of Hashes instead. This way, you can refer to a particular row with its element number in the array, but refer to the column with via a name.
This is using an Array of Arrays:
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
FILE_NAME => "...",
};
open my $fh, "<", FILE_NAME;
#
# This builds your Array of Arrays
#
my #file_contents;
while ( my $row = <$fh> ) {
chomp $row;
push #file_contents, split /\s*\|\s*/, $row;
}
#
# We count from 1, but arrays count from zero. That's why array indexes
# are one less than the row and column I am referring to.
#
say "The first row and second column is " . $file_contents[0]->[1];
say "The third row and third column is " . $file_contents[2]->[2];
#
# This reprints entire file with the separators
#
for my $row ( #file_contents ) {
#columns = #{ $row };
say join "|", #columns;
}
Addendum
I am aware of what are my columns and my hash key also. I need to pass this o/p to an inbuilt API which takes the parameter as hash only. So cant really store the data in the array.
Are you saying that your columns are a hash with the column names being the hash key? This makes sense. If you're saying that each ROW has its own key, you have to give me an idea what that is, and where it comes from.
Here's a solution that creates your file contents in an array called #file_contents. It contains a reference to a hash representing each row of data with the key being the column name and the value being the data of that column. You can then use this hash to update via your API:
This is done in two loops: One filling up #file_contents and another using your API (however that's done). There's no reason why it can't be done in a single loop.
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
FILE_NAME => "...",
};
# Names of the columns
my #column_names = qw( foo bar barfu fubar foofoo barbar );
open my $fh, "<", FILE_NAME;
#
# This builds your Array of Column hashes
#
my #file_contents;
while ( my $row = <$fh> ) {
chomp $row;
#cols = split /\s*\|\s*/, $row;
my %col_hash;
for $col_num ( 0.. $#file_contents ) {
%col_hash{ $col_name[ $col_num ] } = $cols[ $column_num ];
}
push #file_contents, \%col_hash;
}
for my $cols_ref ( #file_contents ) {
my %col_hash = %{ $cols_ref };
API_CALL (..., ..., %col_hash );
}
If this is truly a hash of hashes, that is the rows of your table are hash entries, you've got to let me know where the keys come from. It's very possible that the first column of your table is the key to the rest of the data. For example, let's say your table looks like this:
Visitors in thousands
city |jan |apr |july|oct|
Duluth|0 |0 |0 |0
NYC |500 |1200|1500|600
Miami |1200|1600|2300|200
I can imagine the city being the key to each row, and the month being the key to each column. I could talk about this:
say "The number of people in NYC in July is " . $visitors{NYC}->{july};
Is this the case with your data? If not, what is the key to your hash? Certainly, I'm not suppose to make up random values for the hash keys.
You've got to give a clearer description what you need.

Finding equal lines in file with Perl

I have a CSV file which contains duplicated items in different rows.
x1,y1
x2,y2
y1,x1
x3,y3
The two rows containing x1,y1 and y1,x1 are a match as they contain the same data in a diffrent order.
I need your help to find an algorithm to search for such lines in a 12MB file.

If you can define some ordering and equality relations between fields, you could store a normalized form and test your lines for equality against that.
As an example, we will use string comparision for your fields, but after lowercasing them. We can then sort the parts according to this relation, and create a lookup table via a nested hash:
use strict; use warnings;
my $cache; # A hash of hashes. Will be autovivified later.
while (<DATA>) {
chomp;
my #fields = split;
# create the normalized representation by lowercasing and sorting the fields
my #normalized_fields = sort map lc, #fields;
# find or create the path in the lookup
my $pointer = \$cache;
$pointer = \${$pointer}->{$_} for #normalized_fields;
# if this is an unknow value, make it known, and output the line
unless (defined $$pointer) {
$$pointer = 1; # set some defined value
print "$_\n"; # emit the unique line
}
}
__DATA__
X1 y1
X2 y2
Y1 x1
X3 y3
In this example I used the scalar 1 as value of the lookup data structure, but in more complex scenarios the original fields or the line number could be stored here. For the sake of the example, I used space-seperated values here, but you could replace the split with a call to Text::CSV or something.
This hash-of-hashes approach has sublinear space complexity, and worst case linear space complexity. The lookup time only depends on the number (and size) of fields in a record, not on the total number of records.
Limitation: All records must have the same number of fields, or some shorter records could be falsely considered “seen”. To circumvent these problems, we can use more complex nodes:
my $pointer = \$cache;
$pointer = \$$pointer->[0]{$_} for #normalized_fields;
unless (defined $$pointer->[1]) {
$$pointer->[1] = 1; ...
}
or introduce a default value for nonexistant field (e.g. the seperator of the original file). Here an example with the NUL character:
my $fields = 3;
...;
die "record too long" if #fields > $fields;
...; # make normalized fields
push #normalized_fields, ("\x00") x ($fields - #normalized_fields);
...; # do the lookup

A lot depends on what you want to know about duplicate lines once they have been found. This program uses a simple hash to list the line numbers of those lines that are equivalent.
use strict;
use warnings;
my %data;
while (<DATA>) {
chomp;
my $key = join ',', sort map lc, split /,/;
push #{$data{$key}}, $.;
}
foreach my $list (values %data) {
next unless #$list > 1;
print "Lines ", join(', ', #$list), " are equivalent\n";
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3
output
Lines 1, 3 are equivalent

Make two hash tables A and B
Stream through your input one line at a time
For the first line pair x and y, use each as key and the other as value for both hash tables (e.g., $A->{x} = y; $B->{y} = x;)
For the second and subsequent line pairs, test if the second field's value exists as a key for either A or B — if it does, you have a reverse match — if not, then repeat the addition process from step 3 to add it to the hash tables

To do a version of amon's answer without a hash table, if your data are numerical, you could:
Stream through input line by line, sorting fields one and two by numerical ordering
Pipe result to UNIX sort on first and second fields
Stream through sorted output line by line, checking if current line matches the previous line (reporting a reverse match, if true)
This has the advantage of using less memory than hash tables, but may take more time to process.

amon already provided the answer I would've provided, so please enjoy this bad answer:
#! /usr/bin/perl
use common::sense;
my $re = qr/(?!)/; # always fails
while (<DATA>) {
warn "Found duplicate: $_" if $_ =~ $re;
next unless /^(.*),(.*)$/;
die "Unexpected input at line $.: $_" if "$1$2" =~ tr/,//;
$re = qr/^\Q$2,$1\E$|$re/
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3

Perl script for combining 2 files with multiple entries

I have a tab-delimited text file like this:
contig11 GO:100 other columns of data
contig11 GO:289 other columns of data
contig11 GO:113 other columns of data
contig22 GO:388 other columns of data
contig22 GO:101 other columns of data
And another like this:
contig11 3 N
contig11 1 Y
contig22 1 Y
contig22 2 N
I need to combine them so that each 'multiple' entry of one of the files is duplicated and populated with its data in the other, so that I get:
contig11 3 N GO:100 other columns of data
contig11 3 N GO:289 other columns of data
contig11 3 N GO:113 other columns of data
contig11 1 Y GO:100 other columns of data
contig11 1 Y GO:289 other columns of data
contig11 1 Y GO:113 other columns of data
contig22 1 Y GO:388 other columns of data
contig22 1 Y GO:101 other columns of data
contig22 2 N GO:388 other columns of data
contig22 2 N GO:101 other columns of data
I have little scripting experience, but have done this where e.g. "contig11" occurs only once in one of the files, with hashes/keys. But I can't even begin to get my head around to do this! Really appreciate some help or hints as to how to tackle this problem.
EDIT So I have tried ikegami's suggestion (see answers) with this: However, this has produced the output I needed except the GO:100 column onwards ($rest in script???) - any ideas what I'm doing wrong?
#!/usr/bin/env/perl
use warnings;
open (GOTERMS, "$ARGV[0]") or die "Error opening the input file with GO terms";
open (SNPS, "$ARGV[1]") or die "Error opening the input file with SNPs";
my %goterm;
while (<GOTERMS>)
{
my($id, $rest) = /^(\S++)(,*)/s;
push #{$goterm{$id}}, $rest;
}
while (my $row2 = <SNPS>)
{
chomp($row2);
my ($id) = $row2 =~ /^(\S+)/;
for my $rest (#{ $goterm{$id} })
{
print("$row2$rest\n");
}
}
close GOTERMS;
close SNPS;

Look at your output. It's clearly produced by
for each row of the second file,
for each row of the first file with the same id,
print out the combined rows
So the question is: How does you find the rows of the first file with the same id as a row of the second file?
The answer is: You store the rows of the first file in a hash indexed by the row's id.
my %file1;
while (<$file1_fh>) {
my ($id, $rest) = /^(\S++)(.*)/s;
push #{ $file1{$id} }, $rest;
}
So the earlier pseudo code resolves to
while (my $row2 = <$file2_fh>) {
chomp($row2);
my ($id) = $row2 =~ /^(\S+)/;
for my $rest (#{ $file1{$id} }) {
print("$row2$rest");
}
}
#!/usr/bin/env perl
use strict;
use warnings;
open(my $GOTERMS, $ARGV[0])
or die("Error opening GO terms file \"$ARGV[0]\": $!\n");
open(my $SNPS, $ARGV[1])
or die("Error opening SNP file \"$ARGV[1]\": $!\n");
my %goterm;
while (<$GOTERMS>) {
my ($id, $rest) = /^(\S++)(.*)/s;
push #{ $goterm{$id} }, $rest;
}
while (my $row2 = <$SNPS>) {
chomp($row2);
my ($id) = $row2 =~ /^(\S+)/;
for my $rest (#{ $goterm{$id} }) {
print("$row2$rest");
}
}

I will describe how you can do this. You need each file pu to array (each libe is an array item). Then you just need to compare these array in needed way. You need 2 loops. Main loops for each record of array/file which contains string which you you will use to campare (in your example it will be 2nd file). Under this loop you need to have another loop for each record in a array/file with records which you will compare with. And just check each record of array with the each recrod of another array and process results.
foreach my $record2 (#array2) {
foreach my $record1 (#array1){
if ($record2->{field} eq $record1->{field}){
#here you need to create the string which you will show
my $res_string = $record2->{field}.$record1->{field};
print "$res_string\n";
}
}
}
Or dont use array. Just read files and compare each line with each line of another file. General idea is the same ))

Reading numbers from a file to variables (Perl)

I've been trying to write a program to read columns of text-formatted numbers into Perl variables.
Basically, I have a file with descriptions and numbers:
ref 5.25676 0.526231 6.325135
ref 1.76234 12.62341 9.1612345
etc.
I'd like to put the numbers into variables with different names, e.g.
ref_1_x=5.25676
ref_1_y=0.526231
etc.
Here's what I've got so far:
print "Loading file ...";
open (FILE, "somefile.txt");
#text=<FILE>;
close FILE;
print "Done!\n";
my $count=0;
foreach $line (#text){
#coord[$count]=split(/ +/, $line);
}
I'm trying to compare the positions written in the file to each other, so will need another loop after this.

Sorry, you weren't terribly clear on what you're trying to do and what "ref" refers to. If I misunderstood your problem please commend and clarify.
First of all, I would strongly recommend against using variable names to structure data (e.g. using $ref_1_x to store x coordinate for the first row with label "ref").
If you want to store x, y and z coordinates, you can do so as an array of 3 elements, pretty much like you did - the only difference is that you want to store an array reference (you can't store an array as a value in another array in Perl):
my ($first_column, #data) = split(/ +/, $line); # Remove first "ref" column
#coordinates[$count++] = \#data; # Store the reference to coordinate array
Then, to access the x coordinate for row 2, you do:
$coordinates[1]->[0]; # index 1 for row 2; then sub-index 0 for x coordinate.
If you insist on storing the 3 coordinates in named data structure, because sub-index 0 for x coordinate looks less readable - which is a valid concern in general but not really an issue with 3 columns - use a hash instead of array:
my ($first_column, #data) = split(/ +/, $line); # Remove first "ref" column
#coordinates[$count++] = { x => $data[0], y => $data[1], z => $data[2] };
# curly braces - {} - to store hash reference again
Then, to access the x coordinate for row 2, you do:
$coordinates[1]->{x}; # index 1 for row 2
Now, if you ALSO want to store the rows that have a first column value "ref" in a separate "ref"-labelled data structure, you can do that by wrapping the original #coordinates array into being a value in a hash with a key of "ref".
my ($label, #data) = split(/ +/, $line); # Save first "ref" label
$coordinates{$label} ||= []; # Assign an empty array ref
#if we did not create the array for a given label yet.
push #{ $coordinates{$label} }, { x => $data[0], y => $data[1], z => $data[2] };
# Since we don't want to bother counting per individual label,
# Simply push the coordinate hash at the end of appropriate array.
# Since coordinate array is stored as an array reference,
# we must dereference for push() to work using #{ MY_ARRAY_REF } syntax
Then, to access the x coordinate for row 2 for label "ref", you do:
$label = "ref";
$coordinates{$label}->[1]->{x}; # index 1 for row 2 for $label
Also, your original example code has a couple of outdated idioms that you may want to write in a better style (use 3-argument form of open(), check for errors on IO operations like open(); use of lexical filehandles; storing entire file in a big array instead of reading line by line).
Here's a slightly modified version:
use strict;
my %coordinates;
print "Loading file ...";
open (my $file, "<", "somefile.txt") || die "Can't read file somefile.txt: $!";
while (<$file>) {
chomp;
my ($label, #data) = split(/ +/); # Splitting $_ where while puts next line
$coordinates{$label} ||= []; # Assign empty array ref if not yet assigned
push #{ $coordinates{$label} }
, { x => $data[0], y => $data[1], z => $data[2] };
}
close($file);
print "Done!\n";
It is not clear what you want to compare to what, so can't advise on that without further clarifications.

The problem is you likely need a double-array (or hash or ...). Instead of this:
#coord[$count]=split(/ +/, $line);
Use:
#coord[$count++]=[split(/ +/, $line)];
Which puts the entire results of the split into a sub array. Thus,
print $coord[0][1];
should output "5.25676".

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Parse tab delimited file into hash of array - perl

Related

perl - fetch column names from file

How to refer each column of a text file by hash in perl

Finding equal lines in file with Perl

Perl script for combining 2 files with multiple entries

Reading numbers from a file to variables (Perl)

Categories

Resources