perl match part of string in two files - perl

I'm using a perl script to look for matches between columns in two tab-delimited files. However for one column I only want to look for a partial match between two strings in two columns.
It concerns $row[4] of $table2 and $row{d} of $table1.
The values in $row[4] of $table2 look like this:
'xxxx'.
The values in $row{d} of $table1 look like this:
'xxxx.aaa'.
If the part before the '.' is the same, there is a match. If not, there is no match. I'm not sure how to implement this in my script. This is what I have so far. I only looks for complete matches between different columns. '...' denotes code that is not important for this question
#! /usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
local $Data::Dumper::Useqq = 1;
use Getopt::Long qw(GetOptions);
...
...
chomp( my #header_table2 = split /\t/, <$table2> );
my %lookup;
while(<$table2>){
chomp;
my #row = split(/\t/);
$lookup{ $row[0] }{ $row[1] }{ $row[4] }{ $row[5] }{ $row[6] }{ $row[7] }{ $row[8] } = [ $row[9], $row[10] ];
}
my #header = do {
my $header = <$table1>;
$header =~ s/\t?\n\z//;
split /\t/, $header;
};
print $table3 join ("\t", #header, qw/ name1 name2 /), "\n";
{
no warnings 'uninitialized';
while(<$table1>){
s/\t?\n\z//;
my %row;
#row{#header} = split /\t/;
print $table3 join ( "\t", #row{#header},
#{ $lookup{ $row{a} }{ $row{b} }{ $row{c} }{ $row{d} }{ $row{e} }{ $row{f} }{ $row{g} }
// [ "", "" ] }), "\n";
}
}

This is looking like a job for a database
The solution below isn't going to work, because you are building your %lookup hash with nine levels of keys ($row[0] .. $row[8]) , and accessing it with only seven levels ($row{a} .. $row{g}), so you're going to have to edit in the real situation
I see no reason to next your hashes so deeply. A single key formed by using join on the relevant fields will work fine and probably a little faster. I also see no reason to extract table2 fields into an array and table1 fields into a hash. An array seems fine in both cases
I've solved your immediate problem by copying each #row from table1 into array #key, and removing the last dot and anything following from the fourth element before building the $key string
In view of your history of adding a spare tab character before the newline at the end of each record, I've also added four die statements that verify the size of the header row and columns rows before continuing. You will probably need to tweak those values according to your real data
use strict;
use warnings 'all';
use Data::Dumper;
local $Data::Dumper::Useqq = 1;
use Getopt::Long qw(GetOptions);
use constant TABLE1_COLUMNS => 9;
use constant TABLE2_COLUMNS => 11;
open my $table2, '<', 'table2.txt' or die $!;
my #header_table2 = do {
my $header = <$table2>;
$header =~ s/\t?\n\z//;
split /\t/, $header;
};
die "Incorrect table 2 header count " . scalar #header_table2
unless #header_table2 == TABLE2_COLUMNS;
my %lookup;
while ( <$table2> ) {
chomp;
my #row = split /\t/;
die "Incorrect table 2 column count " . scalar #row
unless #row == TABLE2_COLUMNS;
my $key = do {
local $" = "\n";
"#row[0..8]";
};
$lookup{ $key } = [ #row[9,10] ];
}
open my $table1, '<', 'table1.txt' or die $!;
my #header = do {
my $header = <$table1>;
$header =~ s/\t?\n\z//;
split /\t/, $header;
};
die "Incorrect table 1 header count " . scalar #header
unless #header == TABLE1_COLUMNS;
open my $table3, '>', 'table3.txt' or die $!;
print $table3 join ("\t", #header, qw/ name1 name2 /), "\n";
while ( <$table1> ) {
s/\t?\n\z//;
my #row = split /\t/;
die "Incorrect table 1 column count " . scalar #row
unless #row == TABLE1_COLUMNS;
my $key = do {
my #key = #row;
$key[3] =~ s/\.[^.]*\z//;
local $" = "\n";
"#key";
};
my $lookup = $lookup{ $key } // [ "", "" ];
print $table3 join("\t", #row, #$lookup), "\n";
}

You're going to have a scoping problem because your array #row and your hash %row both exist in completely different scopes.
But if you have variables (say, $foo and $bar) and you want to know if $foo starts with the contents of $bar followed by a dot, then you can do that using a regular expression check like this:
if ($foo =~ /^$bar\./) {
# match
} else {
# no match
}

Related

Best way to compare multi-dimension hash in perl

I want to write a script that can compare two multi-dimension hash to see whether they match or not. This two hash has same value of key, the comparison will done in a pair mean starting with $j=0 first compare $line[0] $line[1] then follow by $line[2] $line[3]
$hash{"key"}{$key}{"$row $col"}= "$line[$j] $line[$j+1]";
$hash1{"key"}{$key1}{"$row1 $col1"}="$line1[$j] $line1[$j+1]";
my $line and $line1 is take from a huge file and will process the huge file line by line. For example:
#line = 1 2 3 4 5 6 7 8 #fst line from file1
#line1 = 1 2 3 3 4 5 6 7 7 #fst line from file2
when $key and $key1 match the scripts will continue to compare the $row $col then lastly compare $line[$j] $line[$j+1]. I am trying to use one of the module Test::More that mention at previous similar question but it can only compare the 1st not matching and the output is in default format.
not ok 1 - data structures should be the same
# Failed test 'data structures should be the same'
# at hash_check1.pl line 80.
# Structures begin differing at:
# $got->{key}{100}{2 3} = '62 19'
# $expected->{key}{100}{2 3} = '12 24'
# Tests were run but no plan was declared and done_testing() was not seen.
Any best way to compare this type of multi-dimension hash?
8/22 Edited
If Test::More module is able to compare all the $key match for two hash i can take that as well but if i can output the matching and non matching in my own prefer format will be great. For example i would like to output like
$key at $row $col no match with value $line[$j] $line[$j+1] ( expected value $line1[$j] $line1[$j+1] )
8/22 Edited
Below is part of my code
use strict;
use warnings;
use Test::More;
open ( FILE1 , '<', "file.txt" ) or die $!;
open ( FILE2 , '<' , "file1.txt" ) or die $!;
chomp (my #file1 = <FILE1>);
chomp (my #file2 = <FILE2>);
my %hash=();
my %hash1=();
for ( $i =0 ; $i<=511 ; $i++ ) {
my #line = split(" ",$file1[$i]);
my #line1 = split(" ",$file2[$i]);
my $key = ($i+1)*10;
my $key1 = ($i+2)*10;
for ( $j=0; $j<=15 ; $j+=2){
my $col = hex($j);
my $col1 = hex($j+1);
$hash{"key"}{$key}{"$row $col1"}= "$line[$j] $line[$j+1]";
$hash1{"key"}{$key1}{"$row1 $col1"}= "$line1[$j] $line1[$j+1]";
}
}
##############comparison part start here###################
is_deeply(\%hash, \%hash1, 'data structures should be the same'); #can only print one mismatch
################Any better way?###########
Here is an example that will compare nested hashes with string values:
#! /usr/bin/env perl
use feature qw(say);
use warnings;
use strict;
my %hash;
my %hash1;
$hash{"key"}{A}{"2 3"}= "1 2";
$hash1{"key"}{A}{"2 3"}="2 3";
$hash{"key"}{B}{"2 3"}= "1 2";
$hash1{"key"}{C}{"2 3"}="2 3";
compare(\%hash, \%hash1);
sub compare {
return _compare( #_, "" );
}
sub get_key_str { return $_[0] . '{' . $_[1] . '}' }
sub _process_key {
my ( $h1, $h2, $info, $key, $str1, $str2 ) = #_;
if ( exists $h2->{$key} ) {
my $val1 = $h1->{$key};
my $val2 = $h2->{$key};
if ( ref $val1 eq "HASH" and ref $val2 eq "HASH" ) {
_compare( $val1, $val2, get_key_str( $info, $key ) );
}
else {
die "Expected string value" if ref $val1 or ref $val2;
if ( $val1 ne $val2 ) {
say "Value '$val1' in $str1 hash for key " . get_key_str( $info, $key )
. " does not match value '$val2' in $str2 hash";
}
}
}
else {
my $cur_key = get_key_str( $info, $key );
say "Got key $cur_key in $str1 hash, but missing in $str2 hash";
}
}
sub _compare {
my ( $h1, $h2, $info ) = #_;
my %processed_keys;
for (keys %$h1) {
_process_key( $h1, $h2, $info, $_, "first", "second" );
$processed_keys{$_}++;
}
for (keys %$h2) {
next if exists $processed_keys{$_};
_process_key( $h2, $h1, $info, $_, "second", "first" );
}
}
Output:
Got key {key}{B} in first hash, but missing in second hash
Value '1 2' in first hash for key {key}{A}{2 3} does not match value '2 3' in second hash
Got key {key}{C} in second hash, but missing in first hash

Handling Nested Delimiters in perl

use strict;
use warnings;
my %result_hash = ();
my %final_hash = ();
Compare_results();
foreach my $key (sort keys %result_hash ){
print "$key \n";
print "$result_hash{$key} \n";
}
sub Compare_results
{
while ( <DATA> )
{
my($instance,$values) = split /\:/, $_;
$result_hash{$instance} = $values;
}
}
__DATA__
1:7802315095\d\d,7802315098\d\d;7802025001\d\d,7802025002\d\d,7802025003\d\ d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
2:7802315095\d\d,7802025002\d\d,7802025003\d\d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
Output
1
7802315095\d\d,7802315098\d\d;7802025001\d\d,7802025002\d\d,7802025003\d\d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
2
7802315095\d\d,7802025002\d\d,7802025003\d\d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
Iam trying to fetch value of each key and again trying to split the comma seperated value from result hash , if i find a semicolon in any value i would want to store the left and right values in separate hash keys.
Something like below
1.#split the value of result_hash{$key} again by , and see whether any chunk is seperated by ;
2. #every chunk without ; and value on left with ; should be stored in
#{$final_hash{"eto"}} = ['7802315095\d\d','7802315098\d\d','7802025002\d\d','7802025003\d\d','7802025004\d\d','7802025005\d\d','7802025006\d\d','7802025007\d\d'] ;
3.#Anything found on the right side of ; has to be stored in
#{$final_hash{"pro"}} = ['7802025001\d\d'] ;
Is there a way that i can handle everything in the subroutine? Can i make the code more simpler
Update :
I tried splitting the string in a single shot, but its just picking the values with semicolon and ignoring everything
foreach my $key (sort keys %result_hash ){
# print "$key \n";
# print "$result_hash{$key} \n";
my ($o,$t) = split(/,|;/, $result_hash{$key});
print "Left : $o \n";
print "Left : $t \n";
#push #{$final_hash{"eto"}}, $o;
#push #{$final_hash{"pro"}} ,$t;
}
}
My updated code after help
sub Compare_results
{
open my $fh, '<', 'Data_File.txt' or die $!;
# split by colon and further split by , and ; if any (done in insert_array)
my %result_hash = map { chomp; split ':', $_ } <$fh> ;
foreach ( sort { $a <=> $b } (keys %result_hash) )
{
($_ < 21)
? insert_array($result_hash{$_}, "west")
: insert_array($result_hash{$_}, "east");
}
}
sub insert_array()
{
my ($val,$key) = #_;
foreach my $field (split ',', $val)
{
$field =~ s/^\s+|\s+$//g; # / turn off editor coloring
if ($field !~ /;/) {
push #{ $file_data{"pto"}{$key} }, $field ;
}
else {
my ($left, $right) = split ';', $field;
push #{$file_data{"pto"}{$key}}, $left if($left ne '') ;
push #{$file_data{"ero"}{$key}}, $right if($right ne '') ;
}
}
}
Thanks
Update Added a two-pass regex, at the end
Just proceed systematically, analyze the string step by step. The fact that you need consecutive splits and a particular separation rule makes it unwieldy to do in one shot. Better have a clear method than a monster statement.
use warnings 'all';
use strict;
use feature 'say';
my (%result_hash, %final_hash);
Compare_results();
say "$_ => $result_hash{$_}" for sort keys %result_hash;
say '---';
say "$_ => [ #{$final_hash{$_}} ]" for sort keys %final_hash;
sub Compare_results
{
%result_hash = map { chomp; split ':', $_ } <DATA>;
my (#eto, #pro);
foreach my $val (values %result_hash)
{
foreach my $field (split ',', $val)
{
if ($field !~ /;/) { push #eto, $field }
else {
my ($left, $right) = split ';', $field;
push #eto, $left;
push #pro, $right;
}
}
}
$final_hash{eto} = \#eto;
$final_hash{pro} = \#pro;
return 1; # but add checks above
}
There are some inefficiencies here, and no error checking, but the method is straightforward. If your input is anything but smallish please change the above to process line by line, what you clearly know how to do. It prints
1 => ... (what you have in the question)
---
eto => [ 7802315095\d\d 7802315098\d\d 7802025002\d\d 7802025003\d\ d ...
pro => [ 7802025001\d\d ]
Note that your data does have one loose \d\ d.
We don't need to build the whole hash %result_hash for this but only need to pick the part of the line after :. I left the hash in since it is declared global so you may want to have it around. If it in fact isn't needed on its own this simplifies
sub Compare_results {
my (#eto, #pro);
while (<DATA>) {
my ($val) = /:(.*)/;
foreach my $field (split ',', $val)
# ... same
}
# assign to %final_hash, return from sub
}
Thanks to ikegami for comments.
Just for the curiosity's sake, here it is in two passes with regex
sub compare_rx {
my #data = map { (split ':', $_)[1] } <DATA>;
$final_hash{eto} = [ map { /([^,;]+)/g } #data ];
$final_hash{pro} = [ map { /;([^,;]+)/g } #data ];
return 1;
}
This picks all characters which are not , or ;, using the negated character class, [^,;]. So that is up to the first either of them, left to right. It does this globally, /g, so it keeps going through the string, collecting all fields that are "left of" , or ;. Then it cheats a bit, picking all [^,;] that are right of ;. The map is used to do this for all lines of data.
If %result_hash is needed build it instead of #data and then pull the values from it with my #values = values %hash_result and feed the map with #values.
Or, broken line by line (again, you can build %result_hash instead of taking $data directly)
my (#eto, #pro);
while (<DATA>) {
my ($data) = /:(.*)/;
push #eto, $data =~ /([^,;]+)/g;
push #pro, $data =~ /;([^,;]+)/g;
}

match columns on different lines and sum

I have a csv with about 160,000 lines, it looks like this:
chr1,160,161,3,0.333333333333333,+
chr1,161,162,4,0.5,-
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,18,0.5,+
chr2,511,512,6,0.333333333333333,-
I would like to pair lines where column 1 is the same, column 3 matches column 2 and where column 6 is a '+' while on the other line it is a '-'. If this is true I would like to sum column 4 and column 5.
My desired out put would be
chr1,160,161,7,0.833333333333333,+
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,24,0.833333333333333,-
the best solution I can think of is to duplicate the file and then match columns between the file and it's duplicate with perl:
#!/usr/bin/perl
use strict;
use warnings;
open my $firstfile, '<', $ARGV[0] or die "$!";
open my $secondfile, '<', $ARGV[1] or die "$!";
my ($chr_a, $chr_b,$start,$end,$begin,$finish, $sum_a, $sum_b, $total_a,
$total_b,$sign_a,$sign_b);
while (<$firstfile>) {
my #col = split /,/;
$chr_a = $col[0];
$start = $col[1];
$end = $col[2];
$sum_a = $col[3];
$total_a = $col[4];
$sign_a = $col[5];
seek($secondfile,0,0);
while (<$secondfile>) {
my #seccol = split /,/;
$chr_b = $seccol[0];
$begin = $seccol[1];
$finish = $seccol[2];
$sum_b = $seccol[3];
$total_b = $seccol[4];
$sign_b = $seccol[5];
print join ("\t", $col[0], $col[1], $col[2], $col[3]+=$seccol[3],
$col[4]+=$seccol[4], $col[5]),
"\n" if ($chr_a eq $chr_b and $end==$begin and $sign_a ne $sign_b);
}
}
And that works fine, but ideally I'd like to be able to do this within the file itself without having to duplicate it, because I have many files and so I would like to run a script over all of them that is less time-consuming.
Thanks.
In the absence of a response to my comment, this program will do as you ask with the data you provide.
use strict;
use warnings;
my #last;
while (<DATA>) {
s/\s+\z//;
my #line = split /,/;
if (#last
and $last[0] eq $line[0]
and $last[2] eq $line[1]
and $last[5] eq '+' and $line[5] eq '-') {
$last[3] += $line[3];
$last[4] += $line[4];
print join(',', #last), "\n";
#last = ()
}
else {
print join(',', #last), "\n" if #last;
#last = #line;
}
}
print join(',', #last), "\n" if #last;
__DATA__
chr1,160,161,3,0.333333333333333,+
chr1,161,162,4,0.5,-
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,18,0.5,+
chr2,511,512,6,0.333333333333333,-
output
chr1,160,161,7,0.833333333333333,+
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,24,0.833333333333333,+

rearrange data from one column to a row

I have the below data and I need to make the second column as the header. Any help is appreciated.
Data:
IBM,Voltality,7,73894756.93897434897
IBM,Market,100,983874.34324
GOOG,Sanity,15,8932748
GOOG,Rate Jump,25,873476378.234234
MBLY,Market,340,23423423432.6783
Output:
PRODUCT|Market|Rate Jump|Sanity|Voltality
IBM|100,983874.34324|||7,73894756.93897434897
GOOG||25,873476378.234234|15,8932748|||
MBLY|340,23423423432.6783|||
Code (incomplete / not sure hot to get to the end):
#!/usr/bin/perl
use strict;
use Getopt::Long;
use warnings;
use Data::Dumper;
my $valsep = ',';
my ( %type, %keys, %ccy, %cnt, %avg );
while (<>) {
chomp;
my ( $product, $reason, $count, $lat ) = split /,/;
my $key = "$product,$reason";
if ( not exists( $type{$reason} ) ) {
$type{$reason} = $reason;
}
$ccy{$key} = $product;
$cnt{$key} = $count;
$avg{$key} = $lat;
}
close(INPUT);
print Dumper ( \%ccy );
print Dumper ( \%type );
my ( %pair, %details );
foreach my $rows ( sort keys %ccy ) {
print "the key is : $rows and $ccy{$rows}\n";
foreach my $res ( sort keys %type ) {
print "The type is : $res and $type{$res}\n";
}
}
You just need to keep track of your columns and row data when parsing the data structure.
The following demonstrates:
#!/usr/bin/perl
use strict;
use warnings;
my $fh = \*DATA;
my %columns;
my %rows;
while (<$fh>) {
chomp;
my ( $company, $col, $vals ) = split ',', $_, 3;
# Track Columns for later labeling
$columns{$col}++;
$rows{$company}{$col} = $vals;
}
my #columns = sort keys %columns;
# Header
print join( '|', 'PRODUCT', #columns ), "\n";
for my $company ( sort keys %rows ) {
print join( '|', $company, map { $_ // '' } #{ $rows{$company} }{#columns} ), "\n";
}
__DATA__
IBM,Voltality,7,73894756.93897434897
IBM,Market,100,983874.34324
GOOG,Sanity,15,8932748
GOOG,Rate Jump,25,873476378.234234
MBLY,Market,340,23423423432.6783
Outputs:
PRODUCT|Market|Rate Jump|Sanity|Voltality
GOOG||25,873476378.234234|15,8932748|
IBM|100,983874.34324|||7,73894756.93897434897
MBLY|340,23423423432.6783|||
The following code will do the job; rather than using several hashes, I've put all the data in a hash of hashes. I've put comments in the script to explain what is happening in case you are not sure. You can, of course, delete them in your script.
#!/usr/bin/perl
use warnings;
use strict;
my %market;
while (<DATA>) {
next unless /\w/;
# remove line endings
chomp;
# split line by commas -- only split into three parts
my #col = split ",", $_, 3;
# save the data as $market{col0}{col1} = col2
$market{$col[0]}{$col[1]} = $col[2];
}
# create an output file
my $outfile = 'output.txt';
open( my $fh, ">", $outfile ) or die "Could not open $outfile: $!";
my #headers = ('Market','Rate Jump','Sanity','Volatility');
# print out the header line, joined by |
print { $fh } join('|', 'PRODUCT', #headers) . "\n";
# for each product in the market data
for my $p (sort keys %market) {
# print the product name
print { $fh } join('|', $p,
# go through the headers using map (map acts like a "for" loop)
# if the relevant property exists in the market data, print it;
# if not, print nothing
map { $market{$p}{$_} // '' } #headers) . "\n";
}
# this is the input data. You might be reading yours in from a file
__DATA__
IBM,Voltality,7,73894756.93897434897
IBM,Market,100,983874.34324
GOOG,Sanity,15,8932748
GOOG,Rate Jump,25,873476378.234234
MBLY,Market,340,23423423432.6783
Output:
PRODUCT|Market|Rate Jump|Sanity|Volatility
GOOG||25,873476378.234234|15,8932748|
IBM|100,983874.34324|||7,73894756.93897434897
MBLY|340,23423423432.6783|||

slicing and array into a hasn value

What I was trying to do was combine elements[1..3] into a single array, and then make the has out of that. Then sort by keys and print out the whole thing.
#!/usr/bin/perl
my %hash ;
while ( <> ) {
#elements = split /,/, $_;
#slice_elements = #elements[1..3] ;
if ($elements[0] ne '' ) {
$hash{ $elements[0] } = $slice_elements[0];
}
}
foreach $key (sort keys %hash ) {
print "$key; $hash{$key}\n";
}
This is what I get when I print this out -
casper_mint#casper-mint-dell /tmp $ /tmp/dke /tmp/File1.csv
060001.926941; TOT
060002.029434; RTP
060002.029568; RTP
060002.126895; UL
060002.229327; RDS/A
060002.312512; EON
060002.429382; RTP
060002.585408; BCS
060002.629333; LYG
060002.712240; HBC
This is waht I want the elements of the array - element[0] is the key and element[1..3] in the value
060001.926941,TOT,86.26,86.48
060002.029434,RTP,310.0,310.66
060002.029568,RTP,310.0,310.74
060002.126895,UL,34.06,34.14
060002.229327,RDS/A,84.47,84.72
060002.312512,EON,56.88,57.04
060002.429382,RTP,310.08,310.77
060002.585408,BCS,58.96,59.06
060002.629333,LYG,46.13,46.41
060002.712240,HBC,93.06,93.23
Always include use strict; and use warnings; at the top of EVERY perl script.
What you need is to create a new anonymous array [ ] as the value to your hash. Then join the values when displaying the results:
#!/usr/bin/perl
use strict;
use warnings;
my %hash;
while (<>) {
chomp;
my #elements = split /,/, $_;
if ($elements[0] ne '' ) {
$hash{ $elements[0] } = [#elements[1..3]];
}
}
foreach my $key (sort keys %hash ) {
print join(',', $key, #{$hash{$key}}) . "\n";
}
Of course, if your data really is fixed width like that, and you're not actually doing anything with the values, there actually is no need to split and join. The following would do the same thing:
use strict;
use warnings;
print sort <>;