Fix hash positions in hash - perl

I have the following code which passes data to a csv, however the column order in the CSV is seemingly random (I pass a hash to the CSV). I'm having real difficulty defining the order of the columns.
Is there a way I can fix the position of hashes in my %output_var hash so I can effectively define the column order in my CSV?
$common->print_header('text/csv','export_order_products.csv');
print common::to_csv(sort values %output_var)."\r\n";
my $cart_id;
foreach my $order (#data_to_export) {
$order->{'order-id'} =~ s/\.cgi//;
$order->{'order-item-id'} =~ s/\.cgi//;
$order->{'purchase-date'} =~ s/\.cgi//;
my #line;
for (sort {$output_var{$a} cmp $output_var{$b}; } keys %output_var) {
push #line, $order->{$_};
}
print common::to_csv(#line)."\r\n";
}
CSV sub:
sub to_csv {
my ($csv_line);
my (#fields) = #_;
my $i;
for ($i=0; $i<scalar(#fields); $i++) {
$fields[$i] =~ s/\r/\\r/gis;
$fields[$i] =~ s/\n/\\n/gis;
$fields[$i] =~ s/"/""/gis; # -- "
$fields[$i] =~ s/(.*,.*)/"$1"/gis;
$fields[$i] = '' if ($fields[$i] eq '0');
}
$csv_line = join ',', #fields;
return $csv_line;
}

If you're were using Text::CSV or Text::CSV_XS,
# Outside the loop
$csv->column_names([qw( ... )]);
# Inside the loop
$csv->print_hr($fh, $order);
But you don't appear to be using either of those modules(?!?!), so
# Outside the loop
my #headers = qw( ... );
# Inside the loop
print($fh to_csv(#$order{#headers}), "\r\n");

Related

Handling Nested Delimiters in perl

use strict;
use warnings;
my %result_hash = ();
my %final_hash = ();
Compare_results();
foreach my $key (sort keys %result_hash ){
print "$key \n";
print "$result_hash{$key} \n";
}
sub Compare_results
{
while ( <DATA> )
{
my($instance,$values) = split /\:/, $_;
$result_hash{$instance} = $values;
}
}
__DATA__
1:7802315095\d\d,7802315098\d\d;7802025001\d\d,7802025002\d\d,7802025003\d\ d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
2:7802315095\d\d,7802025002\d\d,7802025003\d\d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
Output
1
7802315095\d\d,7802315098\d\d;7802025001\d\d,7802025002\d\d,7802025003\d\d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
2
7802315095\d\d,7802025002\d\d,7802025003\d\d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
Iam trying to fetch value of each key and again trying to split the comma seperated value from result hash , if i find a semicolon in any value i would want to store the left and right values in separate hash keys.
Something like below
1.#split the value of result_hash{$key} again by , and see whether any chunk is seperated by ;
2. #every chunk without ; and value on left with ; should be stored in
#{$final_hash{"eto"}} = ['7802315095\d\d','7802315098\d\d','7802025002\d\d','7802025003\d\d','7802025004\d\d','7802025005\d\d','7802025006\d\d','7802025007\d\d'] ;
3.#Anything found on the right side of ; has to be stored in
#{$final_hash{"pro"}} = ['7802025001\d\d'] ;
Is there a way that i can handle everything in the subroutine? Can i make the code more simpler
Update :
I tried splitting the string in a single shot, but its just picking the values with semicolon and ignoring everything
foreach my $key (sort keys %result_hash ){
# print "$key \n";
# print "$result_hash{$key} \n";
my ($o,$t) = split(/,|;/, $result_hash{$key});
print "Left : $o \n";
print "Left : $t \n";
#push #{$final_hash{"eto"}}, $o;
#push #{$final_hash{"pro"}} ,$t;
}
}
My updated code after help
sub Compare_results
{
open my $fh, '<', 'Data_File.txt' or die $!;
# split by colon and further split by , and ; if any (done in insert_array)
my %result_hash = map { chomp; split ':', $_ } <$fh> ;
foreach ( sort { $a <=> $b } (keys %result_hash) )
{
($_ < 21)
? insert_array($result_hash{$_}, "west")
: insert_array($result_hash{$_}, "east");
}
}
sub insert_array()
{
my ($val,$key) = #_;
foreach my $field (split ',', $val)
{
$field =~ s/^\s+|\s+$//g; # / turn off editor coloring
if ($field !~ /;/) {
push #{ $file_data{"pto"}{$key} }, $field ;
}
else {
my ($left, $right) = split ';', $field;
push #{$file_data{"pto"}{$key}}, $left if($left ne '') ;
push #{$file_data{"ero"}{$key}}, $right if($right ne '') ;
}
}
}
Thanks
Update Added a two-pass regex, at the end
Just proceed systematically, analyze the string step by step. The fact that you need consecutive splits and a particular separation rule makes it unwieldy to do in one shot. Better have a clear method than a monster statement.
use warnings 'all';
use strict;
use feature 'say';
my (%result_hash, %final_hash);
Compare_results();
say "$_ => $result_hash{$_}" for sort keys %result_hash;
say '---';
say "$_ => [ #{$final_hash{$_}} ]" for sort keys %final_hash;
sub Compare_results
{
%result_hash = map { chomp; split ':', $_ } <DATA>;
my (#eto, #pro);
foreach my $val (values %result_hash)
{
foreach my $field (split ',', $val)
{
if ($field !~ /;/) { push #eto, $field }
else {
my ($left, $right) = split ';', $field;
push #eto, $left;
push #pro, $right;
}
}
}
$final_hash{eto} = \#eto;
$final_hash{pro} = \#pro;
return 1; # but add checks above
}
There are some inefficiencies here, and no error checking, but the method is straightforward. If your input is anything but smallish please change the above to process line by line, what you clearly know how to do. It prints
1 => ... (what you have in the question)
---
eto => [ 7802315095\d\d 7802315098\d\d 7802025002\d\d 7802025003\d\ d ...
pro => [ 7802025001\d\d ]
Note that your data does have one loose \d\ d.
We don't need to build the whole hash %result_hash for this but only need to pick the part of the line after :. I left the hash in since it is declared global so you may want to have it around. If it in fact isn't needed on its own this simplifies
sub Compare_results {
my (#eto, #pro);
while (<DATA>) {
my ($val) = /:(.*)/;
foreach my $field (split ',', $val)
# ... same
}
# assign to %final_hash, return from sub
}
Thanks to ikegami for comments.
Just for the curiosity's sake, here it is in two passes with regex
sub compare_rx {
my #data = map { (split ':', $_)[1] } <DATA>;
$final_hash{eto} = [ map { /([^,;]+)/g } #data ];
$final_hash{pro} = [ map { /;([^,;]+)/g } #data ];
return 1;
}
This picks all characters which are not , or ;, using the negated character class, [^,;]. So that is up to the first either of them, left to right. It does this globally, /g, so it keeps going through the string, collecting all fields that are "left of" , or ;. Then it cheats a bit, picking all [^,;] that are right of ;. The map is used to do this for all lines of data.
If %result_hash is needed build it instead of #data and then pull the values from it with my #values = values %hash_result and feed the map with #values.
Or, broken line by line (again, you can build %result_hash instead of taking $data directly)
my (#eto, #pro);
while (<DATA>) {
my ($data) = /:(.*)/;
push #eto, $data =~ /([^,;]+)/g;
push #pro, $data =~ /;([^,;]+)/g;
}

Write a Perl script that takes in a fasta and reverses all the sequences (without BioPerl)?

I dont know if this is just a quirk with Stawberry Perl, but I can't seem to get it to run. I just need to take a fasta and reverse every sequence in it.
-The problem-
I have a multifasta file:
>seq1
ABCDEFG
>seq2
HIJKLMN
and the expected output is:
>REVseq1
GFEDCBA
>REVseq2
NMLKJIH
The script is here:
$NUM_COL = 80; ## set the column width of output file
$infile = shift; ## grab input sequence file name from command line
$outfile = "test1.txt"; ## name output file, prepend with “REV”
open (my $IN, $infile);
open (my $OUT, '>', $outfile);
$/ = undef; ## allow entire input sequence file to be read into memory
my $text = <$IN>; ## read input sequence file into memory
print $text; ## output sequence file into new decoy sequence file
my #proteins = split (/>/, $text); ## put all input sequences into an array
for my $protein (#proteins) { ## evaluate each input sequence individually
$protein =~ s/(^.*)\n//m; ## match and remove the first descriptive line of
## the FATA-formatted protein
my $name = $1; ## remember the name of the input sequence
print $OUT ">REV$name\n"; ## prepend with #REV#; a # will help make the
## protein stand out in a list
$protein =~ s/\n//gm; ## remove newline characters from sequence
$protein = reverse($protein); ## reverse the sequence
while (length ($protein) > $NUM_C0L) { ## loop to print sequence with set number of cols
$protein =~ s/(.{$NUM_C0L})//;
my $line = $1;
print $OUT "$line\n";
}
print $OUT "$protein\n"; ## print last portion of reversed protein
}
close ($IN);
close ($OUT);
print "done\n";
This will do as you ask
It builds a hash %fasta out of the FASTA file, keeping array #keys to keep the sequences in order, and then prints out each element of the hash
Each line of the sequence is reversed using reverse before it is added to the hash, and using unshift adds the lines of the sequence in reverse order
The program expects the input file as a parameter on the command line, and prints the result to STDOUT, which may be redirected on the command line
use strict;
use warnings 'all';
my (%fasta, #keys);
{
my $key;
while ( <> ) {
chomp;
if ( s/^>\K/REV/ ) {
$key = $_;
push #keys, $key;
}
elsif ( $key ) {
unshift #{ $fasta{$key} }, scalar reverse;
}
}
}
for my $key ( #keys ) {
print $key, "\n";
print "$_\n" for #{ $fasta{$key} };
}
output
>REVseq1
GFEDCBA
>REVseq2
NMLKJIH
Update
If you prefer to rewrap the sequence so that short lines are at the end, then you just need to rewrite the code that dumps the hash
This alternative uses the length of the longest line in the original file as the limit, and rerwraps the reversed sequence to the same length. It's claer that it would be simple to specify an explicit length instead of calculating it
You will need to add use List::Util 'max' at the top of the program
my $len = max map length, map #$_, values %fasta;
for my $key ( #keys ) {
print $key, "\n";
my $seq = join '', #{ $fasta{$key} };
print "$_\n" for $seq =~ /.{1,$len}/g;
}
Given the original data the output is identical to that of the solution above. I used this as input
>seq1
ABCDEFGHI
JKLMNOPQRST
UVWXYZ
>seq2
HIJKLMN
OPQRSTU
VWXY
with this result. All lines have been wrapped to eleven characters - the length of the longest JKLMNOPQRST line in the original data
>REVseq1
ZYXWVUTSRQP
ONMLKJIHGFE
DCBA
>REVseq2
YXWVUTSRQPO
NMLKJIH
I don't know if this is just for a class that uses toy datasets or actual research FASTAs that can be gigabytes in size. If the latter, it would make sense not to keep the whole data set in memory as both your program and Borodin's do but read it one sequence at a time, print that out reversed and forget about it. The following code does that and also deals with FASTA files that may have asterisks as sequence-end markers as long as they start with >, not ;.
#!/usr/bin/perl
use strict;
use warnings;
my $COL_WIDTH = 80;
my $sequence = '';
my $seq_label;
sub print_reverse {
my $seq_label = shift;
my $sequence = reverse shift;
return unless $sequence;
print "$seq_label\n";
for(my $i=0; $i<length($sequence); $i += $COL_WIDTH) {
print substr($sequence, $i, $COL_WIDTH), "\n";
}
}
while(my $line = <>) {
chomp $line;
if($line =~ s/^>/>REV/) {
print_reverse($seq_label, $sequence);
$seq_label = $line;
$sequence = '';
next;
}
$line = substr($line, 0, -1) if substr($line, -1) eq '*';
$sequence .= $line;
}
print_reverse($seq_label, $sequence);

perl match part of string in two files

I'm using a perl script to look for matches between columns in two tab-delimited files. However for one column I only want to look for a partial match between two strings in two columns.
It concerns $row[4] of $table2 and $row{d} of $table1.
The values in $row[4] of $table2 look like this:
'xxxx'.
The values in $row{d} of $table1 look like this:
'xxxx.aaa'.
If the part before the '.' is the same, there is a match. If not, there is no match. I'm not sure how to implement this in my script. This is what I have so far. I only looks for complete matches between different columns. '...' denotes code that is not important for this question
#! /usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
local $Data::Dumper::Useqq = 1;
use Getopt::Long qw(GetOptions);
...
...
chomp( my #header_table2 = split /\t/, <$table2> );
my %lookup;
while(<$table2>){
chomp;
my #row = split(/\t/);
$lookup{ $row[0] }{ $row[1] }{ $row[4] }{ $row[5] }{ $row[6] }{ $row[7] }{ $row[8] } = [ $row[9], $row[10] ];
}
my #header = do {
my $header = <$table1>;
$header =~ s/\t?\n\z//;
split /\t/, $header;
};
print $table3 join ("\t", #header, qw/ name1 name2 /), "\n";
{
no warnings 'uninitialized';
while(<$table1>){
s/\t?\n\z//;
my %row;
#row{#header} = split /\t/;
print $table3 join ( "\t", #row{#header},
#{ $lookup{ $row{a} }{ $row{b} }{ $row{c} }{ $row{d} }{ $row{e} }{ $row{f} }{ $row{g} }
// [ "", "" ] }), "\n";
}
}
This is looking like a job for a database
The solution below isn't going to work, because you are building your %lookup hash with nine levels of keys ($row[0] .. $row[8]) , and accessing it with only seven levels ($row{a} .. $row{g}), so you're going to have to edit in the real situation
I see no reason to next your hashes so deeply. A single key formed by using join on the relevant fields will work fine and probably a little faster. I also see no reason to extract table2 fields into an array and table1 fields into a hash. An array seems fine in both cases
I've solved your immediate problem by copying each #row from table1 into array #key, and removing the last dot and anything following from the fourth element before building the $key string
In view of your history of adding a spare tab character before the newline at the end of each record, I've also added four die statements that verify the size of the header row and columns rows before continuing. You will probably need to tweak those values according to your real data
use strict;
use warnings 'all';
use Data::Dumper;
local $Data::Dumper::Useqq = 1;
use Getopt::Long qw(GetOptions);
use constant TABLE1_COLUMNS => 9;
use constant TABLE2_COLUMNS => 11;
open my $table2, '<', 'table2.txt' or die $!;
my #header_table2 = do {
my $header = <$table2>;
$header =~ s/\t?\n\z//;
split /\t/, $header;
};
die "Incorrect table 2 header count " . scalar #header_table2
unless #header_table2 == TABLE2_COLUMNS;
my %lookup;
while ( <$table2> ) {
chomp;
my #row = split /\t/;
die "Incorrect table 2 column count " . scalar #row
unless #row == TABLE2_COLUMNS;
my $key = do {
local $" = "\n";
"#row[0..8]";
};
$lookup{ $key } = [ #row[9,10] ];
}
open my $table1, '<', 'table1.txt' or die $!;
my #header = do {
my $header = <$table1>;
$header =~ s/\t?\n\z//;
split /\t/, $header;
};
die "Incorrect table 1 header count " . scalar #header
unless #header == TABLE1_COLUMNS;
open my $table3, '>', 'table3.txt' or die $!;
print $table3 join ("\t", #header, qw/ name1 name2 /), "\n";
while ( <$table1> ) {
s/\t?\n\z//;
my #row = split /\t/;
die "Incorrect table 1 column count " . scalar #row
unless #row == TABLE1_COLUMNS;
my $key = do {
my #key = #row;
$key[3] =~ s/\.[^.]*\z//;
local $" = "\n";
"#key";
};
my $lookup = $lookup{ $key } // [ "", "" ];
print $table3 join("\t", #row, #$lookup), "\n";
}
You're going to have a scoping problem because your array #row and your hash %row both exist in completely different scopes.
But if you have variables (say, $foo and $bar) and you want to know if $foo starts with the contents of $bar followed by a dot, then you can do that using a regular expression check like this:
if ($foo =~ /^$bar\./) {
# match
} else {
# no match
}

Perl Hash of Hash Output

I'm reading a file. I want a hash that gives me the first number of a line as a key to a hash of all the numbers of the rest of the line to 1.
I believe I'm adding the hash correctly, because Dumper prints correctly.
However, print "$first $secondID\n" is not giving me any output.
while (<FILE>) {
chomp $_;
if (/(\d+)\t(.+)/) {
$firstNum = $1;
#seconds = split(/\,/,$2);
foreach $following (#seconds) {
$Pairs->{$firstNum}{$following} = 1;
}
foreach $first (sort {$a <=> $b} keys %Pairs) {
print "$first\n";
%second = {$Pairs{$first}};
foreach $secondID (sort {$a <=> $b} keys %second) {
print "$first $secondID\n";
}
}
print Dumper($Pairs);
}
else {
print "ERROR\n";
}
}
Later on, given a pair of numbers I would like to look up to see whether $Pairs{$num1}{$num2} is defined. would I write
if(defined $Pairs{$num1}{$num2})
Or should I check the first key first. Then check the second key
if (defined $Pairs{$num1}) {
$temp = $Pairs{$num1};
if (defined $temp{$num2}) {
print "true\n;
}
}
You have a couple of errors. Firstly you seem to be unsure whether you are using %Pairs or $Pairs to store your hash, and secondly you have %second = {$Pairs{$first}}, which tries to assign a hash reference to the hash %second. Presumably you want my %second = %{ $Pairs{$first} }.
You should always use strict and use warnings at the start of all your Perl programs, and declare all variables at the point of first use using my. This will alert you to simple mistakes you could otherwise easily overlook, and would have shown up your use of both %Pairs and $Pairs in this program, as well as your attempt to assign a single value (a hash reference) to a hash.
Rather than copying the entire hash, you should save a reference to it in $seconds. Then you can dereference it in the following for loop.
Experienced Perl programmers would also thank you for using lower-case plus underscore for local (my) variables, and reserving capitals for package and class names.
This program works as you intended, and expects the file name as a command-line parameter:
use strict;
use warnings;
my %pairs;
while (<>) {
unless ( /(\d+)\s+(.+)/ ) {
print "ERROR\n";
next;
}
my $first_num = $1;
my #seconds = split /,/, $2;
foreach my $following (#seconds) {
$pairs{$first_num}{$following} = 1;
}
foreach my $first (sort { $a <=> $b } keys %pairs) {
print "$first\n";
my $second = $pairs{$first};
foreach my $second_id (sort { $a <=> $b } keys %$second) {
print "$first $second_id\n";
}
}
}
my %hash;
while ( <> ) {
my #numbers = split /\D+/;
my $key = shift #numbers;
#{$hash{$key}}{ #numbers } = ( 1 ) x #numbers;
}
# test it this way...
if ( $hash{ $num1 }{ $num2 } ) {
}
Use:
%second = %{$Pairs->{$first}};

How can I do alpha numeric sort in Perl?

I have a file which looks like this:
80,1p21
81,19q13
82,6p12.3
83,Xp11.22
84,3pter-q21
86,3q26.33
87,14q24.1-q24.2|14q24|14q22-q24
88,1q42-q43
89,11q13.1
90,2q23-q24
91,12q13
92,2q22.3
93,3p22
94,12q11-q14
95,3p21.1
97,14q24.3
98,2p16.2
And I want to sort them based on the second column. And the first column should change accordingly too. When you use the 'sort' command in Perl, it doesn't do it because it says it's not numeric. Is there a way to sort things alpha numerically in Perl?
If you read the documentation for sort, you'll see that you don't need to do a numeric sort in Perl. You can do string comparisons too.
#sorted = sort { $a cmp $b } #unsorted;
But that still leaves you with a problem as, for example, 19q will sort before 6p. So you can write your own sort function which can make whatever transformations you want before doing the comparison.
#sorted = sort my_complex_sort #unsorted;
sub my_complex_sort {
# code that compares $a and $b and returns -1, 0 or 1 as appropriate
# It's probably best in most cases to do the actual comparison using cmp or <=>
# Extract the digits following the first comma
my ($number_a) = $a =~ /,(\d+)/;
my ($number_b) = $b =~ /,(\d+)/;
# Extract the letter following those digits
my ($letter_a) = $a =~ /,\d+(a-z)/;
my ($letter_b) = $b =~ /,\d+(a-z)/;
# Compare and return
return $number_a <=> $number_b or $letter_a cmp $letter_b;
}
#!/usr/bin/env perl
use strict;
use warnings;
my #datas = map { /^(\d+),(\d*)(.*)$/; [$1, $2, $3]; } <DATA>;
my #res = sort {$a->[1] <=> $b->[1] or $a->[2] cmp $b->[2]} #datas;
foreach my $data (#res) {
my ($x, $y, $z) = #{$data};
print "$x,$y$z\n";
}
__DATA__
80,1p21
81,19q13
82,6p12.3
83,Xp11.22
84,3pter-q21
86,3q26.33
87,14q24.1-q24.2|14q24|14q22-q24
88,1q42-q43
89,11q13.1
90,2q23-q24
91,12q13
92,2q22.3
93,3p22
94,12q11-q14
95,3p21.1
97,14q24.3
98,2p16.2
I actually found the answer to this. The code looks a bit complicated though.
#!/usr/bin/env perl
use strict;
use warnings;
sub main {
my $file;
if (#ARGV != 1) {
die "Usage: perl hashofhash_sort.pl <filename>\n";
}
else {
$file = $ARGV[0];
}
open(IN, $file) or die "Error!! Cannot open the $file file: $!\n";
my #file = <IN>;
chomp #file;
my ($entrez_gene, $loci, $chr, $band, $pq, $band_num);
my (%chromosome, %loci_entrez);
foreach my $line (#file) {
if ($line =~ /(\d+),(.+)/) {
# Entrez genes
$entrez_gene = $1;
# Locus like 12p23.4
$loci = $2;
if ($loci =~ /^(\d+)(.+)?/) {
# chromosome number alone (only numericals)
$chr = $1;
if ($2) {
# locus minus chromosome number. If 12p23.4, then $band is p23.4
$band = "$2";
if ($band =~ /^([pq])(.+)/) {
# either p or q
$pq = $1;
# stores the numericals. for p23.4, stores 23.4
$band_num = $2;
}
if (exists $chromosome{$chr}) {
if (exists $chromosome{$chr}{$pq}) {
push (#{$chromosome{$chr}{$pq}}, $band_num);
}
else {
$chromosome{$chr}{$pq} = [$band_num];
}
}
else {
$chromosome{$chr}{$pq} = [$band_num];
}
}
}
}
} # End of foreach loop
foreach my $key (sort {$a <=> $b} keys %chromosome) {
my %seen = ();
foreach my $key2 (sort {$a cmp $b } keys %{$chromosome{$key}}) {
my #unique = grep { ! $seen{$_}++ } #{$chromosome{$key}{$key2}};
my #sorted = sort #unique;
foreach my $element (#sorted) {
my $sorted_locus = "$key$key2$element";
if (exists $loci_entrez{$sorted_locus}) {
foreach my $element2 (#{$loci_entrez{$sorted_locus}}) {
print "$element2,$sorted_locus\n";
}
}
}
}
}
} # End of main
main();
In the very general case, the question is ambiguous on what to do with integers that are equal but written differently, because of the possibility of leading zeros. The following comparison function (for sort) allows one to consider the lexicographic order as soon as one doesn't have different integers. This is the same as zsh's numeric sort.
sub alphanumcmp ($$)
{
my (#u,#v);
if ((#u = $_[0] =~ /^(\d+)/) &&
(#v = $_[1] =~ /^(\d+)/))
{
my $c = $u[0] <=> $v[0];
return $c if $c;
}
if ((#u = $_[0] =~ /^(.)(.*)/) &&
(#v = $_[1] =~ /^(.)(.*)/))
{
return $u[0] cmp $v[0] || &alphanumcmp($u[1],$v[1]);
}
return $_[0] cmp $_[1];
}
For instance, one would get the following sorted elements:
a0. a00. a000b a00b a0b a001b a01. a01b a1. a1b a010b a10b a011b a11b
Note 1: The use of <=> assumes that the numbers are not too large.
Note 2: In the question, the user wants to do an alphanumeric sort on the second column (instead of the whole string). So, in this particular case, the comparison function could just be adapted to ignore the first column or a Schwartzian transform could be used.