find duplicate filenames and append them to hash of arrays

find duplicate filenames and append them to hash of arrays - perl

Perl question: I have a colon separated file containing paths that I'm using. I just split using a regex, like this:
my %unique_import_hash;
while (my $line = <$log_fh>) {
my ($log_type, $log_import_filename, $log_object_filename)
= split /:/, line;
$log_type =~ s/^\s+|\s+$//g; # trim whitespace
$log_import_filename =~ s/^\s+|\s+$//g; # trim whitespace
$log_object_filename =~ s/^\s+|\s+$//g; # trim whitespace
}
The exact file format is:
type : source-filename : import-filename
What I want is an index file that contains the last pushed $log_object_filename for each unique key $log_import_filename, so, what I'm going to do in English/Perl pseudo-code is push the $log_object_filename onto an array indexed by the hash %unique_import_hash. Then, I want to iterate over the keys and pop the array referred by %unique_import_hash and store it in an array of scalars.
My specific question is: what is the syntax for appending to an array that is the value of a hash?

You can use push, but you have to dereference the array referenced by the hash value:
push #{ $hash{$key} }, $filename;
See perlref for details.

If you only care about the last value for each key, you're over-thinking the problem. No need to fool around with arrays when a simple assignment will overwrite the previous value:
while (my $line = <$log_fh>) {
# ...
$unique_import_hash{$log_import_filename} = $log_object_filename;
}

use strict;
use warnings;
my %unique_import_hash;
my $log_filename = "file.log";
open(my $log_fh, "<" . $log_filename);
while (my $line = <$log_fh>) {
$line =~ s/ *: */:/g;
(my $log_type, my $log_import_filename, my $log_object_filename) = split /:/, $line;
push (#{$unique_import_hash{$log_import_filename}}, $log_object_filename);
}
Seek the wisdom of the Perl monks.

Related

Print keys and values from hash in a proper format?

Hello I am trying to print the values and keys of a hash with one key/value per row like this:
key:value
This is the code I am using to print my hash:
foreach (sort keys %hash) { print "$_:$hash{$_}\n"; }
And this is the output I get:
key
:value
Why is my script printing the value on a new row and what can I do to fix it?

The cursor is moving to the next line because your key contains a line feed. The solution is to remove the line feed from the key.
More specifically, you surely want to avoid creating a key with a line feed in the first place, so it should be removed from the key before you create the hash element.
You're presumably reading the key from a file handle. It's customary to use chomp (to remove any trailing line feed) or s/\s+/z// (to remove any trailing whitespace including line feeds).
my #keys;
while (<>) {
chomp; # Or: s/\s+\z//;
push #keys, $_;
}
my %hash; #hash{#keys} = #values;

Try this version of printing loop:
foreach (sort keys %hash) {
my $v = $hash{$_};
s/\s+$//;
print "$_:$v\n";
}
Keys in %hash definitely have some unwanted trailing characters, so it is better to filter them out when %hash is filled. For example instead of this:
#hash{#keys} = #vals;
Write this:
#hash{map { s/\s+$//; $_ } #keys} = #vals;
Or this:
chmop(#keys);
#hash{#keys} = #vals;
But chomp will not help with multiple characters.

Perl script that parses CSV file excluding the contents enclosed in []

Hi there I am struggling with perl script that parses a an eight column CSV line into another CSV line using the split command. But i want to exclude all the text enclosed by square brackets []. The line looks like :
128.39.120.51,0,49788,6,SYN,[8192:127:1:52:M1460,N,W2,N,N,S:.:Windows:XP/2000 (RFC1323+, w+, tstamp-):link:ethernet/modem],1,1399385680
I used the following script but when i print $fields[7] it gives me N. one of the fields inside [] above.but by print "$fields[7]" i want it to be 1399385680 which is the last field in the above line. the script i tried was.
while (my $line = <LOG>) {
chomp $line;
my #fields=grep { !/^[\[.*\]]$/ } split ",", $line;
my $timestamp=$fields[7];
print "$fields[7]";
}
Thanks for your time. I will appreciate your help.

Always include use strict; and use warnings; at the top of EVERY perl script.
Your "csv" file isn't proper csv. So the only thing I can suggest is to remove the contents in the brackets before you split:
use strict;
use warnings;
while (<DATA>) {
chomp;
s/\[.*?\]//g;
my #fields = split ',', $_;
my $timestamp = $fields[7];
print "$timestamp\n";
}
__DATA__
128.39.120.51,0,49788,6,SYN,[8192:127:1:52:M1460,N,W2,N,N,S:.:Windows:XP/2000 (RFC1323+, w+, tstamp-):link:ethernet/modem],1,1399385680
Outputs:
1399385680
Obviously it is possible to also capture the contents of the bracketed fields, but you didn't say that was a requirement or goal.
Update
If you want to capture the bracket delimited field, one method would be to use a regex for capturing instead.
Note, this current regex requires that each field has a value.
chomp;
my #fields = $_ =~ /(\[.*?\]|[^,]+)(?:,|$)/g;
my $timestamp = $fields[7];
print "$timestamp";

Well, if you want to actually ignore the text between square brackets, you might as well get rid of it:
while ( my $line = <LOG> ) {
chomp $line;
$line =~ s,\[.*?\],,; # Delete all text between square brackets
my #fields = split ",", $line;
my $timestamp = $fields[7];
print $fields[7], "\n";
}

In a file/array, search for hash key, and replace it with the hash value, do this for all hash keys/values

I've searched around the site and surprisingly I can't seem to find something that will work for my particular problem. So I figured I'd post it and see how some of you more experienced programmers can address with problem.
I have a spreadsheet like text file (many lines with tab delimited columns), that I would like to search through for certain labels (ex scaffold1253.1_size81005.6.32799_7496) and replace them with more simplified labels (ex scaffold1253.1a). These labels are only in the first column of the text file. I've already written the script such that I have a hash with the old labels as keys corresponding to the new labels as their respective values. This hash has about 26000 lines. So essentially I'd like to take the hash keys 1 by 1, search for them in the text file, and replace them with their respective hash values.
I have a pretty good server availible so if its too complicated to make it first column specific to speed up the process then thats ok.
THis is what I have so far:
use warnings;
$gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open(FASTAFILE2, $gtf);
#gtfarray = <FASTAFILE2>;
#print #gtfarray;
my %hash;
while (<>)
{
chomp;
my ($key, $val) = split /\t/;
$hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}
#print %hash;
while (my ($find, $replace) = each %hash) {
foreach (#gtfarray){
$_ =~ s/$find/$replace/g;
push #newgtf, $_;
}
}
print #newgtf;
This code doesn't seem to work as it doesn't complete. I'm pretty sure it's a problem with the foreach loop structure. Sorry I don't know of any other way to do this. Does anyone have a better way to run through this file and conduct the replacement?
Any input would be greatly appreciated!
Thanks,
Andrew
#DVK
Here is the full script with your mods that runs into syntax errors with your while loop, any idea why it's not accepting it? Thanks again!
use warnings;
$gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open(FASTAFILE2, $gtf);
my %hash;
while (<>){
chomp;
my ($key, $val) = split /\t/;
$hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}
while $line (<FASTAFILE2>){
my #fields = split(/\t/, $line);
# If you only care about first column, don't need the foreach loop below;
# just do the loop insides on $fields[0]
foreach my $field (#fields) {
$field = $hash{$field} if exists $hash{$field};
print $outfile "$field\t"; # Small bug - will print training \t
}
print $outfile "\n"
}
__END__
Here is the syntax error:
perl gtf_mod2.pl <./Hc_genome/header_file.txt
syntax error at gtf_mod2.pl line 14, near "while $line "
syntax error at gtf_mod2.pl line 23, near "}"
Execution of gtf_mod2.pl aborted due to compilation errors.

You exhaust your file the first time through your loop using the initial $find and $replace key/value pair.
There are two potential solutions:
Open the file for reading during each iteration of your while loop (expensive)
Move the foreach loop to the outside of the while and iterate the hash each time (less expensive)
example:
REPLACE:
for my $line (#gtfarray) {
while(my ($find, $replace) = each %hash) {
if($line =~ s/$find/$replace/g) {
push #newgtf, $line;
next REPLACE; # skip to next iteration
}
}
# if there was no replacement, push the old line
push #newgtf, $line
}

How big is the file that you are replacing the first column in?
If it's >50,000 lines, you are better off doing the reverse:
Iterate through hash file once, and store that hash in memory
Iterate through main file once, and for every line, for every column, find that value in the memorized hash, replace with hash value if found, and write.
In other words, remove the first #gtfarray = <FASTAFILE2>; and replace your last while loop with:
while my $line (<FASTAFILE2>) {
my #fields = split(/\t/, $line);
# If you only care about first column, don't need the foreach loop below;
# just do the loop insides on $fields[0]
foreach my $field (#fields) {
$field = $hash{$field} if exists $hash{$field};
print $outfile "$field\t"; # Small bug - will print training \t
}
print $outfile "\n";
}
NOTE: I'm making an assumption that the fields contain FULL contents of your hash keys (e.g. your data file would contain a field with "scaffold1253.1_size81005.6.32799_7496" but NOT a field with "XYZscaffold1253.1_size81005.6.32799_7496___IOU").
If that assumption is wrong and you really DO need to run a regex because your scaffold strings may be contained in longer strings, there may still be a better solution aside from running O(N*M) regexes: if your scaffold strings are all of a certain well defined format (e.g. "scaffoldNNNNN.NNN_sizeNNNNN.NNN.NNNN_NNNN"), what you need to do then is:
For each line of data file, run a single regex finding that pattern, with the entire pattern inside a capture group parenthesis:
#matches = ($line =~ m/(scaffold\d+\.\d+_size\d+\.\d+\.\d+_\d+/g );
Then, look up every value of #matches array in the hash. If found, run ONLY the matches as a s/// regex.

Looking at your previous post, wouldn't it be more simple to create the shortened 'id' while reading the file. Then you would have no need of the other file where you get your hash?
Here is the (untested) code below. (would need to direct the print statements to an output file on the command line or open a file for writing in your script).
#!/usr/bin/perl
use strict;
use warnings;
my $gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open my $FASTAFILE2, "<", $gtf or die "Unable to open '$gtf' for reading. $!";
my %seen;
while (<$FASTAFILE2>) {
chomp;
my ($id, $val) = split /\t/, $_, 2;
# copy $id to $prefix and
# remove everything after '.1' in $prefix
(my $prefix = $id) =~ s/\.1\K.*//;
if ($seen{$id}) {
++$seen{$id};
}
else {
$seen{$id} = 'a';
}
print "$prefix$seen{$id}\t$val\n";
}
close $FASTAFILE2 or die "Unable to close '$gtf' from reading. $!";

Could it be a job for Tie::File? Assuming, that is, the data file could be operated on as an array.
use Tie::File;
my $file = "./Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf";
tie #lines, 'Tie::File', $file or die ;
for (#lines) {
s/Oldlabel/NewLable/g; # Change this to fit
}
untie #lines ;
Tie::File does a bunch of tricks to keep the "in place " changes to the file memory efficient.

Merging two files based on first column and returns multiple values for each key

I am fairly new to Perl so hopefully this has a quick solution.
I have been trying to combine two files based on a key. The problem is there are multiple values instead of the one it is returning. Is there a way to loop through the hash to get the 1-10 more values it could be getting?
Example:
File Input 1:
12345|AA|BB|CC
23456|DD|EE|FF
File Input2:
12345|A|B|C
12345|D|E|F
12345|G|H|I
23456|J|K|L
23456|M|N|O
32342|P|Q|R
The reason I put those last one in is because the second file has a lot of values I don’t want but file 1 I want all values. The result I want is something like this:
WANTED OUTPUT:
12345|AA|BB|CC|A|B|C
12345|AA|BB|CC|D|E|F
12345|AA|BB|CC|G|H|I
23456|DD|EE|FF|J|K|L
23456|DD|EE|FF|M|N|O
Attached is the code I am currently using. It gives an output like so:
OUTPUT I AM GETTING:
12345|AA|BB|CC|A|B|C
23456|DD|EE|FF|J|K|L
My code so far:
#use strict;
#use warnings;
open file1, "<FILE1.txt";
open file2, "<FILE2.txt";
while(<file2>){
my($line) = $_;
chomp $line;
my($key, $value1, $value2, $value3) = $line =~ /(.+)\|(.+)\|(.+)\|(.+)/;
$value4 = "$value1|$value2|$value3";
$file2Hash{$key} = $value4;
}
while(<file1>){
my ($line) = $_;
chomp $line;
my($key, $value1, $value2, $value3) = $line =~/(.+)\|(.+)\|(.+)\|(.+)/;
if (exists $file2Hash{$key}) {
print $line."|".$file2Hash{$key}."\n";
}
else {
print $line."\n";
}
}
Thank you for any help you may provide,

Your overall idea is sound. However in file2, if you encounter a key you have already defined, you overwrite it with a new value. To work around that, we store an array(-ref) inside our hash.
So in your first loop, we do:
push #{$file2Hash{$key}}, $value4;
The #{...} is just array dereferencing syntax.
In your second loop, we do:
if (exists $file2Hash{$key}){
foreach my $second_value (#{$file2Hash{$key}}) {
print "$line|$second_value\n";
}
} else {
print $line."\n";
}
Beyond that, you might want to declare %file2Hash with my so you can reactivate strict.

Keys in a hash must be unique. If keys in file1 are unique, use file1 to create the hash. If keys are not unique in either file, you have to use a more complicated data structure: hash of arrays, i.e. store several values at each unique key.

I assume that each key in FILE1.txt is unique and that each unique key has at least one corresponding line in FILE2.txt.
Your approach is then quite close to what you need, you should just use FILE1.txt to create the hash from (as already mentioned here).
The following should work:
#!/usr/bin/perl
use strict;
use warnings;
my %file1hash;
open file1, "<", "FILE1.txt" or die "$!\n";
while (<file1>) {
my ($key, $rest) = split /\|/, $_, 2;
chomp $rest;
$file1hash{$key} = $rest;
}
close file1;
open file2, "<", "FILE2.txt" or die "$!\n";
while (<file2>) {
my ($key, $rest) = split /\|/, $_, 2;
if (exists $file1hash{$key}) {
chomp $rest;
printf "%s|%s|%s\n", $key, $file1hash{$key}, $rest;
}
}
close file2;
exit 0;

How to parse multiple line, fixed-width file in perl?

I have a file that I need to parse in the following format. (All delimiters are spaces):
field name 1: Multiple word value.
field name 2: Multiple word value along
with multiple lines.
field name 3: Another multiple word
and multiple line value.
I am familiar with how to parse a single line fixed-width file, but am stumped with how to handle multiple lines.

#!/usr/bin/env perl
use strict; use warnings;
my (%fields, $current_field);
while (my $line = <DATA>) {
next unless $line =~ /\S/;
if ($line =~ /^ \s+ ( \S .+ )/x) {
if (defined $current_field) {
$fields{ $current_field} .= $1;
}
}
elsif ($line =~ /^(.+?) : \s+ (.+) \s+/x ) {
$current_field = $1;
$fields{ $current_field } = $2;
}
}
use Data::Dumper;
print Dumper \%fields;
__DATA__
field name 1: Multiple word value.
field name 2: Multiple word value along
with multiple lines.
field name 3: Another multiple word
and multiple line value.

Fixed-width says unpack to me. It is possible to parse with regexes and split, but unpack should be a safer choice, as it is the Right Tool for fixed width data.
I put the width of the first field to 12 and the empty space between to 13, which works for this data. You may need to change that. The template "A12A13A*" means "find 12 then 13 ascii characters, followed by any length of ascii characters". unpack will return a list of these matches. Also, unpack will use $_ if a string is not supplied, which is what we do here.
Note that if the first field is not fixed width up to the colon, as it appears to be in your sample data, you'll need to merge the fields in the template, e.g. "A25A*", and then strip the colon.
I chose array as the storage device, as I do not know if your field names are unique. A hash would overwrite fields with the same name. Another benefit of an array is that it preserves the order of the data as it appears in the file. If these things are irrelevant and quick lookup is more of a priority, use a hash instead.
Code:
use strict;
use warnings;
use Data::Dumper;
my $last_text;
my #array;
while (<DATA>) {
# unpack the fields and strip spaces
my ($field, undef, $text) = unpack "A12A13A*";
if ($field) { # If $field is empty, that means we have a multi-line value
$field =~ s/:$//; # strip the colon
$last_text = [ $field, $text ]; # store data in anonymous array
push #array, $last_text; # and store that array in #array
} else { # multi-line values get added to the previous lines data
$last_text->[1] .= " $text";
}
}
print Dumper \#array;
__DATA__
field name 1: Multiple word value.
field name 2: Multiple word value along
with multiple lines.
field name 3: Another multiple word
and multiple line value
with a third line
Output:
$VAR1 = [
[
'field name 1:',
'Multiple word value.'
],
[
'field name 2:',
'Multiple word value along with multiple lines.'
],
[
'field name 3:',
'Another multiple word and multiple line value with a third line'
]
];

You could do this:
#!/usr/bin/perl
use strict;
use warnings;
my #fields;
open(my $fh, "<", "multi.txt") or die "Unable to open file: $!\n";
for (<$fh>) {
if (/^\s/) {
$fields[$#fields] .= $_;
} else {
push #fields, $_;
}
}
close $fh;
If the line starts with white space, append it to the last element in #fields, otherwise push it onto the end of the array.
Alternatively, slurp the entire file and split with look-around:
#!/usr/bin/perl
use strict;
use warnings;
$/=undef;
open(my $fh, "<", "multi.txt") or die "Unable to open file: $!\n";
my #fields = split/(?<=\n)(?!\s)/, <$fh>;
close $fh;
It's not a recommended approach though.

You can change delimiter:
$/ = "\nfield name";
while (my $line = <FILE>) {
if ($line =~ /(\d+)\s+(.+)/) {
print "Record $1 is $2";
}
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

find duplicate filenames and append them to hash of arrays - perl

You can use push, but you have to dereference the array referenced by the hash value: push #{ $hash{$key} }, $filename; See perlref for details.

If you only care about the last value for each key, you're over-thinking the problem. No need to fool around with arrays when a simple assignment will overwrite the previous value: while (my $line = <$log_fh>) { # ... $unique_import_hash{$log_import_filename} = $log_object_filename; }

Related

Print keys and values from hash in a proper format?

Perl script that parses CSV file excluding the contents enclosed in []

In a file/array, search for hash key, and replace it with the hash value, do this for all hash keys/values

Merging two files based on first column and returns multiple values for each key

How to parse multiple line, fixed-width file in perl?

Categories

Resources