How to split each entry of an array on each whitespace? - perl

I have a file like this:
This is is my "test"
file with a lot
words separeted by whitespace.
Now I want to achieve to split this so that i create an array where each element contains of one word and all duplicate words are deleted
the desired array:
This
is
my
test
etc...
I read the file into an array but I do not know how to split an whole array so that the result is a new array. And how can I remove the duplicate words?
#!/usr/bin/perl
package catalogs;
use Log::Log4perl;
Log::Log4perl->init("log4perl.properties");
open(FILE, "<Source.txt") || die "file Sources.txt konnte nicht geoeffnet werden";
my #fileContent = <FILE>;
close FILE;
my $log = Log::Log4perl->get_logger("catalogs");
#fileContent = split(" ");

To extract the words, you could use
my #words = $str =~ /\w+/g;
As for removing duplicates,
use List::MoreUtils qw( uniq );
my #uniq_words = uniq #words;
or
my %seen;
my #uniq_words = grep !$seen{$_}++, #words;

You're loading the text of the file into an array, but it may make more sense to load the file into a single string. This would enable you to take advantage of the solution #ikegami provided. To bring it all together, try the following.
use List::MoreUtils qw( uniq );
my $filecontent = do
{
local $/ = undef;
<STDIN>;
};
my #words = $filecontent =~ /\w+/g;
my #uniqword = uniq(#words);

my $log = Log::Log4perl->get_logger("catalogs");
#fileContent = split(/\s+/, $log);
#filecontent = uniq(#filecontent);
To make the words unique, you can use uniq subroutine or map it into a hash. Since keys of a hash are always unique, duplicates will be over-written.
use strict;
use warnings;
use Data::Dumper;
my #a = (1,1,1,2,3,4,4);
my %hash = ();
%hash = map $_=>'1', #a;
my #new = keys(%hash );
print Dumper(#new);

Related

Extracting and storing the the values in key value pair from a text in file in perl

I have a text file like which contains information like this:
name=A
class=B
RollNo=C
I want to extract the values in perl script
key(name) = value(A)
key(class) = value(B)
key(RollNo) = value(C)
the keys should be exported as the variables which will have values. Whenever we type
print $name
the output should be 'A'
I have tried:
open my $fh, '<', $file_name
or die "Could not open sample.txt: $!";
my #lines = <$fh>;
my %hash;
while (<#lines>) {
chomp;
my ($key, $value) = split /=/;
next unless defined $value;
$hash{$key} = $value;
}
print %hash;
Your code looks pretty good and most of what you've done so far works.
At the end, you run print %hash and that doesn't give you what you expect. That will "unroll" the keys and values from the hash into a list and print that list. So you get all of the keys and values printed out.
If you just want one value (for example, the value associated with the "name" key), then just print that.
print $hash{name};
Is that what you were looking for?
You could try using one of the configuration modules that are available. Config::Tiny seems to fit your data:
use strict;
use warnings;
use Data::Dumper;
use Config::Tiny;
my $Config = Config::Tiny->new;
$Config = Config::Tiny->read( 'a.txt' ); # your text file name goes here
print $Config->{_}{name}; # print the name value
print Dumper $Config; # print all the values in perl variable format
You can store data in hash and can retrieve from their.
use strict;
use warnings;
use Data::Dumper;
my %hash = (
name => 'A',
class => 'B',
RollNo => 'C'
);
print Dumper(\%hash);
print $hash{'name'};

Perl grep through large file to match a string

I have an array (#array) which has list of elements. I need to check whether these each of the elements are exists in master file or not. If the element exists in master file then in the same line of master file the string YES (in 5th position) should also exists. And the element should be stored in different array.
Actually my script uses two grep shell command to achieve this. How can I write same thing in Perl do grep.
...
use Data::Dumper;
my #new_array;
my #array = ('RT0AC1', 'WG3RA3');
print Dumper(\#array);
foreach ( #array ){
my $line = `grep $_ "master_file.csv" | grep -i yes`;
next unless($line);
push( #new_array, $_ );
}
print Dumper(#new_array);
...
where master_file.csv looks like this:
101,RT0AC1,CONNECTED,FAULTY,NO
102,RT0AC1,CONNECTED,WORKING,YES
103,RT0AC1,NOT CONNECTED,WORKING,NO
104,WG3RA3,NOT CONNECTED,DISABLED,NO
105,WG3RA3,CONNECTED,WORKING,NO
So Here I am getting $line value as 102,RT0AC1,CONNECTED,WORKING,YES and element RT0AC1 is getting stored in #new_array.
How can I avoid using backtick(`) and two greps to achieve this. I am trying to do this using pure Perl. Also the master_file.csv contains millions of records.
Since all the words you're looking for are in the same location, it's easy to just split up the current line on commas and see if the second column exists in a hash table, and if the fifth column is equal to "YES":
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Data::Dumper;
my $filename = shift // "master_file.csv"; # Default filename if not given on command line
my #array = qw/RT0AC1 WG3RA3/; # Words you're looking for
my %words = map { $_ => 1 } #array; # Store them in a hash for fast lookup
my #new_array;
# Use Text::CSV_XS for non-trivial CSV files
open my $csv, "<", $filename;
while (<$csv>) {
chomp;
my #F = split /,/;
push #new_array, $F[1] if exists $words{$F[1]} && $F[4] eq "YES";
}
print Dumper(\#new_array);
Form regex to match records of interest, split line into fields and compare field #5 to YES. If there is a match increase a count for field #2 in %match hash.
Once the file processed %match hash will have matched records field #2 as a key and value will reflect how many times this field was matched with YES in the file.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my %match;
my #look_for = qw(RT0AC1 WG3RA3);
my $re_filter = join('|',#look_for);
while(<DATA>) {
chomp;
next unless /$re_filter/;
my #data = split(',',$_);
$match{$data[1]}++ if $data[4] eq 'YES';
}
say Dumper(\%match);
__DATA__
101,RT0AC1,CONNECTED,FAULTY,NO
102,RT0AC1,CONNECTED,WORKING,YES
103,RT0AC1,NOT CONNECTED,WORKING,NO
104,WG3RA3,NOT CONNECTED,DISABLED,NO
105,WG3RA3,CONNECTED,WORKING,NO
Output
$VAR1 = {
'RT0AC1' => 1
};
Remove DATA to get final code and give filename on command line to process file with data of interest
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my %match;
my #look_for = qw(RT0AC1 WG3RA3);
my $re_filter = join('|',#look_for);
while(<>) {
chomp;
next unless /$re_filter/;
my #data = split(',',$_);
$match{$data[1]}++ if $data[4] eq 'YES';
}
say Dumper(\%match);
An alternative version based on regular expression without using split
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my %match;
my #look_for = qw(RT0AC1 WG3RA3);
my $re_filter = join('|',#look_for);
my $regex = qr/^\d+,($re_filter),[^,]+,[^,]+,YES$/;
/$regex/ && $match{$1}++ for <DATA>;
say Dumper(\%match);
__DATA__
101,RT0AC1,CONNECTED,FAULTY,NO
102,RT0AC1,CONNECTED,WORKING,YES
103,RT0AC1,NOT CONNECTED,WORKING,NO
104,WG3RA3,NOT CONNECTED,DISABLED,NO
105,WG3RA3,CONNECTED,WORKING,NO

How to fix the error of "Use of unitialized value in addition..." in perl script?

Here is the script of user Suic for calculating molecular weight of fasta sequences (calculating molecular weight in perl),
#!/usr/bin/perl
use strict;
use warnings;
use Encode;
for my $file (#ARGV) {
open my $fh, '<:encoding(UTF-8)', $file;
my $input = join q{}, <$fh>;
close $fh;
while ( $input =~ /^(>.*?)$([^>]*)/smxg ) {
my $name = $1;
my $seq = $2;
$seq =~ s/\n//smxg;
my $mass = calc_mass($seq);
print "$name has mass $mass\n";
}
}
sub calc_mass {
my $a = shift;
my #a = ();
my $x = length $a;
#a = split q{}, $a;
my $b = 0;
my %data = (
A=>71.09, R=>16.19, D=>114.11, N=>115.09,
C=>103.15, E=>129.12, Q=>128.14, G=>57.05,
H=>137.14, I=>113.16, L=>113.16, K=>128.17,
M=>131.19, F=>147.18, P=>97.12, S=>87.08,
T=>101.11, W=>186.12, Y=>163.18, V=>99.14
);
for my $i( #a ) {
$b += $data{$i};
}
my $c = $b - (18 * ($x - 1));
return $c;
}
and the protein.fasta file with n (here is 2) sequences:
seq_ID_1 descriptions etc
ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH
ASDSADGDASHDASHSAREWAWGDASHASGASGASGSDGASDGDSAHSHAS
SFASGDASGDSSDFDSFSDFSD
>seq_ID_2 descriptions etc
ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH
ASDSADGDASHDASHSAREWAWGDASHASGASGASG
When using: perl molecular_weight.pl protein.fasta > output.txt
in terminal, it will generate the correct results, however it also presents an error of "Use of unitialized value in addition (+) at molecular_weight.pl line36", which is just localized in line of "$b += $data{$i};" how to fix this bug ? Thanks in advance !
You probably have an errant SPACE somewhere in your data file. Just change
$seq =~ s/\n//smxg;
into
$seq =~ s/\s//smxg;
EDIT:
Besides whitespace, there may be some non-whitespace invisible characters in the data, like WORD JOINER (U+2060).
If you want to be sure to be thorough and you know all the legal symbols, you can delete everything apart from them:
$seq =~ s/[^ARDNCEQGHILKMFPSTWYV]//smxg;
Or, to make sure you won't miss any (even if you later change the symbols), you can populate a filter regex dynamically from the hash keys.
You'd need to make %Data and the filter regex global, so the filter is available in the main loop. As a beneficial side effect, you don't need to re-initialize the data hash every time you enter calc_mass().
use strict;
use warnings;
my %Data = (A=>71.09,...);
my $Filter_regex = eval { my $x = '[^' . join('', keys %Data) . ']'; qr/$x/; };
...
$seq =~ s/$Filter_regex//smxg;
(This filter works as long as the symbols are single character. For more complicated ones, it may be preferable to match for the symbols and collect them from the sequence, instead of removing unwanted characters.)

Perl: Printing out the file where a word occurs

I am trying to write a small program that takes from command line file(s) and prints out the number of occurrence of a word from all files and in which file it occurs. The first part, finding the number of occurrence of a word, seems to work well.
However, I am struggling with the second part, namely, finding in which file (i.e. file name) the word occurs. I am thinking of using an array that stores the word but don’t know if this is the best way, or what is the best way.
This is the code I have so far and seems to work well for the part that counts the number of times a word occurs in given file(s):
use strict;
use warnings;
my %count;
while (<>) {
my $casefoldstr = lc $_;
foreach my $str ($casefoldstr =~ /\w+/g) {
$count{$str}++;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}:\n";
}
The filename is accessible through $ARGV.
You can use this to build a nested hash with the filename and word as keys:
use strict;
use warnings;
use List::Util 'sum';
while (<>) {
$count{$word}{$ARGV}++ for map +lc, /\w+/g;
}
foreach my $word ( keys %count ) {
my #files = keys %$word; # All files containing lc $word
print "Total word count for '$word': ", sum( #{ $count{$word} }{#files} ), "\n";
for my $file ( #files ) {
print "$count{$word}{$file} counts of '$word' detected in '$file'\n";
}
}
Using an array seems reasonable, if you don't visit any file more than once - then you can always just check the last value stored in the array. Otherwise, use a hash.
#!/usr/bin/perl
use warnings;
use strict;
my %count;
my %in_file;
while (<>) {
my $casefoldstr = lc;
for my $str ($casefoldstr =~ /\w+/g) {
++$count{$str};
push #{ $in_file{$str} }, $ARGV
unless ref $in_file{$str} && $in_file{$str}[-1] eq $ARGV;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}: #{ $in_file{$str} }\n";
}

How to extract multiple columns from a CSV file using Perl

I'm pretty new with Perl and was hoping if anyone could help me with this issue. I need to extract two columns from a CSV file embedded commas. This is how the format looks like:
"ID","URL","DATE","XXID","DATE-LONGFORMAT"
I need to extract the DATE column, the XXID column, and the column immediately after XXID. Note, each line doesn't necessarily follow the same number of columns.
The XXID column contains a 2 letter prefix and doesn't always starts with the same letter. It can pretty much be any letter of the aplhabet. The length is always the same.
Finally, once these three columns are extracted, I need to sort on the XXID column and get a count on duplicates.
I published a module called Tie::Array::CSV which lets Perl interact with your CSV as a native Perl nested array. If you use this, you can take your search logic and apply it just as if your data were already in an array of array-references. Take a look!
#!/usr/bin/env perl
use strict;
use warnings;
use File::Temp;
use Tie::Array::CSV;
use List::MoreUtils qw/first_index/;
use Data::Dumper;
# this builds a temporary file from DATA
# normally you would just make $file the filename
my $file = File::Temp->new;
print $file <DATA>;
#########
tie my #csv, 'Tie::Array::CSV', $file;
#find column from data in first row
my $colnum = first_index { /^\w.{6}$/ } #{$csv[0]};
print "Using column: $colnum\n";
#extract that column
my #column = map { $csv[$_][$colnum] } (0..$#csv);
#build a hash of repetitions
my %reps;
$reps{$_}++ for #column;
print Dumper \%reps;
Here's a sample script using the Text::CSV module to parse your csv data. Consult the documentation for the module to find the proper settings for your data.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({ binary => 1 });
while (my $row = $csv->getline(*DATA)) {
print "Date: $row->[2]\n";
print "Col#1: $row->[3]\n";
print "Col#2: $row->[4]\n";
}
You definitely want to use a CPAN library for parsing CSV, as you will never account for all the quirks of the format.
Please see: How can I parse quoted CSV in Perl with a regex?
Please see: How do I efficiently parse a CSV file in Perl?
However, here is a very naive and non-idiomatic solution for that particular string you provided:
use strict;
use warnings;
my $string = '"ID","URL","DATE","XXID","DATE-LONGFORMAT"';
my #words = ();
my $word = "";
my $quotec = '"';
my $quoted = 0;
foreach my $c (split //, $string)
{
if ($quoted)
{
if ($c eq $quotec)
{
$quoted = 0;
push #words, $word;
$word = "";
}
else
{
$word .= $c;
}
}
elsif ($c eq $quotec)
{
$quoted = 1;
}
}
for (my $i = 0; $i < scalar #words; ++$i)
{
print "column " . ($i + 1) . " = $words[$i]\n";
}