Opening, spliting and sorting into an Arrray in perl

Opening, spliting and sorting into an Arrray in perl - perl

I am a beginner programmer, who has been given a weeklong assignment to build a complex program, but is having a difficult time starting off. I have been given a set of data, and the goal is separate it into two separate arrays by the second column, based on whether the letter is M or F.
this is the code I have thus far:
#!/usr/local/bin/perl
open (FILE, "ssbn1898.txt");
$x=<FILE>;
split/[,]/$x;
#array1=$y;
if #array1[2]="M";
print #array2;
else;
print #array3;
close (FILE);
How do I fixed this? Please try and use the simplest terms possible I stared coding last week!
Thank You

First off - you split on comma, so I'm going to assume your data looks something like this:
one,M
two,F
three,M
four,M
five,F
six,M
There's a few problems with your code:
turn on strict and warnings. The warn you about possible problems with your code
open is better off written as open ( my $input, "<", $filename ) or die $!;
You only actually read one line from <FILE> - because if you assign it to a scalar $x it only reads one line.
you don't actually insert your value into either array.
So to do what you're basically trying to do:
#!/usr/local/bin/perl
use strict;
use warnings;
#define your arrays.
my #M_array;
my #F_array;
#open your file.
open (my $input, "<", 'ssbn1898.txt') or die $!;
#read file one at a time - this sets the implicit variable $_ each loop,
#which is what we use for the split.
while ( <$input> ) {
#remove linefeeds
chomp;
#capture values from either side of the comma.
my ( $name, $id ) = split ( /,/ );
#test if id is M. We _assume_ that if it's not, it must be F.
if ( $id eq "M" ) {
#insert it into our list.
push ( #M_array, $name );
}
else {
push ( #F_array, $name );
}
}
close ( $input );
#print the results
print "M: #M_array\n";
print "F: #F_array\n";
You could probably do this more concisely - I'd suggest perhaps looking at hashes next, because then you can associate key-value pairs.

There's a part function in List::MoreUtils that does exactly what you want.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use List::MoreUtils 'part';
my ($f, $m) = part { (split /,/)[1] eq 'M' } <DATA>;
say "M: #$m";
say "F: #$f";
__END__
one,M,foo
two,F,bar
three,M,baz
four,M,foo
five,F,bar
six,M,baz
The output is:
M: one,M,foo
three,M,baz
four,M,foo
six,M,baz
F: two,F,bar
five,F,bar

#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my #boys=();
my #girls=();
my $fname="ssbn1898.txt"; # I keep stuff like this in a scalar
open (FIN,"< $fname")
or die "$fname:$!";
while ( my $line=<FIN> ) {
chomp $line;
my #f=split(",",$line);
push #boys,$f[0] if $f[1]=~ m/[mM]/;
push #girls,$f[1] if $f[1]=~ m/[gG]/;
}
print Dumper(\#boys);
print Dumper(\#girls);
exit 0;
# Caveats:
# Code is not tested but should work and definitely shows the concepts
#

In fact the same thing...
#!/usr/bin/perl
use strict;
my (#m,#f);
while(<>){
push (#m,$1) if(/(.*),M/);
push (#f,$1) if(/(.*),F/);
}
print "M=#m\nF=#f\n";
Or a "perl -n" (=for all lines do) variant:
#!/usr/bin/perl -n
push (#m,$1) if(/(.*),M/);
push (#f,$1) if(/(.*),F/);
END { print "M=#m\nF=#f\n";}

Related

Parsing the large files in Perl

I need to compare the big file(2GB) contains 22 million lines with the another file. its taking more time to process it while using Tie::File.so i have done it through 'while' but problem remains. see my code below...
use strict;
use Tie::File;
# use warnings;
my #arr;
# tie #arr, 'Tie::File', 'title_Nov19.txt';
# open(IT,"<title_Nov19.txt");
# my #arr=<IT>;
# close(IT);
open(RE,">>res.txt");
open(IN,"<input.txt");
while(my $data=<IN>){
chomp($data);
print"$data\n";
my $occ=0;
open(IT,"<title_Nov19.txt");
while(my $line2=<IT>){
my $line=$line2;
chomp($line);
if($line=~m/\b$data\b/is){
$occ++;
}
}
print RE"$data\t$occ\n";
}
close(IT);
close(IN);
close(RE);
so help me to reduce it...

Lots of things wrong with this.
Asides from the usual (lack of use strict, use warnings, use of 2-argument open(), not checking open() result, use of global filehandles), the specific problem in your case is that you are opening/reading/closing the second file once for every single line of the first. This is going to be very slow.
I suggest you open the file title_Nov19.txt once, read all the lines into an array or hash or something, then close it; and then you can open the first file, input.txt and walk along that once, comparing to things in the array so you don't have to reopen that second file all the time.
Futher I suggest you read some basic articles on style/etc.. as your question is likely to gain more attention if it's actually written in vaguely modern standards.

I tried to build a small example script with a better structure but I have to say, man, your problem description is really very unclear. It's important to not read the whole comparison file each time as #LeoNerd explained in his answer. Then I use a hash to keep track of the match count:
#!/usr/bin/env perl
use strict;
use warnings;
# cache all lines of the comparison file
open my $comp_file, '<', 'input.txt' or die "input.txt: $!\n";
chomp (my #comparison = <$comp_file>);
close $comp_file;
# prepare comparison
open my $input, '<', 'title_Nov19.txt' or die "title_Nov19.txt: $!\n";
my %count = ();
# compare each line
while (my $title = <$input>) {
chomp $title;
# iterate comparison strings
foreach my $comp (#comparison) {
$count{$comp}++ if $title =~ /\b$comp\b/i;
}
}
# done
close $input;
# output (sorted by count)
open my $output, '>>', 'res.txt' or die "res.txt: $!\n";
foreach my $comp (#comparison) {
print $output "$comp\t$count{$comp}\n";
}
close $output;
Just to get you started... If someone wants to further work on this: these were my test files:
title_Nov19.txt
This is the foo title
Wow, we have bar too
Nothing special here but foo
OMG, the last title! And Foo again!
input.txt
foo
bar
And the result of the program was written to res.txt:
foo 3
bar 1

Here's another option using memowe's (thank you) data:
use strict;
use warnings;
use File::Slurp qw/read_file write_file/;
my %count;
my $regex = join '|', map { chomp; $_ = "\Q$_\E" } read_file 'input.txt';
for ( read_file 'title_Nov19.txt' ) {
my %seen;
!$seen{ lc $1 }++ and $count{ lc $1 }++ while /\b($regex)\b/ig;
}
write_file 'res.txt', map "$_\t$count{$_}\n",
sort { $count{$b} <=> $count{$a} } keys %count;
Numerically-sorted output to res.txt:
foo 3
bar 1
An alternation regex which quotes meta characters (\Q$_\E) is built and used, so only one pass against the large file's lines is needed. The hash %seen is used to insure that the input words are only counted once per line.
Hope this helps!

Try this:
grep -i -c -w -f input.txt title_Nov19.txt > res.txt

Perl's Chomp: Chomp is removing the whole word instead of the newline

I am facing issues with perl chomp function.
I have a test.csv as below:
col1,col2
vm1,fd1
vm2,fd2
vm3,fd3
vm4,fd4
I want to print the 2nd field of this csv. This is my code:
#!/usr/bin/perl -w
use strict;
my $file = "test.csv";
open (my $FH, '<', $file);
my #array = (<$FH>);
close $FH;
foreach (#array)
{
my #row = split (/,/,$_);
my $var = chomp ($row[1]); ### <<< this is the problem
print $var;
}
The output of aboe code is :
11111
I really don't know where the "1" is comming from. Actually, the last filed can be printed as below:
foreach (#array)
{
my #row = split (/,/,$_);
print $row[1]; ### << Note that I am not printing "\n"
}
the output is:
vm_cluster
fd1
fd2
fd3
fd4
Now, i am using these field values as an input to the DB and the DB INSERT statement is failing due this invisible newline. So I thought chomp would help me here. instead of chomping, it gives me "11111".
Could you help me understand what am i doing wrong here.
Thanks.
Adding more information after reading loldop's responce:
If I write as below, then it will not print anything (not even the "11111" output mentioned above)
foreach (#array)
{
my #row = split (/,/,$_);
chomp ($row[1]);
my $var = $row[1];
print $var;
}
Meaning, chomp is removing the last string and the trailing new line.

The reason you see only a string of 1s is that you are printing the value of $val which is the value returned from chomp. chomp doesn't return the trimmed string, it modifies its parameter in-place and returns the number of characters removed from the end. Since it always removes exactly one "\n" character you get a 1 output for each element of the array.
You really should use warnings instead of the -w command-line option, and there is no reason here to read the entire file into an array. But well done on using a lexical filehandle with the three-parameter form of open.
Here is a quick refactoring of your program that will do what you want.
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'test.csv';
open my $FH, '<', $file or die qq(Unable to open "$file": $!);
while (<$FH>) {
chomp;
my #row = split /,/;
print $row[1], "\n";
}

although, it is my fault at the beginning.
chomp function return 1 <- result of usage this function.
also, you can find this bad example below. but it will works, if you use numbers.
sometimes i use this cheat (don't do that! it is my bad-hack code!)
map{/filter/ && $_;}#all_to_filter;
instead of this, use
grep{/filter/}#all_to_filter;
foreach (#array)
{
my #row = split (/,/,$_);
my $var = chomp ($row[1]) * $row[1]; ### this is bad code!
print $var;
}
foreach (#array)
{
my #row = split (/,/,$_);
chomp ($row[1]);
my $var = $row[1];
print $var;
}

If you simply want to get rid of new lines you can use a regex:
my $var = $row[1];
$var=~s/\n//g;

So, I was quite frustrated with this easy looking task bugging me for the whole day long. I really appreciate everyone who responded.
Finaly I ended up using Text::CSV perl module and then calling each of the CSV field as array reference. There was no need left to run the chomp after using Text::CSV.
Here is the code:
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1 } ) # should set binary attribute.
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<:encoding(utf8)", "vm.csv" or die "vm.csv: $!";
<$fh>; ## this is to remove the column headers.
while ( my $row = $csv->getline ($fh) )
{
print $row->[1];
}
and here is hte output:
fd1fd2fd3fd4
Later i was pulled these individual values and inserted into the DB.
Thanks everyone.

Alternative to foreach loop with hashes in perl

I have two files, one with text and another with key / hash values. I want to replace occurrences of the key with the hash values. The following code does this, what I want to know is if there is a better way than the foreach loop I am using.
Thanks all
Edit: I know it is a bit strange using
s/\n//;
s/\r//;
instead of chomp, but this works on files with mixed end of line characters (edited both on windows and linux) and chomp (I think) does not.
File with key / hash values (hash.tsv):
strict $tr|ct
warnings w#rn|ng5
here h3r3
File with text (doc.txt):
Do you like use warnings and strict?
I do not like use warnings and strict.
Do you like them here or there?
I do not like them here or there?
I do not like them anywhere.
I do not like use warnings and strict.
I will not obey your good coding practice edict.
The perl script:
#!/usr/bin/perl
use strict;
use warnings;
open (fh_hash, "<", "hash.tsv") or die "could not open file $!";
my %hash =();
while (<fh_hash>)
{
s/\n//;
s/\r//;
my #tmp_hash = split(/\t/);
$hash{ #tmp_hash[0] } = #tmp_hash[1];
}
close (fh_hash);
open (fh_in, "<", "doc.txt") or die "could not open file $!";
open (fh_out, ">", "doc.out") or die "could not open file $!";
while (<fh_in>)
{
foreach my $key ( keys %hash )
{
s/$key/$hash{$key}/g;
}
print fh_out;
}
close (fh_in);
close (fh_out);

One problem with
for my $key (keys %hash) {
s/$key/$hash{$key}/g;
}
is it doesn't correctly handle
foo => bar
bar => foo
Instead of swapping, you end up with all "foo" or all "bar", and you can't even control which.
# Do once, not once per line
my $pat = join '|', map quotemeta, keys %hash;
s/($pat)/$hash{$1}/g;
You might also want to handle
foo => bar
food => baz
by taking the longest rather than possibly ending with "bard".
# Do once, not once per line
my $pat =
join '|',
map quotemeta,
sort { length($b) <=> length($a) }
keys %hash;
s/($pat)/$hash{$1}/g;

You can read a whole file into a variable a replace all occurrences at once for each key-val.
Something like:
use strict;
use warnings;
use YAML;
use File::Slurp;
my $href = YAML::LoadFile("hash.yaml");
my $text = read_file("text.txt");
foreach (keys %$href) {
$text =~ s/$_/$href->{$_}/g;
}
open (my $fh_out, ">", "doc.out") or die "could not open file $!";
print $fh_out $text;
close $fh_out;
produces:
Do you like use w#rn|ng5 and $tr|ct?
I do not like use w#rn|ng5 and $tr|ct.
Do you like them h3r3 or th3r3?
I do not like them h3r3 or th3r3?
I do not like them anywh3r3.
I do not like use w#rn|ng5 and $tr|ct.
I will not obey your good coding practice edict.
For shorting a code i used YAML and replaced your input file with:
strict: $tr|ct
warnings: w#rn|ng5
here: h3r3
and used File::Slurp for reading a whole file into a variable. Of course, you can "slurp" the file without File::Slurp, for example with:
my $text;
{
local($/); #or undef $/;
open(my $fh, "<", $file ) or die "problem $!\n";
$text = <$fh>;
close $fh;
}

Writing to a file in perl

I want to write the key and value pair that i have populated in the hash.I am using
open(OUTFILE,">>output_file.txt");
{
foreach my $name(keys %HoH) {
my $values = $HoH{$name};
print "$name: $values\n";
}
}
close(OUTFILE);
Somehow it creates the output_file.txt but it does not write the data to it.What could be the reason?

Use:
print OUTFILE "$name: $values\n";
Without specifying the filehandle in the print statement, you are printing to STDOUT, which is by default the console.

open my $outfile, '>>', "output_file.txt";
print $outfile map { "$_: $HOH{$_}\n" } keys %HoH;
close($outfile);
I cleaned up for code, using the map function here would be more concise. Also I used my variables for the file handles, always good practice. There are still more ways to do this, you should check out Perl Cook book, here

When you open OUTFILE you have a couple of choices for how to write to it. One, you can specify the filehandle in your print statements, or two, you can select the filehandle and then print normally (without specifying a filehandle). You're doing neither. I'll demonstrate:
use strict;
use warnings;
use autodie;
my $filename = 'somefile.txt';
open my( $filehandle ), '>>', $filename;
foreach my $name ( keys %HoH ) {
print $filehandle "$name: $HoH{$name}\n";
}
close $filehandle;
If you were to use select, you could do it this way:
use strict;
use warnings;
use autodie;
my $filename = 'somefile.txt';
open my( $filehandle ), '>>', $filename;
my $oldout = select $filehandle;
foreach my $name( keys %HoH ) {
print "$name: $HoH{$name}\n";
}
close $filehandle;
select $oldout;
Each method has its uses, but more often than not, in the interest of writing clear and easy to read/maintain code, you use the first approach unless you have a real good reason.
Just remember, whenever you're printing to a file, specify the filehandle in your print statement.

sergio's answer of specifying the filehandle is the best one.
Nonetheless there is another way: use select to change the default output filehandle. And in another alternate way to do things, using while ( each ) rather than foreach ( keys ) can be better in some cases (particularly, when the hash is tied to a file somehow and it would take a lot of memory to get all the keys at once).
open(OUTFILE,">>output_file.txt");
select OUTFILE;
while (my ($name, $value) = each %HoH) {
print "$name: $value\n";
}
close(OUTFILE);

How do I push more than one matched groups as same element of array in Perl?

I am need to push all the matched groups into an array.
#!/usr/bin/perl
use strict;
open (FILE, "/home/user/name") || die $!;
my #lines = <FILE>;
close (FILE);
open (FH, ">>/home/user/new") || die $!;
foreach $_(#lines){
if ($_ =~ /AB_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_W.+txt/){
print FH "$1 $2 $3 $4 $5 $6 $7\n"; #needs to be first element of array
}
elsif ($_ =~ /CD_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_W.+txt/){
print FH "$1 $2 $3 $4 $5 $6\n"; #needs to be second element of array
}
}close (FH);
_ INPUT _
AB_ first--2-45_ Name_ is34_ correct_ OR_ not_W3478.txt
CD_ second_ input_ 89-is_ diffErnt_ 76-from_Wfirst6.txt
Instead of writing matched groups to FILE, I want to push them into array. I can't think of any other command other than push but this function does not accept more than one argument. What is the best way to do the same? The output should look like following after pushing matched groups into array.
_ OUTPUT _
$array[0] = first--2-45 Name is34 correct OR not
$array[1] = second input 89-is diffErnt 76-from

Use the same argument for push that you use for print: A string in double quotes.
push #array, "$1 $2 $3 $4 $5 $6 $7";

Take a look at perldoc -f grep, which returns a list of all elements of a list that match some criterion.
And incidentally, push does take more than one argument: see perldoc -f push.
push #matches, grep { /your regex here/ } #lines;
You didn't include the code leading up to this though.. some of it is a little odd, such as the use of $_ as a function call. Are you sure you want to do that?

If you are using Perl 5.10.1 or later, this is how I would write it.
#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1; # or use 5.010;
use autodie;
my #lines = do{
# don't need to check for errors, because of autodie
open( my $file, '<', '/home/user/name' );
grep {chomp} <$file>;
# $file is automatically closed
};
# use 3 arg form of open
open( my $file, '>>', '/home/user/new' );
my #matches;
for( #lines ){
if( /(?:AB|CD)( (?:_[^_]+)+ )_W .+ txt/x ){
my #match = "$1" =~ /_([^_]+)/g;
say {$file} "#match";
push #matches, \#match;
# or
# push #matches, [ "$1" =~ /_([^_]+)/g ];
# if you don't need to print it in this loop.
}
}
close $file;
This is a little bit more permissive of inputs, but the regex should be a little bit more "correct", than the original.

Remember that a capturing match in list context returns the captured fields, if any:
#!/usr/bin/perl
use strict; use warnings;
my $file = '/home/user/name';
open my $in, '<', $file
or die "Cannot open '$file': $!";
my #matched;
while ( <$in> ) {
my #fields;
if (#fields = /AB_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_W.+txt/
or #fields = /CD_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_W.+txt/)
{
push #matched, "#fields";
}
}
use Data::Dumper;
print Dumper \#matched;
Of course, you could also do
push #matched, \#fields;
depending on what you intend to do with the matches.

I wonder if using push and giant regexes is really the right way to go.
The OP says he wants lines starting with AB at index 0, and those with CD at index 1.
Also, those regexes look like an inside-out split to me.
In the code below I have added some didactic comments that point out why I am doing things differently than the OP and the other solutions offered here.
#!/usr/bin/perl
use strict;
use warnings; # best use warnings too. strict doesn't catch everything
my $filename = "/home/user/name";
# Using 3 argument open addresses some security issues with 2 arg open.
# Lexical filehandles are better than global filehandles, they prevent
# most accidental filehandle name colisions, among other advantages.
# Low precedence or operator helps prevent incorrect binding of die
# with open's args
# Expanded error message is more helpful
open( my $inh, '<', $filename )
or die "Error opening input file '$filename': $!";
my #file_data;
# Process file with a while loop.
# This is VERY important when dealing with large files.
# for will read the whole file into RAM.
# for/foreach is fine for small files.
while( my $line = <$inh> ) {
chmop $line;
# Simple regex captures the data type indicator and the data.
if( $line =~ /(AB|CD)_(.*)_W.+txt/ ) {
# Based on the type indicator we set variables
# used for validation and data access.
my( $index, $required_fields ) = $1 eq 'AB' ? ( 0, 7 )
: $1 eq 'CD' ? ( 1, 6 )
: ();
next unless defined $index;
# Why use a complex regex when a simple split will do the same job?
my #matches = split /_/, $2;
# Here we validate the field count, since split won't check that for us.
unless( #matches == $required_fields ) {
warn "Incorrect field count found in line '$line'\n";
next;
}
# Warn if we have already seen a line with the same data type.
if( defined $file_data[$index] ) {
warn "Overwriting data at index $index: '#{$file[$index]}'\n";
}
# Store the data at the appropriate index.
$file_data[$index] = \#matches;
}
else {
warn "Found non-conformant line: $line\n";
}
}
Be forewarned, I just typed this into the browser window. So, while the code should be correct, there may be typos or missed semicolons lurking--it's untested, use it at your own peril.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Opening, spliting and sorting into an Arrray in perl - perl

In fact the same thing... #!/usr/bin/perl use strict; my (#m,#f); while(<>){ push (#m,$1) if(/(.),M/); push (#f,$1) if(/(.),F/); } print "M=#m\nF=#f\n"; Or a "perl -n" (=for all lines do) variant: #!/usr/bin/perl -n push (#m,$1) if(/(.),M/); push (#f,$1) if(/(.),F/); END { print "M=#m\nF=#f\n";}

Related

Parsing the large files in Perl

Perl's Chomp: Chomp is removing the whole word instead of the newline

Alternative to foreach loop with hashes in perl

Writing to a file in perl

How do I push more than one matched groups as same element of array in Perl?

Categories

Resources