Loop through file in perl and remove strings with less than 4 characters - perl

I am trying to bring a file loop through it and remove any strings that have less than four characters in it and then print the list. I come from a javascript world and perl is brand new to me.
use strict;
use warnings;
sub lessThan4 {
open( FILE, "<names.txt" );
my #LINES = <FILE>;
close( FILE );
open( FILE, ">names.txt" );
foreach my $LINE ( #LINES ) {
print FILE $LINE unless ( $LINE.length() < 4 );
}
close( FILE );
}

use strict;
use warnings;
# automatically throw exception if open() fails
use autodie;
sub lessThan4 {
my #LINES = do {
# modern perl uses lexical, and three arg open
open(my $FILE, "<", "names.txt");
<$FILE>;
};
# remove newlines
chomp(#LINES);
open(my $FILE, ">", "names.txt");
foreach my $LINE ( #LINES ) {
print $FILE "$LINE\n" unless length($LINE) < 4;
# possible alternative to 'unless'
# print $FILE "$LINE\n" if length($LINE) >= 4;
}
close($FILE);
}

You're basically there. I hope you'll find some comments on your code useful.
# Well done for including these. So many new Perl users don't
use strict;
use warnings;
# Perl programs traditionally use all lower-case subroutine names
sub lessThan4 {
# 1/ You should use lexical variables for filehandles
# 2/ You should use the three-argument version of open()
# 3/ You should always check the return value from open()
open( FILE, "<names.txt" );
# Upper-case variable names in Perl are assumed to be global variables.
# This is a lexical variable, so name it using lower case.
my #LINES = <FILE>;
close( FILE );
# Same problems with open() here.
open( FILE, ">names.txt" );
foreach my $LINE ( #LINES ) {
# This is your biggest problem. Perl doesn't yet embrace the idea of
# calling methods to get properties of a variable. You need to call
# length() as a function.
print FILE $LINE unless ( $LINE.length() < 4 );
}
close( FILE );
}
Rewriting to take all that into account, we get the following:
use strict;
use warnings;
sub less_than_4 {
open( my $in_file_h, '<', 'names.txt' ) or die "Can't open file: $!";
my #lines = <$in_file_h>;
close( $in_file_h );
open( my $out_file_h, '>', 'names.txt' ) or die "Can't open file: $!";
foreach my $line ( #lines ) {
# Note: $line will include the newline character, so you might need
# to increase 4 to 5 here
print $out_file_h $line unless length $line < 4;
}
close( $out_file_h );
}

I am trying to bring a file loop through it and remove any strings that have less than four characters in it and then print the list.
I suppose you need to remove strings from the file which are less than 4 chars in length.
#!/usr/bin/perl
use strict;
use warnings;
open ($FH, "<", "names.txt");
my #final_list;
while (my $line = <$FH>) {
map {
length($_) > 4 and push (#final_list, $_) ;
} split (/\s/, $line);
}
print "\nWords with more than 4 chars: #final_list\n";

#Please try this one:
use strict;
use warnings;
my #new;
while(<DATA>)
{
#Push all the values less than 4 characters
push(#new, $_) unless(length($_) > '4');
}
print #new;
__DATA__
Williams
John
Joe
Lee
Albert
Francis
Sun

Related

perl array prints as GLOB(#x#########)

I have a file which contains a list of email addresses which are separated by a semi-colon which is configured much like this (but much larger) :
$ cat email_casper.txt
casper1#foo.com; casper2#foo.com; casper3#foo.com; casper.casper4#foo.com;
#these throw outlook error :
#casper101#foo.com ; casper100#foo.com
#cat /tmp/emailist.txt | tr '\n' '; '
#cat /tmp/emallist.txt | perl -nle 'print /\<(.*)\>/' | sort
I want to break it up on the semicolon - so I suck the whole file into an array supposedly the contents are split on semicolon.
#!/usr/bin/perl
use strict;
use warnings;
my $filename = shift #ARGV ;
open(my $fh, '<', $filename) or die "Could not open file $filename $!";
my #values = split(';', $fh);
foreach my $val (#values) {
print "$val\n";
}
exit 0 ;
But the file awards me with a golb. I just don't know what is going one.
$ ./spliton_semi.pl email_casper.txt
GLOB(0x80070b90)
If I use Data::Dumper I get
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper ;
my $filename = shift #ARGV ;
open(my $fh, '<', $filename) or die "Could not open file $filename $!";
my #values = split(';', $fh);
print Dumper \#values ;
This is what the Dumper returns :
$ ./spliton_semi.pl email_casper.txt
$VAR1 = [
'GLOB(0x80070b90)'
];
You do not "suck the whole file into an array". You don't even attempt to read from the file handle. Instead, you pass the file handle to split. Expecting a string, it stringifies the file handle into GLOB(0x80070b90).
You could read the file into an array of lines as follows:
my #lines = <$fh>;
for my $line ($lines) {
...
}
But it's far simpler to read one line at a time.
while ( my $line = <$fh> ) {
...
}
In fact, there is no reason not to use ARGV here, simplifying your program to the following:
#!/usr/bin/perl
use strict;
use warnings;
use feature qw( say );
while (<>) {
chomp;
say for split /\s*;\s*/, $_;
}
This line
my #values = split(';', $fh);
is not reading from the filehandle like you think it is. You're actually calling split on the filehandle object itself.
You want this:
my $line = <$fh>;
my #values = split(';', $line);
Starting point:
#!/usr/bin/perl
use strict;
use warnings;
open(my $fh, '<', 'dummy.txt')
or die "$!";
my #values;
while (<$fh>) {
chomp;
# original
# push(#values, split(';', $_));
# handle white space
push(#values, split(/\s*;\s*/, $_));
}
close($fh);
foreach my $val (#values) {
print "$val\n";
}
exit 0;
Output for your example:
$ perl dummy.pl
casper1#foo.com
casper2#foo.com
casper3#foo.com
casper.casper4#foo.com

how to assign data into hash from an input file

I am new to perl.
Inside my input file is :
james1
84012345
aaron5
2332111 42332
2345112 18238
wayne[2]
3505554
Question: I am not sure what is the correct way to get the input and set the name as key and number as values. example "james" is key and "84012345" is the value.
This is my code:
#!/usr/bin/perl -w
use strict;
use warnings;
use Data::Dumper;
my $input= $ARGV[0];
my %hash;
open my $data , '<', $input or die " cannot open file : $_\n";
my #names = split ' ', $data;
my #values = split ' ', $data;
#hash{#names} = #values;
print Dumper \%hash;
I'mma go over your code real quick:
#!/usr/bin/perl -w
-w is not recommended. You should use warnings; instead (which you're already doing, so just remove -w).
use strict;
use warnings;
Very good.
use Data::Dumper;
my $input= $ARGV[0];
OK.
my %hash;
Don't declare variables before you need them. Declare them in the smallest scope possible, usually right before their first use.
open my $data , '<', $input or die " cannot open file : $_\n";
You have a spurious space at the beginning of your error message and $_ is unset at this point. You should include $input (the name of the file that failed to open) and $! (the error reason) instead.
my #names = split ' ', $data;
my #values = split ' ', $data;
Well, this doesn't make sense. $data is a filehandle, not a string. Even if it were a string, this code would assign the same list to both #names and #values.
#hash{#names} = #values;
print Dumper \%hash;
My version (untested):
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
#ARGV == 1
or die "Usage: $0 FILE\n";
my $file = $ARGV[0];
my %hash;
{
open my $fh, '<', $file or die "$0: can't open $file: $!\n";
local $/ = '';
while (my $paragraph = readline $fh) {
my #words = split ' ', $paragraph;
my $key = shift #words;
$hash{$key} = \#words;
}
}
print Dumper \%hash;
The idea is to set $/ (the input record separator) to "" for the duration of the input loop, which makes readline return whole paragraphs, not lines.
The first (whitespace separated) word of each paragraph is taken to be the key; the remaining words are the values.
You have opened a file with open() and attached the file handle to $data. The regular way of reading data from a file is to loop over each line, like so:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $input = $ARGV[0];
my %hash;
open my $data , '<', $input or die " cannot open file : $_\n";
while (my $line = <$data>) {
chomp $line; # Removes extra newlines (\n)
if ($line) { # Checks if line is empty
my ($key, $value) = split ' ', $line;
$hash{$key} = $value;
}
}
print Dumper \%hash;
OK, +1 for using strict and warnings.
First Take a look at the $/ variable for controlling how a file is broken into records when it's read in.
$data is a file handle you need to extract the data from the file, if it's not to big you can load it all into an array, if it's a large file you can loop over each record at a time. See the <> operator in perlop
Looking at you code it appears that you want to end up with the following data structure from your input file
%hash(
james1 =>[
84012345
],
aaron5 => [
2332111,
42332,
2345112,
18238
]
'wayne[2]' => [
3505554,
]
)
See perldsc on how to do that.
All the documentation can be read using the perldoc command which comes with Perl. Running perldoc on its own will give you some tips on how to use it and running perldoc perldoc will give you possibly far more info than you need at the moment.

What produces the white space in my perl programm?

As the title says, I have a program or better two functions to read and write a file either in an array or to one. But now to the mean reason why I write this: when running my test several times my test program that tests my functions produces more and more white space. Is there somebody that could explain my fail and correct me?
my code
Helper.pm:
#!/usr/bin/env perl
package KconfCtl::Helper;
sub file_to_array($) {
my $file = shift();
my ( $filestream, $string );
my #rray;
open( $filestream, $file ) or die("cant open $file: $!");
#rray = <$filestream>;
close($filestream);
return #rray;
}
sub array_to_file($$;$) {
my #rray = #{ shift() };
my $file = shift();
my $mode = shift();
$mode='>' if not $mode;
my $filestream;
if ( not defined $file ) {
$filestream = STDOUT;
}
else {
open( $filestream, $mode, $file ) or die("cant open $file: $!");
}
my $l = #rray; print $l,"\n";
foreach my $line (#rray) {
print $filestream "$line\n";
}
close($filestream);
}
1;
test_helper.pl:
use KconfCtl::Helper;
use strict;
my #t;
#t= KconfCtl::Helper::file_to_array("kconf.test");
#print #t;
my $t_index=#t;
#t[$t_index]="n";
KconfCtl::Helper::array_to_file(\#t, "kconf.test", ">");
the result after the first:
n
and the 2nd run:
n
n
When you read from a file, the data includes the newline characters at the end of each line. You're not stripping those off, but you are adding an additional newline when you output your data again. That means your file is gaining additional blank lines each time you read and write it
Also, you must always use strict and use warnings 'all' at the top of every Perl script; you should avoid using subroutine prototypes; and you should declare all of your variables as late as possible
Here's a more idiomatic version of your module code which removes the newlines on input using chomp. Note that you don't need the #! line on the module file as it won't be run from the command line, but you my want it on the program file. It's also more normal to export symbols from a module using the Exporter module so that you don't have to qualify the subroutine names by prefixing them with the full package name
use strict;
use warnings 'all';
package KconfCtl::Helper;
sub file_to_array {
my ($file) = #_;
open my $fh, '<', $file or die qq{Can't open "$file" for input: $!}; #'
chomp(my #array = <$fh>);
return #array;
}
sub array_to_file {
my ($array, $file, $mode) = #_;
$mode //= '>';
my $fh;
if ( $file ) {
open $fh, $mode, $file or die qq{Can't open "$file" for output: $!}; #'
}
else {
$fh = \*STDOUT;
}
print $fh $_, "\n" for #$array;
}
1;
and your test program would be like this
#!/usr/bin/env perl
use strict;
use warnings 'all';
use KconfCtl::Helper;
use constant FILE => 'kconf.test';
my #t = KconfCtl::Helper::file_to_array(FILE);
push #t, 'n';
KconfCtl::Helper::array_to_file(\#t, FILE);
When you read in from your file, you need to chomp() the lines, or else the \n at the end of the line is included.
Try this and you'll see what's happening:
use Data::Dumper; ## add this line
sub file_to_array($) {
my $file = shift();
my ( $filestream, $string );
my #rray;
open( $filestream, '<', $file ) or die("cant open $file: $!");
#rray = <$filestream>;
close($filestream);
print Dumper( \#rray ); ### add this line
return #rray;
}
you can add
foreach(#rray){
chomp();
}
into your module to stop this happening.

Get the header lines of protein sequences that start with specific amino acid in FASTA

Hi guys so I have been trying to use PERL to print only the headers (the entire >gi line) of protein sequences that start with "MAD" or "MAN" (the first 3 aa) from a FASTA file. But I couldn't figure out which part went wrong.
Thanks in advance!
#!usr/bin/perl
use strict;
my $in_file = $ARGV[0];
open( my $FH_IN, "<", $in_file ); ###open to fileholder
my #lines = <$FH_IN>;
chomp #lines;
my $index = 0;
foreach my $line (#lines) {
$index++;
if ( substr( $line, 0, 3 ) eq "MAD" or substr( $line, 0, 3 ) eq "MAN" ) {
print "#lines [$index-1]\n\n";
} else {
next;
}
}
This is a short part of the FASTA file, the header of the first seq is what I am looking for
>gi|16128078|ref|NP_414627.1| UDP-N-acetylmuramoyl-L-alanyl-D-glutamate:meso-diaminopimelate ligase [Escherichia coli str. K-12 substr. MG1655] MADRNLRDLLAPWVPDAPSRALREMTLDSRVAAAGDLFVAVVGHQADGRRYIPQAIAQGVAAIIAEAKDE ATDGEIREMHGVPVIYLSQLNERLSALAGRFYHEPSDNLRLVGVTGTNGKTTTTQLLAQWSQLLGEISAV MGTVGNGLLGKVIPTENTTGSAVDVQHELAGLVDQGATFCAMEVSSHGLVQHRVAALKFAASVFTNLSRD HLDYHGDMEHYEAAKWLLYSEHHCGQAIINADDEVGRRWLAKLPDAVAVSMEDHINPNCHGRWLKATEVN
Your print statement is buggy. Should probably be:
print "$lines[$index-1]\n\n";
However, it's typically better to just process a file line by line unless there is a specific reason you need to slurp the entire thing:
#!usr/bin/perl
use strict;
use warnings;
use autodie;
my $file = shift;
#open my $fh, "<", $in_file;
my $fh = \*DATA;
while (<$fh>) {
print if /^>/ && <$fh> =~ /^MA[DN]/;
}
__DATA__
>gi|16128078|ref|NP_414627.1| UDP-N-acetylmuramoyl-L-alanyl-D-glutamate:meso-diaminopimelate ligase [Escherichia coli str. K-12 substr. MG1655]
MADRNLRDLLAPWVPDAPSRALREMTLDSRVAAAGDLFVAVVGHQADGRRYIPQAIAQGVAAIIAEAKDE
ATDGEIREMHGVPVIYLSQLNERLSALAGRFYHEPSDNLRLVGVTGTNGKTTTTQLLAQWSQLLGEISAV
MGTVGNGLLGKVIPTENTTGSAVDVQHELAGLVDQGATFCAMEVSSHGLVQHRVAALKFAASVFTNLSRD
HLDYHGDMEHYEAAKWLLYSEHHCGQAIINADDEVGRRWLAKLPDAVAVSMEDHINPNCHGRWLKATEVN
–
Outputs:
>gi|16128078|ref|NP_414627.1| UDP-N-acetylmuramoyl-L-alanyl-D-glutamate:meso-diaminopimelate ligase [Escherichia coli str. K-12 substr. MG1655]
Since you want to know how to improve your code, here is a commented version of your program with some suggestions on how you could change it.
#!/usr/bin/perl
use strict;
You should also add the use warnings pragma, which enables warnings (as you might expect).
my $in_file = $ARGV[0];
It's a good idea to check that $ARGV[0] is defined, and to give an appropriate error message if it isn't, e.g.
my $in_file = $ARGV[0] or die "Please supply the name of the FASTA file to process";
If $ARGV[0] is not defined, Perl executes the die statement.
open( my $FH_IN, "<", $in_file ); # open to fileholder
You should check that the script is able to open the input file; you can use a similar structure to the previous statement, by adding a die statement:
open( my $FH_IN, "<", $in_file ) or die "Could not open $in_file: $!";
The special variable $! holds the error message as to why the file could not be opened (e.g. it doesn't exist, no permission to read it, etc.).
my #lines = <$FH_IN>;
chomp #lines;
my $index = 0;
foreach my $line (#lines) {
$index++;
if ( substr( $line, 0, 3 ) eq "MAD" or substr( $line, 0, 3 ) eq "MAN" ) {
print "#lines [$index-1]\n\n";
This is the problem point in the script. Firstly, the correct way to access an item in the array is using $lines[$index-1]. Secondly, the first item in an array is at index 0, so line 1 of the file will be at position 0 in #lines, line 4 at position 3, etc. Because you've already incremented the index, you're printing the line after the header line. The problem can easily be fixed by incrementing $index at the end of the loop.
}
else {
next;
}
Using next isn't really necessary here because there is no code following the else statement, so there's nothing to gain from telling Perl to skip the rest of the loop.
The fixed code would look like this:
#!/usr/bin/perl
use warnings;
use strict;
my $in_file = $ARGV[0] or die "Please supply the name of the FASTA file to be processed";
open( my $FH_IN, "<", $in_file ) or die "Could not open $in_file: $!";
my #lines = <$FH_IN>;
chomp #lines;
my $index = 0;
foreach my $line (#lines) {
if ( substr( $line, 0, 3 ) eq "MAD" or substr( $line, 0, 3 ) eq "MAN" ) {
print "$lines[$index-1]\n\n";
}
$index++;
}
I hope that is helpful and clear!

Perl - empty rows while writing CSV from Excel

I want to convert excel-files to csv-files with Perl. For convenience I like to use the module File::Slurp for read/write operations. I need it in a subfunction.
While printing out to the screen, the program generates the desired output, the generated csv-files unfortunately just contain one row with semicolons, field are empty.
Here is the code:
#!/usr/bin/perl
use File::Copy;
use v5.14;
use Cwd;
use File::Slurp;
use Spreadsheet::ParseExcel;
sub xls2csv {
my $currentPath = getcwd();
my #files = <$currentPath/stage0/*.xls>;
for my $sourcename (#files) {
print "Now working on $sourcename\n";
my $outFile = $sourcename;
$outFile =~ s/xls/csv/g;
print "Output CSV-File: ".$outFile."\n";
my $source_excel = new Spreadsheet::ParseExcel;
my $source_book = $source_excel->Parse($sourcename)
or die "Could not open source Excel file $sourcename: $!";
foreach my $source_sheet_number ( 0 .. $source_book->{SheetCount} - 1 )
{
my $source_sheet = $source_book->{Worksheet}[$source_sheet_number];
next unless defined $source_sheet->{MaxRow};
next unless $source_sheet->{MinRow} <= $source_sheet->{MaxRow};
next unless defined $source_sheet->{MaxCol};
next unless $source_sheet->{MinCol} <= $source_sheet->{MaxCol};
foreach my $row_index (
$source_sheet->{MinRow} .. $source_sheet->{MaxRow} )
{
foreach my $col_index (
$source_sheet->{MinCol} .. $source_sheet->{MaxCol} )
{
my $source_cell =
$source_sheet->{Cells}[$row_index][$col_index];
if ($source_cell) {
print $source_cell->Value, ";"; # correct output!
write_file( $outFile, { binmode => ':utf8' }, $source_cell->Value, ";" ); # only one row of semicolons with empty fields!
}
}
print "\n";
}
}
}
}
xls2csv();
I know it has something to do with the parameter passing in the write_file function, but couldn't manage to fix it.
Has anybody an idea?
Thank you very much in advance.
write_file will overwrite the file unless the append => 1 option is given. So this:
write_file( $outFile, { binmode => ':utf8' }, $source_cell->Value, ";" );
Will write a new file for each new cell value. It does however not match your description of "only one row of semi-colons of empty fields", as it should only be one semi-colon, and one value.
I am doubtful towards this sentiment from you: "For convenience I like to use the module File::Slurp". While the print statement works as it should, using File::Slurp does not. So how is that convenient?
What you should do, if you still want to use write_file is to gather all the lines to print, and then print them all at once at the end of the loop. E.g.:
$line .= $source_cell->Value . ";"; # use concatenation to build the line
...
push #out, "$line\n"; # store in array
...
write_file(...., \#out); # print the array
Another simple option would be to use join, or to use the Text::CSV module.
Well, in this particular case, File::Slurp was indeed complicating this for me. I just wanted to avoid to repeat myself, which I did in the following clumsy working solution:
#!/usr/bin/perl
use warnings;
use strict;
use File::Copy;
use v5.14;
use Cwd;
use File::Basename;
use File::Slurp;
use Tie::File;
use Spreadsheet::ParseExcel;
use open qw/:std :utf8/;
# ... other functions
sub xls2csv {
my $currentPath = getcwd();
my #files = <$currentPath/stage0/*.xls>;
my $fh;
for my $sourcename (#files) {
say "Now working on $sourcename";
my $outFile = $sourcename;
$outFile =~ s/xls/csv/gi;
if ( -e $outFile ) {
unlink($outFile) or die "Error: $!";
print "Old $outFile deleted.";
}
my $source_excel = new Spreadsheet::ParseExcel;
my $source_book = $source_excel->Parse($sourcename)
or die "Could not open source Excel file $sourcename: $!";
foreach my $source_sheet_number ( 0 .. $source_book->{SheetCount} - 1 )
{
my $source_sheet = $source_book->{Worksheet}[$source_sheet_number];
next unless defined $source_sheet->{MaxRow};
next unless $source_sheet->{MinRow} <= $source_sheet->{MaxRow};
next unless defined $source_sheet->{MaxCol};
next unless $source_sheet->{MinCol} <= $source_sheet->{MaxCol};
foreach my $row_index (
$source_sheet->{MinRow} .. $source_sheet->{MaxRow} )
{
foreach my $col_index (
$source_sheet->{MinCol} .. $source_sheet->{MaxCol} )
{
my $source_cell =
$source_sheet->{Cells}[$row_index][$col_index];
if ($source_cell) {
print $source_cell->Value, ";";
open( $fh, '>>', $outFile ) or die "Error: $!";
print $fh $source_cell->Value, ";";
close $fh;
}
}
print "\n";
open( $fh, '>>', $outFile ) or die "Error: $!";
print $fh "\n";
close $fh;
}
}
}
}
xls2csv();
I'm actually NOT happy with it, since I'm opening and closing the files so often (I have many files with many lines). That's not very clever in terms of performance.
Currently I still don't know how to use the split or Text:CSV in this case, in order to put everything into an array and to open, write and close each file only once.
Thank you for your answer TLP.