How to assign value to key hashmap Perl? - perl

I have the code bellow. I want to assign for each version values but somehow I do not know what I'm doing wrong.
%buildsMap = ();
#read schedule
opendir (DIR, $buildsDir) or die $!;
while ( my $file = readdir DIR ) {
if (($file =~ /(V\d\d\d.*)-DVD/) || ($file =~ /(V\d\d\d.*)/)) {
foreach $version(#versionS){
if ($1 eq $version ){
#temp = #{$buildsMap{$version}};
push #temp,$file;
#{$buildsMap{$version}} = #temp;
}
}
}
}
If I want to use the keys from this hashmap it is ok. Please advice me.

First order of business, turn on strict and warnings. This will discover any typo'd variables and other mistakes. The major issue here will be you have to declare all your variables. Fixing that all up, and printing out the resulting %buildMap, we have a minimum viable example.
use strict;
use warnings;
use v5.10;
my $buildsDir = shift;
my #versionS = qw(V111 V222 V333);
my %buildsMap = ();
#read schedule
opendir (DIR, $buildsDir) or die $!;
while ( my $file = readdir DIR ) {
if (($file =~ /(V\d\d\d.*)-DVD/) || ($file =~ /(V\d\d\d.*)/)) {
foreach my $version (#versionS) {
if ($1 eq $version ){
my #temp = #{$buildsMap{$version}};
push #temp,$file;
#{$buildsMap{$version}} = #temp;
}
}
}
}
for my $version (keys %buildsMap) {
my $files = $buildsMap{$version};
say "$version #$files";
}
Which gives the error Can't use an undefined value as an ARRAY reference at test.plx line 15.. That's this line.
my #temp = #{$buildsMap{$version}};
The problem here is how you're working with array references. That specific line fails because if $buildsMap{$version} has no entry you're trying to dereference nothing and Perl won't allow that, not for a normal dereference. We could fix that, but there's better ways to work with hashes of lists. Here's what you have.
my #temp = #{$buildsMap{$version}};
push #temp,$file;
#{$buildsMap{$version}} = #temp;
That copies out all the filenames into #temp, works with #temp and the more comfortable syntax, and copies them back in. It's inefficient for memory and amount of code. Instead, we can do it in place. First by initializing the value to an empty array reference, if necessary, and then by pushing the file onto that array reference directly.
$buildsMap{$version} ||= [];
push #{$buildsMap{$version}}, $file;
||= is the or-equals operator. It will only do the assignment if the left-hand side is false. It's often used to set default values. You could also write $buildsMap{$version} = [] if !$buildsMap{$version} but that rapidly gets redundant.
But we don't even need to do that! push is a special case. For convenience, you can dereference an empty value and pass it to push! So we don't need the initializer, just push.
push #{$buildsMap{$version}}, $file;
While the code works, it could be made more efficient. Instead of scanning #versionS for every filename, potentially wasteful if there's a lot of files or a lot of versions, you can use a hash. Now there's no inner loop needed.
my %versionS = ( 'V111' => 1, 'V222' => 2, 'V333' => 3 );
...
if (($file =~ /(V\d\d\d.*)-DVD/) || ($file =~ /(V\d\d\d.*)/)) {
my $version = $1;
if ($versionS{$version}){
push #{$buildsMap{$version}}, $file;
}
}

Related

Perl - Could not open and read files

I've created a script for validating xml files after given input folder. It should grep xml files from the input directory then sort out the xml files and check the condition. But it throws a command that not Open at line , <STDIN> line 1.
But it creates an empty log file.
Since i faced numeric error while sorting, comment that.
so i need to be given input location, the script should check the xml files and throw errors in a mentioned log file.
Anyone can help this?
Script
#!/usr/bin/perl
# use strict;
use warnings;
use Cwd;
use File::Basename;
use File::Path;
use File::Copy;
use File::Find;
print "Enter the path: ";
my $filepath = <STDIN>;
chomp $filepath;
die "\n\tpleas give input folder \n" if(!defined $filepath or !-d $filepath);
my $Toolpath = dirname($0);
my $base = basename($filepath);
my $base_path = dirname($filepath);
my ($xmlF, #xmlF);
my #errors=();
my #warnings=();
my #checkings=();
my $ecount=0;
my $wcount=0;
my $ccount=0;
my ($x, $y);
my $z="0";
opendir(DIR,"$filepath");
my #xmlFiles = grep{/\.xml$/} readdir(DIR);
closedir(DIR);
my $logfile = "$base_path\\$base"."_Err.log";
# #xmlF=sort{$a <=> $b}#xmlFiles;
#xmlF=sort{$a cmp $b}#xmlFiles;
open(OUT, ">$logfile") || die ("\nLog file couldnt write $logfile :$!");
my $line;
my $flcnt = scalar (#xmlF);
for ($x=0; $x < $flcnt; $x++)
{
open IN, "$xmlF[$x]" or die "not Open";
print OUT "\n".$xmlF[$x]."\n==================\n";
print "\nProcessing File $xmlF[$x] .....\n";
local $/;
while ($line=<IN>)
{
while ($line=~m#(<res(?: [^>]+)? type="weblink"[^>]*>)((?:(?!</res>).)*)</res>#igs)
{
my $tmp1 = $1; my $tmp2 = $&; my $pre1 = $`;
if($tmp1 =~ m{ subgroup="Weblink"}i){
my $pre = $pre1.$`;
if($tmp2 !~ m{<tooltip><\!\[CDATA\[Weblink\]\]><\/tooltip>}ms){
my $pre = $pre1.$`;
push(#errors,lineno($pre),"\t<tooltip><\!\[CDATA\[Weblink\]\]></tooltip> is missing\n");
}
}
}
foreach my $warnings(#warnings)
{
$wcount = $wcount+1;
}
foreach my $checkings(#checkings)
{
$ccount = $ccount+1;
}
foreach my $errors(#errors)
{
$ecount = $ecount+1;
}
my $count_err = $ecount/2;
print OUT "".$count_err." Error(s) Found:-\n------------------------\n ";
print OUT "#errors\n";
$ecount = 0;
my $count_war = $wcount/2;
print OUT "$count_war Warning(s) Found:-\n-------------------------\n ";
print OUT "#warnings\n";
$wcount = 0;
my $count_check = $ccount/2;
print OUT "$count_check Checking(s) Found:-\n-------------------------\n ";
print OUT "#checkings\n";
$wcount = 0;
undef #errors;
undef #warnings;
undef #checkings;
close IN;
}
}
The readdir returns bare file names, without the path.
So when you go ahead to open those files you need to prepend the names returned by readdir with the name of the directory the readdir read them from, here $filepath. Or build the full path names right away
use warnings;
use strict;
use feature 'say';
use File::Spec;
print "Enter the path: ";
my $filepath = <STDIN>;
chomp $filepath;
die "\nPlease give input folder\n" if !defined $filepath or !-d $filepath;
opendir(my $fh_dir, $filepath) or die "Can't opendir $filepath: $!";
my #xml_files =
map { File::Spec->catfile($filepath, $_) }
grep { /\.xml$/ }
readdir $fh_dir;
closedir $fh_dir;
say for #xml_files;
where I used File::Spec to portably piece together the file name.
The map can be made to also do grep's job so to make only one pass over the file list
my #xml_files =
map { /\.xml$/ ? File::Spec->catfile($filepath, $_) : () }
readdir $fh_dir;
The empty list () gets flattened in the returned list, effectively disappearing altogether.
Here are some comments on the code. Note that this is normally done at Code Review but I feel that it is needed here.
First: a long list of variables is declared upfront. It is in fact important to declare in as small a scope as possible. It turns out that most of those variables can indeed be declared where they are used, as seen in comments below.
The location of the executable is best found using
use FindBin qw($RealBin);
where $RealBin also resolves links (as opposed to $Bin, also available)
Assigning () to an array at declaration doesn't do anything; it is exactly the same as normal my #errors;. They can also go together, my (#errors, #warnings, #checks);. If the array has something then = () clears it, what is a good way to empty an array
Assigning a "0" makes the variable a string. While Perl normally converts between strings and numbers as needed, if a number is needed then use a number, my $z = 0;
Lexical filehandles (open my $fh, ...) are better than globs (open FH, ...)
I don't understand the comment about "numeric error" in sorting. The cmp operator sorts lexicographically, for numeric sort use <=>
When array is used in scalar context – when assigned to a scalar for example – the number of elements is returned. So no need for scalar but do my flcnt = #xmlF;
For iteration over array indices use $#ary, the index of the last element of #ary, for
foreach my $i (0..$#xmlF) { ... }
But if there aren't any uses of the index (I don't see any) then loop over elements
foreach my $file (#xmlF) { ... }
When you check the file open print the error $!, open ... or die "... : $!";. This is done elsewhere in the code, and it should be done always.
The local $/; unsets the input record separator, what makes the following read take the whole file. If that is intended then $line is not a good name. Also note that a variable can be declared inside the condition, while (my $line = <$fh>) { }
I can't comment on the regex as I don't know what it's supposed to accomplish, but it is complex; any chance to simplify all that?
The series of foreach loops only works out the number of elements of those arrays; there is no need for loops then, just my $ecount = #errors; (etc). This also allows you to keep the declaration of those counter variables in minimal scope.
The undef #errors; (etc) aren't needed since those arrays count for each file and so you can declare them inside the loops, anew at each iteration (and at smallest scope). When you wish to empty an array it is better to do #ary = (); than to undef it; that way it's not allocated all over again on the next use

Interpolating a non-interpolated passed string inside a subroutine in Perl

I am looking to parse a tab delimited text file into a nested hash with a subroutine. Each file row will be keyed by a unique id from a uid column(s), with the header row as nested keys. Which column(s) is(are) to become the uid changes (as sometimes there isn't a unique column, so the uid has to be a combination of columns). My issue is with the $uid variable, which I pass as a non-interpolated string. When I try to use it inside the subroutine in an interpolated way, it will only give me the non-interpolated value:
use strict;
use warnings;
my $lofrow = tablehash($lof_file, '$row{gene}', "transcript", "ENST");
##sub to generate table hash from file w/ headers
##input values are file, uid, header starter, row starter, max column number
##returns hash reference (deref it)
sub tablehash {
my ($file, $uid, $headstart, $rowstart, $colnum) = #_;
if (!$colnum){ # takes care of a unknown number of columns
$colnum = 0;
}
open(INA, $file) or die "failed to open $file, $!\n";
my %table; # permanent hash table
my %row; # hash of column values for each row
my #names = (); # column headers
my #values = (); # line/row values
while (chomp(my $line = <INA>)){ # reading lines for lof info
if ($line =~ /^$headstart/){
#names = split(/\t/, $line, $colnum);
} elsif ($line =~ /^$rowstart/){ # splitting lof info columns into variables
#values = split(/\t/, $line, $colnum);
#row{#names} = #values;
print qq($uid\t$row{gene}\n); # problem: prints "$row{gene} ACB1"
$table{"$uid"} = { %row }; # puts row hash into permanent hash, but with $row{gene} key)
}
}
close INA;
return \%table;
}
I am out of ideas. I could put $table{$row{$uid}} and simply pass "gene", but in a couple of instances I want to have a $uid of "$row{gene}|$row{rsid}" producing $table{ACB1|123456}
Interpolation is a feature of the Perl parser. When you write something like
"foo $bar baz"
, Perl compiles it into something like
'foo ' . $bar . ' $baz'
It does not interpret data at runtime.
What you have is a string where one of the characters happens to be $ but that has no special effect.
There are at least two possible ways to do something like what you want. One of them is to use a function, not a string. (Which makes sense because interpolation really means concatenation at runtime, and the way to pass code around is to wrap it in a function.)
my $lofrow = tablehash($lof_file, sub { my ($row) = #_; $row->{gene} }, "transcript", "ENST");
sub tablehash {
my ($file, $mkuid, $headstart, $rowstart, $colnum) = #_;
...
my $uid = $mkuid->(\%row);
$table{$uid} = { %row };
Here $mkuid isn't a string but a reference to a function that (given a hash reference) returns a uid string. tablehash calls it, passing a reference to %row to it. You can then later change it to e.g.
my $lofrow = tablehash($lof_file, sub { my ($row) = #_; "$row->{gene}|$row->{rsid}" }, "transcript", "ENST");
Another solution is to use what amounts to a template string:
my $lofrow = tablehash($lof_file, "gene|rsid", "transcript", "ENST");
sub tablehash {
my ($file, $uid_template, $headstart, $rowstart, $colnum) = #_;
...
(my $uid = $uid_template) =~ s/(\w+)/$row{$1}/g;
$table{$uid} = { %row };
The s/// code goes through the template string and manually replaces every word by the corresponding value from %row.
Random notes:
Bonus points for using strict and warnings.
if (!$colnum) { $colnum = 0; } can be simplified to $colnum ||= 0;.
Use lexical variables instead of bareword filehandles. Barewords are effectively global variables (and syntactically awkward because they're not first-class citizens of the language).
Always use the 3-argument form of open to avoid unexpected interpretation of the second argument.
Include the name of your program in error messages (either explicitly with $0 or implicitly by omitting \n from die).
my #foo = (); my %bar = (); is redundant and can be simplified to my #foo; my %bar;. Arrays and hashes start out empty; overwriting them with an empty list is pointless.
chomp(my $line = <INA>) will throw a warning when you reach EOF (because you're trying to chomp a variable containing undef).
my %row; should probably be declared inside the loop. It looks like it's supposed to only contain values from the current line.
Suggestion:
open my $fh, '<', $file or die "$0: can't open $file: $!\n";
while (my $line = readline $fh) {
chomp $line;
...
}

Read an file in two hashes inorder to retain the order

I am trying to read a file with user information categorized under a location, I want to fill in the some of the fields using user input and output the file while keeping the fields under each location intact for eg - file
[California]
$;FIrst_Name =
$;Last_Name=
$;Age =
[NewYork]
$;FIrst_Name =
$;Last_Name=
$;Age =
[Washington]
$;FIrst_Name =
$;Last_Name=
$;Age =
Once user provides input from command line it should look it
[California]
$;FIrst_Name = Jack
$;Last_Name= Daner
$;Age = 27
[NewYork]
$;FIrst_Name = Jill
$;Last_Name= XYZ
$;Age = 30
[Washington]
$;FIrst_Name = Kim
$;Last_Name= ABC
$;Age = 25
The order of First_Name, Last_Name and Age within each location can change and even order of locations can change, but each location section should remain separate and intact. I wrote following code so far and some of my code works for taking whole file in one hash, but i am not able to preserve each location section within it! I tried using two hashes - can someone please help me as it is getting really complex for me! Thanks a lot. ( I had another issue with a similar file as well, but unfortunately could not resolve it either)
EDITED code
Open the file
use strict;
use warnings;
use Getopt::Long;
sub read_config {
my $phCmdLineOption = shift;
my $phConfig = shift;
my $sInputfile = shift;
open($input.file, "<$InputFile") or die "Error! Cannot open $InputFile
+ for reading: $!";
while (<$input.file>) {
$_ =~ s/\s+$//;
next if ($_ =~ /^#/);
next if ($_ =~ /^$/);
if ($_ =~ m/^\[(\S+)\]$/) {
$sComponent = $1;
next;
}
elsif ($_ =~ m/^;;\s*(.*)/) {
$sDesc .= "$1.";
next;
}
elsif ($_ =~ m/\$;(\S+)\$;\s*=\s*(.*)/) {
$sParam = $1;
$sValue = $2;
if ((defined $sValue) && ($sValue !~ m/^\s*$/)) {
$phfield->{$sCategory}{$sParam} = ["$sValue", "$sDesc"];
}
else {
$field->{$sCategory}{$sParam} = [undef, "$sDesc"];
}
}
$sParam = $sValue = $sDesc = "";
next;
}
}
Write the new file -
sub write_config {
my $phCmdLineOption = shift;
my $phConfig = shift;
my $sOut = shift;
open(outfile, ">$sOut") or die " $!";
foreach $sCategory (sort {$a cmp $b} keys %{$fields}) {
print $outfile "[$sCategory]\n";
foreach $sParam (sort {$a cmp $b} keys %{$fields-{$sCategory}}) {
$sDesc = ((defined $phConfig->{$sCategory}{$sParam}[1]) $fields->{$sCategory}{$sParam}[1] : "");
print $outfile ";;$sDesc\n" if ((defined $sDesc) && ($sDesc !~ m/^$/));
$sValue = ((defined $fields->{$sCategory}{$sParam}[0]) ? $fields->{$sCategory}{$sParam}[0] : undef);
print $outfile "$sValue" if (defined $sValue);
print $outfile "\n";
}
print $outfile "\n";
}
close($outfile);
return;
Note - I have posted this question on PerlMonks forum as well. Thanks a lot!
I think you're getting lost in the detail and skipping over some basics which is unnecessarily complicating the problem. Those basics are;
Indent your code properly (it's amazing the difference this makes)
Always use the /x modifier on regex and lots of whitespace to increase readability
When using lots of regexs, use "quote rule", qr, to seperate regex definition from regex use
Apart from that, you were headed in the right direction but there are a couple of insights on the algorithm you were missing which further increased the complexity.
Firstly, for small-time parsing of data, look out for the possibility that matching one type of line immediately disqualifies matching of other types of line. All the elsif's aren't necessary since a line that matches a category is never going to match a LastName or Age and vice versa.
Secondly, when you get a match, see if you can do what's needed immediately rather than storing the result of the match for processing later. In this case, instead of saving a "component" or "category" in a variable, put it immediately into the hash you're building.
Thirdly, if you're updating text files that are not huge, consider working on a new version of the file and then at the end of the program declare the current version old, and the new version current. This reduces the chances of unintentionally modifying something in place and allows comparison of the update with the original after execution - if necessary, "rollback" of the change in trivially easy which one of your users may be very grateful for one day.
Fourthly and most of all, you've only got a couple of attributes or components to worry about, so deal with them in the concrete rather than the abstract. You can see below that I've looped over qw( First_Name Last_Name Age) rather than all keys of the hash. Now obviously, if you have to deal with open-ended or unknown attributes you can't do it this way but in this case, AFAICT, your fields are fixed.
Here's a version that basically works given the above mentioned constraints.
#!/usr/bin/env perl
use v5.12 ;
use Getopt::Long ;
my %db ; # DB hash
my $dbf = "data.txt" ; # DB file name
my $dbu = "data.new" ; # updated DB file name
my $dbo = "data.old" ; # Old DB file name
my ($cat, $first, $last, $age) ; # Default is undef
GetOptions( 'cat=s' => \$cat ,
'first=s' => \$first ,
'last=s' => \$last ,
'age=i' => \$age
);
die "Category option (--cat=...) is compolsory\n" unless $cat ;
open my $dbh, '<', $dbf or die "$dbf: $!\n"; # DB Handle
open my $uph, '>', $dbu or die "$dbu: $!\n"; # UPdate Handle
# REs for blank line, category header and attribute specification
my $blank_re = qr/ ^ \s* $ /x ;
my $cat_re = qr/ ^ \[ (\w+) \] \s* $ /x ;
my $attr_re = qr/ ^ \$ ; (?<key>\w+) \s* = \s* (?<val>\N*) $ /x ;
while ( <$dbh> ) {
next unless /$cat_re/ ;
my %obj = ( cat => $1 ) ;
while ( <$dbh> ) {
$obj{ $+{key} } = $+{val} if /$attr_re/ ;
last if /$blank_re/
}
$db{ $obj{cat} } = \%obj
}
# Grab existing obj, otherwise presume we're adding a new one
my $obref = $db{ $cat } // { cat => $cat } ;
$obref->{ First_Name } = $first if defined $first ;
$obref->{ Last_Name } = $last if defined $last ;
$obref->{ Age } = $age if defined $age ;
# Update the DB with the modified/new one
$db{ $obref->{cat} } = $obref ;
for (sort keys %db) {
my $obref = $db{ $_ } ;
printf $uph "[%s]\n", $obref->{ cat } ;
for (qw( First_Name Last_Name Age )) {
printf $uph '$;' . "%s = %s\n", $_, $obref->{ $_ }
}
print $uph "\n"
}
close $dbh ;
close $dbu ;
rename $dbf , $dbo ;
rename $dbu , $dbf ;
exit 0
User input here need be organized, and for this we can use named options for each field, plus one for state. The Getopt option for reading into a hash is useful here. We also need to associate names of these options with field names. With that in hand it is simple to process the file since we have a ready mechanism to identify lines of interest.
By putting lines on a ref-array we can keep the order as well, and that refarray is a value for the section-key in the hash. The hash is not necessary but adds flexibility for future development. Once we are at it we can also keep the order of sections by using a simple auxiliary array.
use warnings;
use strict;
use Getopt::Long;
use feature qw(say);
# Translate between user input and field name ($;) in file
my ($o1, $o2, $o3) = qw(first last age);
my #tags = ('FIrst_Name', 'Last_Name', 'Age');
my %desc = ($tags[0] => $o1, $tags[1] => $o2, $tags[2] => $o3);
my (%input, $state);
GetOptions(\%input, "$o1=s", "$o2=s", "$o3=i", 'state=s' => \$state);
my $locinfo = 'loc_info.txt';
open my $in_fh, '<', $locinfo;
my (%conf, #sec_order, $section, $field);
while (my $line = <$in_fh>)
{
chomp($line);
next if $line =~ m/^\s*$/;
# New section ([]), for hash and order-array
if ($line =~ m/^\s*\[(.*)\]/) {
push #sec_order, $section = $1;
next;
}
# If we are in a wrong state just copy the line
if ($section ne $state) {
push #{$conf{$section}}, $line . "\n";
next;
}
if (($field) = $line =~ m/^\$;\s*(.*?)\s*=/ ) {
if (exists $input{$desc{$field}}) {
# Overwrite what is there or append
$line =~ s|^\s*(.*?=\s*)(.*)|$1 $input{$desc{$field}}|;
}
}
else { warn "Unexpected line: |$line| --" }
push #{$conf{$section}}, $line . "\n";
}
close $in_fh;
for (#sec_order) { say "[$_]"; say #{$conf{$_}}; }
Invocation
script.pl -state STATE -first FIRST_NAME -last LAST_NAME -age INT
Any option may be left out in which case that field is not touched. A field supplied on the command line will be overwritten if it has something. (This can be changed easily.) This works for a single-state entry as it stands but which is simple to modify if needed.
This is a basic solution. The first next thing would be to read the field names from the file itself, instead of having them hard-coded. (This would avoid the need to spot the typo FIrst and inconsistent spacings before =, for one thing.) But the more refinements are added, the more one is getting into template development. At some point soon it will be a good idea to use a module.
Note The regex delimiter above is different than elsewhere (|) to avoid the editor coloring all red.

How to use Lingua::EN::Ngram for multiple files

I am implementing a naive Bayesian classification algorithm. In my training set I have a number of abstracts in separate files. I want to use N-gram in order to get the term frequency weight, but the code is not taking multiple files.
I edited my code, and now the error I am getting is
cant call method tscore on an undefined value. To check this, I printed #ngrams and it is showing me junk values like hash0*29G45 or something like that.
#!c:\perl\bin\perl.exe -w
use warnings;
use Algorithm::NaiveBayes;
use Lingua::EN::Splitter qw(words);
use Lingua::StopWords qw(getStopWords);
use Lingua::Stem;
use Algorithm::NaiveBayes;
use Lingua::EN::Ngram;
use Data::Dumper;
use Text::Ngram;
use PPI::Tokenizer;
use Text::English;
use Text::TFIDF;
use File::Slurp;
my $pos_file = 'D:\aminoacids';
my $neg_file = 'D:\others';
my $test_file = 'D:\testfiles';
my #vectors = ();
my $categorizer = Algorithm::NaiveBayes->new;
my #files = <$pos_file/*>;
my #ngrams;
for my $filename (#files) {
open(FH, $filename);
my $ngram = Lingua::EN::Ngram->new($filename);
my $tscore = $ngram->tscore;
foreach (sort { $$tscore{$b} <=> $$tscore{$a} } keys %$tscore) {
print "$$tscore{ $_ }\t" . "$_\n";
}
my $trigrams = $ngram->ngram(2);
foreach my $trigram (sort { $$trigrams{$b} <=> $$trigrams{$a} } keys %$trigrams) {
print $$trigrams{$trigram}, "\t$trigram\n";
}
my %positive;
$positive{$_}++ for #files;
$categorizer->add_instance(
attributes => \%positive,
label => 'positive'
);
}
close FH;
Your code <$pos_file/*> should work fine ( thanks #borodir ), still, here is an alternative so as to not mess up the history.
Try
opendir (DIR, $directory) or die $!;
and then
while (my $filename = readdir(DIR)) {
open ( my $fh, $filename );
# work with filehandle
close $fh;
}
closedir DIR;
If called in list context, readdir should give you a list of files:
my #filenames = readdir(DIR);
# you could call that function you wanted to call with this list, file would need to be
# opened still, though
Another point:
If you want to pass a reference to an array, do it like so:
function( list => \#stems );
# thus, your ngram line should probably rather be
my $ngram = Lingua::EN::Ngram->new (file => \#stems );
However, the docs for Lingua::EN::Ngram only talk about scalar for file and so on, it does not seem to expect an array for input. ( Exception being the 'intersection' method )
So you would have to put it in a loop and cycle through, or use map
my #ngrams = map{ Lingua::EN::Ngram->new( file => $_ ) }#filenames
Seems unnecessary to open in filehandle first, Ngram does that by itself.
If you prefer a loop:
my #ngrams;
for my $filename ( #filenames ){
push #ngrams, Lingua::EN::Ngram->new( file => $filename );
}
I think now I got what you actually want to do.
get the tscore: you wrote $tscore = $ngram->tscore, but $ngram is not defined anymore.
Not sure how to get the tscore for a single word. ( "significance of word in text" ) kind of indicates a text.
Thus: make an ngram not for each word, but either for each sentence or each file.
Then you can determine the t-score of that word in that sentence or file ( text ).
for my $filename ( #files ){
my $ngram = Lingua::EN::Ngram->new( file => $filename );
my $tscore = $ngram->tscore();
# tscore returns a hash reference. Keys are bigrams, values are tscores
# now you can do with the tscore what you like. Note that for arbitrary length,
# tscore will not work. This you would have to do yourself.

perl creates unexplained file called "1"

I have been trying to get rid of a weird bug for hours, with no success. I have a subroutine that sorts a file. here is the code:
sub sort_file {
$filename = #_;
print #_;
print $filename;
open(SRTINFILE,"<$filename");
#lines=<SRTINFILE>;
close(SRTINFILE);
open(SRTOUTFILE,">$filename");
#sorted = sort { #aa=split(/ /,$a); #bb=split(/ /,$b); return ($aa[1] <=> $bb[1]); } #lines;
print SRTOUTFILE #sorted;
close(SRTOUTFILE);
}
any time this function is run, perl creates a file, called "1". i have no idea why. I am a complete perl noob and am just using it for quick and dirty text file processing. anyone know what's wrong?
An array in scalar context evalutes to the number of elements in the array. If you pass one argument to the function, the following assigns 1 to $filename.
$filename = #_;
You want any of the following:
$filename = $_[0];
$filename = shift;
($filename) = #_;
Furthermore, you want to limit the scope of the variable to the function, so you want
my $filename = $_[0];
my $filename = shift;
my ($filename) = #_;
(my $filename) = #_; # Exact same as previous.
The other answers are sufficient to tell you why you were getting strange errors.
I would like to show you how a more experienced Perl programmer might write this subroutine.
use warnings;
use strict;
use autodie;
sub sort_file {
my( $filename ) = #_;
my #lines;
{
# 3 arg open
open my $in_fh, '<', $filename;
#lines = <$in_fh>;
close $in_fh;
}
# Schwartzian transform
my #sorted = map{
$_->[0]
} sort {
$a->[2] <=> $b->[2]
} map {
[ $_, split ' ', $_ ]
} #lines;
{
open my $out_fh, '>', $filename;
print {$out_fh} #sorted;
close $out_fh;
}
}
use strict;
prevents you from using a variable without declaring it (among other things).
use warnings;
Informs you of some potential errors.
use autodie;
Now you don't need to write open .... or die ....
{ open ...; #lines = <$fh>; close $fh }
Limits the scope of the FileHandle.
#sorted = map { ... } sort { ... } map { ... } #list
This is an examples of a Schwartzian transform, which reduces the number of times that the values are split. In this example, it may be overkill.
How confusing. Assigning $filename = #_ the way you are means that you are evaluating an array in scalar context, which means that $filename is assigned the number of elements in #_. Because you don't check to see if the first open call succeeds, reading the file 1 likely fails, but you continue anyway and open for writing a file named 1. The solution is to use $filename in an array context and begin your subroutine with ($filename) = #_ or $filename = shift.
Why aren't you using use strict by the way?
Always use:
use strict;
use warnings;
Then Perl will tell you when you're off the mark.
As you've observed, the notation:
$filename = #_;
means that an unscoped variable is assigned the number of elements in the argument list to the function, and since you pass one file, the name of the created file will be '1'.
You meant to write:
my($filename) = #_;
This provides list context for the array, and assigns $_[0] to $filename, ignoring any extra arguments to the function.
OK... nevermind. it just dawned on me. $filename = #_; makes no sense. should be $filename = #_[0]; . There goes 2 hours of my life. note to other perl noobs: beware.