I want to print some dates from a site with a structure like this:
<tr><td><b>(.*?)</b></td>
<td align=".*?"/date/(\d+)-(\d+)/">.*?</a> (\d+)</td>
<td>(.*?)*</td></tr>
etc.
my $country = $1;
my $month = $2;
my $day = $3;
my $year = $4;
my $event = $5;
I need to extract only those where the $country is 'USA' but if I use the while statement the code loops endlessly through the first match. How do I rework the script to extract each found USA date?
sub getSpec {
my $line = shift;
my $site = getSite($line);
while ($site =~ s/.../) {
my $country = $1;
my $month = $2;
my $day = $3;
my $year = $4;
my $event = $5;
if ($country =~ /USA/i) {
print $month.$date.$year.$country.$event."\n";
}
}
}
A global match should make it for you:
while ($site =~ m/.../g) {
For details, look in the documentation.
It looks like you're not changing the string after the first match. Try reading $site (it's html of the whole site, right?) line by line, so the loop would look like this
(my Perl is a little bit rusty, this is only rough sketch, sorry for that):
while ( $_ = another_line_from_$site)
{
if($_ =~ s/.../) {
{variables}
if($country =~ /USA/i)
{ other_stuff }
}
}
Related
I am trying to solve the doublets puzzle problem using Perl. This is one of my first times using Perl so please excuse the messy code.
I have everything working, I believe, but am having an issue printing the shortest path. Using a queue and BFS I am able to find the target word but not the actual path taken.
Does anyone have any suggestions? I have been told to keep track of the parents of each element but it is not working.
#!/usr/bin/perl
use strict;
my $file = 'test';
#my $file = 'wordlist';
open(my $fh, $file);
my $len = length($ARGV[0]);
my $source = $ARGV[0];
my $target = $ARGV[1];
my #words;
# Creates new array of correct length words
while (my $row = <$fh>) {
chomp $row;
my $rowlen = length($row);
if ($rowlen == $len) {
push #words, $row;
}
}
my %wordHash;
# Creates graph for word variations using dictionary
foreach my $word (#words) {
my $wordArr = [];
for (my $i = 0; $i < $len; $i++) {
my $begin = substr($word, 0, $i);
my $end = substr($word, $i+1, $len);
my $key = "$begin" . "_" . "$end";
my $Arr = [];
my $regex = "$begin" . "[a-z]" . "$end";
foreach my $wordTest (#words) {
if ("$wordTest" =~ $regex && "$wordTest" ne "$word") {
push $wordArr, "$wordTest";
}
}
}
$wordHash{"$word"} = $wordArr;
}
my #queue;
push(#queue, "$source");
my $next = $source;
my %visited;
my %parents;
my #path;
# Finds path using BFS and Queue
while ("$next" ne "$target") {
print "$next: ";
foreach my $variation (#{$wordHash{$next}}) {
push(#queue, "$variation");
$parents{"$variation"} = $next;
print "$variation | ";
}
print "\n-----------------\n";
$visited{"$next"} = 1;
push(#path, "$next");
$next = shift(#queue);
while ($visited{"$next"} == 1) {
$next = shift(#queue);
}
}
print "FOUND: $next\n\n";
print "Path the BFS took: ";
print "#path\n\n";
print "Value -> Parent: \n";
for my $key (keys %parents) {
print "$key -> $parents{$key}\n";
}
Before you accept a word from the #queue to be $next, you test to ensure that it's not been %visited. By then, though, damage has been done. The test has ensured a visited word wont become the focus again and hence, will prevent loops but the earlier code updated %parents whether the word had been %visited or not.
If a word has been %visited, you not only want to avoid it becomming the $next candidate, you want to avoid it being a considered $variation as that will screw up %parents. I don't have a word dictionary to test with and you haven't given an example of the failure but I think you can fix this up by shifting the %visited guard into the inner loop where variations are considered;
foreach my $variation (#{$wordHash{$next}}) {
next if %visited{ $variation } ;
push(#queue, "$variation");
... etc ...
This will protect the integrity of your #parents array as well as stop loops. On a small note, you don't need use double quotes when indexing into a hash; as I've done above, just state the scalar variable - using quotes just interpolates the value of the variable which produces the same result.
Your code, IMHO, is excellent for a beginner, BTW.
Update
I've since got a word dictionary and the problem above does exists as well as one other. The code does move one letter at a time from the source but in a near random direction - not necessarily closer to the target. To correct that, I changed the regex you use to build your graph such that the corresponding letter from the target replaces the generic [a-z]. There are also a couple of minor changes - mostly style related. The updated code looks like this;
use v5.12;
my $file = 'wordlist.txt';
#my $file = 'wordlist';
open(my $fh, $file);
my $len = length($ARGV[0]);
my $source = $ARGV[0];
my $target = $ARGV[1];
chomp $target;
my #target = split('', $target);
my #words;
# Creates new array of correct length words
while (my $row = <$fh>) {
$row =~ s/[\r\n]+$//;
my $rowlen = length($row);
if ($rowlen == $len) {
push #words, $row;
}
}
my %wordHash;
# Creates graph for word variations using dictionary
foreach my $word (#words) {
my $wordArr = [];
for (my $i = 0; $i < $len; $i++) {
my $begin = substr($word, 0, $i);
my $end = substr($word, $i+1, $len);
my $key = "$begin" . "_" . "$end";
my $Arr = [];
# my $re_str = "$begin[a-z]$end";
my $regex = $begin . $target[$i] . $end ;
foreach my $wordTest (#words) {
if ($wordTest =~ / ^ $regex $ /x ) {
next if $wordTest eq $word ;
push $wordArr, "$wordTest";
}
}
}
$wordHash{"$word"} = $wordArr;
}
my #queue;
push(#queue, "$source");
my $next = $source;
my %visited;
my %parents;
my #path;
# Finds path using BFS and Queue
while ($next ne $target) {
print "$next: ";
$visited{$next} = 1;
foreach my $variation (#{$wordHash{$next}}) {
next if $visited{ $variation } ;
push(#queue, $variation);
$parents{$variation} = $next;
print "$variation | ";
}
print "\n-----------------\n";
push(#path, $next);
while ( $visited{$next} ) {
$next = shift #queue ;
}
}
push #path, $target ;
print "FOUND: $next\n\n";
print "Path the BFS took: #path\n\n";
print "Value -> Parent: \n";
for my $key (keys %parents) {
print "$key -> $parents{$key}\n";
}
and when ran produces;
./words.pl head tail | more
head: heal |
-----------------
heal: teal | heil |
-----------------
teal:
-----------------
heil: hail |
-----------------
hail: tail |
-----------------
FOUND: tail
Path the BFS took: head heal teal heil hail tail
Value -> Parent:
hail -> heil
heil -> heal
teal -> heal
tail -> hail
heal -> head
You could probably remove the printing of the %parents hash - as hash values come out randomly, it doesnt tell you much
I have made a perl code which is shown below. Here what I am trying to do is first get input from a text file consisting of a HTTP URL with a Title.
thus the first regex is the title and the second regex fetches the id from inside the URL.
All these values are inserted into the hash table %myfilenames().
So this hash table has key as the URL id, and value as the Title. Everything till here works fine, now I have a set of files on my computer which have the ID in their name which we extracted from the URL.
What I want to do is that if the ID is there in the hash table, then the files name should change to the value assigned to the ID. Now the output at the print statement in the last function is correct but I am unable to rename the files. I tried many things, but nothing works. Can someone help please.
example stuff:
URL: https://abc.com/789012 <--- ID
Value (new Title) : ABC
file name on computer = file-789012 <---- ID
new file name = ABC
My code:
use File::Slurp;
use File::Copy qw(move);
open( F, '<hadoop.txt' );
$key = '';
$value = '';
%myfilenames = ();
foreach (<F>) {
if ( $_ =~ /Lecture/ ) {
$value = $_;
}
if ( $_ =~ /https/ ) {
if ( $_ =~ /\d{6}/ ) {
$key = $&;
}
}
if ( !( $value eq '' || $key eq '' ) ) {
#print "$key\t\t$value";
$myfilenames{$key} = $value;
$key = '';
$value = '';
}
}
#while ( my ( $k, $v ) = each %myfilenames ) { print "$k $v\n"; }
my #files = read_dir 'C:\\inputfolder';
for (#files) {
if ( $_ =~ /\d{6}/ ) {
$oldval = $&;
}
$newval = $myfilenames{$oldval};
chomp($newval);
print $_ , "\t\t$newval" . "\n";
$key = '';
}
You probably didn't prepend the path to the file names. The following works for me (on a Linux box):
#!/usr/bin/perl
use warnings;
use strict;
use File::Slurp qw{ read_dir };
my $dir = 0;
mkdir $dir;
open my $FH, '>', "$dir/$_" for 123456, 234567;
my $key = my $value = q();
my %myfilenames = ();
for (<DATA>) {
chomp;
$value = $_ if /Lecture/;
$key = $1 if /https/ and /(\d{6})/;
if ($value ne q() and $key ne q()) {
$myfilenames{$key} = $value;
$key = $value = q();
}
}
my #files = read_dir($dir);
for (#files) {
if (/(\d{6})/) {
my $oldval = $1;
my $newval = $myfilenames{$oldval};
rename "$dir/$oldval", "$dir/$newval";
}
}
__DATA__
Lecture A1
https://123456
# Comment
Lecture A2
https://234567
I want to parse some information from the file.
Information in the file:
Rita_bike_house_Sha9
Rita_bike_house
I want to have output like dis
$a = Rita_bike_house and $b = Sha9,
$a = Rita_bike_house and $b = "original"
In order to get that I have used the below code:
$name = #_; # This #_ has all the information from the file that I have shown above.
#For matching pattern Rita_bike_house_Sha9
($a, $b) = $name =~ /\w\d+/;
if ($a ne "" and $b ne "" ) { return ($a,$b) }
# this statement doesnot work at all as its first condition
# before the end is not satisified.
Is there any way where I can store "Rita_bike_house" in $a and "Sha9" in $b? I think my regexp is missing with something. Can you suggest anything?
Please don't use the variables $a and $b in your code. There are used by sort and will confuse you.
Try:
while( my $line = <DATA> ){
chomp $line;
if( $line =~ m{ \A ( \w+ ) _ ( [^_]* \d [^_]* ) \z }msx ){
my $first = $1;
my $second = $2;
print "\$a = $first and \$b = $second\n";
}else{
print "\$a = $line and \$b = \"original\"\n";
}
}
__DATA__
Rita_bike_house_Sha9
Rita_bike_house
Not very nice, but the next:
use strict;
use warnings;
while(<DATA>) {
chomp;
next if /^\s*$/;
my #parts = split(/_/);
my $b = pop #parts if $parts[$#parts] =~ /\d/;
$b //= '"original"';
my $a = join('_', #parts);
print "\$a = $a and \$b = $b,\n";
}
__DATA__
Rita_bike_house_Sha9
Rita_bike_house
prints:
$a = Rita_bike_house and $b = Sha9,
$a = Rita_bike_house and $b = "original",
If you are sure that the pattern which is required will always be similar to 'Sha9' and also it will appear at the end then just do a greedy matching....
open FILE, "filename.txt" or die $!;
my #data = <FILE>;
close(<FILE>);
#my $line = "Rita_bike_house_Sha9";
foreach $line (#data)
{
chomp($line);
if ($line =~ m/(.*?)(_([a-zA-Z]+[0-9]+))?$/)
{
$a = $1;
$b = $3 ? $3 : "original";
}
}
I'm trying to save a reference to a string in a class variable.
I wish to access this variable by dereferencing it.
For example in the routine getHeaders instead of using:
my $fileContentsRef = $this->getFileContent;
my $fileContentsRef1 = $$fileContentsRef;
$fileContentsRef1 =~ /Spaltenname.*?Datentyp.*?---\n(.*?)\n\n/gsmi;
I would like to use:
my $fileContentsRef = $this->getFileContent;
$$fileContentsRef =~ /Spaltenname.*?Datentyp.*?---\n(.*?)\n\n/gsmi;
For more details you should see the code at the end.
My problem is, that the program doesn't work when I don't work with the copy( i.e when I don't use $fileContentsRef1). What am I doing / getting wrong? Is it possible to reach the goal in the way I described? Could some give me clues how?
open FILE, "a1.bad";
$file_contents .= do { local $/; <FILE> };
close FILE;
my $log = auswerter->new(\$file_contents);
#-----------------------------------------------------------------
# Subs
#-----------------------------------------------------------------
# CONSTRUCTOR
sub new
{
my $fileRef = $_[1];
my $self = {};
bless $self;
$self->initialize();
if($fileRef) { $self->{fileRef} = $fileRef; }
return $self;
}
sub initialize
{
#-----------------------------------------------------------------
# Configuration
#-----------------------------------------------------------------
my $this = shift;
}
sub setFile {
my $this = shift;
$this->{file} = shift;
}
sub getFileContent
{
my $this = shift;
return $this->{fileRef};
}
sub getHeaders
{
print "HEADERS...\n";
my $this = shift;
my #headers = ();
my $fileContentsRef = $this->getFileContent;
my $fileContentsRef1 = $$fileContentsRef;
$fileContentsRef1 =~ /Spaltenname.*?Datentyp.*?---\n(.*?)\n\n/gsmi;
#headers = split ("\n", $1 );
foreach (#headers)
{
$_ =~ s/^(.*?)\s.*/$1/;
}
return \#headers;
}
sub getErrList
{
print "ERR LIST...\n";
my $this = shift;
my #errors = ();
my $fileContentsRef = $this->getFileContent;
my $fileContentsRef1 = $$fileContentsRef;
$fileContentsRef1 =~ /Spaltenname.*?(Satz.*)ORA.*?^Tabelle/gsmi;
return \#errors if !$1;
#errors = split ("\n\n", $1 );
foreach (#errors)
{
$_ =~ s/.*Spalte (.*?)\..*/$1/msgi;
}
return \#errors;
}
sub getEntries
{
my $this = shift;
my #entries = ();
my $fileContentsRef = $this->getFileContent;
my $fileContentsRef1 = $$fileContentsRef;
$fileContentsRef1 =~ /.*==\n(.*)/gsmi;
#entries = split ("\n", $1 );
return \#entries;
}
sub sqlldrAnalyze
{
my $this = shift;
my $token = shift;
my $errRef =$this->getErrList();
return "" if $#$errRef < 0 ;
my $headersRef = $this->getHeaders();
my $entriesRef = $this->getEntries();
my $i = 0;
my $str = "";
$str = "<html>";
$str .= "<table rules=\"all\">";
$str .= "<tr>";
foreach ( #$headersRef)
{
$str .= "<th>".$_."</th>";
}
$str .= "</tr>";
foreach ( #$entriesRef)
{
my #errOffset = grep { $headersRef->[$_] =~ $errRef->[$i] }0..$#$headersRef ;
my #entries = split($token, $_);
$str .= "<tr>";
foreach (my $j =0; $j <= $#entries;$j++)
{
$str .= "<td nowrap";
$str .= " style=\"background-color: red\"" if $j == $errOffset[0];;
$str .= ">";
$str .= "<b>" if $j == $errOffset[0];
$str .= $entries[$j];
$str .= "</b>" if $j == $errOffset[0];
$str .= "</td>";
}
$str .= "</tr>\n";
$i++;
}
$str .= "</table>";
$str .= "</html>";
return $str;
}
return 1;
When you call your class->new(...) constructor with a filename argument, the new subroutine gets the class name as the first argument, and the filename as the second argument.
In your constructor, you are simply copying the value of $_[1] (the filename) into $self->{FileRef}, but that value is not a reference.
So when you access it, there is no need to use a doubled sigil to dereference the value.
You should run all of your code with the following two lines at the top, which will catch many errors for you (including trying to use strings as references when they are not references):
use strict;
use warnings;
These two lines basically move Perl out of quick one-liner mode, and into a mode more suitable for large development (improved type safety, static variable name checking, and others).
Per the update: If the code you have is working properly when copying the string, but not when dereferencing it directly, it sounds like you may be running into an issue of the string reference preserving the last match position (the g flag).
Try running the following:
my $fileContentsRef = $this->getFileContent;
pos($$fileContentsRef) = 0; # reset the match position
$$fileContentsRef =~ /Spaltenname.*?Datentyp.*?---\n(.*?)\n\n/gsmi;
I have a hash of lists that is not getting populated.
I checked that the block at the end that adds to the hash is in fact being called on input. It should either add a singleton list if the key doesn't exist, or else push to the back of the list (referenced under the right key) if it does.
I understand that the GOTO is ugly, but I've commented it out and it has no effect.
The problem is that when printhits is called, nothing is printed, as if there are no values in the hash. I also tried each (%genomehits), no dice.
THANKS!
#!/usr/bin/perl
use strict;
use warnings;
my $len = 11; # resolution of the peaks
#$ARGV[0] is input file
#$ARGV[1] is call number
# optional -s = spread number from call
# optional -o specify output file name
my $usage = "see arguments";
my $input = shift #ARGV or die $usage;
my $call = shift #ARGV or die $usage;
my $therest = join(" ",#ARGV) . " ";
print "the rest".$therest."\n";
my $spread = 1;
my $output = $input . ".out";
if ($therest =~ /-s\s+(\d+)\s/) {$spread = $1;}
if ($therest =~ /-o\s+(.+)\s/) {$output = $1;}
# initialize master hash
my %genomehits = ();
foreach (split ';', $input) {
my $mygenename = "err_naming";
if ($_ =~ /^(.+)-/) {$mygenename = $1;}
open (INPUT, $_);
my #wiggle = <INPUT>;
&singlegene(\%genomehits, \#wiggle, $mygenename);
close (INPUT);
}
&printhits;
#print %genomehits;
sub printhits {
foreach my $key (%genomehits) {
print "key: $key , values: ";
foreach (#{$genomehits{$key}}) {
print $_ . ";";
}
print "\n";
}
}
sub singlegene {
# let %hash be the mapping hash
# let #mygene be the gene to currently process
# let $mygenename be the name of the gene to currently process
my (%hash) = %{$_[0]};
my (#mygene) = #{$_[1]};
my $mygenename = $_[2];
my $chromosome;
my $leftbound = -2;
my $rightbound = -2;
foreach (#mygene) {
#print "Doing line ". $_ . "\n";
if ($_ =~ "track" or $_ =~ "output" or $_ =~ "#") {next;}
if ($_ =~ "Step") {
if ($_ =~ /chrom=(.+)\s/) {$chromosome = $1;}
if ($_ =~ /span=(\d+)/) {$1 == 1 or die ("don't support span not equal to one, see wig spec")};
$leftbound = -2;
$rightbound = -2;
next;
}
my #line = split /\t/, $_;
my $pos = $line[0];
my $val = $line[-1];
# above threshold for a call
if ($val >= $call) {
# start of range
if ($rightbound != ($pos - 1)) {
$leftbound = $pos;
$rightbound = $pos;
}
# middle of range, increment rightbound
else {
$rightbound = $pos;
}
if (\$_ =~ $mygene[-1]) {goto FORTHELASTONE;}
}
# else reinitialize: not a call
else {
FORTHELASTONE:
# typical case, in an ocean of OFFs
if ($rightbound != ($pos-1)) {
$leftbound = $pos;
}
else {
# register the range
my $range = $rightbound - $leftbound;
for ($spread) {
$leftbound -= $len;
$rightbound += $len;
}
#print $range . "\n";
foreach ($leftbound .. $rightbound) {
my $key = "$chromosome:$_";
if (not defined $hash{$key}) {
$hash{$key} = [$mygenename];
}
else { push #{$hash{$key}}, $mygenename; }
}
}
}
}
}
You are passing a reference to %genomehits to the function singlegene, and then copying it into a new hash when you do my (%hash) = %{$_[0]};. You then add values to %hash which goes away at the end of the function.
To fix it, use the reference directly with arrow notation. E.g.
my $hash = $_[0];
...
$hash->{$key} = yadda yadda;
I think it's this line:
my (%hash) = %{$_[0]};
You're passing in a reference, but this statement is making a copy of your hash. All additions you make in singlegene are then lost when you return.
Leave it as a hash reference and it should work.
PS - Data::Dumper is your friend when large data structures are not behaving as expected. I'd sprinkle a few of these in your code...
use Data::Dumper; print Dumper \%genomehash;