How to print out HTML: TableExtract results?

How to print out HTML: TableExtract results? - perl

A stupid question here. I am new to Perl and trying to use HTML: TableExtract to extract some data online. I got numbers from a webpage but do not know how to print them out in a txt file. I tried to open an file but did not succeed. Here are the codes I use. Thanks.
#!/usr/bin/perl
use Encode qw(decode);
use Encode;
use Encode::HanExtra;
use Encode::HanConvert;
use strict;
use warnings;
chdir("C:/perlfiles/test") || die "cannot cd ($!)";
my $file = "tokyo.html";
use HTML::TableExtract;
open my $outfile, '>', "tokyo.txt" or die 'Unable to create file';
my $label = 'by headers';
my $te = HTML::TableExtract->new(headers => [qw(number city)]);
$te->parse_file($file);
foreach my $ts ($te->tables) {
print "Table (", join(',', $ts->coords), "):\n";
foreach my $row ($ts->rows) {
print $outfile join(",", #$row),"\n";
}
}
close $outfile;
What is wrong? Thanks.

Use >> instead of >. > will overwrite the last file everytime, so if your ending for loop iteration returns no value, you would end up getting a blank file. >> appends to the EOF of the existing file thus retaining the previously written data.
open (OUT,'>>tokyo.txt') or die 'Unable to create file';
So something like this might work.
open (OUT,'>>tokyo.txt') or die 'Unable to create file';
....
....
foreach my $row ($ts->rows) {
print OUT join(",", #$row) . "\n";
}
....
close OUT;
Also, your file handle is not quite right my $outfile. The file handle is supposed to "Label" the connection to the external file with the file handle used. In your case $outfile is a variable and contains no value! and hence no Label, and thus the file won't open. You have to prominently Label the connection of perl to the external file using somthing like OUT (as I did) or OUTFILE etc. and use this file handle through out the code to write, read close the file etc.

Related

Perl - Compare two large txt files and return the required lines from the first

So I am quite new to perl programming. I have two txt files, combined_gff.txt and pegs.txt.
I would like to check if each line of pegs.txt is a substring for any of the lines in combined_gff.txt and output only those lines from combined_gff.txt in a separate text file called output.txt
However my code returns empty. Any help please ?
P.S. I should have mentioned this. Both the contents of the combined_gff and pegs.txt are present as rows. One row has a string. second row has another string. I just wish to pickup the rows from combined_gff whose substrings are present in pegs.txt
#!/usr/bin/perl -w
use strict;
open (FILE, "<combined_gff.txt") or die "error";
my #gff = <FILE>;
close FILE;
open (DATA, "<pegs.txt") or die "error";
my #ext = <DATA>;
close DATA;
my $str = ''; #final string
foreach my $gffline (#gff) {
foreach my $extline (#ext) {
if ( index($gffline, $extline) != -1) {
$str=$str.$gffline;
$str=$str."\n";
exit;
}
}
}
open (OUT, ">", "output.txt");
print OUT $str;
close (OUT);

The first problem is exit. The output file is never created if a substring is found.
The second problem is chomp: you don't remove newlines from the lines, so the only way how a substring can be found is when a string from pegs.txt is a suffix of a string from combined_gff.txt.
Even after fixing these two problems, the algorithm will be very slow, as you're comparing each line from one file to each line of the second file. It will also print a line multiple times if it contains several different substrings (not sure if that's what you want).
Here's a different approach: First, read all the lines from pegs.txt and assemble them into a regex (quotemeta is needed so that special characters in substrings are interpreted literally in the regex). Then, read combined_gff.txt line by line, if the regex matches the line, print it.
#!/usr/bin/perl
use warnings;
use strict;
open my $data, '<', 'pegs.txt' or die $!;
chomp( my #ext = <$data> );
my $regex = join '|', map quotemeta, #ext;
open my $file, '<', 'combined_gff.txt' or die $!;
open my $out, '>', 'output.txt' or die $!;
while (<$file>) {
print {$out} $_ if /$regex/;
}
close $out;
I also switched to 3 argument version of open with lexical filehandles as it's the canonical way (3 argument version is safe even for files named >file or rm *| and lexical filehandles aren't global and are easier to pass as arguments to subroutines). Also, showing the actual error is more helpful than just dying with "error".

As choroba says you don't need the "exit" inside the loop since it ends the complete execution of the script and you must remove the line forwards (LF you do it by chomp lines) to find the matches.
Following the logic of your script I made one with the corrections and it worked fine.
#!/usr/bin/perl -w
use strict;
open (FILE, "<combined_gff.txt") or die "error";
my #gff = <FILE>;
close FILE;
open (DATA, "<pegs.txt") or die "error";
my #ext = <DATA>;
close DATA;
my $str = ''; #final string
foreach my $gffline (#gff) {
chomp($gffline);
foreach my $extline (#ext) {
chomp($extline);
print $extline;
if ( index($gffline, $extline) > -1) {
$str .= $gffline ."\n";
}
}
}
open (OUT, ">", "output.txt");
print OUT $str;
close (OUT);
Hope it works for you.
Welcho

Perl Get the web content then writing it as a text file

I'm trying to create a script which get from the website a log file(content) then inputting it to a text file, but I am having errors if use strict is present:
Can't use string ("/home/User/Downloads/text") as a symbol ref while "strict refs" in use at ./scriptname line 92.
Also by removing the use strict: I get another error which is:
File name too long at ./scriptname line 91.
I tried the Perl: Read web text file and "open" it
But, it did not work for me. Plus I am a newbie at Perl and confuse of the Perl syntax.
Are there any suggestions or advices available?
Note: The code does it greps the entire line with the RoomOutProcessTT present and display it together with how many times it appears.
Here is the code.
my $FOutput = get "http://website/Logs/Log_number.ini";
my $FInput = "/home/User/Downloads/text";
open $FInput, '<', $FOutput or die "could not open $FInput: $!";
my $ctr;
my #results;
my #words = <$FInput>;
#results = grep /RoomOutProcessTT/, #words;
print "#results\n";
close $FInput;
open $FInput, '<', $FOutput or die "could not open $FInput: $!";
while(<$FInput>){
$ctr = grep /RoomOutProcessTT/, split ' ' , $_;
$ctr += $ctr;
}
print "RoomOutProcessTT Count: $ctr\n";
close $FInput;

The first argument to open is the filehandle name, not the actual name of the file. That comes later in the open function.
Change your code to:
my $FOutput = get "http://website/Logs/Log_number.ini"; # your content should be stored in this
# variable, you need to write data to your output file.
my $FInput = "/home/User/Downloads/text";
open OUTPUT_FILEHANDLE, '>', $FInput or die "could not open $FInput: $!"; # give a name to the file
# handle, then supply the file name itself after the mode specifier.
# You want to WRITE data to this file, open it with '>'
my $ctr;
my #results;
my #words = split(/(\r|\n)/, $FOutput); # create an array of words from the content from the logfile
# I'm not 100% sure this will work, but the intent is to show
# an array of 'lines' corresponding to the data
# here, you want to print the results of your grep to the output file
#results = grep /RoomOutProcessTT/, #words;
print OUTPUT_FILEHANDLE "#results\n"; # print to your output file
# close the output file here, since you re-open it in the next few lines.
close OUTPUT_FILEHANDLE;
# not sure why you're re-opening the file here... but that's up to your design I suppose
open INPUT_FILEHANDLE, '<', $FInput or die "could not open $FInput: $!"; # open it for read
while(<INPUT_FILEHANDLE>){
$ctr = grep /RoomOutProcessTT/, split ' ' , $_;
$ctr += $ctr;
}
print "RoomOutProcessTT Count: $ctr\n"; # print to stdout
close INPUT_FILEHANDLE; # close your file handle
I might suggest switching the terms you use to identify "input and output", as it's somewhat confusing. The input in this case is actually the file you pull from the web, output being your text file. At least that's how I interpret it. You may want to address that in your final design.

Reading and writing to the same file

I'm using this code I found online to read a properties file in my Perl script:
open (CONFIG, "myfile.properties");
while (CONFIG){
chomp; #no new line
s/#.*//; #no comments
s/^\s+//; #no leading white space
s/\s+$//; #no trailing white space
next unless length;
my ($var, $value) = split (/\s* = \s*/, $_, 2);
$$var = $value;
}
Is it posssible to also write to the text file inside this while loop? Let's say the text file looks like this:
#Some comments
a_variale = 5
a_path = /home/user/path
write_to_this_variable = ""
How can I put some text in write_to_this_variable?

It is not really practical to overwrite text files where you have variable length records (lines). It is normal to copy the file, something like this:
my $filename = 'myfile.properites';
open(my $in, '<', $filename) or die "Unable to open '$filename' for read: $!";
my $newfile = "$filename.new";
open(my $out, '>', $newfile) or die "Unable to open '$newfile' for write: $!";
while (<$in>) {
s/(write_to_this_variable =) ""/$1 "some text"/;
print $out;
}
close $in;
close $out;
rename $newfile,$filename or die "unable to rename '$newfile' to '$filename': $!";
You might have to sanitse the text you are writing with something like \Q if it contains non-alphanumerics.

This is an example of a program that uses the Config::Std module to read an write a simple config file like yours. As far as I know it is the only module that will preserve any comments in the original file.
There are two points to note:
The first hash key in $props{''}{write_to_this_variable} forms the name of the config file section that will contain the value. If there are no sections, as for your file, then you must use an empty string here
If you need quotes around the a value then you must add these explicitly when you are assigning to the hash element, as I do here with '"Some text"'
I think the rest of the program is self-explanatory.
use strict;
use warnings;
use Config::Std { def_sep => ' = ' };
my %props;
read_config 'myfile.properties', %props;
$props{''}{write_to_this_variable} = '"Some text"';
write_config %props;
output
#Some comments
a_variale = 5
a_path = /home/user/path
write_to_this_variable = "Some text"

Perl IPC::Run appending output and parsing stderr while keeping it in a batch of files

I'm trying to wrap my head around IPC::Run to be able to do the following. For a list of files:
my #list = ('/my/file1.gz','/my/file2.gz','/my/file3.gz');
I want to execute a program that has built-in decompression, does some editing and filtering to them, and prints to stdout, giving some stats to stderr:
~/myprogram options $file
I want to append the stdout of the execution for all the files in the list to one single $out file, and be able to parse and store a couple of lines in each stderr as variables, while letting the rest be written out into separate fileN.log files for each input file.
I want stdout to all go into a ">>$all_into_one_single_out_file", it's the err that I want to keep in different logs.
After reading the manual, I've gone so far as to the code below, where the commented part I don't know how to do:
for $file in #list {
my #cmd;
push #cmd, "~/myprogram options $file";
IPC::Run::run \#cmd, \undef, ">>$out",
sub {
my $foo .= $_[0];
#check if I want to keep my line, save value to $mylog1 or $mylog2
#let $foo and all the other lines be written into $file.log
};
}
Any ideas?

First things first. my $foo .= $_[0] is not necessary. $foo is a new (empty) value, so appending to it via .= doesn't do anything. What you really want is a simple my ($foo) = #_;.
Next, you want to have output go to one specific file for each command while also (depending on some conditional) putting that same output to a common file.
Perl (among other languages) has a great facility to help in problems like this, and it is called closure. Whichever variables are in scope at the time of a subroutine definition, those variables are available for you to use.
use strict;
use warnings;
use IPC::Run qw(run new_chunker);
my #list = qw( /my/file1 /my/file2 /my/file3 );
open my $shared_fh, '>', '/my/all-stdout-goes-here' or die;
open my $log1_fh, '>', '/my/log1' or die "Cannot open /my/log1: $!\n";
open my $log2_fh, '>', '/my/log2' or die "Cannot open /my/log2: $!\n";
foreach my $file ( #list ) {
my #cmd = ( "~/myprogram", option1, option2, ..., $file );
open my $log_fh, '>', "$file.log"
or die "Cannot open $file.log: $!\n";
run \#cmd, '>', $shared_fh,
'2>', new_chunker, sub {
# $out contains each line of stderr from the command
my ($out) = #_;
if ( $out =~ /something interesting/ ) {
print $log1_fh $out;
}
if ( $out =~ /something else interesting/ ) {
print $log2_fh $out;
}
print $log_fh $out;
return 1;
};
}
Each of the output file handles will get closed when they're no longer referenced by anything -- in this case at the end of this snippet.
I fixed your #cmd, though I don't know what your option1, option2, ... will be.
I also changed the way you are calling run. You can call it with a simple > to tell it the next thing is for output, and the new_chunker (from IPC::Run) will break your output into one-line-at-a-time instead of getting all the output all-at-once.
I also skipped over the fact that you're outputting to .gz files. If you want to write to compressed files, instead of opening as:
open my $fh, '>', $file or die "Cannot open $file: $!\n";
Just open up:
open my $fh, '|-', "gzip -c > $file" or die "Cannot startup gzip: $!\n";
Be careful here as this is a good place for command injection (e.g. let $file be /dev/null; /sbin/reboot. How to handle this is given in many, many other places and is beyond the scope of what you're actually asking.
EDIT: re-read problem a bit more, and changed answer to more closely reflect the actual problem.
EDIT2:: Updated per your comment. All stdout goes to one file, and the stderr from command is fed to the inline subroutine. Also fixed a stupid typo (for syntax was pseudo code not Perl).

update a column in input file by taking value from Database in perl

input file:
1,a,USA,,
2,b,UK,,
3,c,USA,,
i want to update the 4th column in the input file from taking values from one of the table.
my code looks like this:
my $number_dbh = DBI->connect("DBI:Oracle:$INST", $USER, $PASS ) or die "Couldn't
connect to datbase $INST";
my $num_smh;
print "connected \n ";
open FILE , "+>>$input_file" or die "can't open the input file";
print "echo \n";
while(my $line=<FILE>)
{
my #line_a=split(/\,/,$line);
$num_smh = $number_dbh->prepare("SELECT phone_no from book where number = $line_a[0]");
$num_smh->execute() or die "Couldn't execute stmt, error : $DBI::errstr";
my $number = $num_smh->fetchrow_array();
$line_a[3]=$number;
}

Looks like your data is in CSV format. You may want to use Parse::CSV.

+>> doesn't do what you think it does. In fact, in testing it doesn't seem to do anything at all. Further, +< does something very strange:
% cat file.txt
1,a,USA,,
2,b,UK,,
3,c,USA,,
% cat update.pl
#!perl
use strict;
use warnings;
open my $fh, '+<', 'file.txt' or die "$!";
while ( my $line = <$fh> ) {
$line .= "hello\n";
print $fh $line;
}
% perl update.pl
% cat file.txt
1,a,USA,,
1,a,USA,,
hello
,,
,,
hello
%
+> appears to truncate the file.
Really, what you want to do is to write to a new file, then copy that file over the old one. Opening a file for simultaneous read/write looks like you'd be entering a world of hurt.
As an aside, you should use the three-argument form of open() (safer for "weird" filenames) and use lexical filehandles (they're not global, and when they go out of scope your file automatically closes for you).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to print out HTML: TableExtract results? - perl

Related

Perl - Compare two large txt files and return the required lines from the first

Perl Get the web content then writing it as a text file

Reading and writing to the same file

Perl IPC::Run appending output and parsing stderr while keeping it in a batch of files

update a column in input file by taking value from Database in perl

Categories

Resources