How to select parts of a line in Perl? - perl

I have many long files but I am interested just in part of the information of each one. So far I have a code that trims the file and gives me the line that contains the information I need, working one file at the time.
This is the code I am using:
#!/usr/bin/perl
use strict;
use warnings;
my $data;
open FILE, "<$ARGV[0]" or die "cannot open file '$ARGV[0]'!\n\n";
while ($data= <FILE>){
chomp $data;
if( $data=~m/\<input type="hidden" name="description" value="454read"><input type="hidden" name="format" value="fasta"><input type="submit" name="submitbutton" value="FASTA"/)
{
$data=~s/[^ACTGN]//g;
print $data;
}
}
And this is the input I get:
<input type="hidden" name="sequence" value="TTGTTGAGCTCGACGGTCATGACCCAGCTGGAGTCGGCACGGGCACCCGCGCGCTTCTGCCAGACGCCAATGTGGGACTTCTCGGTGTCGAGGC"><input type="hidden" name="name" value="FUY784js_7HL"><input type="hidden" name="description" value="454read"><input type="hidden" name="format" value="fasta"><input type="submit" name="submitbutton" value="FASTA">
From this I only need two parts, the TTGTT....AGGC, this part will always be uppercase A,T,C,G,or N, however the length might differ in each file. I also need to save the name for this that in this case is FUY784js_7HL, this name will change every time.
The ideal output should look like this:
FUY784js_7HL
TTGTTGAGCTCGACGGTCATGACCCAGCTGGAGTCGGCACGGGCACCCGCGCGCTTCTGCCAGACGCCAATGTGGGACTTCTCGGTGTCGAGGC
Do you have any idea of how can I do it? I have many files like this. I will appreciate if any of you can help me to figure out how to get this to work for multiple files.
Thanks!

perl -pe 's/[^ACTGN]//g;'
As a proxy for the bit which appears to be problematic, the above command seems to work, at least with the input line starting with <input and the second output line.
If you don't have any other prints in your real program, I'm not sure how it could produce the line you said it did.
Actually, that was a lie. I got:
TTGTTGAGCTCGACGGTCATGACCCAGCTGGAGTCGGCACGGGCACCCGCGCGCTTCTGCCAGACGCCAATGTGGGACTTCTCGGTGTCGAGGCATA
back because of the FASTA value at the end. If you want to restrict to the main value:
perl -pe 's/.*"([ACTGN]+)".*<input\b[^>]*\bname="name"\s[^>]*\bvalue="([^"]+)".*/$2\n$1/;'
Please note that all of the standard disclaimers about the stupidity and fragility of parsing XML with a regex apply. Specifically, it is perfectly legal to reorder the name and value attributes and this example regex doesn't allow that.

If I understand the problem correctly, it looks like making use of capturing groups addresses your need. Specially since you know the beginning and the end but don't know the middle, something like this should work:
$data =~ /TTGTT(.+)AGGC/;
print $1;
Check out the section on capture groups on perldoc:
http://perldoc.perl.org/perlre.html#Regular-Expressions

From what has been posted, I think this would return the sequence:
$data =~ /name="sequence" value="([AGCT]*).*name="name" value="([^"])"/;
print "$2\n$1";

Related

Use of uninitialized value $key_value in print at

This is my first post. So please excuse me of any irregularities. I am newbie to Perl and I have got the following issue. I get the error "Use of uninitialized value ...at" when I use the param function in Perl. Here's the code.
use CGI qw(param);
print "Content-type: text/plain \n\n";
$key_value=param('sososo');
print $key_value;
and my html file is
<input type="radio" name="rate" id="sososo" value="1"/>
<label for="sososo">so</label> <br>
In other words I want the value 1 to be displayed. But obviously it does not assign the value to $key_value. I don't know why. Thank you in advance.
To get the value of an input field, you have to use the name of the element, not the id.
use
$key_value=param('rate');
instead of
$key_value=param('sososo');

how to search and take particular text in perl

I have one folder it contain 'n' number of html files. I'll read the files and take the one line. (i.e) I'll take the <img /> tag in one array and print the array. Now doesn't print the array. Can you help me. My code is here.
use strict;
use File::Basename;
use File::Path;
use File::Copy;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Excel';
print "Welcome to PERL program\n";
#print "\n\tProcessing...\n";
my $foldername = $ARGV[0];
opendir(DIR,$foldername) or die("Cannot open the input folder for reading\n");
my (#htmlfiles) = grep/\.html?$/i, readdir(DIR);
closedir(DIR);
#htmlfiles = grep!/(?:index|chapdesc|listdesc|listreview|addform|addform_all|pattern)\.html?$/i,#htmlfiles;
# print "HTML file is #htmlfiles";
my %fileimages;
my $search_for = 'img';
my $htmlstr;
for my $files (#htmlfiles)
{
if(-e "$foldername\\$files")
{
open(HTML, "$foldername\\$files") or die("Cannot open the html files '$files' for reading");
local undef $/;my $htmlstr=<HTML>;
close(HTML);
$fileimages{uc($2)}=[$1,$files] while($htmlstr =~/<img id="([^"]*)" src="\.\/images\/[^t][^\/<>]*\/([^\.]+\.jpg)"/gi);
}
}
In command prompt.
perl findtext.pl "C:\viji\htmlfiles"
regards, viji
I would like to point out that parsing HTML with regexes is futile. See the epic https://stackoverflow.com/a/1732454/1521179 for the answer.
Your regex to extract image tags is quite broken. Instead of using a HTML parser and walking the tree, you search for a string that…
/<img id="([^"]*)" src="\.\/images\/[^t][^\/<>]*\/([^\.]+\.jpg)"/gi
begins with <img
after exactly one space, the sequence id=" is found. The contents of that attribute are captured if it is found, else the match fails. The closing " is consumed.
after exactly one space, the sequence src="./images/ is found,
followed by a character that is not t. (This allows for ", of course).
This is followed by any number of any characters that are not slashes or <> characters (This allows for ", again),
followed by a slash.
now capture this:
one or more characters that are not dots
followed by the suffix .jpg
after which " has to follow immediately.
false positives
Here is some data that your regex will match, where it shouldn't:
<ImG id="" src="./ImAgEs/s" alt="foo/bar.jpg"
So what is the image path you get? ./ImAgEs/s" alt="foo/bar.jpg may not be what you wanted.
<!-- <iMg id="" src="./images/./foobar.jpg" -->
Oops, I matched commented content. And the path does not contain a subfolder of ./images. The . folder is completely valid in your regex, but denotes the same folder. I could even use .., what would be the folder of your HTML file. Or I could use ./images/./t-rex/image.jpg what would match a forbidden t-folder.
false negatives
Here is some data you would want, but that you won't get:
<img
id="you-cant-catch-me"
src='./images/x/awesome.jpg' />
Why? Newlines—but you only allow for single spaces between the parameters. Also, you don't allow for single quotes '
<img src="./images/x/awesome.jpg" id="you-cant-catch-me" />
Why? I now have single spaces, but swapped the arguments. But both these fragments denote the exact same DOM and therefore should be considered equivalent.
Conclusion
go to http://www.cpan.org/ and search for HTML and Tree. Use a module to parse your HTML and walk the tree and extract all matching nodes.
Also, add a print statement somewhere. I found a
use Data::Dumper;
print Dumper \%fileimages;
quite enlightening for debug purposes.

Pass textbox input to perl script on server

Sorry in advance for a potentially dumb newbie question, but here goes.
I am learning web app programming and I would like to have an input textbox on my webpage where the user enters some text. Then I capture that text and pass to a perl script which generates some output. I then take this text output and pass it back to the webpage.
Can someone point me in the right direction on how to do this.
Can be a really simple example, where the user inputs some text. I take the text and pass to a perl script which turns everything to uppercase - uc() - and then passes back to the webpage.
Thanks
In your html body:
<FORM ACTION="/cgi-bin/results.pl">
<P>Enter a value: <INPUT NAME="value">
<P><INPUT TYPE="SUBMIT" VALUE="Next">
</FORM>
In your results.pl:
use CGI qw(:standard);
my $value = uc(param('value'));
print header;
print start_html;
print p($value);
print end_html;
The page needs to contain a form. The action attribute of the form needs to point to a URL that your webserver will process with the Perl program. The simplist way to achieve this is using CGI, a more modern approach uses PSGI. Most Perl form processing libraries use a similar interface to CGI.pm's
useCGI;
my $q = CGI->new;
my $text_box_value = $q->param( 'my_text_box_name' );
This is a decent CGI tutorial: http://www.tutorialspoint.com/perl/perl_cgi.htm . Or there's this http://www.cgi101.com/book/ or this http://www.lies.com/begperl/ or this http://websitehelpers.com/perl/ all found here: http://www.google.com/search?q=perl+CGI+tutorial

I want to check if number starts with 4 or 5 in CGI

I need to write some script in CGI which is new to me. I am trying to do if else with condition numbers starting with 5 or 6. So do one code if number starts with 5 and do another if number starts with 6.
use 5.013;
use warnings;
use Scalar::Util qw( looks_like_number );
use CGI;
my $param = CGI->new()->param('some_example');
given (substr $param, 0, 1) {
when (! looks_like_number($_) ) { say 'Not a number' }
when (5) { say 'starts with 5' }
when (6) { say 'starts with 6' }
}
Alternatively, rather than using substr to get the first letter, put $param and change (5) to your regex of choice.
I don't think you understand what CGI is. CGI is simply a set of environment variables that are set up by the webserver, and your program is executed with them. The output of the program becomes the webpage.
So if you want to write a CGI script in Python, PHP, C, Assembly, Whitespace... as long as it can be called and use environment variables, it's fine.
So this is really a language question. Which language are you using?
EDIT You specified Perl in a comment to this answer. I suggest you edit the question.
What's your input number? The Perl script will be run with a whole truckload of extra environment variables. Two of the most important are QUERY_STRING and REQUEST_METHOD. CGI consists of a specification of these environment variables, so any language can be used to write CGI.
Consider perl_cgi.cgi?something=else. The bit following the ? is the QUERY_STRING. You can specify this directly as part of an anchor:
Run with something equals else
or as part of a form (one of GET or POST, defaults to GET):
<form action="perl_cgi.cgi" method="[GET or POST]">
<input type="text" name="something" value="else"/>
<input type="submit" value="Submit!"/>
</form>
This will run your program with the same query string as above (or a different parameter, if the text box is changed) but REQUEST_METHOD will be either GET or POST depending.
So let's write a Perl CGI script to print the first number of the string we get (we're only passed strings):
use CGI;
$cgi=new CGI;
$x=$cgi->param('x');
$firstnum=substr($x, 0, 1);
print "Content-type: text/html\n\n";
print <<"EOF";
<html>
<head>
<title>My sample HTML page</title>
</head>
<body>
<p>The first number of $x is $firstnum</p>
</body>
</html>
EOF
This presupposes that this program is run as [program_name]?x=[some string]. It's up to you to make sure that's the case.
That should give you enough. You can check firstnum to see if its 5 or 6, then do different things depending.

SSI not producing output, not giving error either

in the html file:
<!--#exec cgi="/cgi-bin/test.pl"-->
the perl script:
#!/usr/bin/perl
print "Content-Type: text/html\n\n";
print "<input type=\"hidden\" name=\"aname\" value=\"avalue\">\n";
print "<img src=\"/cgi-bin/script.pl\" />";
This does not give me an 'error processing directive' error, nor does it output my HTML inplace of the tag. I'll also add that the ssi tag gets replaced with nothing.
Are you sure the script is executing? If you print something to STDERR does it show up in th error log?
Beyond that I have a few comments:
I'm pretty sure printing the Content-Type is redundant, you (well, Apache anyway) have already done that by serving the HTML file that contains the SSI.
reference
exec is really meant for running commands like 'ls -l'. You should use include virtual instead. It also allows you to add arguments to the url. e.g.
<!--#include virtual="/cgi-bin/example.cgi?argument=value" --\>
do yourself a favor and use qq[] instead of the double-quotes. You won't have to escape everything then... e.g.
print qq[< input type="hidden" name="aname" value="avalue"\b];