I have a Perl CGI script for finding the keyword in different types of texts and producing the sorted output.
The present output looks like this
for a keyword "the".
But I would like to highlight the keyword "the" in bold in this output.
How is it possible to do this?
I tried using Term::ANSIColor but it prints the entire sentence in bold and not only the keyword.
My CGI script looks like this:
#!/usr/bin/perl
# require
use warnings;
use diagnostics;
use CGI;
use File::Basename;
my $q = new CGI;
print "Content-type: text/plain\n\n";
#initialize variables
my $target = $q->param( "keyword" );
my $radius = $q->param( "span" );
my $ordinal = $q->param( "ord" );
my $width = 2 * $radius;
#print standard output
print "****************************\n";
print "* SEARCH RESULTS *\n";
print "****************************\n";
print "Your Search word is: $target\n\n";
print "Your Radius is : $radius\n\n";
print "Your Ordinal is : $ordinal\n\n";
print "-----------------------------------------------------------\n\n";
#normal concordance for abstract text
my #files = glob( 'ABS/*.txt' );
for my $file ( #files ) {
my $path = glob( 'ABS/*.txt' );
my $file_name = basename( $path );
# initialize
my $count = 0;
my #lines = ();
$/ = ""; # Paragraph read mode
# open the file, and process each line in it
open( FILE, " < $file" ) or die( "Can not open $file ($!).\n" );
while ( <FILE> ) {
# re-initialize
my $extract = '';
# normalize the data
chomp;
s/\n/ /g; # Replace new lines with spaces
s/\b--\b/ -- /g; # Add spaces around dashes
# process each item if the target is found
while ( $_ =~ /\b$target\b/gi ) {
# find start position
my $match = $1;
my $pos = pos;
my $start = $pos - $radius - length( $match );
# extract the snippets
if ( $start < 0 ) {
$extract = substr( $_, 0, $width + $start + length( $match ) );
$extract = ( " " x -$start ) . $extract;
}
else {
$extract = substr( $_, $start, $width + length( $match ) );
my $deficit = $width + length( $match ) - length( $extract );
if ( $deficit > 0 ) {
$extract .= ( " " x $deficit );
}
}
# add the extracted text to the list of lines, and increment
$lines[$count] = $extract;
++$count;
}
}
sub removePunctuation {
my $string = $_[0];
$string = lc( $string ); # Convert to lowercase
$string =~ s/[^-a-z ]//g; # Remove non-aplhabetic characters
$string =~ s/--+/ /g; # Remove 2+ hyphens with a space
$string =~ s/-//g; # Remove hyphens
$string =~ s/\s=/ /g;
return ( $string );
}
sub onLeft {
#USAGE: $word = onLeft($string, $radius, $ordinal);
my $left = substr( $_[0], 0, $_[1] );
$left = removePunctuation( $left );
my #word = split( /\s+/, $left );
return ( $word[ -$_[2] ] );
}
sub byLeftWords {
my $left_a = onLeft( $a, $radius, $ordinal );
my $left_b = onLeft( $b, $radius, $ordinal );
lc( $left_a ) cmp lc( $left_b );
}
# process each line in the list of lines
my $line_number = 0;
print "File name: $file_name \n\n";
foreach my $x ( sort byLeftWords #lines ) {
++$line_number;
printf "%5d", $line_number;
print " $x\n\n";
}
print "------------------------------------------------------------\n\n";
}
# done
exit;
Thanks.
You can't control the style of characters displayed in a text/plain document. Term::ANSIColor certainly won't work as the browser doesn't recognize the ANSI terminal escape sequences, which will work only on your console
If you write your output as HTML then you just need to put <strong>...</strong> around the word to be emboldened
Or, better, you can write a CSS style like this
.bold {
font-weight: bold;
}
and wrap the words in <span class="bold">...</span>
Update
To convert your text document to HTML, the simplest way is to wrap it in a <pre>...</pre> ("preformatted") element and put it into the standard HTML structure. Less than <, greater than > and ampersand & characters must be replaced with their corresponding entities <, > and & respectively, and you can wrap the words you want in bold in <strong> tags
The minimum HTML5 document which is also friendly with the majority of modern browsers looks like this
<!doctype html>
<html lang=en>
<head>
<meta charset=UTF-8>
<title>Search Results</title>
</head>
<body>
<pre>
Content of my report
with words to be in bold wrapped in <strong> tags
and characters `<`, `>`, and `&` replaced by
`<`, `>` and `&` respectively
</pre>
</body>
</html>
Related
Here I write Perl code but in if condition, used \n new line character not match.
#!/usr/bin/perl
use strict;
#use warnings;
use Cwd;
use File::Basename;
use File::Copy;
my $path = getcwd;
#print $path."\n";
opendir( INP, "$path\/" );
my #out = grep( /.(xml)$/, readdir(INP) );
close INP;
#print #out;
open( F6, ">Log.txt" );
foreach my $f1 (#out) {
open( FF, "<$path\/$f1" ) or die "Cannot open file: $out[0]";
my $data1 = join( "", <FF> );
my #FILE_KA_ARRAY = split( /\n/, $data1 );
my $file_ka_len = #FILE_KA_ARRAY;
#print F6 $file_ka_len."\n";
#print F6 $f."\t".$file_ka_len."\n";
print F6 $f1 . "\n";
for ( my $x = 1; $x < $file_ka_len; $x++ ) {
my $y = $x + 1;
my $temp_file_arr = "";
$temp_file_arr = $FILE_KA_ARRAY[$x];
#print F6 $temp_file_arr."\t$x\n";
my $temp1 = $temp_file_arr;
if ( $temp1
=~ m#(<list .*? depth="(\d+)">)\n?(<list .*? depth="(\d+)">)#gs )
{
my $list3 = $1;
print F6 "\t\t\t\t\t\t\t\t" . $y . "\t\t" . $list3 . "\n";
}
}
}
Assuming your problem line is this:
if($temp1=~m#(<list .*? depth="(\d+)">)\n?(<list .*? depth="(\d+)">)#gs)
Then the problem here is here:
my #FILE_KA_ARRAY = split(/\n/, $data1);
Because your split is removing the linefeeds and putting each line into the array. And so when you do:
$temp_file_arr = $FILE_KA_ARRAY[$x];
my $temp1=$temp_file_arr;
You have no linefeeds in there, because you have no linefeeds in your source.
Additionally though:
Don't turn off warnings. IF you have warnings FIX THEM.
This looks like XML. Use a parser. (Although I'd avoid XML::Simple - it's nasty)
indenting your code is a good thing, because it helps clarify your code.
if you use glob ( "$path/*.xml" ) instead of readdir and grep you get a list of paths built in.
I'm trying to extract a certain string of numbers from a text file using a regular exression, but when my code runs, it is grabbing the numbers after the slash in the separation between date and time. Here is what I have so far.
while ( <INFILE> ) {
my #fields = split( /\ /, $_ );
my #output;
foreach my $field ( #fields ) {
if ( $field =~ /[0-9]{5}\// ) {
push #output, $field;
}
}
if ( #output ) {
my $line = join( ' ', #output );
print "$line\n";
print OUTFILE "$line\n";
}
}
The line I am trying to extract data from is
D2001235 9204 254/2004 254/1944 254/2041 15254/2011 ALL-V4YM 001 AUTO C-C0000
The data I need is the 15254 but when I run my code it returns 15254/2011 and my program is erroring out.
The problem is that you are storing the entire $field in the output array, but you only want the number to the left of the slash to be stored. You could use capturing parentheses in the regular expression and the $1 special variable. This outputs 15254:
use warnings;
use strict;
while (<DATA>) {
my #fields = split( /\ /, $_ );
my #output;
foreach my $field (#fields) {
if ( $field =~ /^([0-9]{5})\// ) {
push #output, $1;
}
}
if (#output) {
my $line = join( ' ', #output );
print "$line\n";
}
}
__DATA__
D2001235 9204 254/2004 254/1944 254/2041 15254/2011 ALL-V4YM 001 AUTO C-C0000
As explained, you are saving an entire field in #output if it matches the regex, instead of just the first part before the slash
Your split is also unnecessarily complicated, and join isn't needed
All you need is this
while ( <INFILE> ) {
my #output = map m{^([0-9]{5})/}, split;
if ( #output ) {
print "#output\n";
print OUTFILE "#output\n";
}
}
This is the program as it stands right now, it takes in a .fasta file (a file containing genetic code), creates a hash table with the data and prints it, however, it is quite slow. It splits a string an compares it against all other letters in the file.
use strict;
use warnings;
use Data::Dumper;
my $total = $#ARGV + 1;
my $row;
my $compare;
my %hash;
my $unique = 0;
open( my $f1, '<:encoding(UTF-8)', $ARGV[0] ) or die "Could not open file '$ARGV[0]' $!\n";
my $discard = <$f1>;
while ( $row = <$f1> ) {
chomp $row;
$compare .= $row;
}
my $size = length($compare);
close $f1;
for ( my $i = 0; $i < $size - 6; $i++ ) {
my $vs = ( substr( $compare, $i, 5 ) );
for ( my $j = 0; $j < $size - 6; $j++ ) {
foreach my $value ( substr( $compare, $j, 5 ) ) {
if ( $value eq $vs ) {
if ( exists $hash{$value} ) {
$hash{$value} += 1;
} else {
$hash{$value} = 1;
}
}
}
}
}
foreach my $val ( values %hash ) {
if ( $val == 1 ) {
$unique++;
}
}
my $OUTFILE;
open $OUTFILE, ">output.txt" or die "Error opening output.txt: $!\n";
print {$OUTFILE} "Number of unique keys: " . $unique . "\n";
print {$OUTFILE} Dumper( \%hash );
close $OUTFILE;
Thanks in advance for any help!
It is not clear from the description what is wanted from this script, but if you're looking for matching sets of 5 characters, you don't actually need to do any string matching: you can just run through the whole sequence and keep a tally of how many times each 5-letter sequence occurs.
use strict;
use warnings;
use Data::Dumper;
my $str; # store the sequence here
my %hash;
# slurp in the whole file
open(IN, '<:encoding(UTF-8)', $ARGV[0]) or die "Could not open file '$ARGV[0]' $!\n";
while (<IN>) {
chomp;
$str .= $_;
}
close(IN);
# not sure if you were deliberately omitting the last two letters of sequence
# this looks at all the sequence
my $l_size = length($str) - 4;
for (my $i = 0; $i < $l_size; $i++) {
$hash{ substr($str, $i, 5) }++;
}
# grep in a scalar context will count the values.
my $unique = grep { $_ == 1 } values %hash;
open OUT, ">output.txt" or die "Error opening output.txt: $!\n";
print OUT "Number of unique keys: ". $unique."\n";
print OUT Dumper(\%hash);
close OUT;
It might help to remove searching for information that you already have.
I don't see that $j depends upon $i so you're actually matching values to themselves.
So you're getting bad counts as well. It works for 1, because 1 is the square of 1.
But if for each five-character string you're counting strings that match, you're going
to get the square of the actual number.
You would actually get better results if you did it this way:
# compute it once.
my $lim = length( $compare ) - 6;
for ( my $i = 0; $i < $lim; $i++ ){
my $vs = substr( $compare, $i, 5 );
# count each unique identity *once*
# if it's in the table, we've already counted it.
next if $hash{ $vs };
$hash{ $vs }++; # we've found it, record it.
for ( my $j = $i + 1; $j < $lim; $j++ ) {
my $value = substr( $compare, $j, 5 );
$hash{ $value }++ if $value eq $vs;
}
}
However, it could be an improvement on this to do an index for your second loop
and let the c-level of perl do your matching for you.
my $pos = $i;
while ( $pos > -1 ) {
$pos = index( $compare, $vs, ++$pos );
$hash{ $vs }++ if $pos > -1;
}
Also, if you used index, and wanted to omit the last two characters--as you do, it might make sense to remove those from the characters you have to search:
substr( $compare, -2 ) = ''
But you could do all of this in one pass, as you loop through file. I believe the code
below is almost an equivalent.
my $last_4 = '';
my $last_row = '';
my $discard = <$f1>;
# each row in the file after the first...
while ( $row = <$f1> ) {
chomp $row;
$last_row = $row;
$row = $last_4 . $row;
my $lim = length( $row ) - 5;
for ( my $i = 0; $i < $lim; $i++ ) {
$hash{ substr( $row, $i, 5 ) }++;
}
# four is the maximum we can copy over to the new row and not
# double count a strand of characters at the end.
$last_4 = substr( $row, -4 );
}
# I'm not sure what you're getting by omitting the last two characters of
# the last row, but this would replicate it
foreach my $bad_key ( map { substr( $last_row, $_ ) } ( -5, -6 )) {
--$hash{ $bad_key };
delete $hash{ $bad_key } if $hash{ $bad_key } < 1;
}
# grep in a scalar context will count the values.
$unique = grep { $_ == 1 } values %hash;
You may be interested in this more concise version of your code that uses a global regex match to find all the subsequences of five characters. It also reads the entire input file in one go, and removes the newlines afterwards.
The path to the input file is expected as a parameter on the command line, and the output is sent to STDIN, and can be redirected to a file on the command line, like this
perl subseq5.pl input.txt > output.txt
I've also used Data::Dump instead of Data::Dumper because I believe it to be vastly superior. However it is not a core module, and so you will probably need to install it.
use strict;
use warnings;
use open qw/ :std :encoding(utf-8) /;
use Data::Dump;
my $str = do { local $/; <>; };
$str =~ tr|$/||d;
my %dups;
++$dups{$1} while $str =~ /(?=(.{5}))/g;
my $unique = grep $_ == 1, values %dups;
print "Number of unique keys: $unique\n";
dd \%dups;
I have a file which I want to take all the lines which starts with CDS and a line below.
This lines are like:
CDS 297300..298235
/gene="ENSBTAG00000035659"
I found this in your site:
open(FH,'FILE');
while ($line = <FH>) {
if ($line =~ /Pattern/) {
print "$line";
print scalar <FH>;
}
}
and it works great when the CDS is only a line.
Sometimes in my file is like
CDS join(complement(416559..416614),complement(416381..416392),
complement(415781..416087))
/gene="ENSBTAG00000047603"
or with more lines in the CDS.
How can I take only the CDS lines and the next line of the ID???
please i need your help!
Thank you in advance.
Assuming the "next line" always contains /gene=, one can use the flip-flop operator.
while (<>) {
print if m{^CDS} ... m{/gene=};
}
Otherwise, you need to parse the CDS line. It might be sufficient to count parens.
my $depth = 0;
my $print_next = 0;
while (<>) {
if (/^CDS/) {
print;
$depth = tr/(// - tr/)//;
$print_next = 1;
}
elsif ($depth) {
print;
$depth += tr/(// - tr/)//;
}
elsif ($print_next) {
print;
$print_next = 0;
}
}
You need to break the input into outdented paragraphs. Outdented paragraphs start a non-space character in their first line and start with space characters for the rest.
Try:
#!/usr/bin/env perl
use strict;
use warnings;
# --------------------------------------
my $input_file = shift #ARGV;
my $para = undef; # holds partial paragraphs
open my $in_fh, '<', $input_file or die "could not open $input_file: $!\n";
while( my $line = <$in_fh> ){
# paragraphs are outdented, that is, start with a non-space character
if( $line =~ m{ \A \S }msx ){
# don't do if very first line of file
if( defined $para ){
# If paragraph starts with CDS
if( $para =~ m{ \A CDS \b }msx ){
process_CDS( $para );
}
# delete the old paragraph
$para = undef;
}
}
# add the line to the paragraph,
$para .= $line;
}
close $in_fh or die "could not close $input_file: $!\n";
# the last paragraph is not handle inside the loop, so do it now
if( defined $para ){
# If paragraph starts with CDS
if( $para =~ m{ \A CDS \b }msx ){
process_CDS( $para );
}
}
Which version would you prefer?
#!/usr/bin/env perl
use warnings;
use strict;
use 5.010;
my $p = 7; # 33
my $prompt = ' : ';
my $key = 'very important text';
my $value = 'Hello, World!';
my $length = length $key . $prompt;
$p -= $length;
Option 1:
$key = $key . ' ' x $p . $prompt;
Option 2:
if ( $p > 0 ) {
$key = $key . ' ' x $p . $prompt;
}
else {
$key = $key . $prompt;
}
say "$key$value"
I would prefer
sprintf "%-7s : %s", $key, $value;
or
sprintf "%-*s : %s", $p, $key, $value;
instead of all this weird stuff.
From the sprintf documentation:
The flag characters
'-' The converted value is to be left adjusted on the field boundary. (The default is right justification.) The converted value is padded on the right with blanks, rather than on the left with blanks or zeros. A '-' overrides a 0 if both are given.
The field width
An optional decimal digit string (with nonzero first digit) specifying a minimum field width. If the converted value has fewer characters than the field width, it will be padded with spaces on the left (or right, if the left-adjustment flag has been given). Instead of a decimal digit string one may write '*' or '*m$' (for some decimal integer m) to specify that the field width is given in the next argument, or in the m-th argument, respectively, which must be of type int. A negative field width is taken as a '-' flag followed by a positive field width. In no case does a nonexistent or small field width cause truncation of a field; if the result of a conversion is wider than the field width, the field is expanded to contain the conversion result.
I don't like option 2 as it introduces an unnecessary special case.
I would refactor out the construction of the prompt suffix:
# Possible at top of program
my $suffix = ( ' ' x $p ) . $prompt;
# Later...
$key .= $suffix ;
Call me old-school, but I'd use printf() or sprintf():
printf "%-33s%s%s\n", $key, $prompt, $value;
That left justifies the string $key into 33 spaces, adds $prompt and $value and a newline. If I wanted to calculate the length for the first part dynamically:
printf "%-*s%s%s\n", $len, $key, $prompt, $value;
Since it is one line instead of the question's 4 (option 1) or 6 (option 2), it scores favourably on the succinctness scale.
I looks a little weird, but this works (until now):
#!/usr/bin/env perl
use warnings; use strict;
use 5.010;
use utf8;
use Term::Size;
my $columns = ( Term::Size::chars *STDOUT{IO} )[0];
binmode STDOUT, ':encoding(UTF-8)';
use Text::Wrap;
use Term::ANSIColor;
sub my_print {
my( $key, $value, $prompt, $color, $p ) = #_;
my $length = length $key.$prompt;
$p -= $length;
my $suff = ( ' ' x $p ) . $prompt;
$key .= $suff;
$length = length $key;
my $col = $columns - $length;
$Text::Wrap::columns = $col;
my #array = split /\n/, wrap ( '','', $value ) ;
$array[0] = colored( $key, $color ) . $array[0];
for my $idx ( 1..$#array ) {
$array[$idx] = ( ' ' x $length ) . $array[$idx];
}
say for #array;
}
my $prompt = ' : ';
my $color = 'magenta';
my $p = 30;
my $key = 'very important text';
my $value = 'text ' x 40;
my_print( $key, $value, $prompt, $color, $p );