Strip paragraph tags from a block of text - preg-replace

I am just trying to get my head round pregreplace, but it is driving me round the bend. All the examples I seem to find are a million times complicated than I need.
What im trying to do is just strip the paragraph tags from a block of text.
so.......
$text = '<p>Some block of text</p>';
should become
$afterreplace = 'Some block of text';
so im wondering how on earth I do this with preg_replace($pattern, $replacement, $string);
I kind of get this far but then, im not sure how to tell it to strip paragraphs........
preg_replace($pattern, $replacement, $text);

For the specific example i would just use str_replace. It is way faster.
You just replace "<p>" and "</p>" with " ".

This worked in the end >>
$text = '<p>Some Long Text String</p>';
$replaced = str_replace(array('<p>','</p>'),array('',''),$text);

Related

How do I select text between two specific symbols in Perl?

I'm very new to Perl. I'm currently going through this Perl file and I've got this variable where I was able to format it down to get all the text after the "<" symbol using this line I found from another stackflow question.
($tempVariable) = $Line =~ /(\<.*)\s*$/;
So currently whenever I print this variable, I get the output
$tempVariable = <some text here #typeOf and more text here after
I need to get everything between the "<" symbol and the "#"symbol.
I tried looking at other stackflow questions and tried implementing it to mines but I keep getting errors so if anybody could help me out I would appreciate it.
my ($substr) = $str =~ /<([^<\#]*)\#/
or die "No match";
You'll need a regex that
looks for the starting < character
then (your question is unclear on this point)
captures one-or-more non-# characters, or
captures zero-or-more non-# characters
looks for the trailing # character
also not specified in your question: do you need to strip leading and trailing white space from the match?
I.e.
#!/usr/bin/perl
use warnings;
use strict;
my $Line = '<some text here #typeOf and more text here after';
my $tempVariable;
# alternative 1: one-or-more characters
($tempVariable) = $Line =~ /<([^#]+)#/
or die "No match alternative 1";
print "Alternative 1: '${tempVariable}'\n";
# alternative 2: zero-or-more characters
($tempVariable) = $Line =~ /<([^#]*)#/
or die "No match alternative 2";
print "Alternative 2: '${tempVariable}'\n";
exit 0;
Test run (white space is not stripped):
$ perl dummy.pl
Alternative 1: 'some text here '
Alternative 2: 'some text here '

simple preg_replace rule that I can't get to work

Can't understand how to do this preg_replace, haven't tried as don't know what to try on it, too hard to understand..
index-D.html where d is a digit from 0-99999
how to replace occurrences of that string, index-D.html to empty
The manual is pretty clear and provides examples as well:
$string = "index-D.html where d is a digit from 0-99999";
$pattern = "index-D.html";
$new_string = preg_replace($pattern, "", $string);

How do I count the "real" words in a text with Perl?

I've run into a text processing problem. I've an article, and I'd like to find out how many "real" words there are.
Here is what I mean by "real". Articles usually contain various punctuation marks such as dashes, and commas, dots, etc. What I'd like to find out is how many words there are, skipping like "-" dashes and "," commas with spaces, etc.
I tried doing this:
my #words = split ' ', $article;
print scalar #words, "\n";
But that includes various punctuations that have spaces in them as words.
So I'm thinking of using this:
my #words = grep { /[a-z0-9]/i } split ' ', $article;
print scalar #words, "\n";
This would match all words that have characters or numbers in them. What do you think, would this be good enough way to count words in an article?
Does anyone know maybe of a module on CPAN that does this?
Try to use: \W - any non-word character, and also drop _
Solution
use strict;
my $article = 'abdc, dd_ff, 11i-11, ff44';
# case David's, but it didn't work with I'm or There's
$article =~ s/\'//g;
my $number_words = scalar (split /[\W_]+/, $article);
print $number_words;
I think your solution is about as good as you're going to get without resorting to something elaborate.
You could also write it as
my #words = $article =~ /\S*\w\S*/
or count the words in a file by writing
my $n = 0;
while (<>) {
my #words = /\S*\w\S*/g;
$n += #words;
}
say "$n words found";
Try a few sample blocks of text and look at the list of "words" that it finds. If you are happy with that then your code works.

How can I wrap lines to 45 characters in a Perl program?

I have this text I am writing in a Perl CGI program:
$text = $message;
#lines = split(/\n/, $text);
$lCnt .= $#lines+1;
$lineStart = 80;
$lineHeight = 24;
I want to force a return after 45 characters. How do I do that here?
Thanks in advance for your help.
Look at the core Text::Wrap module:
use Text::Wrap;
my $longstring = "this is a long string that I want to wrap it goes on forever and ever and ever and ever and ever";
$Text::Wrap::columns = 45;
print wrap('', '', $longstring) . "\n";
Check out Text::Wrap. It will do exactly what you need.
Since Text::Wrap for some reason doesn't work for the OP, here is a solution using a regex:
my $longstring = "lots of text to wrap, and some more text, and more "
. "still. thats right, even more. lots of text to wrap, "
. "and some more text.";
my $wrap_at = 45;
(my $wrapped = $longstring) =~ s/(.{0,$wrap_at}(?:\s|$))/$1\n/g;
print $wrapped;
which prints:
lots of text to wrap, and some more text, and
more still. thats right, even more. lots of
text to wrap, and some more text.
The Unicode::LineBreak module can do more sophisticated wrapping of non-English text (Especially East Asian scripts) than Text::Wrap, and has some nice features like optionally being able to recognize URIs and avoid splitting them.
Example:
#!/usr/bin/env perl
use warnings;
use strict;
use Unicode::LineBreak;
my $longstring = "lots of text to wrap, and some more text, and more "
. "still. thats right, even more. lots of text to wrap, "
. "and some more text.";
my $wrapper = Unicode::LineBreak->new(ColMax => 45, Format => "NEWLINE");
print for $wrapper->break($longstring);

Using Perl, how do I show the context around a search term in the search results?

I am writing a Perl script that is searching for a term in large portions of text. What I would like to display back to the user is a small subset of the text around the search term, so the user can have context of where this search term is used. Google search results are a good example of what I am trying to accomplish, where the context of your search term is displayed under the title of the link.
My basic search is using this:
if ($text =~ /$search/i ) {
print "${title}:${text}\n";
}
($title contains the title of the item the search term was found in)
This is too much though, since sometimes $text will be holding hundreds of lines of text.
This is going to be displayed on the web, so I could just provide the title as a link to the actual text, but there is no context for the user.
I tried modifying my regex to capture 4 words before and 4 words after the search term, but ran into problems if the search term was at the very beginning or very end of $text.
What would be a good way to accomplish this? I tried searching CPAN because I'm sure someone has a module for this, but I can't think of the right terms to search for. I would like to do this without modules if possible, because getting modules installed here is a pain. Does anyone have any ideas?
You can use $and $' to get the string before and after the match. Then truncate those values appropriately. But as blixtor points out, shlomif is correct to suggest using#+and#-to avoid the performance penalty imposed by $ and #' -
$foo =~ /(match)/;
my $match = $1;
#my $before = $`;
#my $after = $';
my $before = substr($foo, 0, $-[0]);
my $after = substr($foo, $+[0]);
$after =~ s/((?:(?:\w+)(?:\W+)){4}).*/$1/;
$before = reverse $before; # reverse the string to limit backtracking.
$before =~ s/((?:(?:\W+)(?:\w+)){4}).*/$1/;
$before = reverse $before;
print "$before -> $match <- $after\n";
Your initial attempt at 4 words before/after wasn't too far off.
Try:
if ($text =~ /((\S+\s+){0,4})($search)((\s+\S+){0,4})/i) {
my ($pre, $match, $post) = ($1, $3, $4);
...
}
I would suggest using the positional parameters - #+ and #- (see perldoc perlvar) to find the position in the string of the match, and how much it takes.
You could try the following:
if ($text =~ /(.*)$search(.*)/i ) {
my #before_words = split ' ', $1;
my #after_words = split ' ',$2;
my $before_str = get_last_x_words_from_array(#before_words);
my $after_str = get_first_x_words_from_array(#after_words);
print $before_str . ' ' . $search . ' ' . $after_str;
}
Some code obviously omitted, but this should give you an idea of the approach.
As far as extracting the title ... I think this approach does not lend itself to that very well.