Perl get text between tags - perl

I tried so many codes that I found on internet but none of them would work.
I have a HTML code something like this.
<div class="usernameHolder">Username: user123</div>
what I want is get the text user123 from this line of code, of course this code is with the rest of the HTML content (an HTML page) Can anyone point me to the right direction?
$text = #source=~ /Username:\s+(.*)\s+</;
print $text;
but it won't return anything.

If the HTML is in a string:
$source = '<div class="usernameHolder">Username: user123</div>';
# Allow optional whitespace before or after the username value.
$text = $source=~ /Username:\s*(.*?)\s*</;
print $1 . "\n"; # user123
If the HTML is in an array:
#source = (
'<p>Some text</p>',
'<div class="usernameHolder">Username: user123</div>',
'<p>More text</p>'
);
# Combine the matching array elements into a string.
$matching_lines = join "",grep(/Username:\s*(.*?)\s*</, #source);
# Extract the username value.
$text = $matching_lines =~ /Username:\s*(.*?)\s*</;
print $1 . "\n"; # user123
A more-compact version using an array:
#source = (
'<p>Some text</p>',
'<div class="usernameHolder">Username: user123</div>',
'<p>More text</p>'
);
# Combine the matching array elements in a string, and extract the username value.
$text = (join "",grep(/Username:\s*(.*?)\s*</, #source)) =~ /Username:\s*(.*?)\s*</;
print $1 . "\n"; # user123

Your second \s+ doesn't match anything, since there is no space between user123 and the following tag.
How about this?
/Username:\s*(.*?)\s*</
Here, \s* is discarding spaces if there are any, and .*? is there so that you don't grab most of the document in the process. (See greedy vs. non-greedy)

Related

use perl to extract a substring between two delimiters and store as a new string

I am working on a Perl script, and I want to split a string between two different variables.
This is my string
<p>Hello my server number is 1221.899999 , please select an option</p>
I want to be able to extract the server number, so I want to split the string after <p>Hello my server number is and before the following space, so my end string would print as
1221.899999
Is regex the best solution for this, rather than using split?
I would just use a regex.
my $str = 'Hello my server number is 1221.899999 , please select an option';
my ($num) = $str =~ /Hello my server number is (\d+\.\d+) ,/;
$num will be undefined if the match didn't succeed.
How about:
$str = 'Hello my server number is 1221.899999 , please select an option';
$str =~ s/^.*\b(\d+\.\d+)\b.*$/$1/;
say $str;
or
$str =~ s/^Hello my server number is (\d+\.\d+)\s.*$/$1/;
If the begining of the string is always that.
output:
1221.899999
I would use regex. How about this:
$str = 'Hello my server number is 1221.899999 , please select an option';
print $1 if $str =~ /is (.*) ,/;
As long as you are sure that there is always a space before the comma, the proper answer is something similar to this
my $string = '<p>Hello my server number is 1221.899999 , please select an option</p>';
my ($server) = $string =~ /server number is (\S+)/;
print $server;
output
1221.899999
If the comma could appear immediately after the end of the server number then you would need to modifiy is slightly to this
my ($server) = $string =~ /server number is ([^\s,]+)/;

Extracting alphanumeric phrase from a string

Trying to extract the alphanumeric characters from this string:
A_phase_I-II,_open-req_project_id_PX15RAD001
The problem is: the term PX15RAD001 can occur anywhere in the string.
Trying to extract the alpha-numeric part using the below expression. But this returns the entire string. I thought Alum was a valid keyword for alpha-numerics. Is that not the case?
(my $string = $line ) =~ s/\P{Alnum}//g;
print $string;
How can I extract the alphanumeric part of the afore mentioned string?
Thanks in advance.
-simak
At the end as per your input:
> echo "A_phase_I-II,_open-req_project_id_PX15RAD001"|perl -lne 'print $1 if(/id_([A-Z0-9]*)/)'
PX15RAD001
In the middle:
> echo "A_phase_I-II,_open-req_id_PX15RAD001_project" | perl -lne 'print $1 if(/id_([A-Z0-9]*)/)'
PX15RAD001
or in your terms:
$line=~m/id_([A-Z0-9]*)/g;
print $1;
Here are some testcases, produced with the comments of #Vijay s Answer:
my #line = (
'A_phase_I-II,_open-req_project_id_PX15RAD001',
'_PX15RAD001_A_phase_I-II,_open-req_project_id',
'A_pha3333se_I-II,_ope_PX15RAD001_n-req_project',
'A_phase_I-II,_PX15RAD001_open-req_projec123123123t_id',
'A_phase_I-II_PX15RAD001_roject_id'
);
foreach my $string ( #line ) {
$string =~ m{_([^_]{10})_?}g;
print $1 . "\n" if $1;
}
These kinds of questions are hard to answer because there is not enough information. What information we have is:
You say your target string is "alphanumeric", but the entire input string is alphanumeric, except for some punctuation, so that really doesn't tell us anything.
You say it is 12 characters long, but the sample you show is 10 characters long.
You seem to think that "alphanumeric" does not include underscore.
So, the reliable information I can sense from you is:
Target string is always delimited by underscore _
Target string is 10-12 characters, all alphanumeric except underscore.
The "reliable" solution based on this rather skimpy information is:
my $str = "A_phase_I-II,_open-req_project_id_PX15RAD001";
for my $field (split /_/, $str) {
if (length($field) <= 12 and
length($field) >= 10 and # field is 10-12 characters
$field !~ /\W/) { # and contains no non-alphanumerics
# do something
}
}
By splitting on underscore, we can easily isolate each field in the string and perform simpler tests on it, such as the ones above.

Better way to extract elements from a line using perl?

I want to extract some elements from each line of a file.
Below is the line:
# 1150 Reading location 09ef38 data = 00b5eda4
I would like to extract the address 09ef38 and the data 00b5eda4 from this line.
The way I use is the simple one like below:
while($line = < INFILE >) {
if ($line =~ /\#\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*=\s*(\S+)/) {
$time = $1;
$address = $4;
$data = $6;
printf(OUTFILE "%s,%s,%s \n",$time,$address,$data);
}
}
I am wondering is there any better idea to do this ? easier and cleaner?
Thanks a lot!
TCGG
Another option is to split the string on whitespace:
my ($time, $addr, $data) = (split / +/, $line)[1, 4, 7];
You could use matching and a list on LHS, something likes this:
echo '# 1150 Reading location 09ef38 data = 00b5eda4' |
perl -ne '
$,="\n";
($time, $addr, $data) = /#\s+(\w+).*?location\s+(\w+).*?data\s*=\s*(\w+)/;
print $time, $addr, $data'
Output:
1150
09ef38
00b5eda4
In python the appropriate regex will be like:
'[0-9]+[a-zA-Z ]*([0-9]+[a-z]+[0-9]+)[a-zA-Z ]*= ([0-9a-zA-Z]+)'
But I don't know exactly how to write it in perl. You can search for it. If you need any explanation of this regexp, I can edit this post with more precise description.
I find it convenient to just split by one or more whitespaces of any kind, using \s+. This way you won't have any problems if the input string has any tab characters in it instead of spaces.
while($line = <INFILE>)
{
my ($time, $addr, $data) = (split /\s+/, $line)[1, 4, 7];
}
When splitting by ANY kind of whitespace it's important to note that it'll also split by the newline at the end, so you'll get an empty element at the end of the return. But in most cases, unless you care about the total amount of elements returned, there's no need to care.

how to return the search results in perl

I would like to write a script which can return me the result whenever the regex meet.I have some difficulties in writing the regex i guess.
Content of My input file is as below:
Number a123;
Number b456789 vit;
alphabet fty;
I wish that it will return me the result of a123 and b456789, which is the string after "Number " and before ("\s" or ";").
I have tried with below cmd line:
my #result=grep /Number/,#input_file;
print "#results\n";
The result i obtained is shown below:
Number a123;
Number b456789 vit;
Wheareas the expected result should be like below:
a123
b456789
Can anyone help on this?
Perls grep function selects/filters all elements from a list that match a certain condition. In your case, you selected all elements that match the regex /Number/ from the #input_file array.
To select the non-whitespace string after Number use this Regex:
my $regex = qr{
Number # Match the literal string 'Number'
\s+ # match any number of whitespace characters
([^\s;]+) # Capture the following non-spaces-or-semicolons into $1
# using a negated character class
}x; # use /x modifier to allow whitespaces in pattern
# for better formatting
My suggestion would be to loop directly over the input file handle:
while(defined(my $line = <$input>)) {
$line =~ /$regex/;
print "Found: $1" if length $1; # skip if nothing was found
}
If you have to use an array, a foreach-loop would be preferable:
foreach my $line (#input_lines) {
$line =~ /$regex/;
print "Found: $1" if length $1; # skip if nothing was found
}
If you don't want to print your matches directly but to store them in an array, push the values into the array inside your loop (both work) or use the map function. The map function replaces each input element by the value of the specified operation:
my #result = map {/$regex/; length $1 ? $1 : ()} #input_file;
or
my #result = map {/$regex/; length $1 ? $1 : ()} <$input>;
Inside the map block, we match the regex against the current array element. If we have a match, we return $1, else we return an empty list. This gets flattened into invisibility so we don't create an entry in #result. This is different form returning undef, what would create an undef element in your array.
if your script is intended as a simple filter, you can use
$ cat FILE | perl -nle 'print $1 if /Number\s+([^\s;]+)/'
or
$ cat FILE | perl -nle 'for (/Number\s+([^\s;]+)/g) { print }'
if there can be multiple occurences on the same line.
perl -lne 'if(/Number/){s/.*\s([a-zA-Z])([\d]+).*$/\1\2/g;print}' your_file
tested below:
> cat temp
Number a123;
Number b456789 vit;
alphabet fty;
> perl -lne 'if(/Number/){s/.*\s([a-zA-Z])([\d]+).*$/\1\2/g;print}' temp
a123
b456789
>

How can i detect symbols using regular expression in perl?

Please how can i use regular expression to check if word starts or ends with a symbol character, also how to can i process the text within the symbol.
Example:
(text) or te-xt, or tex't. or text?
change it to
(<t>text</t>) or <t>te-xt</t>, or <t>tex't</t>. or <t>text</t>?
help me out?
Thanks
I assume that "word" means alphanumeric characters from your example? If you have a list of permitted characters which constitute a valid word, then this is enough:
my $string = "x1 .text1; 'text2 \"text3;\"";
$string =~ s/([a-zA-Z0-9]+)/<t>$1<\/t>/g;
# Add more to character class [a-zA-Z0-9] if needed
print "$string\n";
# OUTPUT: <t>x1</t> .<t>text1</t>; '<t>text2</t> "<t>text3</t>;"
UPDATE
Based on your example you seem to want to DELETE dashes and apostrophes, if you want to delete them globally (e.g. whether they are inside the word or not), before the first regex, you do
$string =~ s/['-]//g;
I am using DVK's approach here, but with a slight modification. The difference is that her/his code would also put the tags around all words that don't contain/are next to a symbol, which (according to the example given in the question) is not desired.
#!/usr/bin/perl
use strict;
use warnings;
sub modify {
my $input = shift;
my $text_char = 'a-zA-Z0-9\-\''; # characters that are considered text
# if there is no symbol, don't change anything
if ($input =~ /^[a-zA-Z0-9]+$/) {
return $input;
}
else {
$input =~ s/([$text_char]+)/<t>$1<\/t>/g;
return $input;
}
}
my $initial_string = "(text) or te-xt, or tex't. or text?";
my $expected_string = "(<t>text</t>) or <t>te-xt</t>, or <t>tex't</t>. or <t>text</t>?";
# version BEFORE edit 1:
#my #aux;
# take the initial string apart and process it one word at a time
#my #string_list = split/\s+/, $initial_string;
#
#foreach my $string (#string_list) {
# $string = modify($string);
# push #aux, $string;
#}
#
# put the string together again
#my $final_string = join(' ', #aux);
# ************ EDIT 1 version ************
my $final_string = join ' ', map { modify($_) } split/\s+/, $initial_string;
if ($final_string eq $expected_string) {
print "it worked\n";
}
This strikes me as a somewhat long-winded way of doing it, but it seemed quicker than drawing up a more sophisticated regex...
EDIT 1: I have incorporated the changes suggested by DVK (using map instead of foreach). Now the syntax highlighting is looking even worse than before; I hope it doesn't obscure anything...
This takes standard input and processes it to and prints on Standard output.
while (<>) {
s {
( [a-zA-z]+ ) # word
(?= [,.)?] ) # a symbol
}
{<t>$1</t>}gx ;
print ;
}
You might need to change the bit to match the concept of word.
I have use the x modifeid to allow the regexx to be spaced over more than one line.
If the input is in a Perl variable, try
$string =~ s{
( [a-zA-z]+ ) # word
(?= [,.)?] ) # a symbol
}
{<t>$1</t>}gx ;