Matching unknown # of occurrences on a page using perl? - perl

i am parsing an html page, let's say this page lists all players in a football team and those who are seniors will be bolded. I can't parse the file line by line and look for the strong tag because in my real example the pattern is much more complex and span multiple lines.
Something like this:
<strong>Senior:</strong> John Smith
Junior: Joe Smith
<strong>Senior:</strong> Mike Johnson
and so on. How do I write a perl regex to get the names of all seniors?
Thanks

The reason you're having difficulty writing a regex to do this is because it's the wrong tool for the job. You should use a real HTML parser like HTML::Parser, HTML::TokeParser, or HTML::TreeBuilder.
I can't give a specific example because I doubt that's exactly what your HTML looks like. Your sample appears to be missing some punctuation or additional tags.

You don't have to parse a file line by line -- you can read in the entire file at once, if it's small, or you can parse it paragraph by paragraph, using whatever separator you like.
The two magic things you need to do this are 1. set the "line separator" variable, $/ (see perldoc perlvar), to be something else than a newline, and 2. enable multi-line regular expression matching with the /s modifier (see perldoc perlre).
Alternatively, you should use an HTML parser, which is what you would have to do if you are attempting to find things like nested tags.

You have to provide a specific example.
Perl regular expressions can be occasionally used for HTML parsing, but only when you know specifically how the page looks like and that it's not too complex.
If you don't know exactly or it is too complex, use the parsers that cjm links.

It's not clear from your example how the end of the senior name is going to be determined, but something like this:
my #seniors = $filecontents =~ m!<strong>Senior:</strong>\s*([^<]+)!g;

Related

perl format : how to avoid truncation when i try to print out strings of unfixed length

I'm trying to do debug port and muxing verification in the ASIC,
the signal hierarchy name can be fairly long, for example
top.eagleTop.ahb_top.btu.u_ble_core.u_ble_txrx_ctlr.rx_dmem_be_3
Right now I'm using pad character for left justification to print out the string
#<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
but it makes my code looks messy. Is there any better way to print a variable length string other than this
Perl's format construct isn't used very much any more so you are unlikely to get good advice, but have you read the documentation for format?
It looks to me like you want the ^* placeholder, documented here. It says it is "for Variable-Width One-line-at-a-time Text".

Underline in org-mode Literal Example

According to the explanation on Literal Example, literal examples should not be subjected to mark-up. But Is there any way to use any mark-up in literal example?
For example, consider the following literal example snippet.
#+BEGIN_EXAMPLE
Enter the city you're from: Chicago
#+END_EXAMPLE
I'd like to put an underline at the word Chicago because I want to emphasize that the word Chicago is typed by user. How can I do that?
If all you want is the example to be typeset in the example fashion, you can probably use emphasis markup, at least for the lines that require mark-up. That would be
=Enter the city you're from:= _Chicago_
I don't think what you request is at all possible, possibly because such blocks seem to be by design not subject to mark-up.
For language-specific mark-up, use source code blocks (#+begin_src), maybe you can get nicely marked-up blocks using org-mode code blocks (#+begin_src org) (though I didn't manage to get "what it looks like" to be exported, I guess that's what in-line org is for).

Regex to match lines between two expressions

I am sorting some data and want to 'cut' out some rubbish between two bits of useful information.
Eg:
Useful one
rubbish
rubbish //rubbish here is covered by [.*], but the number of lines can be any number 1 or above
rubbish
useful two
I have successfully matched the useful parts of my information, I just need to know how to match the rubbish stuff. The pattern is as follows: useful, new line (no content), new line (no content), rubbish, new line (no content), new line (no content), useful.
The important part of this is that the rubbish section can vary in number of lines, but always has at least one line. Im not sure if i described this very well, any help is appreciated.
The best way I know of doing this is to do this
(exp1)(.+?)(exp2)
and replace or use in code the two groups
$1 $3
where $x is the group place holder
comment me for more specific syntax
your regexp (rubbish\s+)(rubbish\s+)(rubbish)
Try a pattern like (useful\n\n\n(.*)\n\n\nuseful\n)+, capturing rubbish into parenthesis. Improving and applying this pattern depends on your needs and your code.

Regex to remove HTML-head-tag

how can I remove, with NSRegularExpression, the entire head-tag in a HTML file. Can some one give me a regex?
Thanks in advance,
Ph99Ph
There is none! HTML is a type-2 language and thus not parsable with a regular expression (type-3).
See this wiki article in case of doubt.
Lots of people use regex for parsing/editing HTML. This works quite well in simple cases but is utterly error prone.
This being said: You should have fairly reliable results with this regex:
<head>.+?</head>
This requires "." to also match line breaks. If it doesn't, then use this:
<head>(?:.|\n|\r)+?</head>
Again: This is error prone, don't do it.
What you should use is an XML parser such as NSXMLParser.
Please see the accepted answer at RegEx match open tags except XHTML self-contained tags. Or any version of this exact same question posted each day since the beginning of Stack Overflow.
In short, you cannot reliably parse HTML with Regular Expressions. RegEx is simply not advanced enough because of the complexities of HTML.
use something like this :
result = System.Text.RegularExpressions.Regex.Replace(result,
#"<( )*head([^>])*>", "<head>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
#"(<( )*(/)( )*head( )*>)", "</head>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(<head>).*(</head>)", " ",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);

regex to get string within 2 strings

"artistName":"Travie McCoy", "collectionName":"Billionaire (feat. Bruno Mars) - Single", "trackName":"Billionaire (feat. Bruno Mars)",
i wish to get the artist name so Travie McCoy from within that code using regex, please not i am using regexkitlite for the iphone sdk if this changes things.
Thanks
"?artistName"?\s*:\s*"([^"]*)("|$) should do the trick. It even handles some variations in the string:
White space before and after the :
artistName with and without the quotes
missing " at the end of the artist name if it is the last thing on the line
But there will be many more variations in the input you might encounter that this regex will not match.
Also you don’t want to use a regex for matching this for performance reasons. Right now you might only be interested in the artistName field. But some time later you will want information from the other fields. If you just change the field name in the regex you’ll have to match the whole string again. Much better to use a parser and transform the whole string into a dictionary where you can access the different fields easily. Parsing the whole string shouldn’t take much longer than matching the last key/value pair using a regex.
This looks like some kind of JSON, there are lots of good and complete parsers available. It isn’t hard to write one yourself though. You could write a simple recursive descent parser in a couple of hours. I think this is something every programmer should have done at least once.
\"?artistName\"?\s*:\s*\"([^\"]*)(\"|$)
Thats for objective c