sed/awk Capitallize everything between patterns and lowercase small words - sed

I did find a way to capitalize the whole document, with both sed and awk, but how to do it, if I want to convert everything inside patterns from CAPS LOCK to Capital?
For example, I have an HTML file, and everything (multiple occurrences) between <b> and </b> has to be converted from TITLE to Title, and if possible making small words (1 ~ 2 letters) in lowercase.
From This:
<div id="1">
<div class="p"><b>THIS IS A RANDOM TITLE</b></div>
<table class="hugetable">
...
</table>
<div class="p"><b>THIS IS ANOTHER RANDOM TITLE</b></div>
<table class="hugetable">
...
</table>
...
</div>
To this:
<div id="1">
<div class="p"><b>This is a Random Title</b></div>
<table class="hugetable">
...
</table>
<div class="p"><b>This is Another Random Title</b></div>
<table class="hugetable">
...
</table>
...
</div>

This is not the most beautiful solution but I think it works:
sed -r -e '/<b>/ {s/( .)([^ ]*)/\1\L\2/g}' -e 's/<b>(.)/<b>\u\1/' -e '/<b>/ {s/(\b.{1,2}\b)/\L\1/g}' data
Explanation:
1st expression (-e): If a line contains <b>:
Then for each word which has a space in front of it, keep the space and the first (already capitalized) character (\1) and then convert all the following characters of the word to lower case (\L\2)
2nd expression (-e): The first word after <b> is still uncapitalized, so select the first character after the bold tag <b>(.) and replace it uppercased <b>\u\1
3rd expression (-e): Again if a line contains <b>:
Then select words of 1 or 2 characters in length \b.{1,2}\b and replace them lowercased \L\1

Related

How to remove text between a string and a space using SED

I have a file with repeating line in it like this;
<stack-block name="B" sub-type="SBL" type="ABM_BLOCK" level="2" parent-name="PBTYRD" geo-anchor-latitude="-34.96723069348281" geo-anchor-longitude="150.2157080161554" geo-anchor-orientation="72.35290364141252" z-index-min="1" />
<stack-block name="C" sub-type="SBL" type="ABM_BLOCK" level="2" parent-name="PBTYRD" geo-anchor-latitude="-34.967529872288864" geo-anchor-longitude="150.2145108805486" geo-anchor-orientation="72.35290364141252" z-index-min="1" />
...and so on...
I want to remove the geo-anchor-latitude="-34.96723069348281" section from the lines of a file including the geo-anchor-latitude phrase up to the second double quote.
I have tried sed -i 's/geo-anchor-latitude.*"//' filename with no luck as it strips everything from geo-anchor-latitude to the end of the line.
Any clues out there? Thanks.
Would you try the following:
sed -i 's/geo-anchor-latitude="[^"]*"//' filename
Output:
<stack-block name="B" sub-type="SBL" type="ABM_BLOCK" level="2" parent-name="PBTYRD" geo-anchor-longitude="150.2157080161554" geo-anchor-orientation="72.35290364141252" z-index-min="1" />
<stack-block name="C" sub-type="SBL" type="ABM_BLOCK" level="2" parent-name="PBTYRD" geo-anchor-longitude="150.2145108805486" geo-anchor-orientation="72.35290364141252" z-index-min="1" />
The regex geo-anchor-latitude="[^"]*" matches the substring such as:
A literal string geo-anchor-latitude="
Followed by a sequence of any characters except for "
Followed by a double quote "
Then the matched substring above is removed by the s command.
You can use extended regular expressions (-E) with sed to do this.
sed -Ei 's/geo-anchor-latitude="[-0-9]+[.][0-9]+"//' filename
This regex looks for the latitude attribute, followed by a decimal number with any number of digits.

Why this line stops Sphinx search?

I use sanitizing from example: Barryhunter's
But when I use the line:
$q = preg_replace('/[^\w~\|\(\)\^\$\?"\/=-]+/',' ',trim(strtolower($q)));
then Russian search don't works! Only English.
What the reason? How I should use sanitizing?
This is my piece:
<HTML>
<BODY>
<form action="" method="get">
<input name="q" size="40" value="<?php echo #$_GET['q']; ?>" />
<input type="submit" value="Search" />
</form>
<?php
require ( 'sphinxapi.php' );
$sphinx = new SphinxClient;
$sphinx->SetServer('ununtu', 9312);
$sphinx->open();
$sphinx->SetMatchMode (SPH_MATCH_EXTENDED);
$sphinx->setFieldWeights(array(
'title' => 10,
'content' => 5
));
$sphinx->SetRankingMode(PH_RANK_WORDCOUNT);
$sphinx->SetSortMode(SPH_SORT_RELEVANCE);
$sphinx->setLimits(0, 10, 200);
$sphinx->resetFilters();
$q = isset($_GET['q'])?$_GET['q']:'';
$q = preg_replace('/ OR /',' | ',$q);
// $q = preg_replace('/[^\w~\|\(\)\^\$\?"\/=-]+/',' ',trim(strtolower($q)));
if(isset($_GET['q']) and strlen($_GET['q']) > 1)
{
$result = $sphinx->query($sphinx->escapeString($q), '*');
...
Assuming your input string is utf-encoded you use non-unicode preg_replace. Add 'u' in the end, e.g.:
$q = preg_replace('/[^\w~\|\(\)\^\$\?"\/=-]+/u',' ',trim(strtolower($q)));
Specifically that regex is stripping anything that is not a 'word' char, or a predefined list of syntax/punctuation chars.
The PREG definition of word (the \w ) is
A "word" character is any letter or digit or the underscore character,
that is, any character which can be part of a Perl "word". The
definition of letters and digits is controlled by PCRE's character
tables, and may vary if locale-specific matching is taking place. For
example, in the "fr" (French) locale, some character codes greater
than 128 are used for accented letters, and these are matched by \w.
http://php.net/manual/en/regexp.reference.escape.php
So possibly in English locale (or other western European for example), hence many Russian chars are not considered a word char, and stripped.
(if your pages are in UTF8, then may also need the /u as mentioned by other answer)

Google forms Regular Expressions

i'm creating a survey in google forms and cant find any regular expressions for a pin code entry.
The User is being asked a question and can enter 2 pin codes in two text fields.
I need the Regular expression that contains 4 digits with numbers from 0-9.
Example:
Textbox1: 1234
Textbox2: 4321
Any ideas?
Try \d{4}
Also set your Regular Expression to Matches
[0-9]{4}
This should be your regular expression.
function validate() {
var textField = document.getElementById("textbox1").value;
var regex = /[0-9]{4}/g;
alert("Valid input: " + regex.test(textField));
}
<input type="text" id="textbox1">
<input type="button" onclick="validate()" value="Validate">

Hash Dereference in Template Toolkit

I've got a multi-dimensional hash that I'm trying to print out in a table. I can't get the referencing / dereferencing right.
I'm putting an excel spreadsheet into the hash and I want to print out the corresponding rows and columns in html and match the rows/columns of the spreadsheet (some of which are empty).
I'm using Perl Dancer and Template Toolkit. On the server side the hash works fine. print $big_table{$column}{$row}; on the server side and it prints the correct column and row with NO issues.
On the client side, the 0, 1, 2... are supposed to be the columns. Some columns are blank so I can't just print the contents.
The way it is now it prints ARRAY(0x3e5389c). I tried a different way and it printed HASH...
I know I've got some reference/dereference issues. Any advice would be welcome.
Server Side Code:
my %big_table = ();
# $cell->value() is the text ripped from the excel cell at that location
$big_table{$column}{$row} = $cell->value();
template 'index', { big_table => \%big_table };
Client Side:
<Table border="3">
<% FOREACH n IN big_table.0 %>
<TR><TD>&nbsp<% big_table.0.keys %>&nbsp<TD>&nbsp<% big_table.1.keys %>
&nbsp<TD>&nbsp<% big_table.2.keys %>&nbsp<TD>&nbsp<% big_table.3.keys %>
&nbsp <TD>&nbsp<% big_table.4.keys %>
&nbsp<TD>&nbsp<% big_table.5.keys %>&nbsp
<% END %>
</Table>
Thanks in advance!
Got it working.
Changed to an array. '$big_table[$col][$row] = $cell->value();' and populated a second array with all the row #'s.
The Client looks like
<% FOREACH r IN row_numbers %>
<TR><TD> &nbsp <% big_table.0.$r %> &nbsp <TD> &nbsp <% big_table.1.$r %>...
<% END %>
Works great but it's probably crazy in-effecient :(. The spreadsheet is 800 rows long so it's a 2nd array with 800 elements just to iterate over 'FOREACH' loop.

How to replace text using greedy approach in sed?

I am parsing one file which has some html tag and changing into latex tag.
cat text
<Text>A <strong>ASDFF</strong> is a <em>cerebrovafdfasfscular</em> condifasdftion caufadfsed fasdfby tfdashe l
ocfsdafalised <span style="text-decoration: underline;">ballooning</span> or difdaslation of an arfdatery in thdfe bfdasrai
n. Smadfsall aasdneurysms may dadisplay fdasno ofadsbvious sdfasigns (<span style="text-decoration: underline;"><em><str
ong>asymptomatic</strong></em></span>) bfdasut lfdsaarger afdasneurysms maydas besda asfdsasociated widfth sdsfudd
sed -e 's|<strong>\(.*\)</strong>|\\textbf{\1}|g' test
cat out
<Text>A \textbf{ASDFF</strong> is a <em>cerebrovafdfasfscular</em> condifasdftion caufadfsed fasdfby tfdashe locfsda
falised <span style="text-decoration: underline;">ballooning</span> or difdaslation of an arfdatery in thdfe bfdasrain. Sma
dfsall aasdneurysms may dadisplay fdasno ofadsbvious sdfasigns (<span style="text-decoration: underline;"><em><strong&gt
;asymptomatic}</em></span>) bfdasut lfdsaarger afdasneurysms maydas besda asfdsasociated widfth sdsfudd
Expected outputs should be \textbf{ASDFF} while i observe \textbf{ASDFF .........}. How to get expected result?
Regards
You may want to use perl regex instead.
perl -pe 's|<strong>(.*?)</strong>|\\textbf{\1}|g'
Your problem is similar with non-greedy-regex-matching-in-sed. And next time you may want to simplify your case to point out the real problem. For example, don't just paste the raw html code, use this instead:
fooTEXT1barfooTEXT2bar
Update
If you just want the greedy approach, just ignore this.