Sed replace every nth occurrence - sed

I am trying to use sed to replace every other occurrence of an html element of a file so I can make alternating color rows.
Here is what I have tried and it doesn't work.
sed 's/<tr valign=top>/<tr valign=top bgcolor='#E0E0E0'>/2' untitled.html

I'd solve it with awk:
awk '/<tr valign=top>/&&v++%2{sub(/<tr valign=top>/, "<tr valign=top bgcolor='#E0E0E0'>")}{print}' untitled.html
First, it verifies if the line contains <tr valign=top>
/<tr valign=top>/&&v++%2
and whether the <tr valign=top> is an odd found instance:
v++%2
If so, it replaces the <tr valign=top> in the line
{sub(/<tr valign=top>/, "<tr valign=top bgcolor='#E0E0E0'>")}
Since all lines are to be printed, there is a block that always will be executed (for all lines) and will print the current line:
{print}

This works for me:
sed -e "s/<tr/<TR bgcolor='#E0E0E0'/g;n" simpletable.htm
sample input:
<table>
<tr><td>Row1 / col1</td><td>col2</td><td>col3</td></tr>
<tr><td>Row2 / col1</td><td>col2</td><td>col3</td></tr>
<tr><td>Row3 / col1</td><td>col2</td><td>col3</td></tr>
<tr><td>Row4 / col1</td><td>col2</td><td>col3</td></tr>
<tr><td>Row5 / col1</td><td>col2</td><td>col3</td></tr>
</table>
sample output:
<table>
<TR bgcolor='#E0E0E0'><td>Row1 / col1</td><td>col2</td><td>col3</td></tr>
<tr><td>Row2 / col1</td><td>col2</td><td>col3</td></tr>
<TR bgcolor='#E0E0E0'><td>Row3 / col1</td><td>col2</td><td>col3</td></tr>
<tr><td>Row4 / col1</td><td>col2</td><td>col3</td></tr>
<TR bgcolor='#E0E0E0'><td>Row5 / col1</td><td>col2</td><td>col3</td></tr>
</table>
The key is to use the n command in sed, which advances to the next line.
This works only if the TR occupy distinct lines.
It will break with nested tables, or if there are multiple TR's on a single line.

According to http://www.linuxquestions.org/questions/programming-9/replace-2nd-occurrence-of-a-string-in-a-file-sed-or-awk-800171/
Try this.
sed '0,/<tr/! s/<tr/<TR bgcolor='#E0E0E0'/' file.txt
The exclamation mark negates everything from the beginning of the file to the first "Jack", so that the substitution operates on all the following lines. Note that I believe this is a gnu sed operation only.
If you need to operate on only the second occurrence, and ignore any subsequent matches, you can use a nested expression.
sed '0,/<tr/! {0,/<tr/ s/<tr/<TR bgcolor='#E0E0E0'/}' file.txt
Here, the bracketed expression will operate on the output of the first part, but in this case, it will exit after changing the first matching "Jack".
PS, I've found the sed faq to be very helpful in cases like this.

you can use python script to fix the html
from bs4 import BeautifulSoup
html_doc = """
<table>
<tr><td>Row1 / col1</td><td>col2</td><td>col3</td></tr>
<tr><td>Row2 / col1</td><td>col2</td><td>col3</td></tr>
<tr><td>Row3 / col1</td><td>col2</td><td>col3</td></tr>
<tr><td>Row4 / col1</td><td>col2</td><td>col3</td></tr>
<tr><td>Row5 / col1</td><td>col2</td><td>col3</td></tr>
</table>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
index=0
for tr in soup.find_all('tr'):
if tr.find('td'):
if index % 2:
tr.find('td').attrs['style'] = 'background-color: #ff0000;'
else:
tr.find('td').attrs['style'] = 'background-color: #00ff00;'
index+=1
print(soup)

Related

How can get html inner tag in PosgreSQL

I have some data like that. And I want to get html.
with t(x) as (values( XMLPARSE(DOCUMENT ('<root><NotificationServiceDetails NotificationNo="0" AlarmCode="mail" AlarmStartTime="10:00:00" AlarmTime="0" Id ="2" ><NotificationServiceDetail Id="2"><Title><![CDATA[aaaaaaaaaaaaa]]></Title><ContentJson><![CDATA[
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1"/>
</head>
<body>
<table style="font-family: 굴림,맑은 고딕; font-size:12px;color:#333333;border-width: 1px;border-color: #ddd; border-collapse: collapse; margin:5px;width:auto; min-width:600px;">
<tbody>
<tr>
<td colspan="2" style="border-width: 1px;padding: 10px;border-style: solid;border-color: #ddd; background-color: #f5f5f5; text-align:left; font-weight:bold; font-size:13px;">aaaaaaaaaaaaa</td>
</tr>
<tr>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Writer</td>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Nguyen Ngo Giap (General Mgr.)</td>
</tr>
<tr>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Date</td>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">2022-01-04 10:00~11:00</td>
</tr>
<tr>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Schedule Div.</td>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">테스트함</td>
</tr>
<tr>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Content</td>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">aaaaaaaaaa</td>
</tr>
<tr>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Share</td>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;"></td>
</tr>
</tbody>
</table>
</body>
</html>
]]></ContentJson></NotificationServiceDetail></NotificationServiceDetails></root>'))))
select
unnest((xpath('//NotificationServiceDetails/NotificationServiceDetail/#Id',t.x)))::text::integer as Id,
unnest((xpath('//NotificationServiceDetails/NotificationServiceDetail/Title/text()',t.x))):: text::character varying as Title,
unnest(xpath('//NotificationServiceDetails/NotificationServiceDetail/ContentJson/text()',t.x))::xml as ContentJson,
t.x
from t;
but the ContentJson column gives me special characters. "<..." I want the real html
Expect result for column ContentJson.
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1"/>
</head>
<body>
<table style="font-family: 굴림,맑은 고딕; font-size:12px;color:#333333;border-width: 1px;border-color: #ddd; border-collapse: collapse; margin:5px;width:auto; min-width:600px;">
<tbody>
<tr>
<td colspan="2" style="border-width: 1px;padding: 10px;border-style: solid;border-color: #ddd; background-color: #f5f5f5; text-align:left; font-weight:bold; font-size:13px;">aaaaaaaaaaaaa</td>
</tr>
<tr>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Writer</td>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Nguyen Ngo Giap (General Mgr.)</td>
</tr>
<tr>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Date</td>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">2022-01-04 10:00~11:00</td>
</tr>
<tr>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Schedule Div.</td>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">테스트함</td>
</tr>
<tr>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Content</td>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">aaaaaaaaaa</td>
</tr>
<tr>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;">Share</td>
<td style="padding: 15px; background-color: #f9f9f9; text-align:left;"></td>
</tr>
</tbody>
</table>
</body>
</html>
How can I do that
So far, I've provided 5 different solutions - all with their own advantages and problems - you'll have to test on your own data and hardware to ensure that it's working for you!
I did the following (all the relevant code is available on the fiddle here):
CREATE TABLE t (x TEXT);
and populated it with some text similar to yours, but shorter to making testing easier:
INSERT INTO t VALUES
($SOMETAG$with t(x) as (values( XMLPARSE(DOCUMENT ('<root><NotificationServiceDetails NotificationNo="0" AlarmCode="mail" AlarmStartTime="10:00:00" AlarmTime="0" Id ="2" ><NotificationServiceDetail Id="2"><Title><![CDATA[aaaaaaaaaaaaa]]></Title><ContentJson><![CDATA[
<html lang="en">
<head>
<meta charset="utf-8"/>
more stuff
more stuff
</table>
</body>
</html>
]]></ContentJson></NotificationServiceDetail></NotificationServiceDetails></root>'))))
select
unnest((xpath('//NotificationServiceDetails/NotificationServiceDetail/#Id',t.x)))::text::integer as Id,
unnest((xpath('//NotificationServiceDetails/NotificationServiceDetail/Title/text()',t.x))):: text::character varying as Title,
unnest(xpath('//NotificationServiceDetails/NotificationServiceDetail/ContentJson/text()',t.x))::xml as ContentJson,
t.x
from t;$SOMETAG$);
I picked up the $SOMETAG$...$SOMETAG$ technique here - very helpful for insterting characters like single quote (') and backslash (\).
There is a "one-pass" solution possible - it justs takes time patience to work it out. Oh, and BTW, there were slight errors in my orininal solution - since corrected.
Part 1:
First I remove all characters up to but not including <html lang="en"> as follows:
SELECT SUBSTRING(x, STRPOS (x,'<html lang="en"'));
SUBSTRING and STRPOS were taken from this snippet.
Part 2:
Then, I reverse that string using the REVERSE() function
Part 3:
Finally, I use the SUBSTRING/STRPOS "trick" to chop of the other end of the string to the point of <ydob>
Part 4 is the pièce de résistance
I reREVERSE() the string to bring it back to its original state - minus the undesirable bits to give the required result.
1st Solution (pretty horrible):
SELECT REVERSE(SUBSTR( REVERSE(SUBSTR(x, strpos(x, '<html lang="en">'))), strpos( REVERSE(SUBSTR(x, strpos(x, '<html lang="en">'))), '>lmth/<'))) FROM t;
It looks a bit better (or is more legible at lease):
SELECT
REVERSE(
SUBSTR(
REVERSE(
SUBSTR(x, strpos(x, '<html lang="en">'))),
strpos(
REVERSE(
SUBSTR(x, strpos(x, '<html lang="en">'))), '>lmth/<'))) FROM t;
Instead of these horrible constructions, I used CTE's (Common Table Expressions - AKA the WITH CLAUSE to do this as follows:
Code spends most of its life in maintenance, so easier to read code is easier to repair.
WITH cte1 AS
(
SELECT REVERSE(SUBSTR(x, strpos(x, '<html lang="en">'))) AS s1 FROM t
), cte2 AS
(
SELECT REVERSE(SUBSTR(s1, strpos(s1, 'ydob'))) AS s2 FROM cte1
)
SELECT * from cte2;
Result:
reverse
<html lang="en">
<head>
<meta charset="utf-8"/>
more stuff
more stuff
</table>
</body>
</html>
The answer is the same for all of them!
2nd Solution (a bit more elegant - fiddle available here):
SELECT SPLIT_PART(x, '</html>', 1) from t;
Result:
split_part
with t(x) as (values( XMLPARSE(DOCUMENT ('<root><NotificationServiceDetails NotificationNo="0" AlarmCode="mail" AlarmStartTime="10:00:00" AlarmTime="0" Id ="2" ><NotificationServiceDetail Id="2"><Title><![CDATA[aaaaaaaaaaaaa]]></Title><ContentJson><![CDATA[
<html lang="en">
<head>
<meta charset="utf-8"/>
more stuff
more stuff
</table>
</body>
<<=== Note there are 7 spaces here
So, SPLIT_PART() cuts the string up to, but not including the delimiter - </html> in this case. So, using this as follows:
So, we combine two SPLIT_PARTs in a CTE as follows:
WITH cte AS
(
SELECT LENGTH(split_part(x, '', 1)) AS beg,
LENGTH(split_part(x, '', 1)) AS fin
FROM t
)
Result:
substring
<html lang="en">
<head>
<meta charset="utf-8"/>
more stuff
more stuff
</table>
</body>
</html>
-- I cannot understand why I have to add 8 characters to the 408 of the length?
--
-- 7 (the length of </html>) I could possibly get, but why 8?
Which is the desired result.
3nd Solution (also reasonably elegant - fiddle available here):
I didn't go through all of the steps this time - I combined them all in one query. The interested reader is invited to go through it line by line.
SELECT
strpos(x, '<html lang="en">'),
strpos(x, '</html>'),
strpos(x, '</html>') - strpos(x, '<html lang="en">'),
substring(x FROM strpos(x, '<html lang="en">')
for ((strpos(x, '</html>') + 8) - strpos(x, '<html lang="en">')) )
FROM t;
--
-- Again, I'm puzzled by the necessity to use 8 characters.
--
--
Result:
strpos strpos ?column? substring
265 409 144 <html lang="en">
<head>
<meta charset="utf-8"/>
more stuff
more stuff
</table>
</body>
</html>
Et voilà - the desired result!
4th Solution (a bit convoluted, but may be instructive - fiddle available here):
1st Step:
I split the lines into records in a table as follows:
SELECT
x.idx,
LENGTH(x.string),
x.string
FROM t,
REGEXP_SPLIT_TO_TABLE(t.x, '\n') WITH ORDINALITY AS x(string, idx);
2nd step:
I pull out the records I want which correspond to the desired result as follows:
WITH cte1 AS
(
SELECT
x.idx,
LENGTH(x.string) AS ls,
x.string
FROM t,
REGEXP_SPLIT_TO_TABLE(t.x, '\n') WITH ORDINALITY AS x(string, idx)
), cte2 AS
(
SELECT idx, ls, string
FROM cte1
WHERE string ~ '<html lang="en">' OR string ~ '</html>'
ORDER BY idx
)
SELECT idx, ls, string
FROM cte1
WHERE idx BETWEEN
(SELECT MIN(idx) FROM cte2) AND (SELECT MAX(idx) FROM cte2);
Result:
idx ls string
2 22 <html lang="en">
3 12 <head>
4 33 <meta charset="utf-8"/>
5 20 more stuff
6 20 more stuff
7 16 </table>
8 13 </body>
9 14 </html>
As you can see, the string field contains the data we want!
Solution 5 - using a relatively simple regular expression - fiddle here
1st pass:
We check the output of the very handy PG_TYPEOF() function, which from the docco here does:
pg_typeof returns the OID of the data type of the value that is passed
to it. This can be helpful for troubleshooting or dynamically
constructing SQL queries. The function is declared as returning
regtype, which is an OID alias type (see Section 8.18); this means
that it is the same as an OID for comparison purposes but displays as
a type name.
So, our first query is:
SELECT
REGEXP_MATCH(x, '^.*(<head>.*</html>).*'),
PG_TYPEOF(REGEXP_MATCH(x, '^.*(<head>.*</html>).*'))
FROM t;
Result:
regexp_match pg_typeof
{"<head>
<meta charset=\"utf-8\"/>
more stuff
more stuff
</table>
</body>
</html>"} text[]
So, we have our data, but it's surrounded by braces (curly brackets) - but we know from our PG_TYPEOF() function is a text array, so we know that it's the first (only) element of that array, so therefore we can use the array element notation as follows:
SELECT
(REGEXP_MATCHES(x, '^.*(<head>.*</html>).*'))[1] -- <<-- Note [1]
FROM t;
Result:
regexp_matches
<head>
<meta charset="utf-8"/>
more stuff
more stuff
</table>
</body>
</html>
Same as the others!
Which is the same as for the others!
Crude performance analysis
After 5 runs, the order of merit appears to be the following (a fiddle of the tests run may be found here. Times will vary according to other uses to which the server may be being put, but as I said, I found them to be fairly consistent in terms of time and always in the same order on the 5 runs that I examined.
In descending order of run time:
1st) Method 3: STRPOS() Time ~ 0.045ms: - let that be a base of 1 times fastest execution
2nd) Method 1 SUBSTRING() & REVERSE() 0.079ms: x times 1.75
3rd) Method 2: SPLIT_PART() x times 2.25
4th) Method 4: REGEXP_SPLIT_TO_TABLE() WITH CTE x times 13.2
5th Method 5: REGEXP_MATCH() x times 49.4
So, we can see that the most expensive algorithm is ~ 50 times more expensive than the most efficient one. The usual caveats about benchmarking with only one record and on an unknown system apply - although the results where fairly consistent over at least 5 runs. Always benchmark on your own system with your own data!

Matlab, Matrix-Division. Showing multile results / non trivial result

I have a 5x5 Matrix A:
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
</style>
<table class="tg">
<tr>
<th class="tg-031e">-4</th>
<th class="tg-031e">0</th>
<th class="tg-031e">0</th>
<th class="tg-031e">1</th>
<th class="tg-031e">0</th>
</tr>
<tr>
<td class="tg-031e">1</td>
<td class="tg-031e">-5</td>
<td class="tg-031e">0</td>
<td class="tg-031e">0</td>
<td class="tg-031e">4</td>
</tr>
<tr>
<td class="tg-031e">-6</td>
<td class="tg-031e">-6</td>
<td class="tg-031e">-6</td>
<td class="tg-031e">0</td>
<td class="tg-031e">0</td>
</tr>
<tr>
<td class="tg-031e">1</td>
<td class="tg-031e">0</td>
<td class="tg-031e">1</td>
<td class="tg-031e">0</td>
<td class="tg-031e">0</td>
</tr>
<tr>
<td class="tg-031e">0</td>
<td class="tg-031e">2</td>
<td class="tg-031e">0</td>
<td class="tg-031e">0</td>
<td class="tg-031e">0</td>
</tr>
</table>
and want to find a vector x:
A*x = 0.
The only way I can think of is by left division in matlab. This gives a trivial result for x: x = [0 0 0 0 0].
Anyhow, in this case i want the result:
x = [1 0 -1 4 -0.25]
does someone know how I can get this?
You're probably looking for the null function. But it doesn't give the exact answer without some transformations:
x = -null(A,'r')/4
Seems to work.

Sed replace sql

page.sql
Replace the following script
INSERT INTO `page`
by this:
INSERT INTO `page` (page_id, page_namespace, page_title, page_restrictions, page_counter, page_is_redirect, page_is_new, page_random, page_touched, page_latest, page_len)
I am a beginer,I don't know what's wrong with this GNU Sed command on windows:
sed 's/INSERT INTO `page`/INSERT INTO `page` (page_id, page_namespace, page_title, page_restrictions, page_counter, page_is_redirect, page_is_new, page_random, page_touched, page_latest, page_len)/g' pageC.sql > paged.sql
changing the whole line by a new content:
sed '/INSERT INTO `page`/ c\
INSERT INTO `page` (page_id, page_namespace, page_title, page_restrictions, page_counter, page_is_redirect, page_is_new, page_random, page_touched, page_latest, page_len)
' pageC.sql > paged.sql
Substitution of text
sed 's/INSERT INTO `page`/& (page_id, page_namespace, page_title, page_restrictions, page_counter, page_is_redirect, page_is_new, page_random, page_touched, page_latest, page_len)/' pageC.sql > paged.sql
no need of g unless this is a pattern that have several occurence on the SAME line (sed work per line, one at a the time by default)

If matched then print all using awk

I have a file which contains many sub-sections each starting with [begin] and ending with [end]:
[begin li1_1378184738754_91]
header=7075|lime|0|0|109582|0|1|2700073||0|0|0|[355]|1|0|ssb-li1-1378184738754-90||0||LIME |0|saved=true|0.002406508312038836|0|[ser=zu1:mtu=model_other_20120806calibex.csv:mu=model_other_20120806calibex.csv:scorerClassUsed=LinearPersonalizedProductSearchScorer][ser=uzu6:mtu=model_other_20120806calibex.csv:mu=model_other_20120806calibex.csv:scorerClassUsed=LinearPersonalizedProductSearchScorer][ser=xzs5:mtu=model_other_20120806calibex.csv:mu=model_other_20120806calibex.csv:scorerClassUsed=LinearPersonalizedProductSearchScorer][ser=sv-stda-zu3:mtu=model_other_20120806calibex.csv:mu=model_other_20120806calibex.csv:scorerClassUsed=LinearPersonalizedProductSearchScorer][ser=hzu8:mtu=model_other_20120806calibex.csv:mu=model_other_20120806calibex.csv:scorerClassUsed=LinearPersonalizedProductSearchScorer][ser=lzu3:mtu=model_other_20120806calibex.csv:mu=model_other_20120806calibex.csv:scorerClassUsed=LinearPersonalizedProductSearchScorer][ser=yzu2:mtu=model_other_20120806calibex.csv:mu=model_other_20120806calibex.csv:scorerClassUsed=LinearPersonalizedProductSearchScorer][ser=xzu7:mtu=model_other_20120806calibex.csv:mu=model_other_20120806calibex.csv:scorerClassUsed=LinearPersonalizedProductSearchScorer]|0|null|false|40||false|
attrs=0|0|0||0|
ptitle=690751404|1|1|1|Rest:1998636||||||2700401|175619|900.5636134725806|0.985486|39.166666666666664|$9.99|100.0|1|||
seller=1998636|1|9.99|1|-1||0|||||true||4.7937584|10412|false|
ptitle=5543369186|2|1|1|Rest:1533891||||||2700211|19615|886.8211044369053|0.776121|34.0|$119.99|100.0|1|||
seller=1533891|1|119.99|3|-1|1.0:text,In+size+6.0%2C7.0%2C8.0%2C8.5%2C9.0%2C9.5%2C10.0%2C...,0.0,,,,0,0,|2|||||true||2.95|20|true|
ptitle=622529158|3|1|1|||||||2700408|67402|796.5289827432475|0.893899|63.0|$5.27|100.0|1|||
seller=4281413|1|5.27|1|-1||0|||||true||4.695052|1769|true|
ptitle=5507199621|4|1|1|||||||2700220|56412|706.9031281251306|0.791171|45.0|$99.99|100.0|1|||
seller=4806107|1|-1.0|1|-1|1.0:sale,$,30.000000000000014,0.0,,,,0,0,:text,In+size+6.0%2C6.5%2C7.0%2C7.5%2C8.0%2C8.5%2C9.0%2C9...,0.0,,,,0,0,|2||||$130 $30.00 off|false||5.0|1|false|
ptitle=5502728013|5|1|1|||||||900000|0|698.7772340643119|0.836740|75.0|$40.95|100.0|1|||
seller=955448|1|40.95|1|-1||0|||||false||4.142857|7|false|
ptitle=840662011|6|1|1|Rest:265238||||||300233|62718|683.2927820751431|0.995513|52.0|$22.95|100.0|1|||
seller=265238|1|22.95|1|-1||0|||||false||4.478261|23|false|
ptitle=848084980|8|1|1|||||||2700073|145653|670.4809846773688|0.880587|60.0|$24.99|100.0|1|||
seller=5267046|1|24.99|1|-1||0|||||true||0.0|0|false|
ptitle=891200492|9|1|1|Rest:1030132||||||2701003|17215|668.8437575254773|0.825491|32.0|$519.99|100.0|1|||
seller=1030132|1|519.99|1|-1||0|||||false||4.7391305|23|false|
ptitle=641974054|10|1|1|||||||900000|69433|667.6678790058678|0.752129|57.0|$4.19|100.0|1|||
seller=3365158|1|4.19|1|-1||0|||||true||4.70907|4410|true|
ptitle=517591869|12|1|1|Rest:4802895||||||2700408|127644|643.0972570735605|0.893899|17.25|$23.95|100.0|1|||
seller=4318776|1|-1.0|3|-1||0|||||false||0.0|0|false|
ptitle=541549480|13|1|1|Rest:1180414||||||2702000|105832|597.4904572011968|0.752129|24.666666666666664|$8.27|100.0|1|||
seller=4636561|1|8.27|1|-1||0|||||false||4.8283377|734|true|
ptitle=1020561900|14|1|1|||||||2700063|159813|594.4717491579845|0.934869|75.0|$5.39|100.0|1|||
seller=4722645|1|5.39|1|-1|1.0:sale,$,0.6000000000000005,0.0,,,,0,0,:text,Free+Shipping+on+All+Orders%21,0.0,201301010000/,,,0,0,|2||||$5.99 $0.60 off|true||4.3942246|1593|true|
ptitle=507792308|15|1|1|Rest:4683455||||||2702000|105832|591.7739184402442|0.768311|22.5|$9.48|100.0|1|||
seller=4910651|1|-1.0|2|-1||0|||||false||5.0|1|false|
ptitle=1090571346|16|1|1|Rest:4452919||||||2700211|20824|776.4814913363535|0.776121|35.0|$59.99|100.0|1|||
seller=1533891|1|59.99|1|-1|1.0:sale,$,49.99999999999999,0.0,,,,0,0,:text,In+size+7.5%2C8.0%2C8.5%2C9.0%2C9.5%2C10.0%2C10.5...,0.0,,,,0,0,|2||||$110 $50.00 off|true||2.95|20|true|
ptitle=573017390|17|1|1|||||||2700073|91937|679.695660577044|0.880587|33.5|$14.85|100.0|1|||
seller=4281413|1|14.85|1|-1||0|||||true||4.695052|1769|true|
ptitle=5502723300|18|1|1|||||||900000|0|639.3095640940136|0.836740|75.0|$50.95|100.0|1|||
seller=955448|1|50.95|1|-1||0|||||false||4.142857|7|false|
ptitle=940022974|20|1|1|||||||2700600|58701|569.9503499778303|0.875839|59.0|$14.40|100.0|1|||
seller=4825227|1|14.4|1|12||0|||||true||4.0289855|276|true|
ptitle=5513277553|21|1|1|||||||2700220|56412|565.2712749001105|0.776121|44.33333333333333|$129.95|100.0|1|||
seller=4825252|1|129.95|1|23||0|||||true||4.0289855|276|true|
ptitle=890329961|22|1|1|||||||2700408|133796|564.7642425785796|0.837916|34.75|$61.95|100.0|1|||
seller=4825235|1|61.95|4|19||0|||||true||4.0289855|276|true|
ptitle=753852910|24|1|1|||||||2700073|146738|557.7419123688652|0.934869|47.69230769230769|$26.99|100.0|1|||
seller=4722645|1|26.99|10|-1|1.0:sale,$,3.0,0.0,,,,0,0,:text,Free+Shipping+on+All+Orders%21,0.0,201301010000/,,,0,0,|2||||$29.99 $3.00 off|true||4.3942246|1593|true|
ptitle=654738989|26|1|1|||||||900000|84012|554.7756559595525|0.752129|57.0|$3.19|100.0|1|||
seller=3365158|1|3.19|1|-1||0|||||true||4.70907|4410|true|
ptitle=707747307|27|1|1|Rest:4736009||||||2700063|76249|552.234395428327|0.889614|19.857142857142854|$6.39|100.0|1|||
seller=4736009|1|6.39|1|-1||0|||||false||4.8071113|15356|true|
ptitle=63531001|28|1|1|||||||2700408|82712|625.0421885589608|0.893899|47.166666666666664|$7.69|100.0|1|||
seller=4281413|1|7.69|3|-1||0|||||true||4.695052|1769|true|
ptitle=5502728016|29|1|1|||||||900000|0|605.9895507237038|0.836740|75.0|$503.00|100.0|1|||
seller=955448|1|503.0|1|-1||0|||||false||4.142857|7|false|
ptitle=507792308|31|1|1|Rest:4683455||||||2702000|105832|559.6902659046442|0.752129|22.5|$8.99|100.0|1|||
seller=5105812|1|-1.0|1|-1||0|||||false||0.0|0|false|
ptitle=753852910|32|1|1|||||||2700073|146738|545.9987095658629|0.870929|47.69230769230769|$22.49|100.0|1|||
seller=4143386|1|22.49|6|-1|1.0:sale,$,7.5,0.0,,,,0,0,:text,Free+Shipping+on+Orders+Over+%24100,0.0,201109010000/201409302359,,,0,0,|2||||$29.99 $7.50 off|false||4.7316346|2355|true|
ptitle=5513277553|33|1|1|Rest:1533891||||||2700220|56412|653.3133907916089|0.825491|44.33333333333333|$149.99|100.0|1|||
seller=1533891|1|149.99|3|-1|1.0:text,In+size+5.0%2C5.5%2C6.0%2C6.5%2C7.0%2C7.5%2C8.0%2C8...,0.0,,,,0,0,|2|||||true||2.95|20|true|
ptitle=63531001|34|1|1|||||||2700408|82712|541.8233547780552|0.893899|47.166666666666664|$7.72|100.0|1|||
seller=2370155|1|7.72|4|-1||0|||||false||4.85|40|false|
ptitle=1018957017|35|1|1|||||||2700073|145653|540.6093714604533|0.860614|56.0|$25.95|100.0|1|||
seller=5036683|1|25.95|1|-1||0|||||false||4.8405056|366|false|
ptitle=743682867|36|1|1|||||||2700073|63437|539.5985846455641|0.870929|58.0|$46.99|100.0|1|||
seller=193176|1|46.99|1|-1||0|||||true||4.8511987|1418|true|
ptitle=679858288|37|1|1|||||||2700063|188669|535.1360632897284|0.902031|30.0|$12.41|100.0|1|||
seller=4143386|1|12.41|2|-1|1.0:sale,$,1.379999999999999,0.0,,,,0,0,:text,Free+Shipping+on+Orders+Over+%24100,0.0,201109010000/201409302359,,,0,0,|2||||$13.79 $1.38 off|false||4.7316346|2355|true|
ptitle=994328713|38|1|1|||||||2700073|71463|534.7715925279717|0.870929|58.0|$1.29|100.0|1|||
seller=1787388|1|1.29|1|-1||0|||||false||4.680464|3624|false|
ptitle=886915818|40|1|1|||||||2700444|201835|529.7519801432289|0.934869|65.5|$44.99|100.0|1|||
seller=4561883|1|44.99|2|-1||0|||||true||4.7913384|508|false|
seller_hidden=227502|990765963|1147436601|-1
seller_hidden=5310958|622529158|5645627277|-1
seller_hidden=4825254|5543369186|5651114316|23
seller_hidden=5289138|5548930281|5653769481|-1
[end li1_1378184738754_91]
I am trying to run the command cat /home/nextag/logs/OutpdirImpressions.log.2013-09-02 | awk -F "$begin" '{print $0}' | awk '$0 ~ "header=7075" {print $0}'
As per this command i want to split the entire file into sub-sections beginning with the word 'begin'. Now in that i want those sub-sections which contains 'header=7075'
Expected output is that it will print the entire sub-section(those which contain that string), but i am getting only this portion as output:
header=7075|lime|0|0|109582|0|1|2700073||0|0|0|[355]|1|0|ssb-li1-1378184738754-90||0||LIME
|0|saved=true|0.002406508312038836|0|[ser=zu1:mtu=model_other_20120806calibex.csv:mu=model_other_20120806calibex.csv:scorerClassUsed=LinearPersonalizedProductSearchScorer][ser=uzu6:mtu=model_other_20120806calibex.csv:mu=model_other_20120806calibex.csv:scorerClassUsed=LinearPersonalizedProductSearchScorer][ser=xzs5:mtu=model_other_20120806calibex.csv:mu=model_other_20120806calibex.csv:scorerClassUsed=LinearPersonalizedProductSearchScorer][ser=sv-stda-zu3:mtu=model_other_20120806calibex.csv:mu=model_other_20120806calibex.csv:scorerClassUsed=LinearPersonalizedProductSearchScorer][ser=hzu8:mtu=model_other_20120806calibex.csv:mu=model_other_20120806calibex.csv:scorerClassUsed=LinearPersonalizedProductSearchScorer][ser=lzu3:mtu=model_other_20120806calibex.csv:mu=model_other_20120806calibex.csv:scorerClassUsed=LinearPersonalizedProductSearchScorer][ser=yzu2:mtu=model_other_20120806calibex.csv:mu=model_other_20120806calibex.csv:scorerClassUsed=LinearPersonalizedProductSearchScorer][ser=xzu7:mtu=model_other_20120806calibex.csv:mu=model_other_20120806calibex.csv:scorerClassUsed=LinearPersonalizedProductSearchScorer]|0|null|false|40||false|
I have tried using if in various ways, but it doesn't works. Can somebody please help me.
I tried awk -F "$begin" '{if($0 ~ "header=7075") {print $0}}' /home/nextag/logs/OutpdirImpressions.log.2013-09-02. It gave the same result
Can somebody please suggest that how do i get the complete sub-section in the result
Try this awk one-liner:
awk '$1=="[end"{p=0}/^header=7075/{p=1}p' file
In parts:
$1=="[end"{p=0} if you reach a line, with the first word "[end", then set the flag to zero
/^header=7075/{p=1} If you reach a line, which begins with "header=7075", set set the flag to one
p if the flag is non-zero, print the current line (equivalent to p{print} or p{print $0} or p!=0{print $0}

Importing tables in Mathematica from web - empty cell problem

I use:
data=Import["http://weburl/","Data"]
to import data from one site. On that page there are tables. This creates nested lists, and you can easily get the data in table form. For example:
Grid[data[[1]]]
would give something like this:
Player Age Shots Goals
P1 24 10 2
P2 22 5 0
P3 28 11 1
...
Now, here is the problem. If one cell in the html table is empty, for example an entry for "Age", then in html this would look like this: <td></td>. Mathematica doesn't include take it in the list at all, not even as, for example, a "Null" value. Instead, this row would just be represented by a list of length 3 and data would be moved by one column, so you'd get "Shots" in place of "Age" and "Goals" in place of "Shots" and "Goals" would be empty.
For example, a "P4" whos age is unknown (empty cell in html table), who had 10 shots and scored 0 goals would be imported as list of length 3 not 4 and moved by one:
Player Age Shots Goals
P1 24 10 2
P2 22 5 0
P3 10 0
...
This poses a difficult problem, because if you have a few empty fields then you can't tell from the list to which column it belongs. Is there a way to put a "Null" on an empty cell in html tables when importing in Mathematica? For example, P4 element in list would look like this:
data[[1,5]]
{"P4","Null",10,0}
instead of:
{"P4",10,0}
As lumeng points out, you can use FullData to get the HTML table element to fill out properly. Here's a simpler illustration of this.
in = ImportString["\<<html><table>
<tr>
<td>(1,1)</td>
<td>(1,2)</td>
<td>(1,3)</td>
</tr>
<tr>
<td>(2,1)</td>
<td></td>
<td>(2,3)</td>
</tr>
</table></html>\>",
{"HTML", "FullData"}];
Grid[in[[1, 1]]]
If you want more complete control of the output, I'd suggest that you Import the page as XML. Here's an example.
in = ImportString["\<<html><table>
<tr>
<td>(1,1)</td>
<td>(1,2)</td>
<td>(1,3)</td>
</tr>
<tr>
<td>(2,1)</td>
<td></td>
<td>(2,3)</td>
</tr>
</table></html>\>", "XML"];
Column[Last /# Cases[in,
XMLElement["td", ___], Infinity]]
You'll need to read up a bit on XML in general and Mathematica's version, namely the XMLObject. It's a delight to work with, once you get the hang of it, though.
In[13]:= htmlcode = "<html><table border=\"1\">
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
<td>row 1, cell 3</td>
</tr>
<tr>
<td>row 2, cell 1</td>
<td></td>
<td>row 2, cell 3</td>
</tr>
</table><html>";
In[14]:= file = ToFileName[{$TemporaryDirectory}, "tmp.html"]
Out[14]= "/tmp/tmp.html"
In[15]:= OpenWrite[file]
WriteString[file,htmlcode]
Close[file]
FilePrint[file]
Out[15]= OutputStream[/tmp/tmp.html,18]
Out[17]= /tmp/tmp.html
During evaluation of In[15]:=
<html><table border="1">
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
<td>row 1, cell 3</td>
</tr>
<tr>
<td>row 2, cell 1</td>
<td></td>
<td>row 2, cell 3</td>
</tr>
</table><html>
In[23]:= Import[file,"Elements"]//InputForm
Out[23]//InputForm=
{"Data", "FullData", "Hyperlinks", "ImageLinks", "Images", "Plaintext", "Source", "Title", "XMLObject"}
In[22]:= Import[file,"FullData"]//InputForm
Out[22]//InputForm=
{{{{"row 1, cell 1", "row 1, cell 2", "row 1, cell 3"}, {"row 2, cell 1", "", "row 2, cell 3"}}}, {}}
Using Computist's sample, you could also do:
htmlcode = "<html><table border=\"1\">
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
<td>row 1, cell 3</td>
</tr>
<tr>
<td>row 2, cell 1</td>
<td></td>
<td>row 2, cell 3</td>
</tr>
</table><html>";
StringReplace[htmlcode, "<td></td>" -> "<td>###</td>"];
ImportString[%, "Data"] /. "###" -> Null