Replacing everything between two strings in UNIX shell - sed

I want to write a shell script that gets all "a href" HTML tags from provided link and prints them to the console. The problem I am facing right now is removing all of the text I don't need between them. After some googling I came to a conclusion, that the "sed" command would be the best for this job, however, I cannot figure out how to write it correctly
#!/bin/sh
wget -qO - $1 | grep -E "*<[Aa]([[:print:]])*( |'\n')[Hh][Rr][Ee][Ff]([[:print:]])*</a" | sed 's/<\/a>.*<a/<\/a>REPLACED\n<a/g'
What I am trying to do is to replace EVERYTHING between the "</a>" closing tag and the next "<a" opening tag (I don't know much about HTML, but there may be other tags that have "a" as opening and closing, but that's a problem for later), however, this (and a few different ways I have tried) only works sometimes.
I am new to shell scripting, so any suggestions are welcome, maybe "sed" is not the command for the job, hope you can help me, thanks in advance
Edit 1: from this:
Canonical</li></ul></li></ul></div></div> <script> $(function() { $(".nav-global .more > a").click(function(e){ $(this).closest(".more").toggleClass("open"); return false; }); $(document).click(function(){ $(".nav-global .more.open").removeClass("open"); }); }); </script></div>
<span></span>
to this:
CanonicalREPLACED<span></span>
Edit 2:
It seems I am bad at explaining exactly what I expect. For large-scale testing, I use the link https://askubuntu.com/questions/726076/whats-wrong-with-my-grep-command. What I am trying to achieve is to have ONLY "a href" (or other HTML tags that start with "<a" and end with "</a>") separated by "REPLACED" as shown in previous edit

1st solution: With your shown samples please try following awk code. Written and tested in GNU awk.
awk -v RS="" -v FS='<\\/a>.*<a href=' '{print $1"</a>REPLACED<a href="$2}' Input_file
2nd solution: Using RS and sub functions of awk, written and tested in GNU awk.
awk -v RS="" '{sub(/<\/a>.*<a href=/,"</a>REPLACED<a href=")} 1' Input_file

Using sed
$ sed -Ez 's~(<[^<]*)[^\n]*\n +~\1</a>REPLACED~' input_file
CanonicalREPLACED<span></span>

Output result to stdout:
sed -z 's/\(<\/a>\).*\(<a\)/\1REPLACED\2/g' inputfile

Related

Inserting the filename before the first line of a text file

I'm trying to add the filename of a text file into the first line of a the same text file. for example if the file name is called test1.txt, then the first line when you open the file should be test1.
below is what I've done so for, the only problem i have is that the word "$file" is being written to the file not the file name. any help is appreciated.
for file in *.txt; do
sed -i '1 i\$file' $file;
awk 'sub("$", "\r")' "$file" > "$file"1;
mv "$file"1 "$file";
done
Without concise, testable sample input and expected output it's an untested guess but it SOUNDS like all you need is:
awk -i inplace -v ORS='\r\n' 'FNR==1{print FILENAME}1' *
No shell loop or multiple commands required.
The above uses GNU awk for inplace editing and I'm assuming the sub() in your code was intended to add a \r at the end of every line.
I've just started learning more about sed and awk and put this into a file called insert.sed and sourced it and passed it a file name:
sed -i '1s/^./'$1'\'$'\n/g' $1
In trying it, it seems to work okay:
rent$ cat x.txt
<<< Who are you?
rent$ source insert.sed x.txt
rent$ cat x.txt
x.txt
<< Who are you?
It is cutting off the first character of the first line so I'd have to fix that otherwise it does add the file name to first line.
I'm sure there's better ways of doing it.
If you want test1 on first line, with gnu sed
sed -i '1{x;s/.*/fich=$(ps -p $PPID -o args=);fich=${fich##*\\} };echo ${fich%%.*}/e;G}' test1.txt

Inserting numbers with sed in Linux?

I have the following line in cmdline
sed -e '1s/^/\\documentstyle\[11pt\]\{article\}\n/' -e 's/[0-9]//g' test.txt
My desired output is something like this
\documentstyle[11pt]{article}
rest of the file
However I only get this
\documentstyle[pt]{article}
rest of the file
I can't seem to find a way to insert numbers. I tried backslashing. Solution might be simple, but I'm a newbie with sed.
Note that sed has more commands than just s///. To insert a line at the top of a file:
sed -e '1i\
\\\documentstyle[11pt]{article}' -e 's/[0-9]//g' file
(frustratingly, the number of backslashes to achieve a backslash in the output was found by trial and error)
The bonus is that does not affect your goal to remove numbers.
My second command was removing numbers, working as intended indeed, but I was just trying to do it all at once. Credits to Jonathan Leffler.

Parsing HTML on the command line; How to capture text in <strong></strong>?

I'm trying to grab data from HTML output that looks like this:
<strong>Target1NoSpaces</strong><span class="creator"> ....
<strong>Target2 With Spaces</strong><span class="creator"> ....
I'm using a pipe train to whittle down the data to the targets I'm trying to hit. Here's my approach so far:
grep "/strong" output.html | awk '{print $1}'
Grep on "/strong" to get the lines with the targets; that works fine.
Pipe to 'awk '{print $1}'. That works in case #1 when the target has no spaces, but fails in case #2 when the target has spaces..only the first word is preserved as below:
<strong>Target1NoSpaces</strong><span
<strong>Target2
Do you have any tips on hitting the target properly, either in my awk or in different command? Anything quick and dirty (grep, awk, sed, perl) would be appreciated.
Try pup, a command line tool for processing HTML. For example:
$ pup 'strong text{}' < file.html
Target1NoSpaces
Target2 With Spaces
To search via XPath, try xpup.
Alternatively, for a well-formed HTML/XML document, try html-xml-utils.
One way using mojolicious and its DOM parser:
perl -Mojo -E '
g("http://your.web")
->dom
->find("strong")
->each( sub { if ( $t = shift->text ) { say $t } } )'
Using Perl regex's look-behind and look-ahead feature in grep. It should be simpler than using awk.
grep -oP "(?<=<strong>).*?(?=</strong>)" file
Output:
Target1NoSpaces
Target2 With Spaces
Add:
This implementation of Perl's regex's multi-matching in Ruby could match values in multiple lines:
ruby -e 'File.read(ARGV.shift).scan(/(?<=<strong>).*?(?=<\/strong>)/m).each{|e| puts "----------"; puts e;}' file
Input:
<strong>Target
A
B
C
</strong><strong>Target D</strong><strong>Target E</strong>
Output:
----------
Target
A
B
C
----------
Target D
----------
Target E
Here's a solution using xmlstarlet
xml sel -t -v //strong input.html
Trying to parse HTML without a real HTML parser is a bad idea. Having said that, here is a very quick and dirty solution to the specific example you provided. It will not work when there is more than one <strong> tag on a line, when the tag runs over more than one line, etc.
awk -F '<strong>|</strong>' '/<strong>/ {print $2}' filename
You never need grep with awk and the field separator doesn't have to be whitespace:
$ awk -F'<|>' '/strong/{print $3}' file
Target1NoSpaces
Target2 With Spaces
You should really use a proper parser for this however.
Since you tagged perl
perl -ne 'if(/(?:<strong>)(.*)(?:<\/strong>)/){print $1."\n";}' input.html
I am surprised no one mensions W3C HTML-XML-utils
curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
hxnormalize -x |
hxselect -s '\n' strong
output:
<strong class="fc-black-750 mb6">Stack Overflow
for Teams</strong>
<strong>Teams</strong>
To capture only content:
curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
hxnormalize -x |
hxselect -s '\n' -c strong
Stack Overflow
for Teams
Teams

Sed command to fetch particular string from full string

I've got a file which contains lot of strings like below input.
Need to extract the below output and process it further.
Input:
History={ExecAt=[2013-05-03 03:00:20,2013-05-03 03:00:23,2013-05-03 03:00:26],MId=["msgId3","msgId4","msgId5"]};
Output should be:
MId=["msgId3","msgId4","msgId5"]
using (sed 's/^.*,MId=/MId/') command i got the output like MId=["msgId3","msgId4","msgId5"]};
but still wanted the exact output (need to remove last 2 special chars }; here).
This works for me:
sed 's/.*\(MId=.*\)\}.*/\1/'
If your grep supports the -o option, you can use it rather than sed:
grep -o 'MId=\[[^]]\+\]'
Using the same regex in sed works fine, just remove anything before and after:
sed -e 's/.*\(MId=\[[^]]\+\]\).*/\1/'

Perl command is not behaving as expected?

I have a file with below contents:
[TEMP.s_m_update_BUS_spec]
$$SRC_STAT_RA=WHG_STATUS_SITEENTSEQCHAIN_20110901094550.dat
$InputFile_RA_SPE=/edwload/rqt/workingdir/status_spe/WHG_STATUS_SITEENTSEQCHAIN_20110901094550.dat
[TEMP.s_m_upd_salions_rqthk]
$$SRC_STAT_RN=WHG_STATUS_SITEENTSEQCHAIN_20110901094550
$InputFile_RN_RQT=/edwload/rqt/workingdir/restriction/WHG_STATUS_SITEENTSEQCHAIN_20110901094550.dat
I am using below perl command to just replace WHG_STATUS_SITEENTSEQCHAIN_20110901094550 with WHG_STATUS_SITEENTSEQCHAIN_20110901999999.dat in the section [TEMP.s_m_upd_salions_rqthk] But somehow its not giving me expected result. Even the WHG_STATUS_SITEENTSEQCHAIN_20110901094550 under section [TEMP.s_m_update_BUS_spec] is getting replaced.
perl -p -i -e "s|\$\$SRC_STAT_RN=.*|\$\$SRC_STAT_RN=WHG_STATUS_SITEENTSEQCHAIN_20110901999999.dat|g;s|\$InputFile_RN_RQT=\/edwload\/rqt\/workingdir\/restriction\/.*|\$InputFile_RN_RQT=\/edwload\/rqt\/workingdir\/restriction\/WHG_STATUS_SITEENTSEQCHAIN_20110901999999.dat|g" Input_File
Please let me know the modifications required in command above.Same subsitute commands works fine with SED command. But i wud want to use perl.
The program you run is
s|$$SRC_STAT_RN=.*|$$SRC_STAT_RN=WHG_STATUS_SITEENTSEQCHAIN_20110901999999.dat|g; s|$InputFile_RN_RQT=\/edwload\/rqt\/workingdir\/restriction\/.*|$InputFile_RN_RQT=\/edwload\/rqt\/workingdir\/restriction\/WHG_STATUS_SITEENTSEQCHAIN_20110901999999.dat|g
There are a fair number of $ that should be escaped but aren't. It would be simpler if you used single quotes instead of double quotes. You were probably trying for:
perl -i -pe'
s{\$\$SRC_STAT_RN=.*}{\$\$SRC_STAT_RN=WHG_STATUS_SITEENTSEQCHAIN_20110901999999.dat}g;
s{\$InputFile_RN_RQT=/edwload/rqt/workingdir/restriction/.*}{\$InputFile_RN_RQT=/edwload/rqt/workingdir/restriction/WHG_STATUS_SITEENTSEQCHAIN_20110901999999.dat}g;
' Input_File
What exactly is not working as you want? On my machine, after running your perl code, the file looks like:
[TEMP.s_m_update_BUS_spec] $$SRC_STAT_RA=WHG_STATUS_SITEENTSEQCHAIN_20110901999999.dat
[TEMP.s_m_upd_salions_rqthk] $$SRC_STAT_RN=WHG_STATUS_SITEENTSEQCHAIN_20110901999999.dat
Ain't this what you expected?
Edit
Try modifying your command to:
perl -p -i -e "s|\$\$SRC_STAT_RN=.*?|\$\$SRC_STAT_RN=WHG_STATUS_SITEENTSEQCHAIN_20110901999999.dat|gmx;s|\$InputFile_RN_RQT=/edwload/rqt/workingdir/restriction/.*?|\$InputFile_RN_RQT=/edwload/rqt/workingdir/restriction/WHG_STATUS_SITEENTSEQCHAIN_20110901999999.dat|gmx" Input_File
and see if the result is as expected:
[TEMP.s_m_update_BUS_spec]
$$SRC_STAT_RA=WHG_STATUS_SITEENTSEQCHAIN_20110901999999.datWHG_STATUS_SITEENTSEQCHAIN_20110901094550.dat
$InputFile_RA_SPE=WHG_STATUS_SITEENTSEQCHAIN_20110901999999.dat/edwload/rqt/workingdir/status_spe/WHG_STATUS_SITEENTSEQCHAIN_20110901094550.dat
[TEMP.s_m_upd_salions_rqthk]
$$SRC_STAT_RN=WHG_STATUS_SITEENTSEQCHAIN_20110901999999.datWHG_STATUS_SITEENTSEQCHAIN_20110901094550
$InputFile_RN_RQT=WHG_STATUS_SITEENTSEQCHAIN_20110901999999.dat/edwload/rqt/workingdir/restriction/WHG_STATUS_SITEENTSEQCHAIN_20110901094550.dat