wget distinguish header from body in output - wget

In the output of
wget www.google.com --save-headers --output-document - --quiet
how can you tell which lines are the headers and where the body starts (e.g., to tee the different parts into different pipelines)
Update
# r=$(wget www.google.com --save-headers --output-document - --quiet)
# status=$(echo $r | grep HTTP | awk '{ print $2 }')
# body=$(echo $r | awk '{ if( body ){ print $0 };if( $0 ~ /^$/ ){ body=1 } }')
However, $body is empty.
Uodate 2
body=$(echo "$r" | awk '{ if( $1 ~ /^[\s\r\n]*$/ ) { b=1 }; if( b ) { print $0 } }')
Quotes around $r. What a bugger.

how can you tell which lines are the headers and where the body starts
RFC1945 stipulates that
The entity body is separated from the headers by a null line (i.e., a
line with nothing preceding the CRLF).
so headers are before first blank line and body after said line in HTTP response. --save-headers option of GNU wget does follow suit
Save the headers sent by the HTTP server to the file, preceding the
actual contents, with an empty line as the separator.
As CRLF line endings are used headers are before first CRLFCRLF (\r\n\r\n) and body is after it. I would use python for that part following way, first download response as file named response
wget www.example.com --save-headers --output-document response --quiet
then create splitter.py as follows
with open("response", "rb") as f:
headers, body = f.read().split(b"\r\n\r\n", 1)
with open("headers", "wb") as f:
f.write(headers)
f.write(b"\r\n")
with open("body", "wb") as f:
f.write(body)
and run it
python splitter.py
I use binary (b) mode so it would work with any encoding and write \r\n after headers as it is CRLF of last key-value pair. Feel free to use any other tool you are comfortable working for making split.

r=$(wget www.example.com --save-headers --quiet --load-cookies /root/cookies.txt --save-cookies /root/cookies.txt --keep-session-cookies --output-document - 2>/dev/null )
status=$(echo "$r" | grep HTTP | awk '{ print $2 }')
if [ "$status" = "200" ]; then
body=$(echo "$r" | awk '{ if( body ){ print $0 };if( $0 ~ /^[\s\r\n]*$/ ){ body=1 } }')
else
exit 1
fi

Related

How to remove YAML frontmatter from markdown files?

I have markdown files that contain YAML frontmatter metadata, like this:
---
title: Something Somethingelse
author: Somebody Sometheson
---
But the YAML is of varying widths. Can I use a Posix command like sed to remove that frontmatter when it's at the beginning of a file? Something that just removes everything between --- and ---, inclusive, but also ignores the rest of the file, in case there are ---s elsewhere.
I understand your question to mean that you want to remove the first ----enclosed block if it starts at the first line. In that case,
sed '1 { /^---/ { :a N; /\n---/! ba; d} }' filename
This is:
1 { # in the first line
/^---/ { # if it starts with ---
:a # jump label for looping
N # fetch the next line, append to pattern space
/\n---/! ba; # if the result does not contain \n--- (that is, if the last
# fetched line does not begin with ---), go back to :a
d # then delete the whole thing.
}
}
# otherwise drop off the end here and do the default (print
# the line)
Depending on how you want to handle lines that begin with ---abc or so, you may have to change the patterns a little (perhaps add $ at the end to only match when the whole line is ---). I'm a bit unclear on your precise requirements there.
If you want to remove only the front matter, you could simply run:
sed '1{/^---$/!q;};1,/^---$/d' infile
If the first line doesn't match ---, sed will quit; else it will delete everything from the 1st line up to (and including) the next line matching --- (i.e. the entire front matter).
If you don't mind the "or something" being perl.
Simply print after two instances of "---" have been found:
perl -ne 'if ($i > 1) { print } else { /^---/ && $i++ }' yaml
or a bit shorter if you don't mind abusing ?: for flow control:
perl -ne '$i > 1 ? print : /^---/ && $i++' yaml
Be sure to include -i if you want to replace inline.
you use a bash file, create script.sh and make it executable using chmod +x script.sh and run it ./script.sh.
#!/bin/bash
#folder articles contains a lot of markdown files
files=./articles/*.md
for f in $files;
do
#filename
echo "${f##*/}"
#replace frontmatter title attribute to "title"
sed -i -r 's/^title: (.*)$/title: "\1"/' $f
#...
done
This AWK based solution works for files with and without FrontMatter, doing nothing in the later case.
#!/bin/sh
# Strips YAML FrontMattter from a file (usually Markdown).
# Exit immediately on each error and unset variable;
# see: https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/
set -Ee
print_help() {
echo "Strips YAML FrontMattter from a file (usually Markdown)."
echo
echo "Usage:"
echo " `basename $0` -h"
echo " `basename $0` --help"
echo " `basename $0` -i <file-with-front-matter>"
echo " `basename $0` --in-place <file-with-front-matter>"
echo " `basename $0` <file-with-front-matter> <file-to-be-without-front-matter>"
}
replace=false
in_file="-"
out_file="/dev/stdout"
if [ -n "$1" ]
then
if [ "$1" = "-h" ] || [ "$1" = "--help" ]
then
print_help
exit 0
elif [ "$1" = "-i" ] || [ "$1" = "--in-place" ]
then
replace=true
in_file="$2"
out_file="$in_file"
else
in_file="$1"
if [ -n "$2" ]
then
out_file="$2"
fi
fi
fi
tmp_out_file="$out_file"
if $replace
then
tmp_out_file="${in_file}_tmp"
fi
awk -e '
BEGIN {
is_first_line=1;
in_fm=0;
}
/^---$/ {
if (is_first_line) {
in_fm=1;
}
}
{
if (! in_fm) {
print $0;
}
}
/^(---|...)$/ {
if (! is_first_line) {
in_fm=0;
}
is_first_line=0;
}
' "$in_file" >> "$tmp_out_file"
if $replace
then
mv "$tmp_out_file" "$out_file"
fi

hash using sha1sum using awk

I have a "pipe-separated" file that has about 20 columns. I want to just hash the first column which is a number like account number using sha1sum and return the rest of the columns as is.
Whats the best way I can do this using awk or sed?
Accountid|Time|Category|.....
8238438|20140101021301|sub1|...
3432323|20140101041903|sub2|...
9342342|20140101050303|sub1|...
Above is an example of the text file showing just 3 columns. Only the first column has the hashfunction implemented on it. Result should like:
Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
What the Best Way™ is is up for debate. One way to do it with awk is
awk -F'|' 'BEGIN { OFS=FS } NR == 1 { print } NR != 1 { gsub(/'\''/, "'\'\\\\\'\''", $1); command = ("echo '\''" $1 "'\'' | sha1sum -b | cut -d\\ -f 1"); command | getline hash; close(command); $1 = hash; print }' filename
That is
BEGIN {
OFS = FS # set output field separator to field separator; we will use
# it because we meddle with the fields.
}
NR == 1 { # first line: just print headers.
print
}
NR != 1 { # from there on do the hash/replace
# this constructs a shell command (and runs it) that echoes the field
# (singly-quoted to prevent surprises) through sha1sum -b, cuts out the hash
# and gets it back into awk with getline (into the variable hash)
# the gsub bit is to prevent the shell from barfing if there's an apostrophe
# in one of the fields.
gsub(/'/, "'\\''", $1);
command = ("echo '" $1 "' | sha1sum -b | cut -d\\ -f 1")
command | getline hash
close(command)
# then replace the field and print the result.
$1 = hash
print
}
You will notice the differences between the shell command at the top and the awk code at the bottom; that is all due to shell expansion. Because I put the awk code in single quotes in the shell commands (double quotes are not up for debate in that context, what with $1 and all), and because the code contains single quotes, making it work inline leads to a nightmare of backslashes. Because of this, my advice is to put the awk code into a file, say foo.awk, and run
awk -F'|' -f foo.awk filename
instead.
Here's an awk executable script that does what you want:
#!/usr/bin/awk -f
BEGIN { FS=OFS="|" }
FNR != 1 { $1 = encodeData( $1 ) }
47
function encodeData( fld ) {
cmd = sprintf( "echo %s | sha1sum", fld )
cmd | getline output
close( cmd )
split( output, arr, " " )
return arr[1]
}
Here's the flow break down:
Set the input and output field separators to |
When the row isn't the first (header) row, re-assign $1 to an encoded value
Print the entire row when 47 is true (always)
Here's the encodeData function break down:
Create a cmd to feed data to sha1sum
Feed it to getline
Close the cmd
On my system, there's extra info after sha1sum, so I discard it by spliting the output
Return the first field of the sha1sum output.
With your data, I get the following:
Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
running by calling awk.script data (or ./awk.script data if you bash)
EDIT by EdMorton:
sorry for the edit, but your script above is the right approach but needs some tweaks to make it more robust and this is much easier than trying to describe them in a comment:
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { for (i=1; i<=NF; i++) f[$i] = i; next }
{ $(f["Accountid"]) = encodeData($(f["Accountid"])); print }
function encodeData( fld, cmd, output ) {
cmd = "echo \047" fld "\047 | sha1sum"
if ( (cmd | getline output) > 0 ) {
sub(/ .*/,"",output)
}
else {
print "failed to hash " fld | "cat>&2"
output = fld
}
close( cmd )
return output
}
$ awk -f tst.awk file
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
The f[] array decouples your script from hard-coding the number of the field that needs to be hashed, the additional args for your function make them local and so always null/zero on each invocation, the if on getline means you won't return the previous success value if it fails (see http://awk.info/?tip/getline) and the rest is maybe more style/preference with a bit of a performance improvement.

Finding multiple strings on multiple lines in file and manipulating output with bash/perl

I am trying to get the version numbers for content management systems being hosted on my server. I can do this fairly simply if the version number is stored on one line with something like this:
grep -r "\$wp_version = '" /home/
Which returns exactly what I want to stdout:
/home/$RANDOMDOMAIN/wp-includes/version.php:$wp_version = '3.7.1';
The issue I run into is when I start looking for version numbers that are stored on two or more lines, like Joomla! or Magento which use the following formats respectively:
Joomla!:
/** #var string Release version. */
public $RELEASE = '3.2';
/** #var string Maintenance version. */
public $DEV_LEVEL = '3';
Magento:
'major' => '1',
'minor' => '8',
'revision' => '1',
'patch' => '0',
I have gotten it to 'work', in a way, using the following (With this method if, for whatever reason, one of the strings I am looking for is missing the whole command becomes useless since xargs -l3 is expecting 2 rows above the path provided by -print):
find /home/ -type f -name version.php -exec grep " \$RELEASE " '{}' \; -exec grep " \$DEV_LEVEL " '{}' \; -print | xargs -l3 | sed 's/\<var\>\s//g;s/\<public\>\s//g' | awk -F\; '{print $3":"$1""$2}' | sed 's/ $DEV_LEVEL = /./g'
Which get's me output like this:
/home/$RANDOMDOMAIN/version.php:$RELEASE = 3.2.3
/home/$RANDOMDOMAIN/anotherfolder/version.php:$RELEASE = 1.5.0
I also have a working for loop that WILL exclude any file that does not contain both strings, but depending how much it has to sift through, can take significantly longer than the find one liner above:
for path in $(grep -rl " \$RELEASE " /home/ 2> /dev/null | xargs grep -rl " \$DEV_LEVEL ")
do
joomlaver="$path"
joomlaver+=$(grep " \$RELEASE " $path)
joomlaver+=$(echo " \$DEV_LEVEL = '$(grep " \$DEV_LEVEL " $path | cut -d\' -f2)';")
echo "$joomlaver" | sed 's/\<var\>\s//g;s/\<public\>\s//g;s/;//g' | awk -F\' '{ print $1""$2"."$4 }' | sed 's/\s\+//g'
unset joomlaver
done
Which get's me output like this:
/home/$RANDOMDOMAIN/version.php$RELEASE=3.2.3
/home/$RANDOMDOMAIN/anotherfolder/version.php$RELEASE=1.5.0
But I have to believe there is a simpler, shorter, more elegant way. Bash is preferred or if it can somehow be done with a perl one liner, that would work as well. Any and all help would be much appreciated. Thanks in advance. (Sorry for all the edits, but I am trying to figure this out myself as well!)
Here is a perl one-liner that will extract the $RELEASE and $DEV_LEVEL from the php file format you showed:
perl -ne '$v=$1 if /\$RELEASE\s*=\s*\047([0-9.]+)\047/; $devlevel=$1 if /\$DEV_LEVEL\s*=\s*\047([0-9.]+)\047/; if (defined $v && defined $devlevel) { print "$ARGV: Release=$v Devlevel=$devlevel\n"; last; }'
The -n makes perl effectivly wrap the whole thing inside a while (<>) { } loop. Each line is checked against two regexes. If both of them have matched then it will print the result and exit.
The \047 is used to match single quotes, otherwise the shell would get confused.
If it does not find a match, it does not print anything. Otherwise it prints something like this:
sample.php: Release=3.2 Devlevel=3
You would use it in combination with find and xargs to traverse down a directory structure, perhaps like this:
find . -name "*.php" | xargs perl -ne '$v=$1 if /\$RELEASE\s*=\s*\047([0-9.]+)\047/; $devlevel=$1 if /\$DEV_LEVEL\s*=\s*\047([0-9.]+)\047/; if (defined $v && defined $devlevel) { print "$ARGV: Release=$v Devlevel=$devlevel\n"; last; }'
You could make a similar version for the other file format (Magento?) you mentioned.

Line after match of two files

I have a similar problem like last time.
This time i have a header file looking like:
>random header 2
>random header name1
and my basefile
>random header name1
wonderfulstringwhatsoevergoeson
>random header 2
someotherline
now the aim is, to have the following output:
someotherline
wonderfulstringwhatsoevergoeson
So i want the line after the match from the basefile. (and only this one, not the header)
Important with this, it shall keep the order of header.
Sort won't work, since it will keep alphabetic order and this just shouldn't happen.
I couldn't figure out, how grep could compare two files and just gives the line after match :/
This will do the job for you:
awk 'FNR==NR
{
a[$0]=FNR;i=FNR;next
}
($0 in a)
{
t=$0;
getline;b[a[t]]=$0
}
END
{
for(k=1;k<=i;k++)print b[k]
}' head base
This should do it:
awk '
{ recs[NR] = $0 } # store the header lines in 1->(NR-FNR) and the basefile lines in ((NR-FNR)+1)->NR
END {
for (hdrNr=1; hdrNr<=(NR-FNR); hdrNr++) {
hdr = recs[hdrNr]
for (lineNr=(NR-FNR)+1; lineNr<=NR; lineNr++) {
line = recs[lineNr]
if (line == hdr) {
print recs[lineNr+1]
}
}
}
}
' header basefile
Following up on #Vijays idea of just storing the matching lines in an array indexed by the order the headers are read in, here's how you'd do that without getline, without unnecessary variables, with meaningful variable names, and without printing blank lines for every unmatched header:
awk '
NR==FNR { hdr2nr[$0] = FNR; next }
hdrNr { hdrNr2line[hdrNr] = $0 }
{ hdrNr = hdr2nr[$0] }
END {
for(hdrNr=1; hdrNr<=(NR-FNR); hdrNr++)
if (hdrNr in hdrNr2line)
print hdrNr2line[hdrNr]
}
' header basefile
That assumes a given header can only appear once in basefile.
Reads basefile into %h hash, and later follows key order specified in header file,
perl -ne 'BEGIN{ open $F,pop or die $!; %h=<$F> } print $h{$_}' header basefile
Try this bash one-liner:
while read line; do match=$(sed -n "/$line/{ n;p}" basefile); echo $match; done < 'header'
This will work, when your basefile always have one line definition for corresponding header.
header:
sat:~# cat header
>random header 2
>random header name1
basefile:
sat:~# cat basefile
>random header name1
wonderfulstringwhatsoevergoeson
>random header 2
someotherline
Output:
sat:~# while read line; do match=$(sed -n "/$line/{ n;p}" basefile);echo $match; done < 'header'
someotherline
wonderfulstringwhatsoevergoeson
This might work for you (GNU sed):
sed -r 'N;s/^(.*)\n(.*)/s|^\1$|\2|/' base_file | sed -f - header_file
Turn the base_file into a sed script and run it against the header_file.

How to delete multiple empty lines with SED?

I'm trying to compress a text document by deleting of duplicated empty lines, with sed. This is what I'm doing (to no avail):
sed -i -E 's/\n{3,}/\n/g' file.txt
I understand that it's not correct, according to this manual, but I can't figure out how to do it correctly. Thanks.
I think you want to replace spans of multiple blank lines with a single blank line, even though your example replaces multiple runs of \n with a single \n instead of \n\n. With that in mind, here are two solutions:
sed '/^$/{ :l
N; s/^\n$//; t l
p; d; }' input
In many implementations of sed, that can be all on one line, with the embedded newlines replaced by ;.
awk 't || !/^$/; { t = !/^$/ }'
As tripleee suggested above, I'm using Perl instead of sed:
perl -0777pi -e 's/\n{3,}/\n\n/g'
Use the translate function
tr -s '\n'
the -s or --squeeze-repeats reduces a sequence of repeated character to a single instance.
This is much better handled by tr -s '\n' or cat -s, but if you insist on sed, here's an example from section 4.17 of the GNU sed manual:
#!/usr/bin/sed -f
# on empty lines, join with next
# Note there is a star in the regexp
:x
/^\n*$/ {
N
bx
}
# now, squeeze all '\n', this can be also done by:
# s/^\(\n\)*/\1/
s/\n*/\
/
I am not sure this is what the OP wanted but using the awk solution by William Pursell here is the approach if you want to delete ALL empty lines in the file:
awk '!/^$/' file.txt
Explanation:
The awk pattern
'!/^$/'
is testing whether the current line is consisting only of the beginning of a line (symbolised by '^') and the end of a line (symbolised by '$'), in other words, whether the line is empty.
If this pattern is true awk applies its default and prints the current line.
HTH
I think OP wants to compress empty lines, e.g. where there are 9 consecutive emty lines, he wants to have just three.
I have written a little bash script that does just that:
#! /bin/bash
TOTALLINES="$(cat file.txt|wc -l)"
CURRENTLINE=1
while [ $CURRENTLINE -le $TOTALLINES ]
do
L1=$CURRENTLINE
L2=$(($L1 + 1))
L3=$(($L1 +2))
if [[ $(cat file.txt|head -$L1|tail +$L1) == "" ]]||[[ $(cat file.txt|head -$L1|tail +$L1) == " " ]]
then
L1EMPTY=true
else
L1EMPTY=false
fi
if [[ $(cat file.txt|head -$L2|tail +$L2) == "" ]]||[[ $(cat file.txt|head -$L2|tail +$L2) == " " ]]
then
L2EMPTY=true
else
L2EMPTY=false
fi
if [[ $(cat file.txt|head -$L3|tail +$L3) == "" ]]||[[ $(cat file.txt|head -$L3|tail +$L3) == " " ]]
then
L3EMPTY=true
else
L3EMPTY=false
fi
if [ $L1EMPTY = true ]&&[ $L2EMPTY = true ]&&[ $L3EMPTY = true ]
then
#do not cat line to temp file
echo "Skipping line "$CURRENTLINE
else
echo "$(cat file.txt|head -$CURRENTLINE|tail +$CURRENTLINE)">>temp.txt
echo "Writing line " $CURRENTLINE
fi
((CURRENTLINE++))
done
cat temp.txt>file.txt
rm -r temp.txt
FINALTOTALLINES="$(cat file.txt|wc -l)"
EMPTYLINELINT=$(( $CURRENTLINE - $FINALTOTALLINES ))
echo "Deleted " $EMPTYLINELINT " empty lines."