wget - work with arguments - sed

I have a list of URI: uri.txt with
category1/image1.jpeg
category1/image32.jpeg
category2/image1.jpeg
and so on, and need to download them from domain example.com with wget, with additional changing filename (final at save) to categoryX-imageY.jpeg
I understand, that I should read uri.txt line by line, add "http://example.com/" in front of each line and change "/" to "-" in each line.
What I have now:
Reading from uri.txt [work]
Adding domain name in front of each URI [work]
Change filename to save [fail]
I'm trying to do this with:
wget 'http://www.example.com/{}' -O '`sed "s/\//-/" {}`' < uri.txt
but wget fails (it depends what type of quotation sign I'm using: ` or ') with:
wget: option requires an argument -- 'O'
or
sed `s/\//-/` category1/image1.jpeg: No such file or directory
sed `s/\//-/` category1/image32.jpeg: No such file or directory
Could you tell, what I'm doing wrong?

Here is how I would do that:
while read LINE ; do
wget "http://example.com/$LINE" -O $(echo $LINE|sed 's=/=-=')
done < uri.txt
In other words, read uri.txt line by line (the text being placed in $LINE bash variable), before performing the wget and saving with modified name (I use another sed delimitor, to avoid escaping / and making it more readable)

When I want to construct a list of args to be executed, I like to use xargs:
cat uri.txt | sed "s#\(.*\)/\(.*\)#http://example.com/\1/\2 -O \1-\2#" | xargs -I {} wget {}

Related

Remove space between a keyword and paranthesis

I have nearly 300 files in 60 folders .
As per the C++ coding guidelines, I need to replace below lines from *.cpp and *.cl files (wants to remove extra space between if and for statement) -
for (* .....)
with
for(* .....)
and also
if (* .....)
with
if(* .....)
Can any one suggest me the grep command to do search and replace for all files.
Edited:
I tried with below commands:
sed -i 's/for (/for(/g' *.cpp
But got error like below:
sed: can't read *.cpp: No such file or directory
I think you need sed command (stream editor, see man sed on your mashine). It is more suitable for file editing.
sed -i -E 's/(for|if)[ ]+(\(.*\))/\1\2/g'
Let me explain:
-i stands for inline, that means that all changes will be done and saved in the file
-E is needed to use extended regular expression inside with sed
s/(for|if)[ ]+(\(.*\))/\1\2/g
s stands for substitute
/ is a separator, which separates different parts of command. Between first / and second / there is pattern that you need to find (and then replace). After second / and third / there that we want to have after substitution.
g in the end stands for global, that means to make changes in the whole file.
How to apply to every file that you need?
This question is already exist, so in the end you need to run in directory where are your files stored following command
find ./ -type f -exec sed -i -E 's/(for|if)[ ]+(\(.*\))/\1\2/g' {} \;
I hope, this will help:)
I have created the file "brol.txt", with following content:
for (correct
for(wrong
if (correct
if(wrong
I have launched following grep command:
grep -E "for \(|if \(" brol.txt
With following result:
for (correct
if (correct
Explanation:
grep -E means extended grep (allows to search for expression1 OR expression2,
separated by a pipe character)
\( means the search for a round bracket. The backslash is an escape character.

Search and replace with grep and perl

I need to replace "vi_chiron" with "vi_huracan" from all the files below. I am using the following command line and also have changed the permission to full access for all the underlying files:
grep -ri "vi_chiron" ./ | grep -v Header | xargs perl -e "s/vi_chiron/vi_huracan/" -pi
I am getting the error:
"Can't open ./build/drivers_file.min:#: No such file or directory.
" and many other similar errors. Any idea why ? Below is the permission for the file:
ll build/drivers_file.min
-rwxrwxrwx 1 ask vi 5860 Mar 13 12:07 build/drivers_file.min
You can change your command in the following ways:
grep -ril "vi_chiron" . | xargs grep -vl Header | xargs perl -e "s/vi_chiron/vi_huracan/" -pi
or
find . -type f -exec grep -ril "vi_chiron" | xargs grep -vl Header | xargs perl -e "s/vi_chiron/vi_huracan/" -pi
In order to manipulate all files that do not contain Header and change vi_chiron into vi_hurican.
If I were you, I would simplify the chain and do it the other way around:
grep -rvl Header . | xargs sed -i.bak 's/vi_chiron/vi_hurican/'
If you are confident enough change the -i.bak into -i in order to have sed not taking any backups.
Also note that your change and replace is not global, if you want to put it globally use: sed -i.bak 's/vi_chiron/vi_hurican/g' instead.
'-i[SUFFIX]' '--in-place[=SUFFIX]'
This option specifies that files are to be edited in-place. GNU
'sed' does this by creating a temporary file and sending output to
this file rather than to the standard output.(1).
This option implies '-s'.
When the end of the file is reached, the temporary file is renamed
to the output file's original name. The extension, if supplied, is
used to modify the name of the old file before renaming the
temporary file, thereby making a backup copy(2)).
This rule is followed: if the extension doesn't contain a '*', then
it is appended to the end of the current filename as a suffix; if
the extension does contain one or more '*' characters, then _each_
asterisk is replaced with the current filename. This allows you to
add a prefix to the backup file, instead of (or in addition to) a
suffix, or even to place backup copies of the original files into
another directory (provided the directory already exists).
If no extension is supplied, the original file is overwritten
without making a backup.
source: https://www.gnu.org/software/sed/manual/sed.txt

Wget: Filenames without the query string

I want to download a list of webpages from a file. How can I stop Wget appending the query strings on to the saved files?
wget http://www.example.com/index.html?querystring
I need this to be downloaded as index.html, not index.html?querystring
There is the -O option:
wget -O file.html http://www.example.com/index.html?querystring
so you can alter a little bit your script to pass to the -O argument the right file name.
I've finally resigned to using the -O and just wrapped it in a bash function to make it easier. I put this in my ~/.bashrc file:
wget-rmq ()
{
[ -z "$1" ] && echo 'error: wget-rmq requires a URL to retrieve as the first arg'
local output_filename="$(echo $1 | sed 's/?.*//g' | sed 's|https.*/||g')"
wget -O "${output_filename}" "${1}"
}
Then when I want to download a file:
wget-rmq http://www.example.com/index.html?querystring
The replacement regex is fairly simple. If any ?s appear in the URL before the query string begins then it will break. In practice that hasn't happened though since URL encoding requires ? to be in URLs as %3F, but I wanted to note the possibility.

Extracting the contents between two different strings using bash or perl

I have tried to scan through the other posts in stack overflow for this, but couldn't get my code work, hence I am posting a new question.
Below is the content of file temp.
<?xml version="1.0" encoding="UTF-8"?>
<env:Envelope xmlns:env="http://schemas.xmlsoap.org/soap/envelope/<env:Body><dp:response xmlns:dp="http://www.datapower.com/schemas/management"><dp:timestamp>2015-01-
22T13:38:04Z</dp:timestamp><dp:file name="temporary://test.txt">XJzLXJlc3VsdHMtYWN0aW9uX18i</dp:file><dp:file name="temporary://test1.txt">lc3VsdHMtYWN0aW9uX18i</dp:file></dp:response></env:Body></env:Envelope>
This file contains the base64 encoded contents of two files names test.txt and test1.txt. I want to extract the base64 encoded content of each file to seperate files test.txt and text1.txt respectively.
To achieve this, I have to remove the xml tags around the base64 contents. I am trying below commands to achieve this. However, it is not working as expected.
sed -n '/test.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test.txt">##g'|perl -p -e 's#</dp:file>##g' > test.txt
sed -n '/test1.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test1.txt">##g'|perl -p -e 's#</dp:file></dp:response></env:Body></env:Envelope>##g' > test1.txt
Below command:
sed -n '/test.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test.txt">##g'|perl -p -e 's#</dp:file>##g'
produces output:
XJzLXJlc3VsdHMtYWN0aW9uX18i
<dp:file name="temporary://test1.txt">lc3VsdHMtYWN0aW9uX18i</dp:response> </env:Body></env:Envelope>`
Howeveer, in the output I am expecting only first line XJzLXJlc3VsdHMtYWN0aW9uX18i. Where I am commiting mistake?
When i run below command, I am getting expected output:
sed -n '/test1.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test1.txt">##g'|perl -p -e 's#</dp:file></dp:response></env:Body></env:Envelope>##g'
It produces below string
lc3VsdHMtYWN0aW9uX18i
I can then easily route this to test1.txt file.
UPDATE
I have edited the question by updating the source file content. The source file doesn't contain any newline character. The current solution will not work in that case, I have tried it and failed. wc -l temp must output to 1.
OS: solaris 10
Shell: bash
sed -n 's_<dp:file name="\([^"]*\)">\([^<]*\).*_\1 -> \2_p' temp
I add \1 -> to show link from file name to content but for content only, just remove this part
posix version so on GNU sed use --posix
assuming that base64 encoded contents is on the same line as the tag around (and not spread on several lines, that need some modification in this case)
Thanks to JID for full explaination below
How it works
sed -n
The -n means no printing so unless explicitly told to print, then there will be no output from sed
's_
This is to substitute the following regex using _ to separate regex from the replacement.
<dp:file name=
Regular text
"\([^"]*\)"
The brackets are a capture group and must be escaped unless the -r option is used( -r is not available on posix). Everything inside the brackets is captured. [^"]* means 0 or more occurrences of any character that is not a quote. So really this just captures anything between the two quotes.
>\([^<]*\)<
Again uses the capture group this time to capture everything between the > and <
.*
Everything else on the line
_\1 -> \2
This is the replacement, so replace everything in the regex before with the first capture group then a -> and then the second capture group.
_p
Means print the line
Resources
http://unixhelp.ed.ac.uk/CGI/man-cgi?sed
http://www.grymoire.com/Unix/Sed.html
/usr/xpg4/bin/sed works well here.
/usr/bin/sed is not working as expected in case if the file contains just 1 line.
below command works for a file containing only single line.
/usr/xpg4/bin/sed -n 's_<env:Envelope\(.*\)<dp:file name="temporary://BackUpDir/backupmanifest.xml">\([^>]*\)</dp:file>\(.*\)_\2_p' securebackup.xml 2>/dev/null
Without 2>/dev/null this sed command outputs the warning sed: Missing newline at end of file.
This because of the below reason:
Solaris default sed ignores the last line not to break existing scripts because a line was required to be terminated by a new line in the original Unix implementation.
GNU sed has a more relaxed behavior and the POSIX implementation accept the fact but outputs a warning.

Appending URL to output file with wget

I'm using wget to read a batch of urls from an input file and download everything to a single output file, and I'd like to append each url before its downloaded content, anyone knows how to do that?
Thanks!
afaik wget does not directly support the use case you are envisioning. however, using standard tools, you can emulate this feature.
we will proceed as follows:
call wget with logging enabled
let sed process the log executing the script detailed below
execute the transformation result as a shell/batch script
conventions: use the following filenames:
wgetin.txt: the file with the urls to fetch using wget
wgetout.sed: sed script
wgetout.final: the final result
wgetass.sh/.cmd: shell/batch script to assemble the downloaded files weaving in the url data
wget.log: the log file of the wget call
Linux
the sed script (linux):
# delete lines _not_ matching the regex
/^\(Saving to: .\|--[0-9: \-]\+-- \)/! { d; }
# turn remaining content into something else
s/^--[0-9: \-]\+-- \(.*\)$/echo '\1\n' >>wgetout.final/
s/^Saving to: .\(.*\).$/cat '\1' >>wgetout.final/
the command line (linux):
rm wgetout.final | rm wgetass.sh | wget -i wgetin.txt -o wget.log | sed -f wgetout.sed -r wget.log >wgetass.sh | chmod 755 wgetass.sh | ./wgetass.sh
Windows
the syntax for windows batch scripts is slightly different. of course, the windows ports of wget and sed have to be installed first.
the sed script (windows):
# delete lines _not_ matching the regex
/^\(Saving to: .\|--[0-9: \-]\+-- \)/! { d; }
# turn remaining content into something else
s/^--[0-9: \-]\+-- \(.*\)$/echo "\1" >>wgetout.final/
s/^Saving to: .\(.*\).$/type "\1" >>wgetout.final/
the command line (windows):
del wgetout.final && del wgetass.cmd && wget -i wgetin.txt -o wget.log && sed -f wgetout.sed -r wget.log >wgetass.cmd && wgetass.cmd