Appending URL to output file with wget - wget

I'm using wget to read a batch of urls from an input file and download everything to a single output file, and I'd like to append each url before its downloaded content, anyone knows how to do that?
Thanks!

afaik wget does not directly support the use case you are envisioning. however, using standard tools, you can emulate this feature.
we will proceed as follows:
call wget with logging enabled
let sed process the log executing the script detailed below
execute the transformation result as a shell/batch script
conventions: use the following filenames:
wgetin.txt: the file with the urls to fetch using wget
wgetout.sed: sed script
wgetout.final: the final result
wgetass.sh/.cmd: shell/batch script to assemble the downloaded files weaving in the url data
wget.log: the log file of the wget call
Linux
the sed script (linux):
# delete lines _not_ matching the regex
/^\(Saving to: .\|--[0-9: \-]\+-- \)/! { d; }
# turn remaining content into something else
s/^--[0-9: \-]\+-- \(.*\)$/echo '\1\n' >>wgetout.final/
s/^Saving to: .\(.*\).$/cat '\1' >>wgetout.final/
the command line (linux):
rm wgetout.final | rm wgetass.sh | wget -i wgetin.txt -o wget.log | sed -f wgetout.sed -r wget.log >wgetass.sh | chmod 755 wgetass.sh | ./wgetass.sh
Windows
the syntax for windows batch scripts is slightly different. of course, the windows ports of wget and sed have to be installed first.
the sed script (windows):
# delete lines _not_ matching the regex
/^\(Saving to: .\|--[0-9: \-]\+-- \)/! { d; }
# turn remaining content into something else
s/^--[0-9: \-]\+-- \(.*\)$/echo "\1" >>wgetout.final/
s/^Saving to: .\(.*\).$/type "\1" >>wgetout.final/
the command line (windows):
del wgetout.final && del wgetass.cmd && wget -i wgetin.txt -o wget.log && sed -f wgetout.sed -r wget.log >wgetass.cmd && wgetass.cmd

Related

How to use 'sed' to find and replace values within a tsv file?

I am currently working with a large .tsv.gz file that contains two columns that looks something like this:
xxxyyy 408261
yzlsdf 408260null408261
zlkajd 408258null408259null408260
asfzns 408260
What I'd like to do is find all the rows that contain "null" and replace it with a comma ",". So that the result would look like this:
xxxyyy 408261
yzlsdf 408260,408261
zlkajd 408258,408259,408260
asfzns 408260
I have tried using the following command but did not work:
sed -i 's/null/,/g' 46536657_1748327588_combined_copy.tsv.gz
Unzipping the file and trying it again also does not work with a tsv file.
I've also tried opening the unzipped file in a text editor to manually find and replace. But the file is too huge and would crash.
Try:
zcat comb.tsv.gz | sed 's/null/,/g' | gzip >new_comb.tsv.gz && mv new_comb.tsv.gz comb.tsv.gz
Because this avoids unzipping your file all at once, this should save on memory.
Example
Let's start with this sample file:
$ zcat comb.tsv.gz
xxxyyy 408261
yzlsdf 408260null408261
zlkajd 408258null408259null408260
asfzns 408260
Next, we run our command:
$ zcat comb.tsv.gz | sed 's/null/,/g' | gzip >new_comb.tsv.gz && mv new_comb.tsv.gz comb.tsv.gz
By looking at the output file, we can see that the substitutions were made:
$ zcat comb.tsv.gz
xxxyyy 408261
yzlsdf 408260,408261
zlkajd 408258,408259,408260
asfzns 408260

Using results from grep to write results line by line with sed

I am trying to take every file name in a directory that has the extension .text
and write it to a file in that same directory line by line starting at line number 14.
This is what I have so far but doesn't work.
cp workDir | grep -r --include *.text | sed -i '14i' home.text
Any assistance is appreciated. Note: I am on Unix.
You can do the above task by following shell command:
find workDir -name "*.text" >> home.text
This will solve what you have commented.
cp workDir doesn't work, because cp is for copying like cp source destination. Further explanations about cp can be read with man cp.
To achieve your goal go to your directory with cd ~/path/to/workDir. There you can use the ls command and redirect the output appending to your existing file with >> for all .text file extensions.
For example like this:
ls *.text >> home.text
This will append only the filenames line by line to your home.text without a preceding ./ like in the answer bevor with the find command.
Let me know If you like another format for your file names you want to append.

How to rename files downloaded with wget -r

I want to download an entire website using the wget -r command and change the name of the file.
I have tried with:
wget -r -o doc.txt "http....
hoping that the OS would have automatically create file in order like doc1.txt doc2.txt but It actually save the stream of the stdout in that file.
Is there any way to do this with just one command?
Thanks!
-r tells wget to recursively get resources from a host.
-o file saves log messages to file instead of the standard error. I think that is not what you are looking for, I think it is -O file.
-O file stores the resource(s) in the given file, instead of creating a file in the current directory with the name of the resource. If used in conjunction with -r, it causes wget to store all resources concatenated to that file.
Since wget -r downloads and stores more than one file, recreating the server file tree in the local system, it has no sense to indicate the name of one file to store.
If what you want is to rename all downloaded files to match the pattern docX.txt, you can do it with a different command after wget has end:
wget -r http....
i=1
while read file
do
mv "$file" "$(dirname "$file")/doc$i.txt"
i=$(( $i + 1 ))
done < <(find . -type f)

wget - work with arguments

I have a list of URI: uri.txt with
category1/image1.jpeg
category1/image32.jpeg
category2/image1.jpeg
and so on, and need to download them from domain example.com with wget, with additional changing filename (final at save) to categoryX-imageY.jpeg
I understand, that I should read uri.txt line by line, add "http://example.com/" in front of each line and change "/" to "-" in each line.
What I have now:
Reading from uri.txt [work]
Adding domain name in front of each URI [work]
Change filename to save [fail]
I'm trying to do this with:
wget 'http://www.example.com/{}' -O '`sed "s/\//-/" {}`' < uri.txt
but wget fails (it depends what type of quotation sign I'm using: ` or ') with:
wget: option requires an argument -- 'O'
or
sed `s/\//-/` category1/image1.jpeg: No such file or directory
sed `s/\//-/` category1/image32.jpeg: No such file or directory
Could you tell, what I'm doing wrong?
Here is how I would do that:
while read LINE ; do
wget "http://example.com/$LINE" -O $(echo $LINE|sed 's=/=-=')
done < uri.txt
In other words, read uri.txt line by line (the text being placed in $LINE bash variable), before performing the wget and saving with modified name (I use another sed delimitor, to avoid escaping / and making it more readable)
When I want to construct a list of args to be executed, I like to use xargs:
cat uri.txt | sed "s#\(.*\)/\(.*\)#http://example.com/\1/\2 -O \1-\2#" | xargs -I {} wget {}

Wget: Filenames without the query string

I want to download a list of webpages from a file. How can I stop Wget appending the query strings on to the saved files?
wget http://www.example.com/index.html?querystring
I need this to be downloaded as index.html, not index.html?querystring
There is the -O option:
wget -O file.html http://www.example.com/index.html?querystring
so you can alter a little bit your script to pass to the -O argument the right file name.
I've finally resigned to using the -O and just wrapped it in a bash function to make it easier. I put this in my ~/.bashrc file:
wget-rmq ()
{
[ -z "$1" ] && echo 'error: wget-rmq requires a URL to retrieve as the first arg'
local output_filename="$(echo $1 | sed 's/?.*//g' | sed 's|https.*/||g')"
wget -O "${output_filename}" "${1}"
}
Then when I want to download a file:
wget-rmq http://www.example.com/index.html?querystring
The replacement regex is fairly simple. If any ?s appear in the URL before the query string begins then it will break. In practice that hasn't happened though since URL encoding requires ? to be in URLs as %3F, but I wanted to note the possibility.