wget many long URLs from a .txt file - wget

I have a couple hundred 10-second mp4s to download. The URLs to these files are listed in a file called urls.txt and they look like
http://v16.muscdn.com/thirty_two_alphanumeric_characters/5cf790de/video/tos/maliva/tos-maliva-v-0068/thirty_two_alphanumeric_characters/?rc=ang7cmg8OmZtaTMzZzczM0ApQHRAbzVHOjYzMzM0NTQ2ODMzMzQ1b0BoNXYpQGczdyl2KUBmamxmc3JneXcxcHpAKTY0ZHEzY2otcTZyb18tLWIxNnNzLW8jbyM2QS8wLS00LTQtLzYzMjYtOiNvIzphLW8jOmA6YC1vI2toXitiZmBjYmJeYDAvOg%3D%3D
so the total length of the url is 329 characters.
When I try wget -i urls.txt I get Error 414 URI Too Long
But when I try to wget a random URL from the file by copy/pasting it into my terminal it works fine and downloads the one file.
So then I tried the following bash script to wget each URL in the file, but that gave me the same error.
#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
echo "Text read from file: $line"
wget $line --user-agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Ubuntu Chromium/25.0.1364.160 Chrome/25.0.1364.160 Safari/537.22"
done < "urls.txt"
I also tried to change the line-ending characters by doing dos2unix on the file but it made no difference.
What else can I try?

If all your URLs are already in a single file, why don't you simply invoke wget as:
$ wget --user-agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Ubuntu Chromium/25.0.1364.160 Chrome/25.0.1364.160 Safari/537.22" -i urls.txt

Related

wget downloads html file, but I need a fasta file

I have to search NCBI for ID CAA37914 and download the fasta file using wget on ubuntu-18.04 and rename the file to CAA37914.fa.
I looked up the ID and got the following url: https://www.ncbi.nlm.nih.gov/protein/CAA37914.1/?report=fasta
I tried the following:
wget https://www.ncbi.nlm.nih.gov/protein/CAA37914.1/?report=fasta -O CAA37914.fa
But that didn't work. What am I doing wrong?
I get a file with html output
edit:
I think I have to do it something like this:
wget “link/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_208885&retype=fasta” -O NP_983532_dna.fa
I figured it out.
This is the answer:
wget "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=CAA37914&rettype=fasta" -O CAA37914.fa

How to ignore specific type of files to download in wget?

How do I ignore .jpg, .png files in wget as I wanted to include only .html files.
I am trying:
wget -R index.html,*tiff,*pdf,*jpg -m http://example.com/
but it's not working.
Use the
--reject jpg,png --accept html
options to exclude/include files with certain extensions, see http://www.gnu.org/software/wget/manual/wget.html#Recursive-Accept_002fReject-Options.
Put patterns with wildcard characters in quotes, otherwise your shell will expand them, see http://www.gnu.org/software/wget/manual/wget.html#Types-of-Files
# -r : recursive
# -nH : Disable generation of host-prefixed directories
# -nd : all files will get saved to the current directory
# -np : Do not ever ascend to the parent directory when retrieving recursively.
# -R : don't download files with this files pattern
# -A : get only *.html files (for this case)
For instance:
wget -r -nH -nd -np -A "*.html" -R "*.gz, *.tar" http://www1.ncdc.noaa.gov/pub/data/noaa/1990/
Worked example to download all files excluding archives:
wget -r -k -l 7 -E -nc \
-R "*.gz, *.tar, *.tgz, *.zip, *.pdf, *.tif, *.bz, *.bz2, *.rar, *.7z" \
-erobots=off \
--user-agent="Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36" \
http://misis.ru/
this is what I get from wget --help:
Recursive accept/reject:
-A, --accept=LIST comma-separated list of accepted extensions.
-R, --reject=LIST comma-separated list of rejected extensions.
--accept-regex=REGEX regex matching accepted URLs.
--reject-regex=REGEX regex matching rejected URLs.
--regex-type=TYPE regex type (posix|pcre).
-D, --domains=LIST comma-separated list of accepted domains.
--exclude-domains=LIST comma-separated list of rejected domains.
--follow-ftp follow FTP links from HTML documents.
--follow-tags=LIST comma-separated list of followed HTML tags.
--ignore-tags=LIST comma-separated list of ignored HTML tags.
-H, --span-hosts go to foreign hosts when recursive.
-L, --relative follow relative links only.
-I, --include-directories=LIST list of allowed directories.
--trust-server-names use the name specified by the redirection
url last component.
-X, --exclude-directories=LIST list of excluded directories.
-np, --no-parent don't ascend to the parent directory.
so you can use -R or --reject to reject extentions this way:
wget -R="index.html,*.tiff,*.pdf,*.jpg" http://example.com/
and in my case here is final command which I wanted to recursively download/update none-html files from an indexed website directory:
wget -N -r -np -nH --cut-dirs=3 -nv -R="*.htm*,*.html" http://example.com/1/2/3/

sed regexp pattern misunderstanding

I try to parse log prase by sed:
echo 195.236.222.1 - - [24/Jul/2012:07:35:25 +0300] "GET / HTTP/1.1" 200 387 "http://www.google.fi/url?sa=t&rct=j&q=tarinat&source=web&cd=9&ved=0CGoQFjAI&url=http%3A%2F%2Fwww.suomi24.fi%2F&ei=XyQOUKi0CeWA4gTjz4D4Cg&usg=AFQjCNE6wg5zPXup3d3PRoqU-BtpiNCccw" "Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1" |
sed -r 's/.*(\&q=.*)\&.*/\1/'
I would like to get "&q=tarinat" but unfortunately have:
\&q=tarinat&source=web&cd=9&ved=0CGoQFjAI&url=http%3A%2F%2Fwww.suomi24.fi%2F&ei=XyQOUKi0CeWA4gTjz4D4Cg
Don't understand the reason why I get the whole string till the end. Any assistance or hints would be highly appreciated.
The .* is quite greedy. You could replace this with a negative character match [^&]* which says match anything but a & character
echo 195.236.222.1 - - [24/Jul/2012:07:35:25 +0300] "GET / HTTP/1.1" 200 387 "http://www.google.fi/url?sa=t&rct=j&q=tarinat&source=web&cd=9&ved=0CGoQFjAI&url=http%3A%2F%2Fwww.suomi24.fi%2F&ei=XyQOUKi0CeWA4gTjz4D4Cg&usg=AFQjCNE6wg5zPXup3d3PRoqU-BtpiNCccw" "Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1" |
sed -r 's/.*(\&q=[^&]*)\&.*/\1/'
The regex .* is greedy. You don't want it to be greedy, so you should probably write:
sed -r 's/.*(\&q=[^&]*)\&.*/\1/'
A simple way using grep:
grep -o "&q=[^&]*"
Result:
&q=tarinat

Grep/Find/Xargs: Search between two strings in folder or result of Wget

I have a folder full of html files, some of which have the following line:
var topicName = "website/something/something_else/1234/12345678_.*, website/something/something_else/1234/12345678_.*//";
I need to get all instances of the text between inverted commas into a text file. I've been trying to combine FIND.exe and XARGS.exe to do this, but have not been successful.
I've been looking at things like the following, but don't know where to start to combine all three to get the output I want.
grep -rI "var topicName = " *
Ideally, I want to combine this with a call to wget also. So in order (a) do a recurisive mirror of a website (maybe limiting the results to Html files) i.e:
wget -mr -k robots=off --user-agent="Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11" --level=1 http://www.website.com/someUrl
(b) go through the html in each result and check if it contains the text 'var topicName', (c) if so, get the text between 'var topicName =' and '"' and write all the values to a text file at the end.
I'd appreciate any help at all with this.
Thanks.
For grabbing the text from the HTML into a file:
If your version of grep supports it, the -o switch tells it to only print the matched portion of the line.
With this in mind, 2 grep invocations should sort you out (provided you can identify uniquely ONLY the lines you wish to grab the text for); something like this:
grep -Rn "var topicName =" html/ | grep -o '"[^"]*"' > topicNames.dat
If it's unacceptable to leave the " symbols in there, you could pass it via sed after the second grep:
grep -Rn "var topicName =" html/ | grep -o '"[^"]*"' | sed 's/"//g' > topicNames.dat

scripting with sed and wget in Windows

I have some issue with Internet connectivity in a LAN. Some users are happy and some complain about the Internet speed. So I came with an idea to install software on three different PCs and try to download/upload a file at the same time and record the speed. Then I will able to create a graph with the data that I acquired.
I am looking for a way to download several files and check the speed. I found How to grep download speed from wget output? for wget and sed. How do I use wget -O /dev/null http://example.com/index.html 2>&1 | sed -e 's|^.*(\([0-9.]\+ [KM]B/s\)).*$|\1|' for Windows? I already installed wget and sed on Windows.
All PCs running Windows XP or 7.
Sed isn't different on Windows. The only difference is, that /dev/null doesn't exist on Windows, but NUL.
So:
wget -O NUL http://example.com/index.html 2>&1 | sed -e 's|^.*(\([0-9.]\+ [KM]B/s\)).*$|\1|'
should work on Windows. I'm not 100% sure about 2>&1 - maybe there is some other syntax to use.