How to wget a file without getting the html instead? - wget

I'm trying to download a file using:
wget https://huggingface.co/distilbert-base-uncased/blob/main/vocab.txt
I'm expecting to get the .txt file, however, I get the page html instead.
I tried wget --max-redirect=2 --trust-server-names <url> based on the suggestions here and wget -m <url> which downloads the entire website, and a few other variations that also don't work.

wget https://huggingface.co/distilbert-base-uncased/blob/main/vocab.txt
This point wget to HTML page even though it has .txt suffix. After visting it I found there is link to text file itself under raw, which you should be able to use with wget following way
wget https://huggingface.co/distilbert-base-uncased/raw/main/vocab.txt
If you need to reveal true type of file without downloading it you might use --spider option, in this case
wget --spider https://huggingface.co/distilbert-base-uncased/blob/main/vocab.txt
gives output containing
Length: 7889527 (7,5M) [text/html]
and
wget --spider https://huggingface.co/distilbert-base-uncased/raw/main/vocab.txt
gives output containing
Length: 231508 (226K) [text/plain]

Related

Recursive file download with wget not working

From what I can tell by studying the wget manual, the following should work:
wget -r -l1 -np -nd -A.* -R "index.html*" http://s3.amazonaws.com/gp.gms/blat/
However, instead of getting all of the files in the blat folder without the apparently auto-generated index.html file, I get a 404 not found error on this and several dozen variations that I've tried.
I can easily download any of the 4 files but trying to do it recursively fails.
Any pointers would be greatly appreciated.
Try replacing -r -l1 with -r -l 1. You need a space between the l and the 1. Also, try adding -k with your options. This will convert the links to point to the corresponding files on your computer.

wget downloads html file, but I need a fasta file

I have to search NCBI for ID CAA37914 and download the fasta file using wget on ubuntu-18.04 and rename the file to CAA37914.fa.
I looked up the ID and got the following url: https://www.ncbi.nlm.nih.gov/protein/CAA37914.1/?report=fasta
I tried the following:
wget https://www.ncbi.nlm.nih.gov/protein/CAA37914.1/?report=fasta -O CAA37914.fa
But that didn't work. What am I doing wrong?
I get a file with html output
edit:
I think I have to do it something like this:
wget “link/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_208885&retype=fasta” -O NP_983532_dna.fa
I figured it out.
This is the answer:
wget "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=CAA37914&rettype=fasta" -O CAA37914.fa

How to print wget output to a file

I'm trying to print http headers to a text file. I tried :
wget -S --spider -O SESSIONS.txt 'mysite.com'
wget -S --spider 'mysite.com' > SESSIONS.txt
In both cases SESSIONS.txt remains empty. why?
"--spider" option does not download anything.
You can try this -
wget -S --spider -q mysite.com 2>Sessions.txt
This will save only the headers to "Sessions.txt"
However, you will have to use echo and other commands to figure out which request generated which headers.
Or, you can remove the -q option and then parse the file to remove unnecessary lines.
Another way is to use "curl -I". However, this sends a HEAD request instead of a GET request. So, it will only work if the server supports and responds to HEAD requests.

how to use -o flag in wget with -i?

I understand that -i flag takes a file (which may contain list of URLs) and I know that -o followed by a name can be specified to rename a item being downloaded using wget.
example:
wget -i list_of_urls.txt
wget -o my_custom_name.mp3 http://example.com/some_file.mp3
I have a file that looks like this:
file name: list_of_urls.txt
http://example.com/some_file.mp3
http://example.com/another_file.mp3
http://example.com/yet_another_file.mp3
I want to use wget to download these files with the -i flag but also save each file as 1.mp3, 2.mp3 and so on.
Can this be done?
You can use any script language (PHP or Python) for generate batch file. In thin batch file each line will contains run wget with url and -O options.
Or you can try write cycle in bash script.
I ran a web search again and found https://superuser.com/questions/336669/downloading-multiple-files-and-specifying-output-filenames-with-wget
Wget can't seem to do it but Curl can with -K flag, the file supplied can contain url and output name. See http://curl.haxx.se/docs/manpage.html#-K
If you are willing to use some shell scripting then https://unix.stackexchange.com/questions/61132/how-do-i-use-wget-with-a-list-of-urls-and-their-corresponding-output-files has the answer.

Download all pdf files using wget

I have the following site http://www.asd.com.tr. I want to download all PDF files into one directory. I've tried a couple of commands but am not having much luck.
$ wget --random-wait -r -l inf -nd -A pdf http://www.asd.com.tr/
With this code only four PDF files were downloaded. Check this link, there are over several thousand PDFs available:
http://www.asd.com.tr/Default.aspx
For instance, hundreds of files are in the following folder:
http://www.asd.com.tr/Folders/asd/…
But I can't figure out how to access them correctly to see and download them all, there are some of folders in this subdirectory, http://www.asd.com.tr/Folders/, and thousands of PDFs in these folders.
I've tried to mirror site using -m command but it failed too.
Any more suggestions?
First, verify that the TOS of the web site permit to crawl it. Then, one solution is :
mech-dump --links 'http://domain.com' |
grep pdf$ |
sed 's/\s+/%20/g' |
xargs -I% wget http://domain.com/%
The mech-dump command comes with Perl's module WWW::Mechanize (libwww-mechanize-perl package on debian & debian likes distros)