wget downloads html file, but I need a fasta file

wget downloads html file, but I need a fasta file - wget

I have to search NCBI for ID CAA37914 and download the fasta file using wget on ubuntu-18.04 and rename the file to CAA37914.fa.
I looked up the ID and got the following url: https://www.ncbi.nlm.nih.gov/protein/CAA37914.1/?report=fasta
I tried the following:
wget https://www.ncbi.nlm.nih.gov/protein/CAA37914.1/?report=fasta -O CAA37914.fa
But that didn't work. What am I doing wrong?
I get a file with html output
edit:
I think I have to do it something like this:
wget “link/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_208885&retype=fasta” -O NP_983532_dna.fa

I figured it out.
This is the answer:
wget "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=CAA37914&rettype=fasta" -O CAA37914.fa

Related

How to wget a file without getting the html instead?

I'm trying to download a file using:
wget https://huggingface.co/distilbert-base-uncased/blob/main/vocab.txt
I'm expecting to get the .txt file, however, I get the page html instead.
I tried wget --max-redirect=2 --trust-server-names <url> based on the suggestions here and wget -m <url> which downloads the entire website, and a few other variations that also don't work.

wget https://huggingface.co/distilbert-base-uncased/blob/main/vocab.txt
This point wget to HTML page even though it has .txt suffix. After visting it I found there is link to text file itself under raw, which you should be able to use with wget following way
wget https://huggingface.co/distilbert-base-uncased/raw/main/vocab.txt
If you need to reveal true type of file without downloading it you might use --spider option, in this case
wget --spider https://huggingface.co/distilbert-base-uncased/blob/main/vocab.txt
gives output containing
Length: 7889527 (7,5M) [text/html]
and
wget --spider https://huggingface.co/distilbert-base-uncased/raw/main/vocab.txt
gives output containing
Length: 231508 (226K) [text/plain]

Recursive file download with wget not working

From what I can tell by studying the wget manual, the following should work:
wget -r -l1 -np -nd -A.* -R "index.html*" http://s3.amazonaws.com/gp.gms/blat/
However, instead of getting all of the files in the blat folder without the apparently auto-generated index.html file, I get a 404 not found error on this and several dozen variations that I've tried.
I can easily download any of the 4 files but trying to do it recursively fails.
Any pointers would be greatly appreciated.

Try replacing -r -l1 with -r -l 1. You need a space between the l and the 1. Also, try adding -k with your options. This will convert the links to point to the corresponding files on your computer.

wget - only output redirect url but no download

I have a download link to a large file.
You need to be logged in to the site, so a cookie is used.
The download link redirects to another URL.
I'm able to download the file with wget but I only want the output of the "real" direct download link.
wget does exactly this before starting the download
Location: https://foo.com/bar.zip [following]
Is there a way to make wget stop and not actually downloading the file?
The solutions I found recommend redirecting to dev/null but this would still download the file. What I want is wget following the redirects but not actually starting the download.

I couldn't find a way to do it with wget, but I found a way to do it with curl:
curl https://openlibrary.org/data/ol_dump_latest.txt.gz -s -L -I -o /dev/null -w '%{url_effective}'
This only downloads the HEAD of the page (and sends it to /dev/null), so the file itself is never downloaded.
(src: https://stackoverflow.com/a/5300429/2317712 )

Going off of #qqilihq's comment to the curl answer, this will first strip out the line starting with "Location:" then remove the "Location: " from the beginning and the " [following]" from the end using awk. Not sure if I would use this as it looks like a small change in the wget output could make it blow up. I would use the curl answer myself.
wget --max-redirect=0 http://example.com/link-to-get-redirec-url-from 2>&1 | awk '/Location: /,// { print }' | awk '{print $2}'

downloading using wget for multiple files, with renaming of files

I am aware that you can download from multiple url using:
wget "url1" "url2" "url3"
Renaming the output file can be done via:
wget "url1" -O "new_name1"
But when I tried
wget "url1" "url2" "url3" -O "name1" "name2" "name3"
all the files are using name1.
what is the proper way to do so in a single command?

Yes something like this, You can add a file name next to each URL in the file then do:
while IFS= read -r url fileName;do
wget -O "$fileName" "$url"
done < list
where it is assumed you have added a (unique) file name after each URL in the file (separated by a space).
The -O option allows you to specify the destination file name. But if you're downloading multiple files at once, wget will save all of their content to the file you specify via -O. Note that in either case, the file will be truncated if it already exists. See the man page for more info.
You can exploit this option by telling wget to download the links one-by-one:
while IFS= read -r url;do
fileName="blah" # Add a rule to define a new name for each file here
wget -O "$fileName" "$url"
done < list
hope it useful.

how to use -o flag in wget with -i?

I understand that -i flag takes a file (which may contain list of URLs) and I know that -o followed by a name can be specified to rename a item being downloaded using wget.
example:
wget -i list_of_urls.txt
wget -o my_custom_name.mp3 http://example.com/some_file.mp3
I have a file that looks like this:
file name: list_of_urls.txt
http://example.com/some_file.mp3
http://example.com/another_file.mp3
http://example.com/yet_another_file.mp3
I want to use wget to download these files with the -i flag but also save each file as 1.mp3, 2.mp3 and so on.
Can this be done?

You can use any script language (PHP or Python) for generate batch file. In thin batch file each line will contains run wget with url and -O options.
Or you can try write cycle in bash script.

I ran a web search again and found https://superuser.com/questions/336669/downloading-multiple-files-and-specifying-output-filenames-with-wget
Wget can't seem to do it but Curl can with -K flag, the file supplied can contain url and output name. See http://curl.haxx.se/docs/manpage.html#-K
If you are willing to use some shell scripting then https://unix.stackexchange.com/questions/61132/how-do-i-use-wget-with-a-list-of-urls-and-their-corresponding-output-files has the answer.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

wget downloads html file, but I need a fasta file - wget

I figured it out. This is the answer: wget "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=CAA37914&rettype=fasta" -O CAA37914.fa

Related

How to wget a file without getting the html instead?

Recursive file download with wget not working

wget - only output redirect url but no download

downloading using wget for multiple files, with renaming of files

how to use -o flag in wget with -i?

Categories

Resources