wget not saving multiple files as html - wget

I'm using
wget -i urllist.txt
All the urls are similarly named, so I get
index
index.1
index.2
index.3
Now, the names don't really matter to me, but is there some way I can get them to save as html? Instead, the extensions are 1, 2, 3, etc. Thanks.

Related

Is there a way to catenate specific pages from multiple pdfs using pdftk?

So say I have a few pdfs with. Some with a single page, some with 3, 4, 5 and so on…
I want to extract say the 4th page from the pdfs (ignoring ones without the 4th) page and merge them into a single pdf.
Tried something like
$ pdftk *.pdf cat 4 output merged.pdf
Going by this pdftk gives me a single page pdf output with the 4th page of the first input pdf.
Wondering whether I should write an elaborate script or if there’s an easier way
I do have a few workarounds where I burst the pdfs and then merge the pages I need but looking for something simpler.

How to make wget *not* overwrite/ignore files?

I have a list of 400 websites from where I'm downloading PDF's. 100 of these websites share the same pdf name: templates.pdf
When running wget, it either ignores the pdfs that have name "templates" or overwrites them. I was searching for a command that would make a new templates2.pdf for 2 hours, but I couldn't find anything.
The default behavior of wget is to use the .1, .2 prefixes when a file is downloaded multiple times into the same target directory. This appears to be what you are asking for. (The poorly named -nc option causes subsequent downloads of files with the same name to be ignored, which you don't want).
If the default behavior is not what you want, the -O option looks promising, as it allows you to _choose) the output file name. Here is a brief article explaining its use.
Of course, if you go the -O route, you'd need to ensure the output file does not exist, and do the suffix incrementing on your own.

How to rename partly the downloaded file using wget?

I'd like to download many files (about 10000) from ftp-server. Names of the files are too long. I'd like to save them only with the date in names. For example: ABCDE201604120000-abcde.nc I prefer to be 20160412.nc
Is it possible?
I am not sure if wget provides similar functionality, nevertheless with curl, one can profit from the relatively rich syntax it provides in order to specify the URL of interest. For example:
curl \
"https://ftp5.gwdg.de/pub/misc/openstreetmap/SOTMEU2014/[53-54].{mp3,mp4}" \
-o "file_#1.#2"
will download files 53.mp3, 53.mp4, 54.mp3, 54.mp4. The output file is specified as file_#1.#2 - here, #1 is replaced by curl with the value of the sequence [53-54] corresponding to the file being downloaded. Similarly, #2 is replace with either mp3 or mp4. Thus, e.g., 53.mp3 will be saved as file_53.mp3.
ewcz's answer works fine if you can enumerate the file names as shown in the post. However, if the filenames are difficult to enumerate, for example, because the integers are sparsely populated, this solution would result in a lot of 404 Not Found requests.
If this is the case, then it is probably better to download all the files recursively, as you have shown, and rename them afterwards. If the file names follow a fixed pattern, you can select the substring from the original name and use it as the new name. In the given example, the new file names start at position 5 and are 8 characters long. The following bash command renames all *.nc files in the current directory.
for f in *.nc; do mv "$f" "${f:5:8}.nc" ; done
If the filenames do not follow a fix pattern and might vary in length, you can use more complex pattern substitution using sed, see SO post for an example.

grabbing all .nc files from URL to get data using matlab

I d like to get all .nc files from URL to get and read the data using matlab. However, the filename is always very long and vary amongst all files.
For instance, I have
url = 'http://sourcename/filename.nc'
the sourcename is always the same, however the filename is very long and vary, so I would like to just use * to be able to grab whatever .nc file in the url
e.g.
url = 'http://sourcename/*.nc'
but this does not work and I am guessing I need to get the exact name - so I am not sure what to do here?
On the other hand, it could be also interesting for me to get the name of each file and record it, but not sure how to do that either.
Thanks a lot in advance!!
HTTP does not implement a filesystem abstraction. This means that each of those URLs that you request could be handled in a completely different way. There is also in many cases no way to get a list of allowable URLs off of a parent (a directory listing, in other words).
It may be the case for you that http://sourcename/ actually returns an index document containing a list of the files. In that case, first fetch that document. Then you'll have to parse the contents to extract the list of files. Then you can loop over those files, form new URLs for each one, and fetch them in sequence.
If you have a list of the file names in a text file, you can use the wget utility to process the file and fetch all the listed files. This file would be formatted as follows:
http://url.com/file1.nc
http://url.com/file2.nc
(etc)
You would then invoke wget as follows:
$ wget -i url-file.txt
Alternatively, you may be able to use wget to fetch the files recursively, if they are all located in the same directory on the web server, e.g.:
$ wget -r -l1 http://url.com/directory
The -r flag says to recurse, the -l1 flag says to go no deeper than 1 level when recursing.
This solution is external to Matlab, but once you have all of the files downloaded, you can work with them all locally.
wget is a fairly standard utility available on linux systems. It is also available for OSX and Windows as well. The wget homepage is here: https://www.gnu.org/software/wget/

how do I get a profile's "link" from Facebook's graph in a bash script?

I'm trying to write a script that needs to extract a list of public facebook page URLs and stores them in a flat text file. I've already downloaded a few http://graph.facebook.com/$NUMBER files with wget, but I'm having trouble separating out the URL because of the weird delimiting that they use. Here's the general format (I'll use a fictitious example):
{"id":"4","name":"John Smith","first_name":"John","last_name":"Smith","link":"http:\/\/www.facebook.com\/john.smith","username":"john.smith","gender":"male","locale":"en_US"
That's JSON, and so your best bet is to use a tool that actually understands JSON. If you have python installed, it comes stock with json support, and so it's easy to do something like:
$ echo '{"id":"4","name":"John Smith","first_name":"John","last_name":"Smith","link":"http:\/\/www.facebook.com\/john.smith","username":"john.smith","gender":"male","locale":"en_US"}' | python -c 'import json,sys; print json.load(sys.stdin)["link"]'
http://www.facebook.com/john.smith
Not a pure bash solution, but parsing JSON in bash seems like a lot of hard and unnecessary work, imo.