grabbing all .nc files from URL to get data using matlab - matlab

I d like to get all .nc files from URL to get and read the data using matlab. However, the filename is always very long and vary amongst all files.
For instance, I have
url = 'http://sourcename/filename.nc'
the sourcename is always the same, however the filename is very long and vary, so I would like to just use * to be able to grab whatever .nc file in the url
e.g.
url = 'http://sourcename/*.nc'
but this does not work and I am guessing I need to get the exact name - so I am not sure what to do here?
On the other hand, it could be also interesting for me to get the name of each file and record it, but not sure how to do that either.
Thanks a lot in advance!!

HTTP does not implement a filesystem abstraction. This means that each of those URLs that you request could be handled in a completely different way. There is also in many cases no way to get a list of allowable URLs off of a parent (a directory listing, in other words).
It may be the case for you that http://sourcename/ actually returns an index document containing a list of the files. In that case, first fetch that document. Then you'll have to parse the contents to extract the list of files. Then you can loop over those files, form new URLs for each one, and fetch them in sequence.

If you have a list of the file names in a text file, you can use the wget utility to process the file and fetch all the listed files. This file would be formatted as follows:
http://url.com/file1.nc
http://url.com/file2.nc
(etc)
You would then invoke wget as follows:
$ wget -i url-file.txt
Alternatively, you may be able to use wget to fetch the files recursively, if they are all located in the same directory on the web server, e.g.:
$ wget -r -l1 http://url.com/directory
The -r flag says to recurse, the -l1 flag says to go no deeper than 1 level when recursing.
This solution is external to Matlab, but once you have all of the files downloaded, you can work with them all locally.
wget is a fairly standard utility available on linux systems. It is also available for OSX and Windows as well. The wget homepage is here: https://www.gnu.org/software/wget/

Related

How to make wget *not* overwrite/ignore files?

I have a list of 400 websites from where I'm downloading PDF's. 100 of these websites share the same pdf name: templates.pdf
When running wget, it either ignores the pdfs that have name "templates" or overwrites them. I was searching for a command that would make a new templates2.pdf for 2 hours, but I couldn't find anything.
The default behavior of wget is to use the .1, .2 prefixes when a file is downloaded multiple times into the same target directory. This appears to be what you are asking for. (The poorly named -nc option causes subsequent downloads of files with the same name to be ignored, which you don't want).
If the default behavior is not what you want, the -O option looks promising, as it allows you to _choose) the output file name. Here is a brief article explaining its use.
Of course, if you go the -O route, you'd need to ensure the output file does not exist, and do the suffix incrementing on your own.

How to rename partly the downloaded file using wget?

I'd like to download many files (about 10000) from ftp-server. Names of the files are too long. I'd like to save them only with the date in names. For example: ABCDE201604120000-abcde.nc I prefer to be 20160412.nc
Is it possible?
I am not sure if wget provides similar functionality, nevertheless with curl, one can profit from the relatively rich syntax it provides in order to specify the URL of interest. For example:
curl \
"https://ftp5.gwdg.de/pub/misc/openstreetmap/SOTMEU2014/[53-54].{mp3,mp4}" \
-o "file_#1.#2"
will download files 53.mp3, 53.mp4, 54.mp3, 54.mp4. The output file is specified as file_#1.#2 - here, #1 is replaced by curl with the value of the sequence [53-54] corresponding to the file being downloaded. Similarly, #2 is replace with either mp3 or mp4. Thus, e.g., 53.mp3 will be saved as file_53.mp3.
ewcz's answer works fine if you can enumerate the file names as shown in the post. However, if the filenames are difficult to enumerate, for example, because the integers are sparsely populated, this solution would result in a lot of 404 Not Found requests.
If this is the case, then it is probably better to download all the files recursively, as you have shown, and rename them afterwards. If the file names follow a fixed pattern, you can select the substring from the original name and use it as the new name. In the given example, the new file names start at position 5 and are 8 characters long. The following bash command renames all *.nc files in the current directory.
for f in *.nc; do mv "$f" "${f:5:8}.nc" ; done
If the filenames do not follow a fix pattern and might vary in length, you can use more complex pattern substitution using sed, see SO post for an example.

Get an WARC achive file with all files from a given domain, using from commoncrawl.org

Commoncrawl datasets are splitted by segments.
How to extract a subset of the common-crawl data-set? I need a WARC archive file (or several archive files) with all the files from a given domain, such as example.com?
Note: common_crawl_index allows to do that by running bin/remote_copy copy "com.ipc.www" --bucket commoncrawl_sample --key common_crawl/ipc_crawl, but the project is outdated: it only works for 2012 datasets, and it does not accept WARC, WAT or WET files.
Note: Also, http://index.commoncrawl.org/ allows to find the segments for a given url prefix, but there is not a utility to download only that pages, such as the previous remote_copy command.
PS: I am aware I can implement a program to do so. Here I am asking if common-crawl (or someone else) already thought and implemented this feature.

SAS- Reading multiple compressed data files

I hope you are all well.
So my question is about the procedure to open multiple raw data files that are compressed.
My files' names are ordered so I have for example : o_equities_20080528.tas.zip o_equities_20080529.tas.zip o_equities_20080530.tas.zip ...
Thank you all in advance.
How much work this will be depends on whether:
You have enough space to extract all the files simultaneously into one folder
You need to be able to keep track of which file each record has come from (i.e. you can't tell just from looking at a particular record).
If you have enough space to extract everything and you don't need to track which records came from which file, then the simplest option is to use a wildcard infile statement, allowing you to import the records from all of your files in one data step:
infile "c:\yourdir\o_equities_*.tas" <other infile options as per individual files>;
This syntax works regardless of OS - it's a SAS feature, not shell expansion.
If you have enough space to extract everything in advance but you need to keep track of which records came from each file, then please refer to this page for an example of how to do this using the filevar option on the infile statement:
http://www.ats.ucla.edu/stat/sas/faq/multi_file_read.htm
If you don't have enough space to extract everything in advance, but you have access to 7-zip or another archive utility, and you don't need to keep track of which records came from each file, you can use a pipe filename and extract to standard output. If you're on a Linux platform then this is very simple, as you can take advantage of shell expansion:
filename cmd pipe "nice -n 19 gunzip -c /yourdir/o_equities_*.tas.zip";
infile cmd <other infile options as per individual files>;
On windows it's the same sort of idea, but as you can't use shell expansion, you have to construct a separate filename for each zip file, or use some of 7zip's more arcane command-line options, e.g.:
filename cmd pipe "7z.exe e -an -ai!C:\yourdir\o_equities_*.tas.zip -so -y";
This will extract all files from all of the matching archives to standard output. You can narrow this down further via the 7-zip command if necessary. You will have multiple header lines mixed in with the data - you can use findstr to filter these out in the pipe before SAS sees them, or you can just choose to tolerate the odd error message here and there.
Here, the -an tells 7-zip not to read the zip file name from the command line, and the -ai tells it to expand the wildcard.
If you need to keep track of what came from where and you can't extract everything at once, your best bet (as far as I know) is to write a macro to process one file at a time, using the above techniques and add this information while you're importing each dataset.

Extracting file names from an online data server in Matlab

I am trying to write a script that will allow me to download numerous (1000s) of data files from a data server (e.g, http://hydro1.sci.gsfc.nasa.gov/thredds/catalog/GLDAS_NOAH10SUBP_3H/2011/345/). Unfortunately, the names of the files in each directory are not formatted in a similar way (the time that they were created were appended to the end of the file name). I need to be able to specify the file name to subset the data (I have a special tool for these data types) and download it. I cannot find a function in matlab that will extract the file names.
I have looked at URLREAD, but it downloads everything including html code.
Thanks for your help!
You can easily parse the link.
x=urlread(url)
links=regexp(x,'<a href=''([^>]+)''>','tokens')
Reads every link, you have to filter all unwanted links.
For example this gets all grb files:
a=regexp(x,'<a href=''([^>]+.grb)''>','tokens')