How to rename partly the downloaded file using wget? - wget

I'd like to download many files (about 10000) from ftp-server. Names of the files are too long. I'd like to save them only with the date in names. For example: ABCDE201604120000-abcde.nc I prefer to be 20160412.nc
Is it possible?

I am not sure if wget provides similar functionality, nevertheless with curl, one can profit from the relatively rich syntax it provides in order to specify the URL of interest. For example:
curl \
"https://ftp5.gwdg.de/pub/misc/openstreetmap/SOTMEU2014/[53-54].{mp3,mp4}" \
-o "file_#1.#2"
will download files 53.mp3, 53.mp4, 54.mp3, 54.mp4. The output file is specified as file_#1.#2 - here, #1 is replaced by curl with the value of the sequence [53-54] corresponding to the file being downloaded. Similarly, #2 is replace with either mp3 or mp4. Thus, e.g., 53.mp3 will be saved as file_53.mp3.

ewcz's answer works fine if you can enumerate the file names as shown in the post. However, if the filenames are difficult to enumerate, for example, because the integers are sparsely populated, this solution would result in a lot of 404 Not Found requests.
If this is the case, then it is probably better to download all the files recursively, as you have shown, and rename them afterwards. If the file names follow a fixed pattern, you can select the substring from the original name and use it as the new name. In the given example, the new file names start at position 5 and are 8 characters long. The following bash command renames all *.nc files in the current directory.
for f in *.nc; do mv "$f" "${f:5:8}.nc" ; done
If the filenames do not follow a fix pattern and might vary in length, you can use more complex pattern substitution using sed, see SO post for an example.

Related

How to make wget *not* overwrite/ignore files?

I have a list of 400 websites from where I'm downloading PDF's. 100 of these websites share the same pdf name: templates.pdf
When running wget, it either ignores the pdfs that have name "templates" or overwrites them. I was searching for a command that would make a new templates2.pdf for 2 hours, but I couldn't find anything.
The default behavior of wget is to use the .1, .2 prefixes when a file is downloaded multiple times into the same target directory. This appears to be what you are asking for. (The poorly named -nc option causes subsequent downloads of files with the same name to be ignored, which you don't want).
If the default behavior is not what you want, the -O option looks promising, as it allows you to _choose) the output file name. Here is a brief article explaining its use.
Of course, if you go the -O route, you'd need to ensure the output file does not exist, and do the suffix incrementing on your own.

Using diff3 where filenames contain a dash (-)

I'm trying to use diff3 in this way
diff3 options... mine older yours
My problem is that I probably can't use it, since all my 3 files contain a "dash" within.
The manual mentions:
At most one of these three file names may be `-', which tells diff3 to read the standard input for that file.
so I probably have to rename filenames before running diff3.
If you know for a better solution or a workaround, please let me know about. Thank you!
At most one of these three file names may be `-', which tells diff3 to read the standard input for that file.
It does not state, that your filenames should not contain dash symbols. It simply says, that if you want to, you can put - instead of one of the names, in which case the standard input will be read instead of reading one of the files.
So, you can have as many dashes in your filenames as you like and diff3 should work just fine.
However, on Windows putting filenames in "" for escaping space characters does not work, and I failed to find a suitable workaround. However, you can automatize the process of renaming files (if the files are relatively small, this would not even be too inefficient):
#echo off
copy %1 tempfile_1.txt
copy %2 tempfile_2.txt
copy %3 tempfile_3.txt
"C:\Program Files (x86)\KDiff3\bin\diff3.exe" -E tempfile_1.txt tempfile_2.txt tempfile_3.txt
del tempfile_1.txt tempfile_2.txt tempfile_3.txt
Put this in a file like diff3.cmd, then run diff3.cmd "first file.txt" "second file.txt" "third file.txt".
P.S. Moving files would be more efficient (if they are on the same disk volume as the script, which they are not in your case), you could even move them back to where they were initially, but for some time they would not be present at their original folder.

grabbing all .nc files from URL to get data using matlab

I d like to get all .nc files from URL to get and read the data using matlab. However, the filename is always very long and vary amongst all files.
For instance, I have
url = 'http://sourcename/filename.nc'
the sourcename is always the same, however the filename is very long and vary, so I would like to just use * to be able to grab whatever .nc file in the url
e.g.
url = 'http://sourcename/*.nc'
but this does not work and I am guessing I need to get the exact name - so I am not sure what to do here?
On the other hand, it could be also interesting for me to get the name of each file and record it, but not sure how to do that either.
Thanks a lot in advance!!
HTTP does not implement a filesystem abstraction. This means that each of those URLs that you request could be handled in a completely different way. There is also in many cases no way to get a list of allowable URLs off of a parent (a directory listing, in other words).
It may be the case for you that http://sourcename/ actually returns an index document containing a list of the files. In that case, first fetch that document. Then you'll have to parse the contents to extract the list of files. Then you can loop over those files, form new URLs for each one, and fetch them in sequence.
If you have a list of the file names in a text file, you can use the wget utility to process the file and fetch all the listed files. This file would be formatted as follows:
http://url.com/file1.nc
http://url.com/file2.nc
(etc)
You would then invoke wget as follows:
$ wget -i url-file.txt
Alternatively, you may be able to use wget to fetch the files recursively, if they are all located in the same directory on the web server, e.g.:
$ wget -r -l1 http://url.com/directory
The -r flag says to recurse, the -l1 flag says to go no deeper than 1 level when recursing.
This solution is external to Matlab, but once you have all of the files downloaded, you can work with them all locally.
wget is a fairly standard utility available on linux systems. It is also available for OSX and Windows as well. The wget homepage is here: https://www.gnu.org/software/wget/

SAS- Reading multiple compressed data files

I hope you are all well.
So my question is about the procedure to open multiple raw data files that are compressed.
My files' names are ordered so I have for example : o_equities_20080528.tas.zip o_equities_20080529.tas.zip o_equities_20080530.tas.zip ...
Thank you all in advance.
How much work this will be depends on whether:
You have enough space to extract all the files simultaneously into one folder
You need to be able to keep track of which file each record has come from (i.e. you can't tell just from looking at a particular record).
If you have enough space to extract everything and you don't need to track which records came from which file, then the simplest option is to use a wildcard infile statement, allowing you to import the records from all of your files in one data step:
infile "c:\yourdir\o_equities_*.tas" <other infile options as per individual files>;
This syntax works regardless of OS - it's a SAS feature, not shell expansion.
If you have enough space to extract everything in advance but you need to keep track of which records came from each file, then please refer to this page for an example of how to do this using the filevar option on the infile statement:
http://www.ats.ucla.edu/stat/sas/faq/multi_file_read.htm
If you don't have enough space to extract everything in advance, but you have access to 7-zip or another archive utility, and you don't need to keep track of which records came from each file, you can use a pipe filename and extract to standard output. If you're on a Linux platform then this is very simple, as you can take advantage of shell expansion:
filename cmd pipe "nice -n 19 gunzip -c /yourdir/o_equities_*.tas.zip";
infile cmd <other infile options as per individual files>;
On windows it's the same sort of idea, but as you can't use shell expansion, you have to construct a separate filename for each zip file, or use some of 7zip's more arcane command-line options, e.g.:
filename cmd pipe "7z.exe e -an -ai!C:\yourdir\o_equities_*.tas.zip -so -y";
This will extract all files from all of the matching archives to standard output. You can narrow this down further via the 7-zip command if necessary. You will have multiple header lines mixed in with the data - you can use findstr to filter these out in the pipe before SAS sees them, or you can just choose to tolerate the odd error message here and there.
Here, the -an tells 7-zip not to read the zip file name from the command line, and the -ai tells it to expand the wildcard.
If you need to keep track of what came from where and you can't extract everything at once, your best bet (as far as I know) is to write a macro to process one file at a time, using the above techniques and add this information while you're importing each dataset.

zip recursively each file in a dir, where the name of the file has spaces in it

I am quite stuck; I need to compress the content of a folder, where I have multiple files (extension .dat). I went for shell scripting.
So far I told myself that is not that hard: I just need to recursively read the content of the dir, get the name of the file and zip it, using the name of the file itself.
This is what I wrote:
for i in *.dat; do zip $i".zip" $i; done
Now when I try it I get a weird behavior: each file is called like "12/23/2012 data102 test1.dat"; and when I run this sequence of commands; I see that zip instead of recognizing the whole file name, see each part of the string as single entity, causing the whole operation to fail.
I told myself that I was doing something wrong, and that the i variable was wrong; so I have replaced echo, instead than the zip command (to see which one was the output of the i variable); and the $i output is the full name of the file, not part of it.
I am totally clueless at this point about what is going on...if the variable i is read by zip it reads each single piece of the string, instead of the whole thing, while if I use echo to see the content of that variable it gets the correct output.
Do I have to pass the value of the filename to zip in a different way? Since it is the content of a variable passed as parameter I was assuming that it won't matter if the string is one or has spaces in it, and I can't find in the man page the answer (if there is any in there).
Anyone knows why do I get this behavior and how to fix it? Thanks!
You need to quote anything with spaces in it.
zip "$i.zip" "$i"
Generally speaking, any variable interpolation should have double quotes unless you specifically require the shell to split it into multiple tokens. The internal field separator $IFS defaults to space and tab, but you can change it to make the shell do word splitting on arbitrary separators. See any decent beginners' shell tutorial for a detailed account of the shell's quoting mechanisms.