make wget download a file directly to disk from bash - wget

On a website, after logging in with my credentials I am able to download daa by changing the url address to variations of this:
https://data.somewhere.com/DataDownload/getfile.jsp?ccy=AUDUSD&df=BBO&year=2014&month=02&dllater=Download
This put a zip file in my downlaod directory.
If I try to automate it with wget using:
wget "https://data.somewhere.com/DataDownload/getfile.jsp?ccy=AUDUSD&df=BBO&year=2014&month=02&dllater=Download" --no-check-certificate --ignore-length
$ ~/dnloadHotSpot.sh
--2014-03-22 16:05:16-- https://data.somewhere.com/DataDownload/getfile.jsp?ccy=AUDUSD&df=BBO&year=2014&month=02&dllater=Download
Resolving data.somewhere.com (data.somewhere.com)... 209.191.250.173
Connecting to data.somewhere.com (data.somewhere.com)|209.191.250.173|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `getfile.jsp#ccy=AUDUSD&df=BBO&year=2014&month=02&dllater=Download'
[ <=> ] 8,925 --.-K/s in 0.001s
2014-03-22 16:05:18 (14.4 MB/s) - `getfile.jsp#ccy=AUDUSD&df=BBO&year=2014&month=02&dllater=Download' saved [8925]
What else to I need to add to make wget actually download the file?

If you want to specify the name of the output file into which wget places the contents of the file is is downloading, then use the capital O parameter, something like:
wget -O myfilename ......

Related

Cannot apply count() or collecr() on RDD from textfile(Spark)

I am new at Spark and I have Databricks Community Edition account. Right now I'm doing Lab and encountered with following error:
!rm README.md* -f
!wget https://raw.githubusercontent.com/carloapp2/SparkPOT/master/README.md
textfile_rdd = sc.textFile("README.md")
textfile_rdd.count()
Output:
IllegalArgumentException: Path must be absolute: dbfs:/../dbfs/README.md
By default, wget will download your file to /databricks/driver
You have to store it in the DataBricks File System (dbfs) in order to be able to read it with the -P option. See wget manual for reference.
It also seems that the !wget magic creates a file that is not available with the dbfs:/ path. On Databricks Community, !wget leads to a file not found as you mentionned.
You can do the following in a %sh cell first:
%sh
rm README.md* -f
wget https://raw.githubusercontent.com/carloapp2/SparkPOT/master/README.md -P /dbfs/downloads/
And then in a second python cell, you can access the file throug the Files API (note the path starting with file:/
textfile_rdd = sc.textFile("file:/dbfs/downloads/README.md")
textfile_rdd.count()
--2022-02-11 13:48:19-- https://raw.githubusercontent.com/carloapp2/SparkPOT/master/README.md
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3624 (3.5K) [text/plain]
Saving to: ‘/dbfs/FileStore/README.md.1’
README.md.1 100%[===================>] 3.54K --.-KB/s in 0.001s
2022-02-11 13:48:19 (4.10 MB/s) - ‘/dbfs/FileStore/README.md.1’ saved [3624/3624]
Out[25]: 98
The following solution has been tested on a Databricks Community Edition with a 7.1 LTS ML and a 9.1 LTS ML Databricks Runtime.

Getting HASH of individual files within folder uploaded to IPFS

When I upload a folder of .jpg files to IPFS, I get the HASH of that folder - which is cool.
But is each individual file in that folder also getting hashed?
And if so, how do I get the hash of each file?
I basically want to be able to upload a whole bunch of files - like 500 images - and do it all at once, or programmatically, and have the hash of each file be returned to me.
Any way to do this?
Yes! From the command line you get back the CIDs (the Content IDentifier, aka, IPFS hash) for each file added when you run ipfs add -r <path to directory>
$ ipfs add -r gifs
added QmfBAEYhJp9ZjGvv8utB3Yv8uuuxsDKjv9rurkHRsYU3ih gifs/martian-iron-man.gif
added QmRBHTH3p4W2xAzgLxvdh8VJvAmWBgchwCr9G98EprwetE gifs/needs-more-dogs.gif
added QmZbffnCcV598QxsUy7WphXCAMZJULZAzy94tuFZzbFcdK gifs/satisfied-with-your-care.gif
added QmTxnmk85ESr97j2xLNFeVZW2Kk9FquhdswofchF8iDGFg gifs/stone-of-triumph.gif
added QmcN71Qh56oSg2YXsEXuf8o6u5CrBXbyYYzgMyAkdkcxxK gifs/thanks-dog.gif
added QmTnuLaivKc1Aj8LBf2iWBHDXsmedip3zSPbQcGi6BFwTC gifs
the root CID for the directory is always the last item in the list.
You can limit the output of that command to just include the CIDs using the --quiet flag
⨎ ipfs add -r gifs --quiet
QmfBAEYhJp9ZjGvv8utB3Yv8uuuxsDKjv9rurkHRsYU3ih
QmRBHTH3p4W2xAzgLxvdh8VJvAmWBgchwCr9G98EprwetE
QmZbffnCcV598QxsUy7WphXCAMZJULZAzy94tuFZzbFcdK
QmTxnmk85ESr97j2xLNFeVZW2Kk9FquhdswofchF8iDGFg
QmcN71Qh56oSg2YXsEXuf8o6u5CrBXbyYYzgMyAkdkcxxK
QmTnuLaivKc1Aj8LBf2iWBHDXsmedip3zSPbQcGi6BFwTC
Or, if you know the CID for a directory, you can list out the files it contains and their individual CIDs with ipfs ls. Here I list out the contents of the gifs dir from the previous example
$ ipfs ls QmTnuLaivKc1Aj8LBf2iWBHDXsmedip3zSPbQcGi6BFwTC
QmfBAEYhJp9ZjGvv8utB3Yv8uuuxsDKjv9rurkHRsYU3ih 2252675 martian-iron-man.gif
QmRBHTH3p4W2xAzgLxvdh8VJvAmWBgchwCr9G98EprwetE 1233669 needs-more-dogs.gif
QmZbffnCcV598QxsUy7WphXCAMZJULZAzy94tuFZzbFcdK 1395067 satisfied-with-your-care.gif
QmTxnmk85ESr97j2xLNFeVZW2Kk9FquhdswofchF8iDGFg 1154617 stone-of-triumph.gif
QmcN71Qh56oSg2YXsEXuf8o6u5CrBXbyYYzgMyAkdkcxxK 2322454 thanks-dog.gif
You can it programatically with the core api in js-ipfs or go-ipfs. Here is an example of adding a files from the local file system in node.js using js-ipfs from the docs for ipfs.addAll(files) - https://github.com/ipfs/js-ipfs/blob/master/docs/core-api/FILES.md#importing-files-from-the-file-system
There is a super helpful video on how adding files to IPFS works over at https://www.youtube.com/watch?v=Z5zNPwMDYGg
And a walk through of js-ipfs here https://github.com/ipfs/js-ipfs/tree/master/examples/ipfs-101

opkg install error - wfopen no such file or directory

I have followed instructions to create an .ipk file, the Packages.gz and host them on a web server as a repo. I have set the opkg.conf in my other VM to point to this repo. The other VM is able to update and list the contents of repositories successfully.
But, when I try to install, I get this message. Can you please describe why I am getting this and what needs to be changed?
Collected errors:
* wfopen: /etc/repo/d1/something.py: No such file or directory
* wfopen: /etc/repo/d1/something-else.py: No such file or directory
While creating the .ipk, I had created a folder named data that had a file structure as /etc/repo/d1/ with the file something.py stored at d1 location. I zipped that folder to data.tar.gz. And, then together with control.tar.gz and 'debian-binary`, I created the .ipk.
I followed instructions from here:
http://bitsum.com/creating_ipk_packages.htm
http://www.jumpnowtek.com/yocto/Managing-a-private-opkg-repository.html
http://www.jumpnowtek.com/yocto/Using-your-build-workstation-as-a-remote-package-repository.html
It is very likely that the directory called /etc/repo/d1/ does not exist on the target system. If you create the folder manually, and try installing again, it probably will not fail. I'm not sure how to force opkg to create the empty directory by itself :/
Update:
You can solve this problem using a preinst script. Just create the missing directories on it, like this:
#!/bin/sh
mkdir -p /etc/repo/d1/
# always return 0 if success
exit 0

wget files from FTP-like listings

So, site that used to use FTP now has an HTTP front-end and won't allow FTP connections. The site in question (for an example directory) will show a page with links to different dates. Inside each of these different dates, there are many files, and I typically just need to get some file with some clear pattern e.g. *h17v04*.hdf. I thought this could work:
wget -I "${PLATFORM}/${PRODUCT}/${YEAR}.*" -r -l 4 \
--user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" \
--verbose -c -np -nc -nd \
-A "*h17v04*.hdf" http://e4ftl01.cr.usgs.gov/$PLATFORM/$PRODUCT/
where PLATFORM=MOLT, PRODUCT=MOD09GA.005 and YEAR=2004, for example. This seems to start looking into all the useful dates, finds the index.html, and then just skips to the next directory, without downloading the relevant hdf file:
--2013-06-14 13:09:18-- http://e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.01/
Reusing existing connection to e4ftl01.cr.usgs.gov:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.01/index.html'
[ <=> ] 174,182 134K/s in 1.3s
2013-06-14 13:09:20 (134 KB/s) - `e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.01/index.html' saved [174182]
Removing e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.01/index.html since it should be rejected.
--2013-06-14 13:09:20-- http://e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.02/
[...]
If I ignore the -A option, only the index.html file is downloaded to my system, but it appears it's not parsed and the links are not followed. I don't really know what more is required to make this work, as I can't see why it doesn't!!!
SOLUTION
In the end, the problem was due to an old bug in the local version of wget. However, I ended up writing my own script for downloading MODIS data from the server above. The script is pure Python, and is available from here.
Consider to use pyModis instead of wget which is a Free and Open Source Python based library to work with MODIS data. It offers bulk-download for user selected time ranges, mosaicking of MODIS tiles, and the reprojection from Sinusoidal to other projections, convert HDF format to other formats. See
http://www.pymodis.org/

Why does my REST request return garbage data?

I am trying to use LWP::Simple to make a GET request to a REST service. Here's the simple code:
use LWP::Simple;
$uri = "http://api.stackoverflow.com/0.8/questions/tagged/php";
$jsonresponse= get $uri;
print $jsonresponse;
On my local machine, running Ubuntu 10.4, and Perl version 5.10.1:
farhan#farhan-lnx:~$ perl --version
This is perl, v5.10.1 (*) built for x86_64-linux-gnu-thread-multi
I can get the correct response and have it printed on the screen. E.g.:
farhan#farhan-lnx:~$ head -10 output.txt
{
"total": 1000,
"page": 1,
"pagesize": 30,
"questions": [
{
"tags": [
"php",
"arrays",
"coding-style"
(... snipped ...)
But on my host's machine to which I SSH into, I get garbage printed on the screen for the same exact code. I am assuming it has something to do with the encoding, but the REST service does not return the character set type in the response, so how do I force LWP::Simple to use the correct encoding? Any ideas what may be going on here?
Here's the version of Perl on my host's machine:
[dredd]$ perl --version
This is perl, v5.8.8 built for x86_64-linux-gnu-thread-multi
I happen to have a 64 bit RHEL 5.4 box which has Perl 5.8.8 on it. I took your code and got the exact same result. I tried using Data::Dumper to dump the data, but that didn't change anything. I then went to the command line and did this:
wget -O jsonfile http://api.stackoverflow.com/0.8/questions/tagged/php
--2010-05-26 11:42:41-- http://api.stackoverflow.com/0.8/questions/tagged/php
Resolving api.stackoverflow.com... 69.59.196.211
Connecting to api.stackoverflow.com|69.59.196.211|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5430 (5.3K) [application/json]
Saving to: `jsonfile'
2010-05-26 11:42:42 (56.9 KB/s) - `jsonfile' saved [5430/5430]
When I did this:
file jsonfile
I got:
jsonfile: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT), max speed
So, the JSON data was gzipped by the web server. I tried this:
gzip -dc jsonfile
and lo and behold the results are the JSON data as you would expect.
What you can do now is to either use another module to ungzip the data, or you can check out this other thread which shows how to accept gzip using LWP::UserAgent and handle the request that way
This is bug 44435. Upgrade libwww-perl to version 5.827 or better.