WGET : how to download a link and sub link?

WGET : how to download a link and sub link? - wget

I use WGET to download the data on the website and I save it as an HTML file. Data is presented in tabular form . The table consists of three columns are id_sales , sales_name, number_of_buyers . if the numbers in the number_of_the buyer column clicked , will display detailed data . I want to download the data in the form tables and data details. to be able to see the data, I have to log in first.
My Script :
#echo off
SET office_id=613
set userid=123456
set password=p#ssw0rd
set save_cookies="cookies\cookies.txt"
wget --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" --post-data="username=%userid%&password=%password%&sublogin=Login" --save-cookies=%save_cookies% --keep-session-cookies http://app/login/login/loging_simpel
wget --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" -r -E -nd --load-cookies=cookies\cookies.txt --save-cookies=cookies\cookies.txt --keep-session-cookies "http://app/portal/credit/result.php?office_id=613&years=2013"
pause
The above script could just download the data table only. but the details are not downloaded. please help me to correct this script . thank you very much

Lynx or links could be used to transform html table into text but ideally I would suggest investing in writing this with Scrapy

Related

How to load multiple osm files into Nominatim

I need to figure out the process to load multiple OSM files into a Nominatim database. I have everything setup and can load a single file with no issues.
Basically what I'm trying to do is load some of the GeoFabrik OSM files for only a part of the world. So I'm grabbing like the North America and South America OSM files. Or any 2 on their site.
For the first load I use the setup.php:
./utils/setup.php --osm-file file.osm --all --osm2pgsql-cache 4000
I'm not sure if I have another file (file2.osm) how to load this into the database and keep the original data.
Basically, I just want pieces of the world and I only need to load data like every six months or so. I don't need daily updates/ etc...
I need to split the files up because it just takes too long to load and I want to manage it better.
Can I use the update.php..... But not sure what parameters.
I thought about loading all data with update and the no-index clause...Then maybe building the index??
I did try to re-run the setup.php for the second file but it just hung for a long time
For second file
./utils/setup.php --import-data --osm-file file2.osm --osm2pgsql-cache 4000
But this just hangs on Setting up table: planet_osm_ways. (I tested very small OSM files that should finish within minutes but it just hangs).
The files that I'm using are all non-intersecting so not truly updates. SO I have a North America and a South America...How do I load both into Nominatim separately.
Thanks

The answer can be found at help.openstreetmap.org.
First you need to import it via the update script: ./utils/update.php --import-file <yourfile>. Then you need to trigger a re-indexing of the data: ./utils/update.php --index
But according to lonvia (one of the Nominatim developers) this will be very slow and it is better if you merge all your files first and then import it as one large file.

Sample Merging Code, merging Andorra, Malta and Liechtenstein,
curl -L 'http://download.geofabrik.de/europe/andorra-latest.osm.pbf' --create-dirs -o /srv/nominatim/src/andorra.osm.pbf
curl -L 'http://download.geofabrik.de/europe/malta-latest.osm.pbf' --create-dirs -o /srv/nominatim/src/malta.osm.pbf
curl -L 'http://download.geofabrik.de/europe/liechtenstein-latest.osm.pbf' --create-dirs -o /srv/nominatim/src/liechtenstein.osm.pbf
osmconvert /srv/nominatim/src/andorra.osm.pbf -o=/srv/nominatim/src/andorra.o5m
osmconvert /srv/nominatim/src/malta.osm.pbf -o=/srv/nominatim/src/malta.o5m
osmconvert /srv/nominatim/src/liechtenstein.osm.pbf -o=/srv/nominatim/src/liechtenstein.o5m
osmconvert /srv/nominatim/src/andorra.o5m /srv/nominatim/src/malta.o5m /srv/nominatim/src/liechtenstein.o5m -o=/srv/nominatim/src/data.o5m
osmconvert /srv/nominatim/src/data.o5m -o=/srv/nominatim/src/data.osm.pbf;
More about OsmConvert -> https://wiki.openstreetmap.org/wiki/Osmconvert
Once merged, you can,
sudo -u nominatim /srv/Nominatim/build/utils/setup.php \
--osm-file /srv/nominatim/src/data.osm.pbf \
--all \
--threads ${BUILD_THREADS} \ # 16 Threads?
--osm2pgsql-cache ${OSM2PGSQL_CACHE} # 24000 ?

make wget download a file directly to disk from bash

On a website, after logging in with my credentials I am able to download daa by changing the url address to variations of this:
https://data.somewhere.com/DataDownload/getfile.jsp?ccy=AUDUSD&df=BBO&year=2014&month=02&dllater=Download
This put a zip file in my downlaod directory.
If I try to automate it with wget using:
wget "https://data.somewhere.com/DataDownload/getfile.jsp?ccy=AUDUSD&df=BBO&year=2014&month=02&dllater=Download" --no-check-certificate --ignore-length
$ ~/dnloadHotSpot.sh
--2014-03-22 16:05:16-- https://data.somewhere.com/DataDownload/getfile.jsp?ccy=AUDUSD&df=BBO&year=2014&month=02&dllater=Download
Resolving data.somewhere.com (data.somewhere.com)... 209.191.250.173
Connecting to data.somewhere.com (data.somewhere.com)|209.191.250.173|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `getfile.jsp#ccy=AUDUSD&df=BBO&year=2014&month=02&dllater=Download'
[ <=> ] 8,925 --.-K/s in 0.001s
2014-03-22 16:05:18 (14.4 MB/s) - `getfile.jsp#ccy=AUDUSD&df=BBO&year=2014&month=02&dllater=Download' saved [8925]
What else to I need to add to make wget actually download the file?

If you want to specify the name of the output file into which wget places the contents of the file is is downloading, then use the capital O parameter, something like:
wget -O myfilename ......

wget files from FTP-like listings

So, site that used to use FTP now has an HTTP front-end and won't allow FTP connections. The site in question (for an example directory) will show a page with links to different dates. Inside each of these different dates, there are many files, and I typically just need to get some file with some clear pattern e.g. *h17v04*.hdf. I thought this could work:
wget -I "${PLATFORM}/${PRODUCT}/${YEAR}.*" -r -l 4 \
--user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" \
--verbose -c -np -nc -nd \
-A "*h17v04*.hdf" http://e4ftl01.cr.usgs.gov/$PLATFORM/$PRODUCT/
where PLATFORM=MOLT, PRODUCT=MOD09GA.005 and YEAR=2004, for example. This seems to start looking into all the useful dates, finds the index.html, and then just skips to the next directory, without downloading the relevant hdf file:
--2013-06-14 13:09:18-- http://e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.01/
Reusing existing connection to e4ftl01.cr.usgs.gov:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.01/index.html'
[ <=> ] 174,182 134K/s in 1.3s
2013-06-14 13:09:20 (134 KB/s) - `e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.01/index.html' saved [174182]
Removing e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.01/index.html since it should be rejected.
--2013-06-14 13:09:20-- http://e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.02/
[...]
If I ignore the -A option, only the index.html file is downloaded to my system, but it appears it's not parsed and the links are not followed. I don't really know what more is required to make this work, as I can't see why it doesn't!!!
SOLUTION
In the end, the problem was due to an old bug in the local version of wget. However, I ended up writing my own script for downloading MODIS data from the server above. The script is pure Python, and is available from here.

Consider to use pyModis instead of wget which is a Free and Open Source Python based library to work with MODIS data. It offers bulk-download for user selected time ranges, mosaicking of MODIS tiles, and the reprojection from Sinusoidal to other projections, convert HDF format to other formats. See
http://www.pymodis.org/

Uploading zip file works in curl, but not in Powershell - why?

I'm switching from regular files to zip files doing an upload, and was told I'd need to use a header in this format - Content-Type: application/zip.
So, I can get my file to upload properly via curl with the following:
curl --verbose --header "Content-Type: application/zip" --data-binary #C:\Junk\test.zip "http://client.xyz.com/submit?username=test#test.com&password=testpassword&job=test"
However, when I write a simple powershell script to do the same thing, I run into problems - the data isn't loaded. I don't know how to get a good error message returned, so I don't know the details, but bottom line the data isn't getting in.
$FullFileName = "C:\Junk\test.zip"
$wc = new-object System.Net.WebClient -Verbose
$wc.Headers.Add("Content-Type: application/zip")
$URL = "http://client.xyz.com/submit?username=test#test.com&password=testpassword&job=test"
$wc.UploadFile( $URL, $FullFileName )
# $wc.UploadData( $URL, $FullFileName )
I've tried using UploadData instead of UploadFile, but that doesn't appear to work either.
Thanks,
Sylvia

I don't necessarily have a solution but I think the issue is that you are trying to upload a binary file using the WebClient object. You most likely will need the UploadData method but I think you are going to have to run the zip file into an array of bytes to upload. That I'm not sure of off the top of my head.
If you haven't, be sure to look at the MSDS docs for this class and methods: http://msdn.microsoft.com/en-us/library/system.net.webclient_methods.aspx

Now that I look at it again I think you need: $wc.Headers.Add("Content-Type", "application/zip") because the collection is key/value paired. Check out this SO question:
WebClient set headers
Also if your still having issues you might need to add a user agent header. I think curl has it's own.
$userAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2;)"
$wc.Headers.Add("user-agent", $userAgent)

Why doesn't rrdtool generate any PNG output in my Perl CGI program?

I'm trying to output an image from RRD Tool using Perl. I've posted the relevant part of the CGI script below:
sub graph
{
my $rrd_path = $co->param('rrd_path');
my $RRD_DIR = "../data/";
#generate a PNG from the RRD
my $png_filename = "-"; # a '-' as the filename send the PNG to stdout
my $rrd = "$RRD_DIR/$rrd_path";
my $png = `rrdtool graph $png_filename -a PNG -r -l 0 --base 1024 --start -151200 -- vertical-label 'bits per second' --width 500 --height 200 DEF:bytesInPerSec=$rrd:bytesInPerSec:AVERAGE DEF:bytesOutPerSec=$rrd:bytesOutPerSec:AVERAGE CDEF:sbytesInPerSec=bytesInPerSec,8,* CDEF:sbytesOutPerSec=bytesOutPerSec,8,* AREA:sbytesInPerSec#00cf00:AvgIn LINE1:sbytesOutPerSec#002a97:AvgOut VRULE:1246428000#ff0000:`;
#print the image header
use bytes;
print $co->header(-type=>"image/png",-Content_length=>length($png));
binmode STDOUT;
print $png;
}#end graph
This works fine on the command line (perl graph.cgi > test.png) - commenting out the header, of course, as well as on my Ubuntu 10.04 development machine. However, when I move to the Centos 5 production server, it doesn't, and the browser receives a content-length of 0:
Ubuntu 10.04/Apache:
Request URL:http://noc-student.nmsu.edu/grasshopper/web/graph.cgi
Request Method:GET
Status Code:200 OK
Request Headers
Accept:application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Cache-Control:max-age=0
User-Agent:Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.36 Safari/534.7
Response Headers
Connection:Keep-Alive
Content-Type:image/png
Content-length:12319
Date:Fri, 08 Oct 2010 21:40:05 GMT
Keep-Alive:timeout=15, max=97
Server:Apache/2.2.14 (Ubuntu)
And from the Centos 5/Apache Server:
Request URL:http://grasshopper-new.nmsu.edu/grasshopper/branches/michael_dev/web/graph.cgi
Request Method:GET
Status Code:200 OK
Request Headers
Accept:application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Cache-Control:max-age=0
User-Agent:Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.36 Safari/534.7
Response Headers
Connection:close
Content-Type:image/png
Content-length:0
Date:Fri, 08 Oct 2010 21:40:32 GMT
Server:Apache/2.2.3 (CentOS)
The use bytes and manual setting of the content length are in there to try to fix the problem, but it's the same without them. Same with setting binmode on STDOUT. The script works fine from the command line on both machines.

See my How can I troubleshoot my Perl CGI program. Typically, the difference between running your program on the command line and from the web server is a matter of difference environments. In this case I'd expect that either rddtool is not in the path or the webserver user can't run it.
The backticks only capture standard output. There is probably some standard error output in the web server error log.

Are you sure your web user has access to your data? Try having the CGI writing the png to the filesystem, so you can make sure it's generated properly. If it is, the problem is in the transmission (headers, encodings, etc). If not, it's unrelated to the web server, and probably related to permissions.