wget download for offline viewing including absolute references - wget

I'm trying to download an entire webpage using the following command
wget -p -k www.myspace.com/
This does download the page and any images or scripts under that directory, but I'm trying to figure out how to download that page for completely offline viewing. How would I get every image, script, and style sheet linked within the source for www.myspace.com including external links?

wget -e robots=off -H -p -k http://www.myspace.com/
The -H or --span-hosts flag is necessary for a complete mirror, as the page is likely to include content on hosts outside the www.myspace.com domain. Ignore robots for good measure.

wget -mk http://www.myspace.com/
works for me. I am not sure about myspace or whatever site you are trying to mirror specifically, but sometimes you have to pass in some other options to get around the no-robots policy. I am not going to say how to do that because it means you are doing something you shouldn't be doing. Although it is definitely possible.

Related

wget does not use option in .wgetrc

I'm trying to specify a specific CA file for use with a proxy. When I use wget --ca-certificate=file.cer, it works fine. But when I try to put ca_certificate = file.cer in $HOME/.wgetrc, it doesn't work and I get the following error:
Unable to locally verify the issuer's authority.
The docs say that these should both do the same thing, so I don't know what is causing the difference.
I'm on SLES 15 SP1 and using GNU Wget 1.20.3.
According to Wgetrc Location manual
If the environmental variable WGETRC is set, Wget will try to load
that file. Failing that, no further attempts will be made.
If WGETRC is not set, Wget will try to load $HOME/.wgetrc.
So first thing to check is if WGETRC is set. If it is and is other than $HOME/.wgetrc then wget will not load latter.
what is causing the difference.
I am not sure in relation to what is wget looking for files, so I would try using absolute path rather than relative.

How do I set a default cache-control for for new images uploaded to buckets on google storage

I know that you can run a command at upload to set the cache-control of the image being uploaded
gsutil -h "Cache-Control:public,max-age=2628000" cp -a public-read \\
-r html gs://bucket
But I'm using carrierwave in rails and don't think its possible to set it up to run this command each time an image is uploaded.
I was looking around to see if you can change the default cache-control number but cant find any solutions. Currently I run gsutil -m setmeta -h "Cache-Control:public, max-age=2628000" gs://bucket/*.png every now and then to update new images but this is a horrible solution.
Any ideas on how to set the default cache-control for files uploaded to a bucket?
There's no way to set a default Cache-Control header on newly uploaded files. It either needs to be set explicitly (by setting the header) at the time the object is written, or after the upload by updating the object's metadata using something like the gsutil command you noted.

osm tile server renderd command not found

using this set of instructions to make an OSM tile server (on Ubuntu 14.04).
When I run this sudo -u my_username renderd -f -c /usr/local/etc/renderd.conf the terminal reports renderd: command not found.
Any ideas why this would be? I have everything in the instructions up to this point working and i dont see a note on how exactly to install renderd, its just part of mod_tile. I thought about trying to get around the issues by running the renderd.py file and supplying the file path to my renderd.conf file, but i get more issues as the OSMBright.xml file contains fonts mapnik can't find, despite setting all the font dirs correctly...maybe more on this issues later.
For now I'd be grateful if anyone can shed light on why my install cant find the command renderd
Solved it. The instructions are missing a line after doing the make step for mod_tile, there should be a make renderd command too. That way the binary for renderd is actually generated and will respond
try to find correct path to renderd. like /usr/local/bin/renderd

Increase image upload limt

Whenever I try to upload an Image larger than 125Kb, I receive Upload HTTP Error. How can I increase this limit so I can upload Hi-res images?
Thank you,
FD
This has nothing to do with Magento and everything to do with your server settings.
You will likely have to bump up post_max_size and upload_max_filesize in your php.ini
Also, if you're running NGINX you may also have to increase client_max_body_size
Please note, however, that settings and restrictions can vary greatly from one hosting environment to the next. If you're not sure how to alter the config files properly or do not have the necessary access to do so - then you may have to contact your hosting provider and ask them to do it for you.
First of all, make sure that you have correct permissions for your media dir using command line:
sudo chmod -R 775 [magento_root]/media
If it doesn't help, try to check your php config:
php -i | egrep 'upload_max_filesize|post_max_size|memory_limit'
If you see the small values there, you, probably, need to change limits by editing these limits in your php.ini file. You can find this file by running the following command
php -i | grep php.ini
Also, do not forget to restart your apache/php servers after some config changes have been made. Usually, you are able to do it by running:
sudo /etc/init.d/apache2 restart
or
sudo service apache2 restart
Also, I noticed that sometimes mod_security might cause such kind of issues. Try to check your [magento_root]/.htaccess file for the following configuration and try to add it if it's absent:
<IfModule mod_security.c>
SecFilterEngine Off
SecFilterScanPOST Off
</IfModule>
And, the last thing: try to upload images from another browser/computer. Magento has flash uploader for product images and we have cases when the flash player caused the similar issues on some computers.
You have to change both post_max_size and upload_max_filesize in the php.ini
And don’t forget to restart your server afterwards.

wget giving error when downloading certain files

I am facing a problem in downloading some documents programmatically.
For example this link
https://www-950.ibm.com/events/wwe/grp/grp019.nsf/vLookupPDFs/Introduction_to_Storwize_V7000_Unified_T3/$file/Introduction_to_Storwize_V7000_Unified_T3.pdf
can be downloaded from browser, but when I try to get it from wget it doesn't work.
I have tried
wget https://www-950.ibm.com/events/wwe/grp/grp004.nsf/vLookupPDFs/3-Mobile%20Platform%20--%20Truty%20--%20March%208%202012/\$file/3-Mobile%20Platform%20--%20Truty%20--%20March%208%202012.pdf
It gave me this output
--2012-04-18 17:09:42--
https://www-950.ibm.com/events/wwe/grp/grp004.nsf/vLookupPDFs/3-Mobile%20Platform%20--%20Truty%20--%20March%208%202012/$file/3-Mobile%20Platform%20--%20Truty%20--%20March%208%202012.pdf
Resolving www-950.ibm.com... 216.208.176.98
Connecting to www-950.ibm.com|216.208.176.98|:443... connected.
Unable to establish SSL connection.
Can any one help me solve this problem. Thanks in advance.
Add the --no-check-certificate to your original wget command.
Plus you need to ensure that you are using a proxy.
On Linux:
export http_proxy=http://myproxyserver.com:8080
On Windows:
set http_proxy=http://myproxyserver.com:8080
I also found that on windows, because this is a https request, that in order to make it work, I also had to set https_proxy. So
set https_proxy=http://myproxyserver.com:8080
Obviously, change the proxy settings to suite your particular situation.