How do I copy symbolic links between servers? - webserver

We are moving web servers. (LAMP)
Webserver 1 has hundreds of symbolic links pointing to files in a different directory (eg. ../../files/001.png). When we move over to the new server (have downloaded the site to my computer then reuploading to Webserver 2 using SFTP client Transmit). It does not copy the symbolic links...
Is there a better way to get the symbolic links from one server to another apart from recreating them on the new server?

rsync -a from one server to another will preserve file attributes and symlinks.
rsync -av user#server:/path/to/source user#server2:/path/to/target

Something like the following? On Webserver 1:
tar czf - the_directory | (ssh Webserver2 "cd /path/to/wherever && tar xzf -")
This creates a tar of the stuff to copy to STDOUT, and pipes it to an untar on the other server over ssh. It can be faster than a recursive ssh copy too.

Related

Copying from Windows to Linux without copying the entire path

I am using Cygwin to copy all the files in a local windows folder to an EC2 linux instance inside of powershell. When I attempt to copy all the files in a folder, it copies the pathname as a folder:
\cygwin64\bin\scp.exe -i "C:\cygwin64\home\Ken\ken-key-pair.pem" -vr \git\configs\configs_test\ ec2-user#ec2-22-75-189-18.compute-1.amazonaws.com:/var/www/html/temp4configs/
will copy the correct files, but incorrectly include the path in a Windows format a path like:
/var/www/html/temp4configs/\git\configs\configs_test/file.php
I have tried an asterisk after the folder without the -r, such as:
\cygwin64\bin\scp.exe -i "C:\cygwin64\home\Ken\ken-key-pair.pem" -v \git\configs\configs_test\* ec2-user#ec2-22-75-189-18.compute-1.amazonaws.com:/var/www/html/temp4configs/
but that will return an error such as
"gitconfigsconfigs_test*: No such file or directory"
What can I do to copy the files without copying the path?
Thanks
When using cygwin programs is safer to use POSIX Path, and most of the time is the only way. To covert from windows to posix PATH use cygpath
$ cygpath -u "C:\cygwin64\home\Ken\ken-key-pair.pem"
/home/Ken/ken-key-pair.pem
$ cygpath -u "C:\git\configs\configs_test\ "
/cygdrive/c/git/configs/configs_test/
Using the windows one, will cause the server to misunderstand the client request

How to download multiple files in Google Cloud Storage

Scenario: there are multiple folders and many files stored in storage bucket that is accessible by project team members. Instead of downloading individual files one at a time (which is very slow and time consuming), is there a way to download entire folders? Or at least multiple files at once? Is this possible without having to use one of the command consoles? Some of the team members are not tech savvy and need to access these files as simple as possible. Thank you for any help!
I would suggest downloading the files with gsutil. However if you have a large number of files to transfer you might want to use the gsutil -m option, to perform a parallel (multi-threaded/multi-processing) copy:
gsutil -m cp -R gs://your-bucket .
The time reduction for downloading the files can be quite significant. See this Cloud Storage documentation for complete information on the GCS cp command.
If you want to copy into a particular directory, note that the directory must exist first, as gsutils won't create it automatically. (e.g: mkdir my-bucket-local-copy && gsutil -m cp -r gs://your-bucket my-bucket-local-copy)
I recommend they use gsutil. GCS's API deals with only one object at a time. However, its command-line utility, gsutil, is more than happy to download a bunch of objects in parallel, though. Downloading an entire GCS "folder" with gsutil is pretty simple:
$> gsutil cp -r gs://my-bucket/remoteDirectory localDirectory
To download files to local machine need to:
install gsutil to local machine
run Google Cloud SDK Shell
run the command like this (example, for Windows-platform):
gsutil -m cp -r gs://source_folder_path "%userprofile%/Downloads"
gsutil rsync -d -r gs://bucketName .
works for me

gsutil rsync with gzip compression

I'm hosting publicly available static resources in a google storage bucket, and I want to use the gsutil rsync command to sync our local version to the bucket, saving bandwidth and time. Part of our build process is to pre-gzip these resources, but gsutil rsync has no way to set the Content-Encoding header. This means we must run gsutil rsync, then immediately run gsutil setmeta to set headers on all the of gzipped file types. This leaves the bucket in a BAD state until that header is set. Another option is to use gsutil cp, passing the -z option, but this requires us to re-upload the entire directory structure every time, and this includes a LOT of image files and other non-gzipped resources that wastes time and bandwidth.
Is there an atomic way to accomplish the rsync and set proper Content-Encoding headers?
Assuming you're starting with gzipped source files in source-dir you can do:
gsutil -h content-encoding:gzip rsync -r source-dir gs://your-bucket
Note: If you do this and then run rsync in the reverse direction it will decompress and copy all the objects back down:
gsutil rsync -r gs://your-bucket source-dir
which may not be what you want to happen. Basically, the safest way to use rsync is to simply synchronize objects as-is between source and destination, and not try to set content encodings on the objects.
I'm not completely answering the question but I came here as I was wondering the same thing trying to achieve the following:
how to deploy efficiently a static website to google cloud storage
I was able to find an optimized way for deploying my static web site from a local folder to a gs bucket
Split my local folder into 2 folders with the same hierarchy, one containing the content to be gzip (html,css,js...), the other the other files
Gzip each file in my gzip folder (in place)
Call gsutil rsync in for each folder to the same gs destination
Of course, it is only a one way synchronization and deleted local files are not deleted remotely
For the gzip folder the command is
gsutil -m -h Content-Encoding:gzip rsync -c -r src/gzip gs://dst
forcing the content encoding to be gzippped
For the other folder the command is
gsutil -m rsync -c -r src/none gs://dst
the -m option is used for parallel optimization. The -c option is needed to force using checksum validation (Why is gsutil rsync re-downloading all our files?) as I was touching each local file in my build process. the -r option is used for recursivity.
I even wrote a script for it (in dart): http://tekhoow.blogspot.fr/2016/10/deploying-static-website-efficiently-on.html

Why does wget only download the index.html for some websites?

I'm trying to use wget command:
wget -p http://www.example.com
to fetch all the files on the main page. For some websites it works but in most of the cases, it only download the index.html. I've tried the wget -r command but it doesn't work. Any one knows how to fetch all the files on a page, or just give me a list of files and corresponding urls on the page?
Wget is also able to download an entire website. But because this can put a heavy load upon the server, wget will obey the robots.txt file.
wget -r -p http://www.example.com
The -p parameter tells wget to include all files, including images. This will mean that all of the HTML files will look how they should do.
So what if you don't want wget to obey by the robots.txt file? You can simply add -e robots=off to the command like this:
wget -r -p -e robots=off http://www.example.com
As many sites will not let you download the entire site, they will check your browsers identity. To get around this, use -U mozilla as I explained above.
wget -r -p -e robots=off -U mozilla http://www.example.com
A lot of the website owners will not like the fact that you are downloading their entire site. If the server sees that you are downloading a large amount of files, it may automatically add you to it's black list. The way around this is to wait a few seconds after every download. The way to do this using wget is by including --wait=X (where X is the amount of seconds.)
you can also use the parameter: --random-wait to let wget chose a random number of seconds to wait. To include this into the command:
wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com
Firstly, to clarify the question, the aim is to download index.html plus all the requisite parts of that page (images, etc). The -p option is equivalent to --page-requisites.
The reason the page requisites are not always downloaded is that they are often hosted on a different domain from the original page (a CDN, for example). By default, wget refuses to visit other hosts, so you need to enable host spanning with the --span-hosts option.
wget --page-requisites --span-hosts 'http://www.amazon.com/'
If you need to be able to load index.html and have all the page requisites load from the local version, you'll need to add the --convert-links option, so that URLs in img src attributes (for example) are rewritten to relative URLs pointing to the local versions.
Optionally, you might also want to save all the files under a single "host" directory by adding the --no-host-directories option, or save all the files in a single, flat directory by adding the --no-directories option.
Using --no-directories will result in lots of files being downloaded to the current directory, so you probably want to specify a folder name for the output files, using --directory-prefix.
wget --page-requisites --span-hosts --convert-links --no-directories --directory-prefix=output 'http://www.amazon.com/'
The link you have provided is the homepage or /index.html, Therefore it's clear that you are getting only a index.html page. For an actual download, for example, for "test.zip" file, you need to add the exact file name at the end. For example use the following link to download test.zip file:
wget -p domainname.com/test.zip
Download a Full Website Using wget --mirror
Following is the command line which you want to execute when you want to download a full website and made available for local viewing.
wget --mirror -p --convert-links -P ./LOCAL-DIR
http://www.example.com
–mirror: turn on options suitable for mirroring.
-p: download all files that are necessary to properly display a given HTML page.
–convert-links: after the download, convert the links in document
for local viewing.
-P ./LOCAL-DIR: save all the files and directories to the specified directory
Download Only Certain File Types Using wget -r -A
You can use this under following situations:
Download all images from a website,
Download all videos from a website,
Download all PDF files from a website
wget -r -A.pdf http://example.com/test.pdf
Another problem might be that the site you're mirroring uses links without www. So if you specify
wget -p -r http://www.example.com
it won't download any linked (intern) pages because they are from a "different" domain. If this is the case then use
wget -p -r http://example.com
instead (without www).
I had the same problem downloading files of CFSv2 model. I solved it using mixing of the above answers, but adding the parameter --no-check-certificate
wget -nH --cut-dirs=2 -p -e robots=off --random-wait -c -r -l 1 -A "flxf*.grb2" -U Mozilla --no-check-certificate https://nomads.ncdc.noaa.gov/modeldata/cfsv2_forecast_6-hourly_9mon_flxf/2018/201801/20180101/2018010100/
Here a brief explanation of every parameter used, for a further explanation go to the GNU wget 1.2 Manual
-nH equivalent to --no-host-directories: Disable generation of host-prefixed directories. In this case, avoid the generation of the directory ./https://nomads.ncdc.noaa.gov/
--cut-dirs=<number>: Ignore directory components. In this case, avoid the generation of the directories ./modeldata/cfsv2_forecast_6-hourly_9mon_flxf/
-p equivalent to --page-requisites: This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.
-e robots=off: avoid download robots.txt file
-random-wait: Causes the time between the request to vary between 0.5 and 1.5 * seconds, where was specified using the --wait option.
-c equivalent to --continue: continue getting a partially-downloaded file.
-r equivalent to --recursive: Turn on recursive retrieving. The default maximum depth is 5
-l <depth> equivalent to --level <depth>: Specify recursion maximum depth level
-A <acclist> equivalent to --accept <acclist>: specify a comma-separated list of the name suffixes or patterns to accept.
-U <agent-string> equivalent to --user-agent=<agent-string>: The HTTP protocol allows the clients to identify themselves using a User-Agent header field. This enables distinguishing the WWW software, usually for statistical purposes or for tracing of protocol violations. Wget normally identifies as ‘Wget/version’, the version being the current version number of Wget.
--no-check-certificate: Don't check the server certificate against the available certificate authorities.
I know that this thread is old, but try what is mentioned by Ritesh with:
--no-cookies
It worked for me!
If you look for index.html in the wget manual you can find an option --default-page=name which is index.html by default. You can change to index.php for example.
--default-page=index.php
If you only get the index.html and that file looks like it only contains binary data (i.e. no readable text, only control characters), then the site is probably sending the data using gzip compression.
You can confirm this by running cat index.html | gunzip to see if it outputs readable HTML.
If this is the case, then wget's recursive feature (-r) won't work. There is a patch for wget to work with gzip compressed data, but it doesn't seem to be in the standard release yet.

How to specify the download location with wget?

I need files to be downloaded to /tmp/cron_test/. My wget code is
wget --random-wait -r -p -nd -e robots=off -A".pdf" -U mozilla http://math.stanford.edu/undergrad/
So is there some parameter to specify the directory?
From the manual page:
-P prefix
--directory-prefix=prefix
Set directory prefix to prefix. The directory prefix is the
directory where all other files and sub-directories will be
saved to, i.e. the top of the retrieval tree. The default
is . (the current directory).
So you need to add -P /tmp/cron_test/ (short form) or --directory-prefix=/tmp/cron_test/ (long form) to your command. Also note that if the directory does not exist it will get created.
-O is the option to specify the path of the file you want to download to:
wget <uri> -O /path/to/file.ext
-P is prefix where it will download the file in the directory:
wget <uri> -P /path/to/folder
Make sure you have the URL correct for whatever you are downloading. First of all, URLs with characters like ? and such cannot be parsed and resolved. This will confuse the cmd line and accept any characters that aren't resolved into the source URL name as the file name you are downloading into.
For example:
wget "sourceforge.net/projects/ebosse/files/latest/download?source=typ_redirect"
will download into a file named, ?source=typ_redirect.
As you can see, knowing a thing or two about URLs helps to understand wget.
I am booting from a hirens disk and only had Linux 2.6.1 as a resource (import os is unavailable). The correct syntax that solved my problem downloading an ISO onto the physical hard drive was:
wget "(source url)" -O (directory where HD was mounted)/isofile.iso"
One could figure the correct URL by finding at what point wget downloads into a file named index.html (the default file), and has the correct size/other attributes of the file you need shown by the following command:
wget "(source url)"
Once that URL and source file is correct and it is downloading into index.html, you can stop the download (ctrl + z) and change the output file by using:
-O "<specified download directory>/filename.extension"
after the source url.
In my case this results in downloading an ISO and storing it as a binary file under isofile.iso, which hopefully mounts.
"-P" is the right option, please read on for more related information:
wget -nd -np -P /dest/dir --recursive http://url/dir1/dir2
Relevant snippets from man pages for convenience:
-P prefix
--directory-prefix=prefix
Set directory prefix to prefix. The directory prefix is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree. The default is . (the current directory).
-nd
--no-directories
Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the
filenames will get extensions .n).
-np
--no-parent
Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
man wget:
-O file
--output-document=file
wget "url" -O /tmp/cron_test/<file>