Deploy gzip content on Amazon S3 - deployment

I use gzip compression for Amazon S3. I gzip HTML, JS, CSS files and keep images unchanged.
I sync everything using s3cmd:
s3cmd sync --cf-invalidate ./deploy s3://n12v.com
Unfortunately, this doesn’t set Content-Encoding: gzip to all necessary files.
I need to find all updated gzip files and set Content-Encoding: gzip for each one of them. The best solution I could came up with:
find all gzipped files by running gzip --test filepath on every one
s3cmd put --add-header='Content-Encoding: gzip' filepath s3://n12v.com/filepath, e.g. upload files again just to add a header.
This is very ad-hoc solution, is there any better ways of doing this?

I was having the same type of issue so I wrote s3tup. Unfortunately it doesn't offer anything like s3cmd's cf-invalidate flag (yet), but it makes configuring while rsyncing a lot easier. You'd write a config file that looks something like this.
# config.yml
---
bucket: n12v.com
rsync: ./deploy
key_config:
- patterns: ['*.html', '*.css', '*.js']
content_encoding: gzip
And then rsync from the command line like this.
$ s3tup config.yml --rsync_only
It's still early in development but maybe you'll find it useful.

Related

Downloading public data directory from google cloud storage with command line utilities like wget

I would like to download publicly available data from google cloud storage. However, because I need to be in a Python3.x environment, it is not possible to use gsutil. I can download individual files with wget as
wget http://storage.googleapis.com/path-to-file/output_filename -O output_filename
However, commands like
wget -r --no-parent https://console.cloud.google.com/path_to_directory/output_directoryname -O output_directoryname
do not seem to work as they just download an index file for the directory. Neither do rsync or curl attempts based on some initial attempts. Any idea of how to download publicly available data on google cloud storage as a directory?
The approach you mentioned above does not work because Google Cloud Storage doesn't have real "directories". As an example, "path/to/some/files/file.txt" is the entire name of that object. A similarly named object, "path/to/some/files/file2.txt", just happens to share the same naming prefix.
As for how you could fetch these files: The GCS APIs (both XML and JSON) allow you to do an object listing against the parent bucket, specifying a prefix; in this case, you'd want all objects starting with the prefix "path/to/some/files/". You could then make individual HTTP requests for each of the objects specified in the response body. That being said, you'd probably find this much easier to do via one of the GCS client libraries, such as the Python library.
Also, gsutil currently has a GitHub issue open to track adding support for Python 3.

GCS encryption always fails on big files

I'm trying to encrypt a file on GCS with my own key using gsutil rewrite command (following https://cloud.google.com/storage/docs/using-encryption-keys)
As instructed I'm using a boto file including
[GSUtil]
encryption_key = p9syBNA0ycKxGotK3XinNZC6aCpdn3ZQ7WWOhKNgBaY=
It is working without a problem on small files but fails constantly on big ones.
I'm running the command:
gsutil rewrite -k -O gs://ywz-tmp/bigfile.txt
Is that a know issue?
Any workaround?
Feel free to use the file and key (both were generated for this post)
The fix for this issue should be in production now.

Error 403 with Ckan 2.6.2 - Datapush

I have, in order to process some big data, to set up ckan on a local machine. I've set up the whole system following this guide : http://docs.ckan.org/en/latest/maintaining/installing/install-from-source.html
I wanted to display a preview of a locally loaded file, so the user can actually see it before downloading it. And it doesn't work, because it only works for online files. For instance, it DOES work with this online file but NOT with my own file I upload.
So, I've been interested about Datastore and Datapusher. I've followed every part of the guide, and it appears on my ckan. However, I have an error. Specifically this one :
Upload error: An Error occurred while sending the job: 403 Client Error: Forbidden for url: http://127.0.0.1:8800/job
Here's my most important parts about my production.ini file (copying the whole would be very long) :
ckan.site_url = http://localhost
ckan.plugins = datastore datapusher stats text_view image_view
recline_view recline_graph_view recline_map_view webpage_view
ckan.datapusher.formats = csv xls xlsx tsv application/csv
application/vnd.ms-excel
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
ckan.datapusher.url = http://127.0.0.1:8800/
I truly have no idea about what my problem could be, I tried to change the datapusher.url to 0.0.0.0 as the guide suggested, but it doesn't work either.
If the data to be added to CKAN is in a file on your computer, select “Upload a file” option. CKAN will give you a file browser to select it. You should use link to a file option just for publicly available resources.
Have you installed datapusher also? Its a separate process running on port 8800. CKAN uses datastore to be able to have a grid view of tabular data. Data needs to be pushed through datapusher to be used by datastore.
Yes, you need to set up the Datapusher.It's not activated by default.
Pull the datapusher code, install the dependencies and run it using:
python datapusher/main.py deployment/settings.py
The instructions to configure the settings are on the repository.
Here's the datapusher manual: http://docs.ckan.org/projects/datapusher/en/latest/
Here's the repository: https://github.com/ckan/datapusher
Had the exact same error message.
This post solved my issue though.
short: insert/check the following in your virtualhost in /etc/apache2/sites-enabled/datapusher.conf
<Directory /etc/ckan>
Options All
AllowOverride All
Require all granted
</Directory>

How to compress a list of files into a single gzip file using elasticluster, grid-engine-tools, and google cloud

I want to start by thanking you all for your help ahead of time, as this will help clear up a detail left out on the readthedocs.io guide. What I need is to compress several files into a single gzip, however, the guide shows only how to compress a list of files as individual gzipped file. Again, I appreciate any help as there is very few resources and documentation for this set up. (If there is some extra info, please include links to sources)
After I had set up the grid engine, I ran through the samples in the guide.
Am I right in assuming there is not a script for combining multiple files into one gzip using grid-computing-tools?
Are there any solutions on the Elasticluster Grid Engine setup to compress multiple files into 1 gzip?
What changes can be made to the grid-engine-tools to make it work?
EDIT
The reason we are considering a cluster is that we do expect multiple operations occurring simultaneously, zipped up files per order, which will occur systematically so that a vendor can download a single compressed file per order.
May I state the definition of the problem and you can let me know if I understood it correctly, as both Matt and I provided the exact same solution and somehow it doesn't seem sufficient.
Problem Definition
You have an Order defining the start of a task to process some data.
The processing of data would be split among several compute nodes, each producing a resulting file stored on GS directories.
The goal is:
Collect the files from GS bucket (that were produced by each of the nodes),
Archive the collection of files as one file,
Then compress that archive, and
Push it back to a different GS location.
Let me know if I summarized it properly,
Thanks,
Paul
Are the files in question in Cloud Storage?
Are the files in question on a local or network drive?
In your description, you indicate "What I need is to compress several files into a single gzip". It isn't clear to me that a cluster of computers is needed for this. It sounds more like you just want to use tar along with gzip.
The tar utility will create an archive file it can compress it as well. For example:
$ # Create a directory with a few input files
$ mkdir myfiles
$ echo "This is file1" > myfiles/file1.txt
$ echo "This is file2" > myfiles/file2.txt
$ # (C)reate a compressed archive
$ tar cvfz archive.tgz myfiles/*
a myfiles/file1.txt
a myfiles/file2.txt
$ # (V)erify the archive
$ tar tvfz archive.tgz
-rw-r--r-- 0 myuser mygroup 14 Jul 20 15:19 myfiles/file1.txt
-rw-r--r-- 0 myuser mygroup 14 Jul 20 15:19 myfiles/file2.txt
To extract the contents use:
$ # E(x)tract the archive contents
$ tar xvfz archive.tgz
x myfiles/file1.txt
x myfiles/file2.txt
UPDATE:
In your updated problem description, you have indicated that you may have multiple orders processed simultaneously. If the frequency in which results need to be tar-ed is low, and providing the tar-ed results is not extremely time-sensitive, then you could likely do this with a single node.
However, as the scale of the problem ramps up, you might take a look at using the Pipelines API.
Rather than keeping a fixed cluster running, you could initiate a "pipeline" (in this case a single task) when a customer's order completes.
A call to the Pipelines API would start a VM whose sole purpose is to download the customer's files, tar them up, and push the resulting tar file into Cloud Storage. The Pipelines API infrastructure does the copying from and to Cloud Storage for you. You would effectively just need to supply the tar command line.
There is an example that does something similar here:
https://github.com/googlegenomics/pipelines-api-examples/tree/master/compress
This example will download a list of files and compress each of them independently. It could be easily modified to tar the list of input files.
Take a look at the https://github.com/googlegenomics/pipelines-api-examples github repository for more information and examples.
-Matt
So there are many ways to do it, but the thing is that you cannot directly compress on Google Storage a collection of files - or a directory - into one file, and would need to perform the tar/gzip combination locally before transferring it.
If you want you can have the data compressed automatically via:
gsutil cp -Z
Which is detailed at the following link:
https://cloud.google.com/storage/docs/gsutil/commands/cp#changing-temp-directories
And the nice thing is that you retrieve uncompressed results from compressed data on Google Storage, because it has the ability to perform Decompressive Transcoding:
https://cloud.google.com/storage/docs/transcoding#decompressive_transcoding
You will notice on the last line in the following script:
https://github.com/googlegenomics/grid-computing-tools/blob/master/src/compress/do_compress.sh
The following line will basically copy the current compressed file to Google Cloud Storage:
gcs_util::upload "${WS_OUT_DIR}/*" "${OUTPUT_PATH}/"
What you will need is to first perform the tar/zip on the files in the local scratch directory, and then gsutil copy the compressed file over to Google Storage, but make sure that all the files that need to be compressed are in the scratch directory before starting to compress them. Most likely you would need to SSH copy (scp) them to one of the nodes (i.e. master), and then have the master tar/gzip the whole directory before sending it over to Google Storage. I am assuming each GCE instance has its own scratch disk, but the "gsutil cp" transfer is very fast when working on GCE.
Since Google Storage is fast at data transfers with Google Compute instances, the easiest second option to pursue is to mark out lines 66-69 in the do_compress.sh file:
https://github.com/googlegenomics/grid-computing-tools/blob/master/src/compress/do_compress.sh
This way no compression happens, but the copy happens on the last line via gsutil::upload, in order to have all the uncompressed files transferred to the same Google Storage bucket. Then using "gsutil cp" from the master node you would copy them back locally, in order to compress them locally via tar/gz and then copy the compressed directory file back to the bucket using "gsutil cp".
Hope it helps but it's tricky,
Paul

Are you able to create clean URLs with Wget?

I'm attempting to create a mirror of a WordPress site with clean URLs (i.e. http://example.org/foo not http://example.org/foo.php). When Wget mirrors the site, it gives all pages and links a ".html" extension (i.e. http://example.org/foo.html).
Is it possible to set options for Wget to create a clean URL structure, so that the mirrored file corresponding to the page "http:example.org/foo" would be "/foo/index.html" and the link to that page would be "http:example.org/foo"? If so, how?
If I understand your question correctly, you're asking for what is the default behaviour of Wget.
Wget will only add the extension to the local copy, if the --adjust-extension option has been passed to it. Quoting the man page for Wget:
--adjust-extension
If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the
local filename. This is useful, for instance, when you're mirroring a remote site that uses .asp pages, but you want the mirrored pages to be viewable on your stock Apache server. Another good
use for this is when you're downloading CGI-generated materials. A URL like http://example.com/article.cgi?25 will be saved as article.cgi?25.html.
However, what you seem to be asking for, that Wget saves example.org/foo as /foo/index.html is actually the default option. If you're seeing some other output, you should post the complete output of Wget with the --debug switch.