How to set charset to UTF8 for text files with gsutil when uploading to Google Cloud Storage bucket? - google-cloud-storage

We have a (public) Google Cloud Storage bucket that hosts a simple website, meaning both HTML and images.
Our build process uses Google Cloud Build, however the question is not tied to using Cloud Build, but specifically regarding on how to use gsutil properly.
This is our current gsutil task:
# Upload it to the bucket
- name: gcr.io/cloud-builders/gsutil
dir: "public/"
args: [
"-m", # run the rsync command in parallel
"-h", "Cache-Control: public, max-age=0", # Custom cache control header
"cp", # copy command
"-r", # recursively
".", # source folder
"gs://mybucket/" # the target bucket and folder
]
As you can see, this copies everything in the local public/ folder to the bucket and applies the Cache-Control header on all objects.
According to this:
https://cloud.google.com/storage/docs/gsutil/addlhelp/WorkingWithObjectMetadata
You can specify the content type with
-h "Content-Type:text/html; charset=utf-8"
However, this makes all objects (not only .html files, but also images, etc) to get the content type text/html; charset=utf-8.
(I have even tried -h "Content-Type:; charset=utf-8" but then gsutil fails saying its an invalid content type value).
Is there a way to tell gsutil to apply charset=utf-8 on all objects, without actually overwriting the main content type?

Related

Getting HASH of individual files within folder uploaded to IPFS

When I upload a folder of .jpg files to IPFS, I get the HASH of that folder - which is cool.
But is each individual file in that folder also getting hashed?
And if so, how do I get the hash of each file?
I basically want to be able to upload a whole bunch of files - like 500 images - and do it all at once, or programmatically, and have the hash of each file be returned to me.
Any way to do this?
Yes! From the command line you get back the CIDs (the Content IDentifier, aka, IPFS hash) for each file added when you run ipfs add -r <path to directory>
$ ipfs add -r gifs
added QmfBAEYhJp9ZjGvv8utB3Yv8uuuxsDKjv9rurkHRsYU3ih gifs/martian-iron-man.gif
added QmRBHTH3p4W2xAzgLxvdh8VJvAmWBgchwCr9G98EprwetE gifs/needs-more-dogs.gif
added QmZbffnCcV598QxsUy7WphXCAMZJULZAzy94tuFZzbFcdK gifs/satisfied-with-your-care.gif
added QmTxnmk85ESr97j2xLNFeVZW2Kk9FquhdswofchF8iDGFg gifs/stone-of-triumph.gif
added QmcN71Qh56oSg2YXsEXuf8o6u5CrBXbyYYzgMyAkdkcxxK gifs/thanks-dog.gif
added QmTnuLaivKc1Aj8LBf2iWBHDXsmedip3zSPbQcGi6BFwTC gifs
the root CID for the directory is always the last item in the list.
You can limit the output of that command to just include the CIDs using the --quiet flag
⨎ ipfs add -r gifs --quiet
QmfBAEYhJp9ZjGvv8utB3Yv8uuuxsDKjv9rurkHRsYU3ih
QmRBHTH3p4W2xAzgLxvdh8VJvAmWBgchwCr9G98EprwetE
QmZbffnCcV598QxsUy7WphXCAMZJULZAzy94tuFZzbFcdK
QmTxnmk85ESr97j2xLNFeVZW2Kk9FquhdswofchF8iDGFg
QmcN71Qh56oSg2YXsEXuf8o6u5CrBXbyYYzgMyAkdkcxxK
QmTnuLaivKc1Aj8LBf2iWBHDXsmedip3zSPbQcGi6BFwTC
Or, if you know the CID for a directory, you can list out the files it contains and their individual CIDs with ipfs ls. Here I list out the contents of the gifs dir from the previous example
$ ipfs ls QmTnuLaivKc1Aj8LBf2iWBHDXsmedip3zSPbQcGi6BFwTC
QmfBAEYhJp9ZjGvv8utB3Yv8uuuxsDKjv9rurkHRsYU3ih 2252675 martian-iron-man.gif
QmRBHTH3p4W2xAzgLxvdh8VJvAmWBgchwCr9G98EprwetE 1233669 needs-more-dogs.gif
QmZbffnCcV598QxsUy7WphXCAMZJULZAzy94tuFZzbFcdK 1395067 satisfied-with-your-care.gif
QmTxnmk85ESr97j2xLNFeVZW2Kk9FquhdswofchF8iDGFg 1154617 stone-of-triumph.gif
QmcN71Qh56oSg2YXsEXuf8o6u5CrBXbyYYzgMyAkdkcxxK 2322454 thanks-dog.gif
You can it programatically with the core api in js-ipfs or go-ipfs. Here is an example of adding a files from the local file system in node.js using js-ipfs from the docs for ipfs.addAll(files) - https://github.com/ipfs/js-ipfs/blob/master/docs/core-api/FILES.md#importing-files-from-the-file-system
There is a super helpful video on how adding files to IPFS works over at https://www.youtube.com/watch?v=Z5zNPwMDYGg
And a walk through of js-ipfs here https://github.com/ipfs/js-ipfs/tree/master/examples/ipfs-101

How to change the metadata of all specific file of exist objects in Google Cloud Storage?

I have uploaded thousands of files to google storage, and i found out all the files miss content-type,so that my website cannot get it right.
i wonder if i can set some kind of policy like changing all the files content-type at the same time, for example, i have bunch of .html files inside the bucket
a/b/index.html
a/c/a.html
a/c/a/b.html
a/a.html
.
.
.
is that possible to set the content-type of all the .html files with one command in the different place?
You could do:
gsutil -m setmeta -h Content-Type:text/html gs://your-bucket/**.html
There's no a unique command to achieve the behavior you are looking for (one command to edit all the object's metadata) however, there's a command from gcloud to edit the metadata which you could use on a bash script to make a loop through all the objects inside the bucket.
1.- Option (1) is to use a the gcloud command "setmeta" on a bash script:
# kinda pseudo code here.
# get the list with all your object's names and iterate over the metadata edition command.
for OUTPUT in $(get_list_of_objects_names)
do
gsutil setmeta -h "[METADATA_KEY]:[METADATA_VALUE]" gs://[BUCKET_NAME]/[OBJECT_NAME]
# the "gs://[BUCKET_NAME]/[OBJECT_NAME]" would be your object name.
done
2.- You could also create a C++ script to achieve the same thing:
namespace gcs = google::cloud::storage;
using ::google::cloud::StatusOr;
[](gcs::Client client, std::string bucket_name, std::string object_name,
std::string key, std::string value) {
# you would need to find list all the objects, while on the loop, you can edit the metadata of the object.
for (auto&& object_metadata : client.ListObjects(bucket_name)) {
string bucket_name=object_metadata->bucket(), object_name=object_metadata->name();
StatusOr<gcs::ObjectMetadata> object_metadata =
client.GetObjectMetadata(bucket_name, object_name);
gcs::ObjectMetadata desired = *object_metadata;
desired.mutable_metadata().emplace(key, value);
StatusOr<gcs::ObjectMetadata> updated =
client.UpdateObject(bucket_name, object_name, desired,
gcs::Generation(object_metadata->generation()))
}
}

ERROR: The specifed resource name contains invalid characters. ErrorCode: InvalidResourceName

ERROR: The specifed resource name contains invalid characters. ErrorCode: InvalidResourceName
2019-10-31T10:28:17.4678189Z <?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidResourceName</Code><Message>The specifed resource name contains invalid characters.
2019-10-31T10:28:17.4678695Z RequestId:
2019-10-31T10:28:17.4679207Z Time:2019-10-31T10:28:17.4598301Z</Message></Error>
I am trying to deploy my static website to blob storage in azure with azure DevOps, but I am getting this error. In my pipeline, I am using grunt build to build, and archive it to zip, then publishing to the azure pipeline, then in the release, I am extracting files, and trying to upload these files with azure CLI task.
I am using following command
az storage blob upload-batch --account-name something --account-key something --destination ‘$web’ --source ./
My Container name is $web
Permitted characters are lowercase a-z 0-9 and single infix hyphens
[a-z0-9\-]
https://learn.microsoft.com/en-us/rest/api/storageservices/naming-and-referencing-containers--blobs--and-metadata
I solved this problem by removing apostrophes around container name:
az storage blob upload-batch --account-name something --account-key something --destination $web --source ./
This will probably not solve your problem, but it will solve a related problem for other people:
If the aim is to simply download a file from Azure File Storage using a link, after generating a SAS token, as shown here: Azure File Storage URL in browser showing InvalidHeaderValue
If you remove the slash after the name of the file in the generated link, the file will download!
verify container name or coonection string might contain extra/nonallowed symbols... In my case it was having extra spaces in container name

Remove /index.html from url for static site

I have a static site on google-cloud-storage bucket.
I rsync my site to the storage bucket with:
args: ["-m", "-h", "Content-Encoding:gzip", "rsync", "-c", "-r", "./folder", "gs://mysite.com"]
I have set in my cloud bucket for website config:
/index.html
This results in:
mysite.com/category/index.html
And from this I want to remove the index.html, so I tried in addition to above args in a second line, the following:
args: ["-h", "Content-Type:text/html", "cp", "./folder/*/index.html", "gs://mysite.com/*"]
But this second args did not work.
How to write the second args so that the index.html is removed from the URL in mysite.com/category/index.html?
The second args are probably working, the thing is that you are using cp which copies files, so you are just uploading the index.html file again.
If you want to remove the index.html you have to use rm:
args: ["-h", "Content-Type:text/html", "rm", "gs://mysite.com/category/index.html"]

wget files from FTP-like listings

So, site that used to use FTP now has an HTTP front-end and won't allow FTP connections. The site in question (for an example directory) will show a page with links to different dates. Inside each of these different dates, there are many files, and I typically just need to get some file with some clear pattern e.g. *h17v04*.hdf. I thought this could work:
wget -I "${PLATFORM}/${PRODUCT}/${YEAR}.*" -r -l 4 \
--user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" \
--verbose -c -np -nc -nd \
-A "*h17v04*.hdf" http://e4ftl01.cr.usgs.gov/$PLATFORM/$PRODUCT/
where PLATFORM=MOLT, PRODUCT=MOD09GA.005 and YEAR=2004, for example. This seems to start looking into all the useful dates, finds the index.html, and then just skips to the next directory, without downloading the relevant hdf file:
--2013-06-14 13:09:18-- http://e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.01/
Reusing existing connection to e4ftl01.cr.usgs.gov:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.01/index.html'
[ <=> ] 174,182 134K/s in 1.3s
2013-06-14 13:09:20 (134 KB/s) - `e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.01/index.html' saved [174182]
Removing e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.01/index.html since it should be rejected.
--2013-06-14 13:09:20-- http://e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.02/
[...]
If I ignore the -A option, only the index.html file is downloaded to my system, but it appears it's not parsed and the links are not followed. I don't really know what more is required to make this work, as I can't see why it doesn't!!!
SOLUTION
In the end, the problem was due to an old bug in the local version of wget. However, I ended up writing my own script for downloading MODIS data from the server above. The script is pure Python, and is available from here.
Consider to use pyModis instead of wget which is a Free and Open Source Python based library to work with MODIS data. It offers bulk-download for user selected time ranges, mosaicking of MODIS tiles, and the reprojection from Sinusoidal to other projections, convert HDF format to other formats. See
http://www.pymodis.org/