Google Cloud Storage Share Publicly - google-cloud-storage

My Acl permissions on my bucket are as follow :
<?xml version="1.0" ?>
<AccessControlList>
<Owner>
<ID>00b4903a97dfaa16aff41eeb91e90b5fb524f1daf0d88fceca29b6f647412e8d</ID>
</Owner>
<Entries>
<Entry>
<Scope type="GroupById">
<ID>00b4903a97dfaa16aff41eeb91e90b5fb524f1daf0d88fceca29b6f647412e8d</ID>
</Scope>
<Permission>FULL_CONTROL</Permission>
</Entry>
<Entry>
<Scope type="AllUsers"/>
<Permission>READ</Permission>
</Entry>
<Entry>
<Scope type="UserByEmail">
<EmailAddress>
my_app#appspot.gserviceaccount.com
</EmailAddress>
</Scope>
<Permission>
WRITE
</Permission>
</Entry>
</Entries>
</AccessControlList>
But when i upload a new file in this bucket is not shared by default.
I think i should be because of the AllUsers permission set to read

I think you are confusing bucket permissions and object permissions. The bucket is publicly readable so everyone can list the contents of the bucket but the object you upload has its own set of permissions. If you want an uploaded object to be publicly readable you need to enable that explicitly. You could use the following command to do that:
gsutil setacl public-read gs://bucket/object
Alternatively, you could set the default object ACL for the containing bucket to be publicly readable, using this command:
gsutil setdefacl public-read gs://bucket
The advantage of the latter is that every object uploaded to that bucket will automatically inherit public readability from the containing bucket.

If you get: You are using a deprecated alias, "setdefacl", for the "defacl" ...
Use gsutil defacl set public-read gs://bucketname

Related

gsutil - what are the storage class options for cp?

I'm using gsutil CLI to copy files to buckets of Google Cloud. But I didn't find what are the options for specifying a storage class? I read the documentation and the options are not written there. It's just written to use -s <class> but what are the options for <class>?
These are the following Storage Class that you can define:
STANDARD
NEARLINE
COLDLINE
ARCHIVE
Additional: You should only use copy to copy between the same location and storage class

How to set Hadoop fs.s3a.acl.default on AWS EMR?

I have a map-reduce application running on AWS EMR that writes some output to a different (aws account) s3 bucket. I have the permission setup and the job can write to the external bucket, but the owner is still the root from the account where the Hadoop job is running. I would like to change this to the external account that owns the bucket.
I found I can set fs.s3a.acl.default to bucket-owner-full-control, however that doesn't seem like working. This is what I am doing:
conf.set("fs.s3a.acl.default", "bucket-owner-full-control");
FileSystem fileSystem = FileSystem.get(URI.create(s3Path), conf);
FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path(filePath));
PrintWriter writer = new PrintWriter(fsDataOutputStream);
writer.write(contentAsString);
writer.close();
fsDataOutputStream.close();
Any help is appreciated.
conf.set("fs.s3a.acl.default", "bucket-owner-full-control");
is the right property you are setting.
As this the property in core-site.xml to give full control to bucket owner.
<property>
<name>fs.s3a.acl.default</name>
<description>Set a canned ACL for newly created and copied objects. Value may be private,
public-read, public-read-write, authenticated-read, log-delivery-write,
bucket-owner-read, or bucket-owner-full-control.</description>
</property>
BucketOwnerFullControl
Specifies that the owner of the bucket is granted Permission.FullControl. The owner of the bucket is not necessarily the same as the owner of the object.
I recommend to set fs.s3.canned.acl also to value BucketOwnerFullControl
For debugging you can use the below snippet to understand what parameters are actually passing..
for (Entry<String, String> entry: conf) {
System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());
}
For testing purpose do this command with command line
aws s3 cp s3://bucket/source/dummyfile.txt s3://bucket/target/dummyfile.txt --sse --acl bucket-owner-full-control
If this works then through api also it will.
Bonus point with Spark , useful for spark scala users:
For Spark to access the s3 file system and set the proper configurations like the below example...
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3a.fast.upload","true")
hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version","2")
hadoopConf.set("fs.s3a.server-side-encryption-algorithm", "AES256")
hadoopConf.set("fs.s3a.canned.acl","BucketOwnerFullControl")
hadoopConf.set("fs.s3a.acl.default","BucketOwnerFullControl")
If you are using EMR then you have to use the AWS team's S3 connector, with "s3://" URLs and use their documented configuration options. They don't support the apache one, so any option with "fs.s3a" at the beginning isn't going to have any effect whatsoever.
As mentioned in answer by Stevel, For EMR with pyspark use this
sc=spark.sparkContext
hadoop_conf=sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.canned.acl","BucketOwnerFullControl")
Canned ACL Description
BucketOwnerFullControl Specifies that the owner of the bucket is granted
Permission.FullControl. The owner of the bucket is not necessarily
the same as the owner of the object.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-s3-acls.html

How to download multiple objects from IBM Cloud Object Storage?

I am trying to use IBM Cloud Object Storage to store images uploaded to my site by users. I have this functionality working just fine.
However, based on the documentation here (link) it appears as though only one object can be downloaded from a bucket at a time.
Is there any way a list of objects could all be downloaded from the bucket? Is there a different approach to requesting multiple objects from a COS bucket?
Via the REST API, no, you can only download a single object at a time. But most tools (like the AWS CLI, or Minio Client) allow downloading all objects that share a prefix (eg foo/bar and foo/bas). The IBM forks of the S3 libraries also are now integrated with Aspera, and can transfer large directories all at once. What are you trying to do?
According to S3 spec (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html), you can only download one object at a time.
There are various tools which may help to download multiple objects at a time from COS. I used AWS CLI tool to download and upload the objects from/to COS.
So install aws-cli tool and configure it by supplying access_key_id and secret_access_key here.
Recursively copying S3 objects to a local directory: When passed with the parameter --recursive, the following cp command recursively copies all objects under a specified prefix and bucket to a specified directory.
C:\Users\Shashank>aws s3 cp s3://yourBucketName . --recursive
for example:
C:\Users\Shashank>aws --endpoint-url http://s3.us-east.cloud-object-storage.appdomain.cloud s3 cp s3://yourBucketName D:\s3\ --recursive
In my case having endpoint based on us-east region and I am copying objects into D:\s3 directory.
Recursively copying local files to S3: When passed with the parameter --recursive, the following cp command recursively copies all files under a specified directory to a specified bucket.
C:\Users\Shashank>aws s3 cp myDir s3://yourBucketName/ --recursive
for example:
C:\Users\Shashank>aws --endpoint-url http://s3.us-east.cloud-object-storage.appdomain.cloud s3 cp D:\s3 s3://yourBucketName/ --recursive
I am copying objects from D:\s3 directory to COS.
For more reference, you can see the link here.
I hope it works for you.

Google Cloud Storage ACL troubles

I have a bucket on Google Cloud Storage that I created. I wanted to test some of the built in ACLs like public-read, public-read-write, etc. But once I changed the ACL using the gsutil setacl command like so:
gsutil setacl public-read-write gs://mybucket
I seem to have lost the ability to set the ACL to anything else, I also can not get the current ACL. When I attempt either I get the following message:
GSResponseError: status=403, code=AccessDenied, reason=Forbidden, detail=mybucket.
Not sure if this is a bug or I am just missing something obvious. How do I regain the ability to set the ACLs?
Go to https://code.google.com/apis/console/#:team and see whether you are a viewer, editor or owner of the project.
Contact one of the owners of the project.
Ask the owner to run gsutil chacl -u you#gmail.com:FC gs://mybucket or else make you an owner of the project.
Project owners always have full control of buckets.

How does the web server locate a file on server through URL?

Has anyone ever tried to implement a web server? Or know something about the underhood of a working web server program? I am wondering what happens exactly from when a URL is received by the web server to a file on the web server is located and sent back as response.
Does the server just keep an internal table to remember the mapping between the URLs it supports and the corresponding local paths? Or is there anything more tricky?
Thanks!
Update
Thanks for your replies. Here's my understanding for now.
I checked with the Microsoft IIS (Internet Information Service), I noticed that IIS can host multiple sites, and foreach site IIS memorize its root path on the local file system. Different sites on the same host share the same host name or IP, and they are differentiated by separate ports. For example:
http://www.myServer.com:1111/folderA/pageA.htm
The web server will use www.myServer.com:1111 part of the URL string to locate which path on its local file system will be used, and then in that local path, it searches for subfolder folderA and then the file pageA.htm.
The web server only need to memorize the following mapping between 2 plain strings:
"http://www.myServer.com:1111/" <---> "D:\myWebRoot"
I don't know where this kind of mapping info is stored, maybe some config files for the Web Server Program in question.
But the result of this mapping granularity is that we could only access content within that mapped local folder. We couldn't do arbitray mapping.
Update - 2 -
I found where the IIS keep the mapping, here's some quotes from applicationHost.config:
<sites>
<site name="Default Web Site" id="1" serverAutoStart="false">
<application path="/">
<virtualDirectory path="/" physicalPath="%SystemDrive%\inetpub\wwwroot" />
</application>
<bindings>
<binding protocol="http" bindingInformation="*:80:" />
<binding protocol="net.tcp" bindingInformation="808:*" />
<binding protocol="net.pipe" bindingInformation="*" />
<binding protocol="net.msmq" bindingInformation="localhost" />
<binding protocol="msmq.formatname" bindingInformation="localhost" />
</bindings>
</site>
<site name="myIISService" id="2" serverAutoStart="true">
<application path="/" applicationPool="myIISService">
<virtualDirectory path="/" physicalPath="D:\MySites\MyIISService" />
</application>
<bindings>
<binding protocol="http" bindingInformation="*:8022:" />
</bindings>
</site>
<siteDefaults>
<logFile logFormat="W3C" directory="%SystemDrive%\inetpub\logs\LogFiles" />
<traceFailedRequestsLogging directory="%SystemDrive%\inetpub\logs\FailedReqLogFiles" />
</siteDefaults>
<applicationDefaults applicationPool="DefaultAppPool" />
<virtualDirectoryDefaults allowSubDirConfig="true" />
</sites>
Update - 3 -
After I read foo's reply, my undersanding of a "server" is enlarged. I want to make some comment based on my recent learning of WCF.
No matter what kind of server it is, we could always send messages to them by specifying the protocol, URL, port. For example:
[http://www.myserver.com:1111/]page.htm
[net.tcp://www.myserver.com/]someService.svc/someMethod
[net.msmq://www.myserver.com/]someService.svc
[net.pipe://localhost/]
After the messages arrives at the server program using the parts in square bracket of above URLs, the rest part of the url will send to the server program as input for further processing. And the following behaviour could be as simple as static content feeding or as complex as dynamic content generating.
Depends on the webserver and what its focus is.
(For all items, checking access rights, remapping and such steps apply of course.)
General-purpose webservers like Apache start out with files and directories, so they split up the URL into a hierarchical path description, try to find a file at the given location, and serve it if it exists. (This gets more complex with modules and filetypes; some filetypes imply processing the file as a script and returning the script output rather than just piping out the file contents, and so on).
Application servers like Tomcat do a mapping to servlets; if they have found a servlet that will handle the URL, they call it and pass any leftover URL parts/parameters to it for further handling.
Embedded webservers may even use hardcoded lookup tables for available URL patterns, directly mapping to functions to be called.
Special-purpose webservers will do whatever is required; some won't even parse the URL but just the other headers (like some streaming servers do).
It all depends on what you want to achieve. In most cases, you will be best off with nginx or Apache and maybe some modules and/or finetuning.
Be aware that any HTTP header can be used for mapping the request to whatever means of producing output you have. Hostname, port and URL are used most often, but you may as well take language or client IP or other header data and use them in the mapping.
So for your question: Yes, it can be as simple as that; and yes, it can be substantially more tricky (with mapping, rewriting, and complex processing).
For servers that serve "files", a typical approach is to treat the path portion of the URL as a relative path starting at a "web root" directory defined in the server's configuration. However, a URL doesn't have to correspond to a file on disk at all; it could correspond to an object or method in a running web application, or a database record, or anything else.
For static files there's usually no means of a mapping. The only what the webserver need to know is the absolute disk file system path to the public web document root which is usually definied somewhere in some deployment configuration file (httpd.conf for Apache HTTPD, server.xml and/or context.xml for Apache Tomcat, etc). The webserver extracts the relevant part from the URL, converts it to an absolute disk file system path based on the path to the web document root, locates the file on disk and streams it.