Accessing google cloud storage using hadoop FileSystem api - google-cloud-dataproc

From my machine, I've configured the hadoop core-site.xml to recognize the gs:// scheme and added gcs-connector-1.2.8.jar as a Hadoop lib. I can run hadoop fs -ls gs://mybucket/ and get the expected results. However, if I try to do the analogue from java using:
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
FileStatus[] status = fs.listStatus(new Path("gs://mybucket/"));
I get the files under root in my local HDFS instead of in gs://mybucket/, but with those files prepended with gs://mybucket. If I modify the conf with conf.set("fs.default.name", "gs://mybucket"); before obtaining the fs, then I can see the files on GCS.
My question is:
1. Is this expected behavior?
2. Is there a disadvantage to using this hadoop FileSystem api as opposed to the google cloud storage client api?

As to your first question, "expected" is questionable, but I think I can at least explain. When FileSystem.get() is used the default FileSystem is returned and by default that is HDFS. My guess is that the HDFS client (DistributedFileSystem) has code to prepend scheme + authority automatically to all files in the filesystem.
Instead of using FileSystem.get(conf), try
FileSystem gcsFs = new Path("gs://mybucket/").getFS(conf)
On disadvantages, I could probably argue that if you end up needing to access the object-store directly then you'll end up writing code to interact with the storage APIs directly anyways (and there are things that do not translate very well to the Hadoop FS API, e.g., object composition, complex object write preconditions other than simple object overwrite protection, etc).
I am admittedly biased (working on the team), but if you're intending to use GCS from Hadoop Map/Reduce, from Spark, etc, the GCS connector for Hadoop should be a fairly safe bet.

Related

Reading Json file from Azure datalake as a file using Json.load in Azure databricks /Synapse notebooks

I am trying to parse Json data with multi nested level. I am using the approach is giving filename and using open(File-name) to load the data. when I am providing datalake path, it is throwing error that file path not found. I am able to read data in dataframes but How can I read file from data lake without converting to dataframes and reading it as a file and open it?
Current code approach on local machine which is working:
f = open(File_Name.Json)
data = json.load(f)
Failing scenario when provding datalake path:
f = open(Datalake path/File_Name.Json)
data = json.load(f)
You need to mount the data lake folder to a location in dbfs (in Databricks), although mounting is a security risk. Anyone with access to Databricks resource will have access to all mounted locations.
Documentation on mounting to dbfs: https://docs.databricks.com/data/databricks-file-system.html#mount-object-storage-to-dbfs
The open function works only with local files, not understanding (out of box) the cloud file paths. You can of course try to mount the cloud storage, but as it was mentioned by #ARCrow, it would be a security risk (until you create so-called passthrough mount that will control access on the cloud storage level).
But if you're able to read file into dataframe, then it means that cluster has all necessary settings for accessing the cloud storage - in this case you can just use dbutils.fs.cp command to copy file from the cloud storage to local disk, and then open it with open function. Something like this:
dbutils.fs.cp("Datalake path/File_Name.Json", "file:///tmp/File_Name.Json")
with open("/tmp/File_Name.Json", "r") as f:
data = json.load(f)

How to set Hadoop fs.s3a.acl.default on AWS EMR?

I have a map-reduce application running on AWS EMR that writes some output to a different (aws account) s3 bucket. I have the permission setup and the job can write to the external bucket, but the owner is still the root from the account where the Hadoop job is running. I would like to change this to the external account that owns the bucket.
I found I can set fs.s3a.acl.default to bucket-owner-full-control, however that doesn't seem like working. This is what I am doing:
conf.set("fs.s3a.acl.default", "bucket-owner-full-control");
FileSystem fileSystem = FileSystem.get(URI.create(s3Path), conf);
FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path(filePath));
PrintWriter writer = new PrintWriter(fsDataOutputStream);
writer.write(contentAsString);
writer.close();
fsDataOutputStream.close();
Any help is appreciated.
conf.set("fs.s3a.acl.default", "bucket-owner-full-control");
is the right property you are setting.
As this the property in core-site.xml to give full control to bucket owner.
<property>
<name>fs.s3a.acl.default</name>
<description>Set a canned ACL for newly created and copied objects. Value may be private,
public-read, public-read-write, authenticated-read, log-delivery-write,
bucket-owner-read, or bucket-owner-full-control.</description>
</property>
BucketOwnerFullControl
Specifies that the owner of the bucket is granted Permission.FullControl. The owner of the bucket is not necessarily the same as the owner of the object.
I recommend to set fs.s3.canned.acl also to value BucketOwnerFullControl
For debugging you can use the below snippet to understand what parameters are actually passing..
for (Entry<String, String> entry: conf) {
System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());
}
For testing purpose do this command with command line
aws s3 cp s3://bucket/source/dummyfile.txt s3://bucket/target/dummyfile.txt --sse --acl bucket-owner-full-control
If this works then through api also it will.
Bonus point with Spark , useful for spark scala users:
For Spark to access the s3 file system and set the proper configurations like the below example...
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3a.fast.upload","true")
hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version","2")
hadoopConf.set("fs.s3a.server-side-encryption-algorithm", "AES256")
hadoopConf.set("fs.s3a.canned.acl","BucketOwnerFullControl")
hadoopConf.set("fs.s3a.acl.default","BucketOwnerFullControl")
If you are using EMR then you have to use the AWS team's S3 connector, with "s3://" URLs and use their documented configuration options. They don't support the apache one, so any option with "fs.s3a" at the beginning isn't going to have any effect whatsoever.
As mentioned in answer by Stevel, For EMR with pyspark use this
sc=spark.sparkContext
hadoop_conf=sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.canned.acl","BucketOwnerFullControl")
Canned ACL Description
BucketOwnerFullControl Specifies that the owner of the bucket is granted
Permission.FullControl. The owner of the bucket is not necessarily
the same as the owner of the object.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-s3-acls.html

Spark Scala S3 storage: permission denied

I've read a lot of topic on Internet on how to get working Spark with S3 still there's nothing working properly.
I've downloaded : Spark 2.3.2 with hadoop 2.7 and above.
I've copied only some libraries from Hadoop 2.7.7 (which matches Spark/Hadoop version) to Spark jars folder:
hadoop-aws-2.7.7.jar
hadoop-auth-2.7.7.jar
aws-java-sdk-1.7.4.jar
Still I can't use nor S3N nor S3A to get my file read by spark:
For S3A I have this exception:
sc.hadoopConfiguration.set("fs.s3a.access.key","myaccesskey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecretkey")
val file = sc.textFile("s3a://my.domain:8080/test_bucket/test_file.txt")
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: AE203E7293ZZA3ED, AWS Error Code: null, AWS Error Message: Forbidden
Using this piece of Python, and some more code, I can list my buckets, list my files, download files, read files from my computer and get file url.
This code gives me the following file url:
https://my.domain:8080/test_bucket/test_file.txt?Signature=%2Fg3jv96Hdmq2450VTrl4M%2Be%2FI%3D&Expires=1539595614&AWSAccessKeyId=myaccesskey
How should I install / set up / download to get spark able to read and write from my S3 server ?
Edit 3:
Using debug tool in comment here's the result.
Seems like the issue is with a signature thing not sure what it means.
First you will need to download aws-hadoop.jar and aws-java-sdk.jar that matches the install of your spark-hadoop release and add them to the jars folder inside spark folder.
Then you will need to precise the server you will use and enable path style if your S3 server do not support dynamic DNS:
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
#I had to change signature version because I have an old S3 api implementation:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
Here's my final code:
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val tmp = sc.textFile("s3a://test_bucket/test_file.txt")
sc.hadoopConfiguration.set("fs.s3a.access.key","mykey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecret")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled","true")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
tmp.count()
I would recommand to put most of the settings inside spark-defaults.conf:
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.endpoint mydomain:8080
spark.hadoop.fs.s3a.connection.ssl.enabled true
spark.hadoop.fs.s3a.signing-algorithm S3SignerType
One of the issue I had has been to set spark.hadoop.fs.s3a.connection.timeout to 10 but this value is set in millisecond prior to Hadoop 3 and it gives you a very long timeout; error message would appear 1.5 minute after the attempt to read a file.
PS:
Special thanks to Steve Loughran.
Thank you a lot for the precious help.

Is it possible to use "Custom Sources and Sinks" to write/append file during Dataflow pipeline execution?

My program relies on local system storage to write a file that is being generated by the program itself. Hence executing the job in "DirectPipelineRunner" mode. Below is the flow,
One of my function - Makes multiple REST API requests and creates/appends to a file(Output.txt) in local system storage.
Pipeline: a) Upload generated file to GCS 2) Read the file from GCS c) Perform transformation d) Write to BigQuery.
Since, my program writes/appends API response to local system storage, I'm executing the pipeline in DirectPipelineRunner mode.
Is it possible to have temporary space in cloud to remove dependency on local file system So that I can execute the pipleline in DataflowPipelineRunner mode?
I guess Custom Sources and Sinks can be used here. Can someone add some light on this problem statement?

dsx writing to blue-mix object storage

Will bluemix object storage ever have folder capability inside a container like amazon s3. I am not sure about other folks but pretty soon writing from DSX, it gets such a mess in a container. Its like a computer with no capability of creating folders under C:\ drive . Its a complete mess.
Since its DSX's primary storage, is the DSX pushing for this capability.Bluemix object storage no folder capability
Here's the s3 container and how beautifully you can organize everything S3 conatiner
i believe what you are looking for is something like subcontainers and to organize your files.
I think Object-storage service is based Openstack Object Storage and according to Openstack doc it is not possible to create nested directories.
https://docs.openstack.org/user-guide/cli-swift-pseudo-hierarchical-folders-directories.html
You can use the path in the filename to simulate subdirectories by seperating with / when writing/reading file you can use something like this 'swift://containername.' + name + '/foldername/fillename.csv'
So anything you write with /foldername/filename.csv will be organized under foldername.
Thanks,
Charles.