I've read a lot of topic on Internet on how to get working Spark with S3 still there's nothing working properly.
I've downloaded : Spark 2.3.2 with hadoop 2.7 and above.
I've copied only some libraries from Hadoop 2.7.7 (which matches Spark/Hadoop version) to Spark jars folder:
hadoop-aws-2.7.7.jar
hadoop-auth-2.7.7.jar
aws-java-sdk-1.7.4.jar
Still I can't use nor S3N nor S3A to get my file read by spark:
For S3A I have this exception:
sc.hadoopConfiguration.set("fs.s3a.access.key","myaccesskey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecretkey")
val file = sc.textFile("s3a://my.domain:8080/test_bucket/test_file.txt")
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: AE203E7293ZZA3ED, AWS Error Code: null, AWS Error Message: Forbidden
Using this piece of Python, and some more code, I can list my buckets, list my files, download files, read files from my computer and get file url.
This code gives me the following file url:
https://my.domain:8080/test_bucket/test_file.txt?Signature=%2Fg3jv96Hdmq2450VTrl4M%2Be%2FI%3D&Expires=1539595614&AWSAccessKeyId=myaccesskey
How should I install / set up / download to get spark able to read and write from my S3 server ?
Edit 3:
Using debug tool in comment here's the result.
Seems like the issue is with a signature thing not sure what it means.
First you will need to download aws-hadoop.jar and aws-java-sdk.jar that matches the install of your spark-hadoop release and add them to the jars folder inside spark folder.
Then you will need to precise the server you will use and enable path style if your S3 server do not support dynamic DNS:
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
#I had to change signature version because I have an old S3 api implementation:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
Here's my final code:
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val tmp = sc.textFile("s3a://test_bucket/test_file.txt")
sc.hadoopConfiguration.set("fs.s3a.access.key","mykey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecret")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled","true")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
tmp.count()
I would recommand to put most of the settings inside spark-defaults.conf:
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.endpoint mydomain:8080
spark.hadoop.fs.s3a.connection.ssl.enabled true
spark.hadoop.fs.s3a.signing-algorithm S3SignerType
One of the issue I had has been to set spark.hadoop.fs.s3a.connection.timeout to 10 but this value is set in millisecond prior to Hadoop 3 and it gives you a very long timeout; error message would appear 1.5 minute after the attempt to read a file.
PS:
Special thanks to Steve Loughran.
Thank you a lot for the precious help.
Related
Looking for databricks python/pyspark code to copy azure blob from one container to another container older than 30 days
The copy code is simple as follows.
dbutils.fs.cp("/mnt/xxx/file_A", "/mnt/yyy/file_A", True)
The difficult part is checking blob modification time. According to the doc, the modification time will only get returned by using dbutils.fs.ls command on Databricks Runtime 10.2 or above. You may check the Runtime version using the command below.
spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")
The returned value will be Databricks Runtime followed by Scala versions.
If you get lucky with the version, you can can do something like:
import time
ts_now = time.time()
for file in dbutils.fs.ls('/mnt/xxx'):
if ts_now - file.modificationTime > 30 * 86400:
dbutils.fs.cp(f'/mnt/xxx/{file.name}', f'/mnt/yyy/{file.name}', True)
Please, can you help me with this question below? The image with the error is available in the question.
I use Azure databricks for data engineering. Running the same code in databricks community runs without error, but in Azure returns the error that path was not found. Has anyone been through this situation?
I'm using sparkfiles.
cnae = 'https://servicodados.ibge.gov.br/api/v2/cnae/subclasses'
from pyspark import SparkFiles
spark.sparkContext.addFile(cnae)
cnaeDF = spark.read.option("multiLine", True).option("mode", "PERMISSIVE").json("file://"+SparkFiles.get("subclasses"))
pixel raster: rendered error message & stuff
It seems like a bug on runtime 10 as spark.sparkContext.addFile(cnae) add it to local storage:
/local_disk0/spark-f1411c54-0a2e-4138-a0ed-c2e6bbfe5ca4/userFiles-7616de8f-3e03-493c-89e6-50fa1f7324ca/subclasses
but SparkFiles.get("subclasses") want to read it from dbfs storage (I tried to add it all possible ways)...
but when magic command is run:
%sh
cp -r /local_disk0/spark-f1411c54-0a2e-4138-a0ed-c2e6bbfe5ca4 /dbfs/local_disk0/
then it is possible to read it without problem
sup y'all
in python, this executes no problem:
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "...")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "...")
sc.textFile("s3a://path").count()
someBigNumber
in scala, i get a 403:
sc.hadoopConfiguration.set("fs.s3a.access.key", "...")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "...")
sc.textFile("s3a://path").count()
StackTrace: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: ...)
why?
this is all spark 2.0.
thanks
Try to set the properties before you create the SC, e.g set sparkConf "spark.hadoop.fs.s3a..." = value
Spark tries to be clever in spark submit and copy in the AWS_ env vars into the s3a and s3n properties prior to submission, even if the properties are set. This can stamp on your settings. Look at them, verify their correctness, maybe unset them (or try setting them).
And S3a goes through the auth sequence of: attempt hadoop props; attempt env vars in destination process; attempt EC2 IAM role (exact checks & ordering is Hadoop JAR dependent). It may be something at the far end is causing fun here.
There's another emergency option, which is pretty insecure, of putting the username:pass in the url, such as s3a://AAID43ss:1356#bucket/path. This doesn't work on Hadoop < 2.8 if there is a / in the secret, and your secrets get logged to the console. Use carefully. Update this was cut from Hadoop 3.2 after many years of warning the users to stop it.
Trying to debug auth problems is a PITA as the code deliberately avoids having useful debug statements: we don't dare log the properties.
You may find something useful in the Troubleshooting S3A section of the Hadoop docs. Do bear in mind that this covers later versions of Hadoop; some things mentioned there won't be valid.
Enjoy
Steve L (currently working on the S3A code)
It means in this case, Python and Scala are "incompatible" and Scala doesn't have access to the amazonaws. Maybe the key is different and you have a typo on the Scala code, or maybe Scala doesn't work with amazonaws anymore due to amazonaws changing.
I'm trying to monitor a repository in HDFS to read and process data in files copied to it (to copy files from local system to HDFS I use hdfs dfs -put ), sometimes it generates the problem : Spark Streaming: java.io.FileNotFoundException: File does not exist: .COPYING so I read the problems in forums and the question here Spark Streaming: java.io.FileNotFoundException: File does not exist: <input_filename>._COPYING_
According to what I read the problem is linked to Spark streaming reading the file before it finishes being copied in HDFS and on Github :
https://github.com/maji2014/spark/blob/b5af1bdc3e35c53564926dcbc5c06217884598bb/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala , they say that they corrected the problem but only for FileInputDStream as I could see but I'm using textFileStream
When I tried to use FileInputDStream the IDE throws an error the Symbol is not accessible from this place.
Does anyone know how to filter out the files that are still COPYING because I tried :
var lines = ssc.textFileStream(arg(0)).filter(!_.contains("_COPYING_")
but that didn't work and it's expected because the filter should be applied on the name of the file process I guess which I can't access
As you can see I did plenty of research before asking the question but didn't get lucky ,
Any help please ?
So I had a look: -put is the wrong method. Look at the final comment: you have to use -rename in your shell script to have an atomical transaction on the HDFS.
From my machine, I've configured the hadoop core-site.xml to recognize the gs:// scheme and added gcs-connector-1.2.8.jar as a Hadoop lib. I can run hadoop fs -ls gs://mybucket/ and get the expected results. However, if I try to do the analogue from java using:
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
FileStatus[] status = fs.listStatus(new Path("gs://mybucket/"));
I get the files under root in my local HDFS instead of in gs://mybucket/, but with those files prepended with gs://mybucket. If I modify the conf with conf.set("fs.default.name", "gs://mybucket"); before obtaining the fs, then I can see the files on GCS.
My question is:
1. Is this expected behavior?
2. Is there a disadvantage to using this hadoop FileSystem api as opposed to the google cloud storage client api?
As to your first question, "expected" is questionable, but I think I can at least explain. When FileSystem.get() is used the default FileSystem is returned and by default that is HDFS. My guess is that the HDFS client (DistributedFileSystem) has code to prepend scheme + authority automatically to all files in the filesystem.
Instead of using FileSystem.get(conf), try
FileSystem gcsFs = new Path("gs://mybucket/").getFS(conf)
On disadvantages, I could probably argue that if you end up needing to access the object-store directly then you'll end up writing code to interact with the storage APIs directly anyways (and there are things that do not translate very well to the Hadoop FS API, e.g., object composition, complex object write preconditions other than simple object overwrite protection, etc).
I am admittedly biased (working on the team), but if you're intending to use GCS from Hadoop Map/Reduce, from Spark, etc, the GCS connector for Hadoop should be a fairly safe bet.