Unable to read from s3 bucket using spark

Unable to read from s3 bucket using spark - scala

val spark = SparkSession
.builder()
.appName("try1")
.master("local")
.getOrCreate()
val df = spark.read
.json("s3n://BUCKET-NAME/FOLDER/FILE.json")
.select($"uid").show(5)
I have given the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY as environment variables. I face below error while trying to read from S3.
Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/FOLDER%2FFILE.json' - ResponseCode=400, ResponseMessage=Bad Request
I suspect the error is caused due to "/" being converted to "%2F" by some internal function as the error shows '/FOLDER%2FFILE.json' instead of '/FOLDER/FILE.json'

Your spark (jvm) application cannot read environment variable if you don't tell it to, so a quick work around :
spark.sparkContext
.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
spark.sparkContext
.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)
You'll also need to precise the s3 endpoint :
spark.sparkContext
.hadoopConfiguration.set("fs.s3a.endpoint", "<<ENDPOINT>>");
To know more about what is AWS S3 Endpoint, refer to the following documentation :
AWS Regions and Endpoints.
Working with Amazon S3 Buckets.

Related

Hdinsight Spark Session issue with Parquet

Using HDinsight to run spark and a scala script.
I'm using the example scripts provided by the Azure plugin in intellij.
It provides me with the following code:
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
Fair enough. And I can do things like:
val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
and I can save files:
rdd1.saveAsTextFile("wasb:///HVACout2")
However, I am looking to load in a parquet file. The code I have found (elsewhere) for parquet files coming in is:
val df = spark.read.parquet("resources/Parquet/MyFile.parquet/")
Line above gives an error on this in HDinsight (when I submit the jar via intellij).
Why don't you use?:
val spark = SparkSession.builder
.master("local[*]") // adjust accordingly
.config("spark.sql.warehouse.dir", "E:/Exp/") //change accordingly
.appName("MySparkSession") //change accordingly
.getOrCreate()
When I put in spark session and get rid of spark context, HD insight breaks.
What am I doing wrong?
How using HdInsight do I go about creating either a spark session or context, that allows me to read in text files, parquet and all the rest? How do I get the best of both worlds
My understanding is SparkSession, is the better and more recent way. And what we should be using. So how do I get it running in HDInsight?
Thanks in advance

Turns out if I add
val spark = SparkSession.builder().appName("Spark SQL basic").getOrCreate()
After the spark context line and before the parquet, read part, it works.

register hive udf in scala - java.net.MalformedURLException: unknown protocol: s3

I am trying to register a udf in scala spark like this where registering the following udf works in hive create temporary function udf_parallax as 'com.abc.edw.hww.etl.udf.parallax.ParallaxHiveHash' USING JAR 's3://bx-analytics-softwares/gdpr_hive_udfs/gdpr-hive-udfs-hadoop.jar'
val sparkSess = SparkSession.builder()
.appName("Opens")
.enableHiveSupport()
.config("set hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
sparkSess.sql("""create temporary function udf_parallax as 'com.abc.edw.hww.etl.udf.parallax.ParallaxHiveHash' USING JAR 's3://bx-analytics-softwares/gdpr_hive_udfs/gdpr-hive-udfs-hadoop.jar'""");
I get an error saying
Exception in thread "main" java.net.MalformedURLException: unknown protocol: s3
Would like to know if I have to set something in config or anything else , I have just started learning.
Any help with this is appreciated.

Why not add this gdpr-hive-udfs-hadoop.jar as an external jar to your project and then do this to register the udf:
val sqlContext = sparkSess.sqlContext
val udf_parallax = sqlContext.udf .register("udf_parallax", com.abc.edw.hww.etl.udf.parallax.ParallaxHiveHash _)
Update:
1.If your hive is running on remote server:
val sparkSession= SparkSession.builder()
.appName("Opens")
.config("hive.metastore.uris", "thrift://METASTORE:9083")
.config("set hive.exec.dynamic.partition.mode", "nonstrict")
.enableHiveSupport()
.getOrCreate()
sparkSession.sql("""create temporary function udf_parallax as 'com.abc.edw.hww.etl.udf.parallax.ParallaxHiveHash' USING JAR 's3://bx-analytics-softwares/gdpr_hive_udfs/gdpr-hive-udfs-hadoop.jar'""");
2.If hive is not running on remote server:
Copy the hive-site.xml from your /hive/conf/ directory to /spark/conf/ directory and create the SparkSession as you have mentioned in the question

Spark 2.3 dynamic partitionBy not working on S3 AWS EMR 5.13.0

Dynamic partitioning introduced by Spark 2.3 doesn't seem to work on AWS's EMR 5.13.0 when writing to S3
When executing, a temporary directory is created in S3 but it disappears once the process is completed without writing the new data to the final folder structure.
The issue was found when executing a Scala/Spark 2.3 application on EMR 5.13.0.
The configuration is as follows:
var spark = SparkSession
.builder
.appName(MyClass.getClass.getSimpleName)
.getOrCreate()
spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC") // also tried "dynamic"
The code that writes to S3:
val myDataset : Dataset[MyType] = ...
val w = myDataset
.coalesce(10)
.write
.option("encoding", "UTF-8")
.option("compression", "snappy")
.mode("overwrite")
.partitionBy("col_1","col_2")
w.parquet(s"$destinationPath/" + Constants.MyTypeTableName)
With destinationPath being a S3 bucket/folder
Anyone else has experienced this issue?

Upgrading to EMR 5.19 fixes the problem. However my previous answer is incorrect - using the EMRFS S3-optimized Committer has nothing to do with it. The EMRFS S3-optimized Committer is silently skipped when spark.sql.sources.partitionOverwriteMode is set to dynamic: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-committer-reqs.html
If you can upgrade to at least EMR 5.19.0, AWS's EMRFS S3-optimized Committer solves these issues.
--conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
See: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html

Error instantiating 'org.apache.spark.sql.hive.HiveSessionState': on Linux server

I have a Scala Spark application that I'm trying to run on a Linux server using a shell script. I am getting the error:
Exception in thread "main" java.lang.IllegalArgumentException: Error
while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
However, I don't understand what is wrong. I am doing this to instantiate Spark:
val sparkConf = new SparkConf().setAppName("HDFStoES").setMaster("local")
val spark: SparkSession = SparkSession.builder.enableHiveSupport().config(sparkConf).getOrCreate()
Am I doing this correctly, if so what could be the error?

sparkSession = SparkSession.builder().appName("Test App").master("local[*])
.config("hive.metastore.warehouse.dir", hiveWareHouseDir)
.config("spark.sql.warehouse.dir", hiveWareHouseDir).enableHiveSupport().getOrCreate();
Use above, you need to specify the "hive.metastore.warehouse.dir" directory to enable hive support in spark session.

Spark AWS emr checkpoint location

I'm running a spark job on EMR but need to create a checkpoint. I tried using s3 but got this error message
17/02/24 14:34:35 ERROR ApplicationMaster: User class threw exception:
java.lang.IllegalArgumentException: Wrong FS: s3://spark-
jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected-
components-e3210fd6/2, expected: hdfs://ip-172-18-13-18.ec2.internal:8020
java.lang.IllegalArgumentException: Wrong FS: s3://spark-
jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected-
components-e3210fd6/2, expected: hdfs://ip-172-18-13-18.ec2.internal:8020
Here is my sample code
...
val sparkConf = new SparkConf().setAppName("spark-job")
.set("spark.default.parallelism", (CPU * 3).toString)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array(classOf[Member], classOf[GraphVertex], classOf[GraphEdge]))
.set("spark.dynamicAllocation.enabled", "true")
implicit val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
sparkSession.sparkContext.setCheckpointDir("s3://spark-jobs/checkpoint")
....
How can I checkpoint on AWS EMR?

There's a now fixed bug for Spark which meant you could only checkpoint to the default FS, not any other one (like S3). It's fixed in master, don't know about backports.
if it makes you feel any better, the way checkpointing works: write then rename() is slow enough on the object store you may find yourself off better checkpointing locally then doing the upload to s3 yourself.

There is a fix in the master branch for this to allow checkpoint to s3 too. I was able to build against it and it worked so this should be part of next release.

Try something with AWS authenticaton like:
val hadoopConf: Configuration = new Configuration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "id-1")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "secret-key")
sparkSession.sparkContext.getOrCreate(checkPointDir, () =>
{ createStreamingContext(checkPointDir, config) }, hadoopConf)