Spark 2.3 dynamic partitionBy not working on S3 AWS EMR 5.13.0 - scala

Dynamic partitioning introduced by Spark 2.3 doesn't seem to work on AWS's EMR 5.13.0 when writing to S3
When executing, a temporary directory is created in S3 but it disappears once the process is completed without writing the new data to the final folder structure.
The issue was found when executing a Scala/Spark 2.3 application on EMR 5.13.0.
The configuration is as follows:
var spark = SparkSession
.builder
.appName(MyClass.getClass.getSimpleName)
.getOrCreate()
spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC") // also tried "dynamic"
The code that writes to S3:
val myDataset : Dataset[MyType] = ...
val w = myDataset
.coalesce(10)
.write
.option("encoding", "UTF-8")
.option("compression", "snappy")
.mode("overwrite")
.partitionBy("col_1","col_2")
w.parquet(s"$destinationPath/" + Constants.MyTypeTableName)
With destinationPath being a S3 bucket/folder
Anyone else has experienced this issue?

Upgrading to EMR 5.19 fixes the problem. However my previous answer is incorrect - using the EMRFS S3-optimized Committer has nothing to do with it. The EMRFS S3-optimized Committer is silently skipped when spark.sql.sources.partitionOverwriteMode is set to dynamic: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-committer-reqs.html
If you can upgrade to at least EMR 5.19.0, AWS's EMRFS S3-optimized Committer solves these issues.
--conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
See: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html

Related

Hdinsight Spark Session issue with Parquet

Using HDinsight to run spark and a scala script.
I'm using the example scripts provided by the Azure plugin in intellij.
It provides me with the following code:
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
Fair enough. And I can do things like:
val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
and I can save files:
rdd1.saveAsTextFile("wasb:///HVACout2")
However, I am looking to load in a parquet file. The code I have found (elsewhere) for parquet files coming in is:
val df = spark.read.parquet("resources/Parquet/MyFile.parquet/")
Line above gives an error on this in HDinsight (when I submit the jar via intellij).
Why don't you use?:
val spark = SparkSession.builder
.master("local[*]") // adjust accordingly
.config("spark.sql.warehouse.dir", "E:/Exp/") //change accordingly
.appName("MySparkSession") //change accordingly
.getOrCreate()
When I put in spark session and get rid of spark context, HD insight breaks.
What am I doing wrong?
How using HdInsight do I go about creating either a spark session or context, that allows me to read in text files, parquet and all the rest? How do I get the best of both worlds
My understanding is SparkSession, is the better and more recent way. And what we should be using. So how do I get it running in HDInsight?
Thanks in advance
Turns out if I add
val spark = SparkSession.builder().appName("Spark SQL basic").getOrCreate()
After the spark context line and before the parquet, read part, it works.

How to load a csv/txt file into AWS Glue job

I have below 2 clarifications on AWS Glue, could you please clarify. Because I need to use glue as part of my project.
I would like to load a csv/txt file into a Glue job to process it. (Like we do in Spark with dataframes). Is this possible in Glue? Or do we have to use only Crawlers to crawl the data into Glue tables and make use of them like below for further processing?
empdf = glueContext.create_dynamic_frame.from_catalog(
database="emp",
table_name="emp_json")
Below I used Spark code to load a file into Glue, but I'm getting lengthy error logs. Can we directly run Spark or PySpark code as it is without any changes in Glue?
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
dfnew = spark.read.option("header","true").option("delimiter", ",").csv("C:\inputs\TEST.txt")
dfnew.show(2)
It's possible to load data directly from s3 using Glue:
sourceDyf = glueContext.create_dynamic_frame_from_options(
connection_type="s3",
format="csv",
connection_options={
"paths": ["s3://bucket/folder"]
},
format_options={
"withHeader": True,
"separator": ","
})
You can also do that just with spark (as you already tried):
sourceDf = spark.read
.option("header","true")
.option("delimiter", ",")
.csv("C:\inputs\TEST.txt")
However, in this case Glue doesn't guarantee that they provide appropriate Spark readers. So if your error is related to missing data source for CSV then you should add spark-csv lib to the Glue job by providing s3 path to its locations via --extra-jars parameter.
Below 2 cases i tested working fine:
To load a file from S3 into Glue.
dfnew = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://MyBucket/path/"] }, format="csv" )
dfnew.show(2)
To load data from Glue db and tables which are generated already through Glue Crawlers.
DynFr = glueContext.create_dynamic_frame.from_catalog(database="test_db", table_name="test_table")
DynFr is a DynamicFrame, so if we want to work with Spark code in Glue, then we need to convert it into a normal data frame like below.
df1 = DynFr.toDF()

Unable to create dataframe using SQLContext object in spark2.2

I am using spark 2.2 version on Microsoft Windows 7. I want to load csv file in one variable to perform SQL related actions later on but unable to do so. I referred accepted answer from this link but of no use. I followed below steps for creating SparkContext object and SQLContext object:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val sc=SparkContext.getOrCreate() // Creating spark context object
val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Creating SQL object for query related tasks
Objects are created successfully but when I execute below code it throws an error which can't be posted here.
val df = sqlContext.read.format("csv").option("header", "true").load("D://ResourceData.csv")
And when I try something like df.show(2) it says that df was not found. I tried databricks solution for loading CSV from the attached link. It downloads the packages but doesn't load csv file. So how can I rectify my problem?? Thanks in advance :)
I solved my problem for loading local file in dataframe using 1.6 version in cloudera VM with the help of below code:
1) sudo spark-shell --jars /usr/lib/spark/lib/spark-csv_2.10-1.5.0.jar,/usr/lib/spark/lib/commons-csv-1.5.jar,/usr/lib/spark/lib/univocity-parsers-1.5.1.jar
2) val df1 = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("treatEmptyValuesAsNulls", "true" ).option("parserLib", "univocity").load("file:///home/cloudera/Desktop/ResourceData.csv")
NOTE: sc and sqlContext variables are automatically created
But there are many improvements in the latest version i.e 2.2.1 which I am unable to use because metastore_db doesn't gets created in windows 7. I ll post a new question regarding the same.
In reference with your comment that you are able to access SparkSession variable, then follow below steps to process your csv file using SparkSQL.
Spark SQL is a Spark module for structured data processing.
There are mainly two abstractions - Dataset and Dataframe :
A Dataset is a distributed collection of data.
A DataFrame is a Dataset organized into named columns.
In the Scala API, DataFrame is simply a type alias of Dataset[Row].
With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.
You have a csv file and you can simply create a dataframe by doing one of the following:
From your spark-shell using the SparkSession variable spark:
val df = spark.read
.format("csv")
.option("header", "true")
.load("sample.csv")
After reading the file into dataframe, you can register it into a temporary view.
df.createOrReplaceTempView("foo")
SQL statements can be run by using the sql methods provided by Spark
val fooDF = spark.sql("SELECT name, age FROM foo WHERE age BETWEEN 13 AND 19")
You can also query that file directly with SQL:
val df = spark.sql("SELECT * FROM csv.'file:///path to the file/'")
Make sure that you run spark in local mode when you load data from local, or else you will get error. The error occurs when you have already set HADOOP_CONF_DIR environment variable,and which expects "hdfs://..." otherwise "file://".
Set your spark.sql.warehouse.dir (default: ${system:user.dir}/spark-warehouse).
.config("spark.sql.warehouse.dir", "file:///C:/path/to/my/")
It is the default location of Hive warehouse directory (using Derby)
with managed databases and tables. Once you set the warehouse directory, Spark will be able to locate your files, and you can load csv.
Reference : Spark SQL Programming Guide
Spark version 2.2.0 has built-in support for csv.
In your spark-shell run the following code
val df= spark.read
.option("header","true")
.csv("D:/abc.csv")
df: org.apache.spark.sql.DataFrame = [Team_Id: string, Team_Name: string ... 1 more field]

Unable to read from s3 bucket using spark

val spark = SparkSession
.builder()
.appName("try1")
.master("local")
.getOrCreate()
val df = spark.read
.json("s3n://BUCKET-NAME/FOLDER/FILE.json")
.select($"uid").show(5)
I have given the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY as environment variables. I face below error while trying to read from S3.
Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/FOLDER%2FFILE.json' - ResponseCode=400, ResponseMessage=Bad Request
I suspect the error is caused due to "/" being converted to "%2F" by some internal function as the error shows '/FOLDER%2FFILE.json' instead of '/FOLDER/FILE.json'
Your spark (jvm) application cannot read environment variable if you don't tell it to, so a quick work around :
spark.sparkContext
.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
spark.sparkContext
.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)
You'll also need to precise the s3 endpoint :
spark.sparkContext
.hadoopConfiguration.set("fs.s3a.endpoint", "<<ENDPOINT>>");
To know more about what is AWS S3 Endpoint, refer to the following documentation :
AWS Regions and Endpoints.
Working with Amazon S3 Buckets.

Spark AWS emr checkpoint location

I'm running a spark job on EMR but need to create a checkpoint. I tried using s3 but got this error message
17/02/24 14:34:35 ERROR ApplicationMaster: User class threw exception:
java.lang.IllegalArgumentException: Wrong FS: s3://spark-
jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected-
components-e3210fd6/2, expected: hdfs://ip-172-18-13-18.ec2.internal:8020
java.lang.IllegalArgumentException: Wrong FS: s3://spark-
jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected-
components-e3210fd6/2, expected: hdfs://ip-172-18-13-18.ec2.internal:8020
Here is my sample code
...
val sparkConf = new SparkConf().setAppName("spark-job")
.set("spark.default.parallelism", (CPU * 3).toString)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array(classOf[Member], classOf[GraphVertex], classOf[GraphEdge]))
.set("spark.dynamicAllocation.enabled", "true")
implicit val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
sparkSession.sparkContext.setCheckpointDir("s3://spark-jobs/checkpoint")
....
How can I checkpoint on AWS EMR?
There's a now fixed bug for Spark which meant you could only checkpoint to the default FS, not any other one (like S3). It's fixed in master, don't know about backports.
if it makes you feel any better, the way checkpointing works: write then rename() is slow enough on the object store you may find yourself off better checkpointing locally then doing the upload to s3 yourself.
There is a fix in the master branch for this to allow checkpoint to s3 too. I was able to build against it and it worked so this should be part of next release.
Try something with AWS authenticaton like:
val hadoopConf: Configuration = new Configuration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "id-1")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "secret-key")
sparkSession.sparkContext.getOrCreate(checkPointDir, () =>
{ createStreamingContext(checkPointDir, config) }, hadoopConf)