ValueError: Cannot run multiple SparkContexts at once; existing SparkContext - AWS Glue

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext - AWS Glue - pyspark

Getting below error while executing Glue job,
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext
Not sure how to stop the existing SparkContext,
any suggestion will be helpful.
If it is a spark code, we can do it using ,
sc = SparkContext.getOrCreate()
or
sc.stop()
Not sure how to stop in glue

Related

Convert pyspark script to awsglue script

i have a bunch of existing pyspark scripts that I want to execute using AWS Glue. The scripts use APIs like SparkSession.read and various transformation in pyspark DataFrames.
I wasn't able to find docs outlining how to convert such a script. Do you have a hint / examples where I could find more infos? Thanks :)

One approach: Use the source/sink read/write APIs from AWS Glue and keep the DataFrame transformations as Pyspark code. This enables "easy" integration with AWS services (e.g. S3, Glue Catalog) and makes unit testing DataFrame transformations simple (since this is well known in Pyspark).
Example:
import sys
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame, DynamicFrameWriter
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
# init Glue context (and Spark context)
spark_context = SparkContext()
glue_context = GlueContext(spark_context)
# init Glue job
args = getResolvedOptions(sys.argv, ["JOB_NAME", "PARAM_1"])
job = Job(glue_context)
job.init(args["JOB_NAME"], args)
# read from source (use Glue APIs)
dynamic_frame = glue_context.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={},
format="json",
format_options={},
)
# convert DynamicFrame to DataFrame
df = dynamic_frame.toDF()
# do DataFrame transformations (use Pyspark API)
# convert DataFrame back to DynamicFrame
dynamic_frame = DynamicFrame.fromDF(
dataframe=df,
glue_ctx=glue_context,
)
# write to sink (use Glue APIs)
DynamicFrameWriter(glue_context).from_options(
frame=dynamic_frame,
connection_type="s3",
connection_options={},
format="json",
format_options={},
)
# commit job
job.commit()
There are different ways to organize this example code into classes and functions, etc. Do whatever is appropriate for your existing script.
References:
GlueContext class (AWS)
DynamicFrame class (AWS)
DynamicFrameWriter class (AWS)

Pyspark script should run as is on AWS Glue since Glue is basically Spark with some custom AWS library added. For start, I would just paste it into Glue and try to run it.
If you need some functionality of Glue like dynamic frames or bookmarks, then you will need to modify the scripts to get GlueContext and work with that. The basic initialization is:
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
glueContext = GlueContext(spark_session.sparkContext)
From here onwards, you can use glueContext for Glue features or spark_session for plain Spark functionality.
I would however avoid using Glue-specific stuff just for the sake of it, because:
it will reduce portability
there is much better community support for Spark than it is for Glue

Code errors out from IntelliJ but runs well on the Databricks Notebook

I develop Spark code using Scala APIs on IntelliJ and when I run this I get below error but runs well on the Databricks notebook though.
I am using Databricks Connect to connect from local installation of IntelliJ to the Databricks Spark Cluster. I am connected to the cluster and was able to submit a job from IntelliJ to the Cluster too. AMOF, everything else works except the below piece.
DBConnect is 6.1 , Databricks Runtime is 6.2
Imported the jar file from the cluster (using Databricks-connect get-jar-dir) and set up the SBT project with the jar in the project library
source code:
val sparkSession = SparkSession.builder.getOrCreate()
val sparkContext = sparkSession.sparkContext
import sparkSession.implicits._
val v_textFile_read = sparkContext.textFile(v_filename_path)
v_textFile_read.take(2).foreach(println)
Error:
cannot assign instance of scala.Some to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of
type scala.collection.Seq in instance of org.apache.spark.rdd.HadoopRDD
The reason I use a RDD reader for text is so I can pass this output to a createDataFrame API. As you know, the createdataframe API takes in an RDD and schema as input parameters.
step-1: val v_RDD_textFile_read = sparkContext.textFile(v_filename_path).map(x => MMRSplitRowIntoStrings(x))
step-2: val v_DF_textFile_read = sparkSession.sqlContext.createDataFrame(v_RDD_textFile_read, v_schema) (edited

Apache spark - loading data from elasticsearch is too slow

I'm new to Apache Spark and I'm trying to load some elasticsearch data from a scala script I'm running on it.
Here is my script:
import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession.builder.appName("Simple Application").getOrCreate()
val options = Map("es.nodes" -> "x.x.x.x:9200", "pushdown" -> "true")
import sparkSession.implicits._
val df = sparkSession.read.format("org.elasticsearch.spark.sql").options(options).load("my_index-07.05.2018/_doc").limit(5).select("SomeField", "AnotherField", "AnotherOne")
df.cache()
df.show()
And it works, but It's terribly slow. Am I doing anything wrong here?
Connectivity shouldn't be an issue at all, the index I'm trying to query has at around 200k documents but I'm limiting the query to 5 results.
Btw I had to run the spark-shell (or submit) by passing the elasticsearch-hadoop dependency as a parameter in the command line (--packages org.elasticsearch:elasticsearch-hadoop:6.3.0). Is that the right way to do it? Is there any way to just build sbt package including all the dependencies?
Thanks a lot

Are you running this locally in a single machine? If so it could be normal... You
will have to check your network your spark web ui etc...
About submitting all the dependencies without specifying them in the shell withing spark-submit what we usually create a FAT jar by using sbt assembly.
http://queirozf.com/entries/creating-scala-fat-jars-for-spark-on-sbt-with-sbt-assembly-plugin

Different outputs per number of partition in spark

I run spark code in my local machine and cluster.
I create SparkContext object for local machine with following code:
val sc = new SparkContext("local[*]", "Trial")
I create SparkContext object for cluster with following code:
val spark = SparkSession.builder.appName(args(0)+" "+args(1)).getOrCreate()
val sc = spark.sparkContext
and I set the number of partition as 4 for local machine and cluster with following code
val dataset = sc.textFile("Dataset.txt", 4)
In my cluster, I created 5 workers. One of them is driver node, rest of them run as worker.
I expects that the results should be same. However, the results of two parts which are local and cluster are different. What are the reasons of the problem?

I create SparkContext object for local machine with following code
and
I create SparkContext object for cluster with following code:
It appears that you may have defined two different environments for sc and spark as you define local[*] explicitly for sc while taking some default value for spark (that may read external configuration files or take so-called master URL from spark-submit).
These may be different that may affect what you use.
I expects that the results should be same. However, the results of two parts which are local and cluster are different. What are the reasons of the problem?
Dataset.txt you process in local vs cluster environments are different and hence the difference in the results. I'd strongly recommend using HDFS or some other shared file system to avoid such "surprises" in the future.

Spark on standalone cluster throws java.lang.illegalStateException

I hava a app and read data from MongoDB.
If I use local pattern, it runs well, however, it throws java.lang.illegalStateExcetion when I use standalone cluster pattern
With local pattern, the SparkContext is val sc = new SparkContext("local","Scala Word Count")
With Standalone cluster pattern, the SparkContext is val sc = new SparkContext() and submit shell is ./spark-submit --class "xxxMain" /usr/local/jarfile/xxx.jar --master spark://master:7077
It trys 4 times then throw error when it runs to the first action
My code
configOriginal.set("mongo.input.uri","mongodb://172.16.xxx.xxx:20000/xxx.Original")
configOriginal.set("mongo.output.uri","mongodb://172.16.xxx.xxx:20000/xxx.sfeature")
mongoRDDOriginal =sc.newAPIHadoopRDD(configOriginal,classOf[com.mongodb.hadoop.MongoInputFormat],classOf[Object], classOf[BSONObject])
I learned from this example
mongo-spark
I searched and someone said it was because of mongo-hadoop-core-1.3.2, but either I up the version to mongo-hadoop-core-1.4.0 or down to 'mongo-hadoop-core-1.3.1', it didn't work.
Please help me!

Finally, I got the solution.
Because each of my workers have many cores and mongo-hadoop-core-1.3.2 doesn't support multiple threads, however it fixed in mongo-hadoop-core-1.4.0. But why my app still get error is because of "intellij idea" cache. You should add mongo-java-driver dependency, too.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext - AWS Glue - pyspark

Related

Convert pyspark script to awsglue script

Code errors out from IntelliJ but runs well on the Databricks Notebook

Apache spark - loading data from elasticsearch is too slow

Different outputs per number of partition in spark

Spark on standalone cluster throws java.lang.illegalStateException

Categories

Resources