How run glue job locally? - scala

I have setup project as described here. But code:
import com.amazonaws.services.glue.{AWSGlueClientBuilder, GlueContext}
import org.apache.spark.SparkContext
import org.slf4j.LoggerFactory
object MyGlueJob {
private val logger = LoggerFactory.getLogger(getClass)
def main(sysArgs: Array[String]) {
val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)
val awsGlueClient = AWSGlueClientBuilder.defaultClient
}
}
fails with error:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/11/21 15:40:32 INFO SparkContext: Running Spark version 2.4.3
19/11/21 15:40:33 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:368)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2544)
at MyGlueJob$.main(MyGlueJob.scala:13)
at MyGlueJob.main(MyGlueJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:66)
19/11/21 15:40:33 ERROR Utils: Uncaught exception in thread main
java.lang.NullPointerException
at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$postApplicationEnd(SparkContext.scala:2416)
at org.apache.spark.SparkContext$$anonfun$stop$1.apply$mcV$sp(SparkContext.scala:1931)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1930)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:585)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2544)
at MyGlueJob$.main(MyGlueJob.scala:13)
at MyGlueJob.main(MyGlueJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:66)
19/11/21 15:40:33 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:66)
Caused by: org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:368)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2544)
at MyGlueJob$.main(MyGlueJob.scala:13)
at MyGlueJob.main(MyGlueJob.scala)
... 5 more
It is obvious that master url should be set but how to this from commandline or system variables? (E.g. without touching the code)
Also I have [read] that --master argument can fix problem, but adding it to args do nothing (here is Intellij Idea run configuration):
The key question is to run glue job locally and be able to run it in aws without code touching, is it possible?

You can created a spark session explicitly and set any parameters you want. But I cannot say that this will work eventually in Glue. The following is a local session that I use to test Spark jobs locally even though I do run them eventually in Glue. I test only pure spark code.
lazy val spark: SparkSession = {
UserGroupInformation.setLoginUser(UserGroupInformation.createRemoteUser("hduser"))
SparkSession
.builder()
.master("local")
.appName("spark unit test")
.getOrCreate()
}
The key question is to run glue job locally and be able to run it in aws without code touching, is it possible?
It's possible to run any code with a dev endpoint and Zeppelin. See aws docs.

Related

java.lang.NoSuchMethodError: shapeless.DefaultSymbolicLabelling$.instance(Lshapeless/HList;)Lshapeless/DefaultSymbolicLabelling;

I am trying to run the spark job using the spark-submit command and getting this error-
Exception in thread "main" java.lang.NoSuchMethodError:shapeless.DefaultSymbolicLabelling$.instance(Lshapeless/HList;)Lshapeless/DefaultSymbolicLabelling;
at io.home.windowGateway.ConfigLoader$anon$exportedReader$macro$268$1.inst$macro$1$lzycompute(ConfigLoader.scala:24)
at io.home.windowGateway.ConfigLoader$anon$exportedReader$macro$268$1.inst$macro$1(ConfigLoader.scala:24)
at io.home.windowGateway.ConfigLoader$.<init>(ConfigLoader.scala:24)
at io.home.windowGateway.ConfigLoader$.<clinit>(ConfigLoader.scala)
at io.home.windowGateway$.main(test.scala:23)
at io.home.windowGateway.test.main(test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info
The control is not even going into the spark. My spark session looks like this-
val sparkSession = SparkSession
.builder()
.appName("test")
.master("spark://dheerajHome:7077")
.getOrCreate()
The URL in the master is the spark master URL. And the server is up and running.
And my spark-submit command is
spark-submit --name test --class io.home.windowGateway ~/Desktop/jar/spark-data-export-0.1.0.jar
Any help would be appreciated. Thanks in advance.
I ran into this issue when running spark job in google dataproc
Adding this to the sbt file worked for me:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("shapeless.**" -> "repackaged.com.google.common.#1").inAll
)
https://cloud.google.com/blog/products/data-analytics/managing-java-dependencies-apache-spark-applications-cloud-dataproc

I couldn't do spark-submit using mobaxterm on Apache Ambari using intellij, are there any suggestions?

I have written the code in intellij and created a jar file. Then i have used mobaxterm to upload the jar file onto apache ambari files under /user/maria_dev. Finally i have written spark submit command on mobaxterm as "spark-submit --class csv_parquet /user/maria_dev/first_2.12-0.1.jar yarn /user/maria_dev/fileupload/business-operations-survey-2020-covid-19-csv.csv /user/maria_dev/output" and im getting the following error
SPARK_MAJOR_VERSION is set to 2, using Spark2
Warning: Local jar /user/maria_dev/first_2.12-0.1.jar does not exist, skipping.
java.lang.ClassNotFoundException: csv_parquet
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:239)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:861)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/06/28 19:51:42 INFO ShutdownHookManager: Shutdown hook called
21/06/28 19:51:42 INFO ShutdownHookManager: Deleting directory /tmp/spark-02f3e56f-af15-46ab-
b43a-54d17d88f5cc
and my intellij code is:
import org.apache.spark.sql.SparkSession
object csv_parquet {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("csv_parquet").master("yarn").getOrCreate()
val df = spark.read.format("csv")
.option("delimiter",",")
.option("header", "true")
.option("path", "C:\\Users\\kiran\\business-operations-survey-2020-covid-19-
csv.csv").load()
df.show(20)
df.write.mode("overwrite")
.parquet("\\user\\maria_dev\\output")
val par= spark.read.format("parquet").load("\\user\\maria_dev\\output")
par.show()
spark.close()
}
}
I'm getting the output and was able to spark-submit locally but unable to do it on apache ambari.

Connecting SQLserver jdbc driver to Dataproc cluster

I am working on PySpark application on analyzing Aviation Data. The Database is a MS SQLServer DB. While connecting to the database on the server. I get an error of "No suitable driver". However when I run on local machine with CLI and add JDBC driver jar file to driver-class-path, it runs and connects with DB. But when I try to run on Dataproc cluster, it throws an error of "No suitable driver".
The code snippet is as follows:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.functions import *
spark = SparkSession.builder
.appName('Test')
.getOrCreate()
df = spark.read.format("jdbc").options(
url="jdbc:sqlserver:XYXYXY",
database="data1",
user="YYYY", password="XXXX",
dbtable="db")
.load()
The Error was:
Py4JJavaError: An error occurred while calling o209.load.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:83)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:34)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:34)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Is there other way to add JDBC jar files to the Dataproc cluster?
Here is a very similar question and answer to it that shows how to add JDBC driver to Spark Driver classpath using gcloud command:
$ gcloud dataproc jobs submit spark ... \
--jars=gs://<BUCKET>/<DIRECTORIES>/<JAR_NAME> \
--properties=spark.driver.extraClassPath=<JAR_NAME>

Error initializing SparkContext: A master URL must be set in your configuration

I used this code
My error is:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/02/03 20:39:24 INFO SparkContext: Running Spark version 2.1.0
17/02/03 20:39:25 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/02/03 20:39:25 WARN SparkConf: Detected deprecated memory fraction
settings: [spark.storage.memoryFraction]. As of Spark 1.6, execution and
storage memory management are unified. All memory fractions used in the old
model are now deprecated and no longer read. If you wish to use the old
memory management, you may explicitly enable `spark.memory.useLegacyMode`
(not recommended).
17/02/03 20:39:25 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your
configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:379)
at PCA$.main(PCA.scala:26)
at PCA.main(PCA.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
17/02/03 20:39:25 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:379)
at PCA$.main(PCA.scala:26)
at PCA.main(PCA.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Process finished with exit code 1
If you are running spark stand alone then
val conf = new SparkConf().setMaster("spark://master") //missing
and you can pass parameter while submit job
spark-submit --master spark://master
If you are running spark local then
val conf = new SparkConf().setMaster("local[2]") //missing
you can pass parameter while submit job
spark-submit --master local
if you are running spark on yarn then
spark-submit --master yarn
Error message is pretty clear, you have to provide the address of the Spark Master node, either via the SparkContext or via spark-submit:
val conf =
new SparkConf()
.setAppName("ClusterScore")
.setMaster("spark://172.1.1.1:7077") // <--- This is what's missing
.set("spark.storage.memoryFraction", "1")
val sc = new SparkContext(conf)
SparkConf configuration = new SparkConf()
.setAppName("Your Application Name")
.setMaster("local");
val sc = new SparkContext(conf);
It will work...
Most probably you are using Spark 2.x API in Java.
Use code snippet like this to avoid this error. This is true when you are running Spark standalone on your computer using Shade plug-in which will import all the runtime libraries on your computer.
SparkSession spark = SparkSession.builder()
.appName("Spark-Demo")//assign a name to the spark application
.master("local[*]") //utilize all the available cores on local
.getOrCreate();

ERROR SparkContext: Error initializing SparkContext

I am using spark-1.5.0-cdh5.6.0. tried the sample application (scala)
command is:
> spark-submit --class com.cloudera.spark.simbox.sparksimbox.WordCount --master local /home/hadoop/work/testspark.jar
Got the following error:
ERROR SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/user/spark/applicationHistory does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:424)
at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:100)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:541)
at com.cloudera.spark.simbox.sparksimbox.WordCount$.main(WordCount.scala:12)
at com.cloudera.spark.simbox.sparksimbox.WordCount.main(WordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Spark has a feature called "history server" which allows you to browse historical events after the SparkContext dies. This property is set via setting spark.eventLog.enabled to true.
You have two options, either specify a valid directory to store the event log via the spark.eventLog.dir config value, or simply set spark.eventLog.enabled to false if you don't need it.
You can read more on that in the Spark Configuration page.
I got the same error which working with nltk in spark, To fix this I just removed all the nltk related properties from spark-conf.default.