I am new in machine learning and in python. I am working on task using Logistic Regression. while I am trying to find out how to get the Evaluation Metrics using PySpark and Jupiter notebook, I get the following error.
from pyspark.mllib.util import MLUtils
from pyspark.mllib.evaluation import MulticlassMetrics
# Compute raw scores on the test set
predictionAndLabels = test.rdd.map(lambda lp: (float(lr_model.predict(lp.features)), lp.label))
metrics = MulticlassMetrics(predictionAndLabels)
The error
PicklingError: Could not serialize object: Py4JError: An error occurred while calling o123.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Related
I have setup project as described here. But code:
import com.amazonaws.services.glue.{AWSGlueClientBuilder, GlueContext}
import org.apache.spark.SparkContext
import org.slf4j.LoggerFactory
object MyGlueJob {
private val logger = LoggerFactory.getLogger(getClass)
def main(sysArgs: Array[String]) {
val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)
val awsGlueClient = AWSGlueClientBuilder.defaultClient
}
}
fails with error:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/11/21 15:40:32 INFO SparkContext: Running Spark version 2.4.3
19/11/21 15:40:33 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:368)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2544)
at MyGlueJob$.main(MyGlueJob.scala:13)
at MyGlueJob.main(MyGlueJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:66)
19/11/21 15:40:33 ERROR Utils: Uncaught exception in thread main
java.lang.NullPointerException
at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$postApplicationEnd(SparkContext.scala:2416)
at org.apache.spark.SparkContext$$anonfun$stop$1.apply$mcV$sp(SparkContext.scala:1931)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1930)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:585)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2544)
at MyGlueJob$.main(MyGlueJob.scala:13)
at MyGlueJob.main(MyGlueJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:66)
19/11/21 15:40:33 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:66)
Caused by: org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:368)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2544)
at MyGlueJob$.main(MyGlueJob.scala:13)
at MyGlueJob.main(MyGlueJob.scala)
... 5 more
It is obvious that master url should be set but how to this from commandline or system variables? (E.g. without touching the code)
Also I have [read] that --master argument can fix problem, but adding it to args do nothing (here is Intellij Idea run configuration):
The key question is to run glue job locally and be able to run it in aws without code touching, is it possible?
You can created a spark session explicitly and set any parameters you want. But I cannot say that this will work eventually in Glue. The following is a local session that I use to test Spark jobs locally even though I do run them eventually in Glue. I test only pure spark code.
lazy val spark: SparkSession = {
UserGroupInformation.setLoginUser(UserGroupInformation.createRemoteUser("hduser"))
SparkSession
.builder()
.master("local")
.appName("spark unit test")
.getOrCreate()
}
The key question is to run glue job locally and be able to run it in aws without code touching, is it possible?
It's possible to run any code with a dev endpoint and Zeppelin. See aws docs.
a newbie in apache spark here! I am using Spark 2.4.0 and Scala version 2.11.12, and I'm trying to run the following code in my spark shell -
import org.apache.spark.sql.SparkSession
import spark.implicits._
var df = spark.read.json("storesales.json")
df.createOrReplaceTempView("storesales")
spark.sql("SELECT * FROM storesales")
And I get the following error -
2018-12-18 07:05:03 WARN Hive:168 - Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.
hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java
:62)
I also saw this Issues trying out example in Spark-shell and as per the accepted answer, I have tried to start my spark shell like so,
~/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --conf spark.sql.warehouse.dir=file:///tmp/spark-warehouse, however, it did not help and the issue persists.
I am working on PySpark application on analyzing Aviation Data. The Database is a MS SQLServer DB. While connecting to the database on the server. I get an error of "No suitable driver". However when I run on local machine with CLI and add JDBC driver jar file to driver-class-path, it runs and connects with DB. But when I try to run on Dataproc cluster, it throws an error of "No suitable driver".
The code snippet is as follows:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.functions import *
spark = SparkSession.builder
.appName('Test')
.getOrCreate()
df = spark.read.format("jdbc").options(
url="jdbc:sqlserver:XYXYXY",
database="data1",
user="YYYY", password="XXXX",
dbtable="db")
.load()
The Error was:
Py4JJavaError: An error occurred while calling o209.load.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:83)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:34)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:34)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Is there other way to add JDBC jar files to the Dataproc cluster?
Here is a very similar question and answer to it that shows how to add JDBC driver to Spark Driver classpath using gcloud command:
$ gcloud dataproc jobs submit spark ... \
--jars=gs://<BUCKET>/<DIRECTORIES>/<JAR_NAME> \
--properties=spark.driver.extraClassPath=<JAR_NAME>
I have configured Spark 2.1 on my remote linux server (IBM RHEL Z systems). I am trying to create a SparkContext and getting the below error
from pyspark.context import SparkContext, SparkConf
master_url="spark://<IP>:7077"
conf = SparkConf()
conf.setMaster(master_url)
conf.setAppName("App1")
sc = SparkContext.getOrCreate(conf)
I am getting the below error. when i run the same code on the remote server in pyspark shell it works without error.
The currently active SparkContext was created at:
(No active SparkContext.)
at org.apache.spark.SparkContext.assertNotStopped(SparkContext.scala:100)
at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1768)
at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2411)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:563)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
It sounds like you haven't set jupyter to be the pyspark driver. Before controlling pyspark from jupyter you must first set PYSPARK_DRIVER_PYTHON=jupyter and PYSPARK_DRIVER_PYTHON_OPTS='notebook'. If I am not mistaken if you look at the code in libexec/bin/pyspark (on OSX) you will find instructions for setting up the jupyter notebook.
I need to read Redshift data into dataframes in Zeppelin. For the last several months I've been using Spark 2.0 via Zeppelin on AWS to open csv and json S3 files successfully.
I used to be able to connect to Redshift from Zeppelin on AWS EMR with Spark 1.6.2 (maybe 1.6.1), using this code:
%pyspark
from pyspark.sql import SQLContext, Row
import sys
from pyspark.sql.window import Window
import pyspark.sql.functions as func
#Load the data
aquery = "(SELECT serial_number, min(date_time) min_date_time from schema.table where serial_number in ('abcdefg','1234567') group by serial_number) as minDates"
dfMinDates = sqlContext.read.format('jdbc').options(url='jdbc:postgresql://dadadadaaaredshift.amazonaws.com:5439/idw?tcpKeepAlive=true&ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory?user=user&password=password', dbtable=aquery).load()
dfMinDates.show()
and it worked. That was summer of 2016.
I haven't had need of it since then and now AWS has Spark 2.0.
The new syntax is
myDF = spark.read.jdbc like this:
%pyspark
aquery = "(SELECT serial_number, min(date_time) min_date_time from schema.table where serial_number in ('abcdefg','1234567') group by serial_number) as minDates"
dfMinDates = spark.read.jdbc("jdbc:postgresql://dadadadaaaredshift.amazonaws.com:5439/idw?tcpKeepAlive=true&ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory?user=user&password=password", dbtable=aquery).load()
dfMinDates.show()
but I get this error:
Py4JJavaError: An error occurred while calling o119.jdbc. :
java.sql.SQLException: No suitable driver at
java.sql.DriverManager.getDriver(DriverManager.java:315) at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
at scala.Option.getOrElse(Option.scala:121) at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createConnectionFactory(JdbcUtils.scala:53)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:123)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:117)
at
org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:237)
at
org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:159)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:280) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:211) at
java.lang.Thread.run(Thread.java:745) (, Py4JJavaError(u'An error occurred
while calling o119.jdbc.\n', JavaObject id=o121), )
I researched the Spark 2.0 documentation, and found this:
The JDBC driver class must be visible to the primordial class loader
on the client session and on all executors. This is because Java’s
DriverManager class does a security check that results in it ignoring
all drivers not visible to the primordial class loader when one goes
to open a connection. One convenient way to do this is to modify
compute_classpath.sh on all worker nodes to include your driver JARs.
I don't know how to implement this and did more reading from various posts, some blogs and some posts in stackoverflow and found this:
spark.driver.extraClassPath = org.postgresql.Driver
I did this in the Interpreter settings page of Zeppelin, but still I get the same error.
I tried to add a Postgres Interpreter, and I'm not sure I did it right (because I wasn't sure whether to put it in the Spark interpreter or Python interpreter), and I chose the Spark interpreter. Now the Postgres interpreter also has all the same settings as the Spark interpreter, which might not matter, but still I get the same error.
In Spark 1.6, I just don't remember going through all this trouble.
As an experiment, I spun up an EMR cluster with Spark 1.6.2 and tried the old code that used to work, and got the same error as above!
The Zeppelin site has Postgres covered but their information looks like code rather than how to set up the interpreters, so I don't know how to use it.
I'm out of ideas and references.
Any suggestions are much appreciated!
You need to use Amazon's Redshift specific driver. You can download it from here: http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html.
However, if you're using EMR it's already in place (at /usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar) and you can just tell Zeppelin where it is.
Here's how to declare it: AWS Redshift driver in Zeppelin