Pyspark command on jupyter: Connecting spark on remote server - pyspark

I have configured Spark 2.1 on my remote linux server (IBM RHEL Z systems). I am trying to create a SparkContext and getting the below error
from pyspark.context import SparkContext, SparkConf
master_url="spark://<IP>:7077"
conf = SparkConf()
conf.setMaster(master_url)
conf.setAppName("App1")
sc = SparkContext.getOrCreate(conf)
I am getting the below error. when i run the same code on the remote server in pyspark shell it works without error.
The currently active SparkContext was created at:
(No active SparkContext.)
at org.apache.spark.SparkContext.assertNotStopped(SparkContext.scala:100)
at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1768)
at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2411)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:563)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

It sounds like you haven't set jupyter to be the pyspark driver. Before controlling pyspark from jupyter you must first set PYSPARK_DRIVER_PYTHON=jupyter and PYSPARK_DRIVER_PYTHON_OPTS='notebook'. If I am not mistaken if you look at the code in libexec/bin/pyspark (on OSX) you will find instructions for setting up the jupyter notebook.

Related

How run glue job locally?

I have setup project as described here. But code:
import com.amazonaws.services.glue.{AWSGlueClientBuilder, GlueContext}
import org.apache.spark.SparkContext
import org.slf4j.LoggerFactory
object MyGlueJob {
private val logger = LoggerFactory.getLogger(getClass)
def main(sysArgs: Array[String]) {
val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)
val awsGlueClient = AWSGlueClientBuilder.defaultClient
}
}
fails with error:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/11/21 15:40:32 INFO SparkContext: Running Spark version 2.4.3
19/11/21 15:40:33 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:368)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2544)
at MyGlueJob$.main(MyGlueJob.scala:13)
at MyGlueJob.main(MyGlueJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:66)
19/11/21 15:40:33 ERROR Utils: Uncaught exception in thread main
java.lang.NullPointerException
at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$postApplicationEnd(SparkContext.scala:2416)
at org.apache.spark.SparkContext$$anonfun$stop$1.apply$mcV$sp(SparkContext.scala:1931)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1930)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:585)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2544)
at MyGlueJob$.main(MyGlueJob.scala:13)
at MyGlueJob.main(MyGlueJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:66)
19/11/21 15:40:33 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:66)
Caused by: org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:368)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2544)
at MyGlueJob$.main(MyGlueJob.scala:13)
at MyGlueJob.main(MyGlueJob.scala)
... 5 more
It is obvious that master url should be set but how to this from commandline or system variables? (E.g. without touching the code)
Also I have [read] that --master argument can fix problem, but adding it to args do nothing (here is Intellij Idea run configuration):
The key question is to run glue job locally and be able to run it in aws without code touching, is it possible?
You can created a spark session explicitly and set any parameters you want. But I cannot say that this will work eventually in Glue. The following is a local session that I use to test Spark jobs locally even though I do run them eventually in Glue. I test only pure spark code.
lazy val spark: SparkSession = {
UserGroupInformation.setLoginUser(UserGroupInformation.createRemoteUser("hduser"))
SparkSession
.builder()
.master("local")
.appName("spark unit test")
.getOrCreate()
}
The key question is to run glue job locally and be able to run it in aws without code touching, is it possible?
It's possible to run any code with a dev endpoint and Zeppelin. See aws docs.

Connecting SQLserver jdbc driver to Dataproc cluster

I am working on PySpark application on analyzing Aviation Data. The Database is a MS SQLServer DB. While connecting to the database on the server. I get an error of "No suitable driver". However when I run on local machine with CLI and add JDBC driver jar file to driver-class-path, it runs and connects with DB. But when I try to run on Dataproc cluster, it throws an error of "No suitable driver".
The code snippet is as follows:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.functions import *
spark = SparkSession.builder
.appName('Test')
.getOrCreate()
df = spark.read.format("jdbc").options(
url="jdbc:sqlserver:XYXYXY",
database="data1",
user="YYYY", password="XXXX",
dbtable="db")
.load()
The Error was:
Py4JJavaError: An error occurred while calling o209.load.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:83)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:34)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:34)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Is there other way to add JDBC jar files to the Dataproc cluster?
Here is a very similar question and answer to it that shows how to add JDBC driver to Spark Driver classpath using gcloud command:
$ gcloud dataproc jobs submit spark ... \
--jars=gs://<BUCKET>/<DIRECTORIES>/<JAR_NAME> \
--properties=spark.driver.extraClassPath=<JAR_NAME>

Connecting to AWS Redshift with Zeppelin Spark 2.0 and Pyspark

I need to read Redshift data into dataframes in Zeppelin. For the last several months I've been using Spark 2.0 via Zeppelin on AWS to open csv and json S3 files successfully.
I used to be able to connect to Redshift from Zeppelin on AWS EMR with Spark 1.6.2 (maybe 1.6.1), using this code:
%pyspark
from pyspark.sql import SQLContext, Row
import sys
from pyspark.sql.window import Window
import pyspark.sql.functions as func
#Load the data
aquery = "(SELECT serial_number, min(date_time) min_date_time from schema.table where serial_number in ('abcdefg','1234567') group by serial_number) as minDates"
dfMinDates = sqlContext.read.format('jdbc').options(url='jdbc:postgresql://dadadadaaaredshift.amazonaws.com:5439/idw?tcpKeepAlive=true&ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory?user=user&password=password', dbtable=aquery).load()
dfMinDates.show()
and it worked. That was summer of 2016.
I haven't had need of it since then and now AWS has Spark 2.0.
The new syntax is
myDF = spark.read.jdbc like this:
%pyspark
aquery = "(SELECT serial_number, min(date_time) min_date_time from schema.table where serial_number in ('abcdefg','1234567') group by serial_number) as minDates"
dfMinDates = spark.read.jdbc("jdbc:postgresql://dadadadaaaredshift.amazonaws.com:5439/idw?tcpKeepAlive=true&ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory?user=user&password=password", dbtable=aquery).load()
dfMinDates.show()
but I get this error:
Py4JJavaError: An error occurred while calling o119.jdbc. :
java.sql.SQLException: No suitable driver at
java.sql.DriverManager.getDriver(DriverManager.java:315) at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
at scala.Option.getOrElse(Option.scala:121) at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createConnectionFactory(JdbcUtils.scala:53)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:123)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:117)
at
org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:237)
at
org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:159)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:280) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:211) at
java.lang.Thread.run(Thread.java:745) (, Py4JJavaError(u'An error occurred
while calling o119.jdbc.\n', JavaObject id=o121), )
I researched the Spark 2.0 documentation, and found this:
The JDBC driver class must be visible to the primordial class loader
on the client session and on all executors. This is because Java’s
DriverManager class does a security check that results in it ignoring
all drivers not visible to the primordial class loader when one goes
to open a connection. One convenient way to do this is to modify
compute_classpath.sh on all worker nodes to include your driver JARs.
I don't know how to implement this and did more reading from various posts, some blogs and some posts in stackoverflow and found this:
spark.driver.extraClassPath = org.postgresql.Driver
I did this in the Interpreter settings page of Zeppelin, but still I get the same error.
I tried to add a Postgres Interpreter, and I'm not sure I did it right (because I wasn't sure whether to put it in the Spark interpreter or Python interpreter), and I chose the Spark interpreter. Now the Postgres interpreter also has all the same settings as the Spark interpreter, which might not matter, but still I get the same error.
In Spark 1.6, I just don't remember going through all this trouble.
As an experiment, I spun up an EMR cluster with Spark 1.6.2 and tried the old code that used to work, and got the same error as above!
The Zeppelin site has Postgres covered but their information looks like code rather than how to set up the interpreters, so I don't know how to use it.
I'm out of ideas and references.
Any suggestions are much appreciated!
You need to use Amazon's Redshift specific driver. You can download it from here: http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html.
However, if you're using EMR it's already in place (at /usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar) and you can just tell Zeppelin where it is.
Here's how to declare it: AWS Redshift driver in Zeppelin

ERROR SparkContext: Error initializing SparkContext

I am using spark-1.5.0-cdh5.6.0. tried the sample application (scala)
command is:
> spark-submit --class com.cloudera.spark.simbox.sparksimbox.WordCount --master local /home/hadoop/work/testspark.jar
Got the following error:
ERROR SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/user/spark/applicationHistory does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:424)
at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:100)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:541)
at com.cloudera.spark.simbox.sparksimbox.WordCount$.main(WordCount.scala:12)
at com.cloudera.spark.simbox.sparksimbox.WordCount.main(WordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Spark has a feature called "history server" which allows you to browse historical events after the SparkContext dies. This property is set via setting spark.eventLog.enabled to true.
You have two options, either specify a valid directory to store the event log via the spark.eventLog.dir config value, or simply set spark.eventLog.enabled to false if you don't need it.
You can read more on that in the Spark Configuration page.
I got the same error which working with nltk in spark, To fix this I just removed all the nltk related properties from spark-conf.default.

Connecting from Spark/pyspark to PostgreSQL

I've installed Spark on a Windows machine and want to use it via Spyder. After some troubleshooting the basics seems to work:
import os
os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6"
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
spark_config = SparkConf().setMaster("local[8]")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
textFile = sc.textFile("D:\\Analytics\\Spark\\spark-1.4.0-bin-hadoop2.6\\README.md")
textFile.count()
textFile.filter(lambda line: "Spark" in line).count()
sc.stop()
This runs as expected. I now want to connect to a Postgres9.3 database running on the same server. I have downloaded the JDBC driver from here here and have put it in the folder D:\Analytics\Spark\spark_jars. I've then created a new file D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6\conf\spark-defaults.conf containing this line:
spark.driver.extraClassPath 'D:\\Analytics\\Spark\\spark_jars\\postgresql-9.3-1103.jdbc41.jar'
I've ran the following code to test the connection
import os
os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6"
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
spark_config = SparkConf().setMaster("local[8]")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
df = (sqlContext
.load(source="jdbc",
url="jdbc:postgresql://[hostname]/[database]?user=[username]&password=[password]",
dbtable="pubs")
)
sc.stop()
But am getting the following error:
Py4JJavaError: An error occurred while calling o22.load.
: java.sql.SQLException: No suitable driver found for jdbc:postgresql://uklonana01/stonegate?user=analytics&password=pMOe8jyd
at java.sql.DriverManager.getConnection(Unknown Source)
at java.sql.DriverManager.getConnection(Unknown Source)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118)
at org.apache.spark.sql.jdbc.JDBCRelation.<init>(JDBCRelation.scala:128)
at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:113)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
How can I check whether I've downloaded the right .jar file or where else the error might come from?
I have tried SPARK_CLASSPATH environment variable but it doesn't work with Spark 1.6.
Other answers from posts like below suggested adding pyspark command arguments and it works.
Not able to connect to postgres using jdbc in pyspark shell
Apache Spark : JDBC connection not working
pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>
Remove spark-defaults.conf and add the SPARK_CLASSPATH to the system environment in python like this:
os.environ["SPARK_CLASSPATH"] = 'PATH\\TO\\postgresql-9.3-1101.jdbc41.jar'
Another way to connect pyspark with your postrgresql db.
1) Install spark with pip: pip install pyspark
2) Download last version of jdbc postgresql connector in:
https://jdbc.postgresql.org/download.html
3) Complete this code with your db credentials:
from __future__ import print_function
from pyspark.sql import SparkSession
def jdbc_dataset_example(spark):
df = spark.read \
.jdbc("jdbc:postgresql://[your_db_host]:[your_db_port]/[your_db_name]",
"com_dim_city",
properties={"user": "[your_user]", "password": "[your_password]"})
df.createOrReplaceTempView("[your_table]")
sqlDF = spark.sql("SELECT * FROM [your_table] LIMIT 10")
sqlDF.show()
if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.getOrCreate()
jdbc_dataset_example(spark)
spark.stop()
Finally run your aplication with:
spark-submit --driver-class-path /path/to/your_jdbc_jar/postgresql-42.2.6.jar --jars postgresql-42.2.6.jar /path/to/your_jdbc_jar/test_pyspark_to_postgresql.py