Pyspark running error - pyspark

The Spark I connected to, is not built on my local computer but a remote one. Everytime when I connect to it http://xx.xxx.xxx.xxx:10000/, the error says:
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /usr/local/spark/python/pyspark/shell.py:
18/03/07 08:52:53 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Anyways, I still keep trying to run it on Jupyter notebook:
from pyspark.conf import SparkConf
SparkSession.builder.config(conf=SparkConf())
dir(spark)
When I ran it yesterday, it shows directory. when I did it today, it says:
NameError: name 'spark' is not defined
Any suggestion is appreciated!

you re missing the spark variable
from pyspark.conf import SparkConf
spark=SparkSession.builder.config(conf=SparkConf())
dir(spark)

Related

Pyspark error: Cannot load class when registering a function, make sure it is on the classpath

I am trying to run the code below in a python notebook on anaconda but I am getting an error
spark = SparkSession.builder.enableHiveSupport().appName('test').getOrCreate()
spark.sql("SET spark.hadoop.hive.mapred.supports.subdirectories=true")
spark.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")
spark.sql("create temporary function ptyUnprotectStr as 'com.protegrity.hive.udf.ptyUnprotectStr'")
Getting this on running the above code:
AnalysisException: "Can not load class 'com.protegrity.hive.udf.ptyUnprotectStr' when regisitering the function 'ptyUnprotectStr' please make sure it is on the classpath"
How can I resolve it?

Zeppelin fails after loading DeltaTable with "Could not find active SparkSession"

I'm running into an issue when trying to do repeat queries with Delta Lake tables in Zeppelin. This code snippet runs without any problems the first time through:
import io.delta.tables._
val deltaTable = DeltaTable.forPath("s3://bucket/path")
deltaTable.toDF.show()
But when I try to run it a second time, it fails with this error:
java.lang.IllegalArgumentException: Could not find active SparkSession
at io.delta.tables.DeltaTable$$anonfun$1.apply(DeltaTable.scala:620)
at io.delta.tables.DeltaTable$$anonfun$1.apply(DeltaTable.scala:620)
at scala.Option.getOrElse(Option.scala:121)
at io.delta.tables.DeltaTable$.forPath(DeltaTable.scala:619)
... 51 elided
I can restart the Spark interpreter and run the query again, but this is a huge impediment to development. Does anyone know why this is happening and whether there is a workaround that doesn't involve restarting the interpreter every time I want to run a new query?

Spark ElasticSearch Configuration - Reading Elastic Search from Spark

I am trying to read data from ElasticSearch via Spark Scala. I see lot of post addressing this question, i have tried all the options they have mentioned in various posts but seems nothing is working for me
JAR Used - elasticsearch-hadoop-5.6.8.jar (Used elasticsearch-spark-5.6.8.jar too without any success)
Elastic Search Version - 5.6.8
Spark - 2.3.0
Scala - 2.11
Code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
val spark = SparkSession.builder.appName("elasticSpark").master("local[*]").getOrCreate()
val reader = spark.read.format("org.elasticsearch.spark.sql").option("es.index.auto.create", "true").option("spark.serializer", "org.apache.spark.serializer.KryoSerializer").option("es.port", "9200").option("es.nodes", "xxxxxxxxx").option("es.nodes.wan.only", "true").option("es.net.http.auth.user","xxxxxx").option("es.net.http.auth.pass", "xxxxxxxx")
val read = reader.load("index/type")
Error:
ERROR rest.NetworkClient: Node [xxxxxxxxx:9200] failed (The server xxxxxxxxxxxxx failed to respond); no other nodes left - aborting...
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:294)
at org.elasticsearch.spark.sql.SchemaUtils$.discoverMappingAndGeoFields(SchemaUtils.scala:98)
at org.elasticsearch.spark.sql.SchemaUtils$.discoverMapping(SchemaUtils.scala:91)
at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema$lzycompute(DefaultSource.scala:129)
at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema(DefaultSource.scala:129)
at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:133)
at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:133)
at scala.Option.getOrElse(Option.scala:121)
at org.elasticsearch.spark.sql.ElasticsearchRelation.schema(DefaultSource.scala:133)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:432)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
... 53 elided
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[xxxxxxxxxxx:9200]]
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:461)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:425)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:429)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:155)
at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:655)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:287)
... 65 more
Apart from this I have also tried below properties without any success:
option("es.net.ssl.cert.allow.self.signed", "true")
option("es.net.ssl.truststore.location", "<path for elasticsearch cert file>")
option("es.net.ssl.truststore.pass", "xxxxxx")
Please note elasticsearch node is within Unix edge node and is http://xxxxxx:9200 (mentioning it just in case if that makes any difference with the code)
What I am missing here? Any other properties? Please Help
Use below Jar which support spark 2+ version instead of Elastic-Hadoop or Elastic-Spark jar.
https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-spark-20_2.11/5.6.8

Scala Spark : (org.apache.spark.repl.ExecutorClassLoader) Failed to check existence of class org on REPL class server at path

Running basic df.show() post spark notebook installation
I am getting the following error when running scala - spark code on spark-notebook. Any idea when this occurs and how to avoid?
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class org.apache.spark.sql.catalyst.expressions.Object on REPL class server at spark://192.168.10.194:50935/classes
[org.apache.spark.util.Utils] Aborting task
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class org on REPL class server at spark://192.168.10.194:50935/classes
[org.apache.spark.util.Utils] Aborting task
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class
I installed the spark on local, and when I was using following code it was giving me the same error.
spark.read.format("json").load("Downloads/test.json")
I think the issue was, it was trying to find some master node and taking some random or default IP. I specified the mode and then provided the IP as 127.0.0.1 and it resolved my issue.
Solution
Run the spark using local master
usr/local/bin/spark-shell --master "local[4]" --conf spark.driver.host=127.0.0.1'

how to set checkpiont dir PySpark Data Science Experience

Could you help me with instructions on how to set the checkpoint dir for a PySpark session on IBM's Data Science Experience?.
The need came because i have to run connectedComponents() from GraphFrames and it raises the following error
Py4JJavaError: An error occurred while calling o221.run.
: java.io.IOException: Checkpoint directory is not set. Please set it first using sc.setCheckpointDir().
The main issue is to get the directory that the notebook has as working directory to set the checkpoit dir with sc.setCheckpointDir(). this can be done easily with
!pwd
Then, a directory for checkpoints should be created on that route
!mkdir <pwd_output>/checkpoints
Finally set the checkpoint
spark.sparkContext.setCheckpointDir('<pwd_output>/checkpoints')