PySpark, GraphFrames, exception Caused by: java.lang.ClassNotFoundException: com.typesafe.scalalogging.slf4j.LazyLogging - pyspark

I am trying to run the following code which leverages graphframes, and I am getting an error now which, to the best of my knowledge and after some hours of Googling, I cannot resolve. It seems like a class cannot be loaded, but I don't really know what else I should be doing.
Can someone please have another look at the code and error below? I have followed the instructions from here, and in case you want to quickly give it a try, you can find my dataset here.
"""
Program: RUNNING GRAPH ANALYTICS WITH SPARK GRAPH-FRAMES:
Author: Dr. C. Hadjinikolis
Date: 14/09/2016
Description: This is the application's core module from where everything is executed.
The module is responsible for:
1. Loading Spark
2. Loading GraphFrames
3. Running analytics by leveraging other modules in the package.
"""
# IMPORT OTHER LIBS -------------------------------------------------------------------------------#
import os
import sys
import pandas as pd
# IMPORT SPARK ------------------------------------------------------------------------------------#
# Path to Spark source folder
USER_FILE_PATH = "/Users/christoshadjinikolis"
SPARK_PATH = "/PycharmProjects/GenesAssociation"
SPARK_FILE = "/spark-2.0.0-bin-hadoop2.7"
SPARK_HOME = USER_FILE_PATH + SPARK_PATH + SPARK_FILE
os.environ['SPARK_HOME'] = SPARK_HOME
# Append pySpark to Python Path
sys.path.append(SPARK_HOME + "/python")
sys.path.append(SPARK_HOME + "/python" + "/lib/py4j-0.10.1-src.zip")
try:
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from graphframes import *
except ImportError as ex:
print "Can not import Spark Modules", ex
sys.exit(1)
# GLOBAL VARIABLES --------------------------------------------------------------------------------#
# Configure spark properties
CONF = (SparkConf()
.setMaster("local")
.setAppName("My app")
.set("spark.executor.memory", "10g")
.set("spark.executor.instances", "4"))
# Instantiate SparkContext object
SC = SparkContext(conf=CONF)
# Instantiate SQL_SparkContext object
SQL_CONTEXT = SQLContext(SC)
# MAIN CODE ---------------------------------------------------------------------------------------#
if __name__ == "__main__":
# Main Path to CSV files
DATA_PATH = '/PycharmProjects/GenesAssociation/data/'
FILE_NAME = 'gene_gene_associations_50k.csv'
# LOAD DATA CSV USING PANDAS -----------------------------------------------------------------#
print "STEP 1: Loading Gene Nodes -------------------------------------------------------------"
# Read csv file and load as df
GENES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME,
usecols=['OFFICIAL_SYMBOL_A'],
low_memory=True,
iterator=True,
chunksize=1000)
# Concatenate chunks into list & convert to dataFrame
GENES_DF = pd.DataFrame(pd.concat(list(GENES), ignore_index=True))
# Remove duplicates
GENES_DF_CLEAN = GENES_DF.drop_duplicates(keep='first')
# Name Columns
GENES_DF_CLEAN.columns = ['id']
# Output dataFrame
print GENES_DF_CLEAN
# Create vertices
VERTICES = SQL_CONTEXT.createDataFrame(GENES_DF_CLEAN)
# Show some vertices
print VERTICES.take(5)
print "STEP 2: Loading Gene Edges -------------------------------------------------------------"
# Read csv file and load as df
EDGES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME,
usecols=['OFFICIAL_SYMBOL_A', 'OFFICIAL_SYMBOL_B', 'EXPERIMENTAL_SYSTEM'],
low_memory=True,
iterator=True,
chunksize=1000)
# Concatenate chunks into list & convert to dataFrame
EDGES_DF = pd.DataFrame(pd.concat(list(EDGES), ignore_index=True))
# Name Columns
EDGES_DF.columns = ["src", "dst", "rel_type"]
# Output dataFrame
print EDGES_DF
# Create vertices
EDGES = SQL_CONTEXT.createDataFrame(EDGES_DF)
# Show some edges
print EDGES.take(5)
print "STEP 3: Generating the Graph -----------------------------------------------------------"
GENES_GRAPH = GraphFrame(VERTICES, EDGES)
print "STEP 4: Running Various Basic Analytics ------------------------------------------------"
print "Vertex in-Degree -----------------------------------------------------------------------"
GENES_GRAPH.inDegrees.sort('inDegree', ascending=False).show()
print "Vertex out-Degree ----------------------------------------------------------------------"
GENES_GRAPH.outDegrees.sort('outDegree', ascending=False).show()
print "Vertex degree --------------------------------------------------------------------------"
GENES_GRAPH.degrees.sort('degree', ascending=False).show()
print "Triangle Count -------------------------------------------------------------------------"
RESULTS = GENES_GRAPH.triangleCount()
RESULTS.select("id", "count").show()
print "Label Propagation ----------------------------------------------------------------------"
GENES_GRAPH.labelPropagation(maxIter=10).show() # Convergence is not guaranteed
print "PageRank -------------------------------------------------------------------------------"
GENES_GRAPH.pageRank(resetProbability=0.15, tol=0.01)\
.vertices.sort('pagerank', ascending=False).show()
print "STEP 5: Find Shortest Paths w.r.t. Landmarks -------------------------------------------"
# Shortest paths
SHORTEST_PATH = GENES_GRAPH.shortestPaths(landmarks=["ARF3", "MAP2K4"])
SHORTEST_PATH.select("id", "distances").show()
print "STEP 6: Save Vertices and Edges --------------------------------------------------------"
# Save vertices and edges as Parquet to some location.
# Note: You can't overwrite existing vertices and edges directories.
GENES_GRAPH.vertices.write.parquet("vertices")
GENES_GRAPH.edges.write.parquet("edges")
print "STEP 7: Load "
# Load the vertices and edges back.
SAME_VERTICES = GENES_GRAPH.read.parquet("vertices")
SAME_EDGES = GENES_GRAPH.read.parquet("edges")
# Create an identical GraphFrame.
SAME_GENES_GRAPH = GF.GraphFrame(SAME_VERTICES, SAME_EDGES)
# END OF FILE -------------------------------------------------------------------------------------#
This is the output:
Ivy Default Cache set to: /Users/username/.ivy2/cache
The jars for the packages stored in: /Users/username/.ivy2/jars
:: loading settings :: url = jar:file:/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found graphframes#graphframes;0.2.0-spark2.0-s_2.11 in list
found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in list
found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 in list
found org.scala-lang#scala-reflect;2.11.0 in list
[2.11.0] org.scala-lang#scala-reflect;2.11.0
found org.slf4j#slf4j-api;1.7.7 in list
:: resolution report :: resolve 391ms :: artifacts dl 14ms
:: modules in use:
com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 from list in [default]
com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 from list in [default]
graphframes#graphframes;0.2.0-spark2.0-s_2.11 from list in [default]
org.scala-lang#scala-reflect;2.11.0 from list in [default]
org.slf4j#slf4j-api;1.7.7 from list in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 5 | 0 | 0 | 0 || 5 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 5 already retrieved (0kB/11ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/09/20 11:00:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
OK1
Traceback (most recent call last):
File "/Users/username/PycharmProjects/GenesAssociation/main.py", line 32, in <module>
g = GraphFrame(v, e)
File "/Users/tjhunter/work/graphframes/python/graphframes/graphframe.py", line 62, in __init__
File "/Users/tjhunter/work/graphframes/python/graphframes/graphframe.py", line 34, in _java_api
File "/Users/christoshadjinikolis/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/Users/christoshadjinikolis/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib" \
"/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o53.newInstance.
: java.lang.NoClassDefFoundError: com/typesafe/scalalogging/slf4j/LazyLogging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.graphframes.GraphFrame$.<init>(GraphFrame.scala:677)
at org.graphframes.GraphFrame$.<clinit>(GraphFrame.scala)
at org.graphframes.GraphFramePythonAPI.<init>(GraphFramePythonAPI.scala:11)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.typesafe.scalalogging.slf4j.LazyLogging
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 43 more
Process finished with exit code 1

I've had the same issue with spark/scala I've resolved it by adding the fowlowing jar to classpath:
spark-shell --jars scala-logging_2.12-3.5.0.jar
You can find the jar here:
https://mvnrepository.com/artifact/com.typesafe.scala-logging/scala-logging_2.12/3.5.0
Source: https://github.com/graphframes/graphframes/issues/113

Related

org.jpmml.sparkml.PMMLBuilder does not exist in the JVM

Thanks a lot for any help.
My goal is to save a trained model in XML format and Im really stragling with this error and warnings
---------------------------------------------------------------------------
Exception in thread "Thread-4" java.lang.ExceptionInInitializerError
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:398)
at py4j.reflection.CurrentThreadClassLoadingStrategy.classForName(CurrentThreadClassLoadingStrategy.java:40)
at py4j.reflection.ReflectionUtil.classForName(ReflectionUtil.java:51)
at py4j.reflection.TypeUtil.forName(TypeUtil.java:243)
at py4j.commands.ReflectionCommand.getUnknownMember(ReflectionCommand.java:175)
at py4j.commands.ReflectionCommand.execute(ReflectionCommand.java:87)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalArgumentException: Expected Apache Spark ML version 3.1, got version 3.2 (3.2.0)
at org.jpmml.sparkml.ConverterFactory.checkVersion(ConverterFactory.java:114)
at org.jpmml.sparkml.PMMLBuilder.init(PMMLBuilder.java:481)
at org.jpmml.sparkml.PMMLBuilder.<clinit>(PMMLBuilder.java:545)
... 10 more
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/home/mbg/.local/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9.2-src.zip/py4j/clientserver.py", line 480, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/mbg/.local/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1038, in send_command
response = connection.send_command(command)
File "/home/mbg/.local/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9.2-src.zip/py4j/clientserver.py", line 503, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
---------------------------------------------------------------------------
Py4JError Traceback (most recent call last)
/tmp/ipykernel_20251/3496938591.py in <module>
----> 1 pmmlBuilder = PMMLBuilder(sc, df_train, rfModel)
~/.local/lib/python3.8/site-packages/pyspark2pmml/__init__.py in __init__(self, sc, df, pipelineModel)
10 javaSchema = javaDf.schema.__call__()
11 javaPipelineModel = pipelineModel._to_java()
---> 12 javaPmmlBuilderClass = sc._jvm.org.jpmml.sparkml.PMMLBuilder
13 if(not isinstance(javaPmmlBuilderClass, JavaClass)):
14 raise RuntimeError("JPMML-SparkML not found on classpath")
~/.local/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py in __getattr__(self, name)
1647 answer[proto.CLASS_FQN_START:], self._gateway_client)
1648 else:
-> 1649 raise Py4JError("{0} does not exist in the JVM".format(new_fqn))
1650
1651
Py4JError: org.jpmml.sparkml.PMMLBuilder does not exist in the JVM
My code is the folowing:
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("SparkApp_ETL_ML").setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)
spark = SparkSession.builder.getOrCreate()
import pandas as pd
df=pd.read_parquet("https://s3.eu-de.cloud-object-storage.appdomain.cloud/cloud-object-storage-yy-cos-standard-js4/data.parquet")
sdf = spark.createDataFrame(df)
from pyspark.sql.types import DoubleType
sdf = sdf.withColumn("x", sdf.x.cast(DoubleType()))
sdf = sdf.withColumn("y", sdf.y.cast(DoubleType()))
sdf = sdf.withColumn("z", sdf.z.cast(DoubleType()))
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
input_columns = ["x", "y", "z"] # input columns to consider
train, test = sdf.randomSplit([0.8, 0.2], seed=1)
indexer = StringIndexer(inputCol="class", outputCol="label")
vectorAssembler = VectorAssembler(inputCols=input_columns, outputCol="features")
normalizer = MinMaxScaler(inputCol="features", outputCol="features_norm")
pipeline = Pipeline(stages=[indexer, vectorAssembler, normalizer])
binEval = MulticlassClassificationEvaluator().setMetricName("accuracy").setPredictionCol("prediction"). \
setLabelCol("label")
df_train = pipeline.fit(train).transform(train)
df_test = pipeline.fit(test).transform(test)
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(featuresCol='features_norm', labelCol='label', maxDepth=20, numTrees=7, seed=1)
rfModel = rf.fit(df_train)
from pyspark2pmml import PMMLBuilder
model_target = "HMP_frModel.xml"
pmmlBuilder = PMMLBuilder(sc, df_train, rfModel)
All works well till the last line in code.
I tried all solutions i found on the internet but unfortunatly without success.
I am working with jupyter notebook not anaconda and installed pyspark with pip and I added those variables in the .bashrc
export PATH=$PATH:~/.local/bin
export SPARK_HOME=~/.local/lib/python3.8/site-packages/pyspark
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9.2-src.zip
export PATH=$SPARK_HOME/bin:$SPARK_HOME/python:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
I also downloaded those jar files jpmml-sparkml-executable-1.7.2.jar jpmml-sparkml-executable-1.8.0.jar and put them in this directory ~/.local/lib/python3.8/site-packages/pyspark/jars

pyspark 3.0 - PythonUDF Runner error whilst applying OneVsRest model

I've been able to train a OneVsRest model, and save it.The model has 100 classes. I've managed to load the model, and currently trying to apply the model on some test data. The test data has 45 million rows.
test_data = spark \
.read \
.parquet(TEST_DATA_LOC)
ovrModel = OneVsRestModel.load(MODEL_LOCATION)
logger.warn("loading the model from {}".format(MODEL_LOCATION))
predictions = ovrModel.transform(test_data)
#udf("double")
def getRawPredictions(vectors:Vectors):
try:
vectors = vectors.toArray().tolist()
probabilities = np.exp(vectors)/(1+np.exp(vectors))
probabilities = np.round(probabilities, decimals=3)
probability = np.max(probabilities)
return probability
except:
pass
predictions = predictions \
.withColumn('rawPrediction', getRawPredictions('rawPrediction'))
However, I'm getting a PythonUDFRunner error. I'm unable to fix the issue.
ERROR PythonUDFRunner: Python worker exited unexpectedly (crashed)
java.net.SocketException: Connection reset
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
at java.base/java.io.BufferedInputStream.read1(BufferedInputStream.java:292)
at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:351)
at java.base/java.io.DataInputStream.readFully(DataInputStream.java:200)
at java.base/java.io.DataInputStream.readFully(DataInputStream.java:170)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:74)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage24.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1209)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1215)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:295)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.writeIteratorToStream(PythonUDFRunner.scala:50)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:383)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:218)
The traceback error is as follows
Traceback (most recent call last):
File "/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/pyspark.zip/pyspark/daemon.py", line 186, in manager
File "/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/pyspark.zip/pyspark/daemon.py", line 74, in worker
File "/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 642, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 595, in read_int
raise EOFError

PySpark java.io.IOException: No FileSystem for scheme: https

I am using local windows and trying to load the XML file with the following code on python, and i am having this error, do anyone knows how to resolve it,
this is the code
df1 = sqlContext.read.format("xml").options(rowTag="IRS990EZ").load("https://irs-form-990.s3.amazonaws.com/201611339349202661_public.xml")
and this is the error
Py4JJavaError Traceback (most recent call last)
<ipython-input-7-4832eb48a4aa> in <module>()
----> 1 df1 = sqlContext.read.format("xml").options(rowTag="IRS990EZ").load("https://irs-form-990.s3.amazonaws.com/201611339349202661_public.xml")
C:\SPARK_HOME\spark-2.2.0-bin-hadoop2.7\python\pyspark\sql\readwriter.py in load(self, path, format, schema, **options)
157 self.options(**options)
158 if isinstance(path, basestring):
--> 159 return self._df(self._jreader.load(path))
160 elif path is not None:
161 if type(path) != list:
C:\SPARK_HOME\spark-2.2.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
C:\SPARK_HOME\spark-2.2.0-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
C:\SPARK_HOME\spark-2.2.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o38.load.
: java.io.IOException: No FileSystem for scheme: https
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:500)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:469)
at org.apache.spark.SparkContext$$anonfun$newAPIHadoopFile$2.apply(SparkContext.scala:1160)
at org.apache.spark.SparkContext$$anonfun$newAPIHadoopFile$2.apply(SparkContext.scala:1148)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1148)
at com.databricks.spark.xml.util.XmlFile$.withCharset(XmlFile.scala:46)
at com.databricks.spark.xml.DefaultSource$$anonfun$createRelation$1.apply(DefaultSource.scala:62)
at com.databricks.spark.xml.DefaultSource$$anonfun$createRelation$1.apply(DefaultSource.scala:62)
at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:47)
at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:46)
at scala.Option.getOrElse(Option.scala:121)
at com.databricks.spark.xml.XmlRelation.<init>(XmlRelation.scala:45)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:65)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:43)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Unknown Source)
Somehow pyspark is unable to load the http or https, one of my colleague found the answer for this so here is the solution,
before creating the spark context and sql context we need to load this two line of code
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.4.1 pyspark-shell'
after creating the sparkcontext and sqlcontext from sc = pyspark.SparkContext.getOrCreate and sqlContext = SQLContext(sc)
add the http or https url into the sc by using sc.addFile(url)
Data_XMLFile = sqlContext.read.format("xml").options(rowTag="anytaghere").load(pyspark.SparkFiles.get("*_public.xml")).coalesce(10).cache()
this solution worked for me
The error message says it all: you cannot use dataframe reader & load to access files on the web (http or htpps). I suggest you first download the file locally.
See the pyspark.sql.DataFrameReader docs for more on the available sources (in general, local file system, HDFS, and databases via JDBC).
Irrelevantly to the error, notice that you seem to use the format part of the command incorrectly: assuming that you use the XML Data Source for Apache Spark package, the correct usage should be format('com.databricks.spark.xml') (see the example).
I've commit a similar but slightly different error: forgot the "s3://" prefix to file path. After adding this prefix to form "s3://path/to/object" the following code works:
my_data = spark.read.format("com.databricks.spark.csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.option("delimiter", ",")\
.load("s3://path/to/object")
I was also having a similar issue with the CSV file basically we were trying to load a CSV file into spark.
We were able to load the file successfully by making use of the pandas' library, first we loaded the file into the pandas data frame, and then by using the pandas we were able to load the data into the spark data frame.
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName('appName').getOrCreate()
pdf = pd.read_csv('file patth with https')
sdf = spark.createDataFrame(pdf)

FileTailSource throws null pointer exception

I wrote the following Alpakka code
val fs = FileSystems.getDefault
val resource = getClass.getResource("countrycapital.csv")
val source = FileTailSource.lines(Paths.get(resource.toURI), maxLineSize = 8092, pollingInterval = 10000 seconds)
I have a file called "countrycapital.csv" in the src/main/resources folder.
My objective is to build a Akka Streams Source for this file and process all the records of this file.
when I run this code. the list line throws an exception
[error] (run-main-0) java.lang.NullPointerException
java.lang.NullPointerException
at com.abhi.ElasticAkkaStreams$.delayedEndpoint$com$abhi$ElasticAkkaStreams$1(ElasticAkkaStreams.scala:37)
at com.abhi.ElasticAkkaStreams$delayedInit$body.apply(ElasticAkkaStreams.scala:28)
at scala.Function0.apply$mcV$sp(Function0.scala:34)
at scala.Function0.apply$mcV$sp$(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App.$anonfun$main$1$adapted(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:389)
I have imported the following libraries in my build.sbt
"com.lightbend.akka" %% "akka-stream-alpakka-elasticsearch" % "0.13",
"com.lightbend.akka" %% "akka-stream-alpakka-csv" % "0.13",
"com.lightbend.akka" %% "akka-stream-alpakka-file" % "0.13"
OK. I solved this myself
I didn't have a / when picking the file path.
This works without the null pointer exception
val resource = getClass.getResource("/countrycapital.csv")

IOException When Running Process

Given:
$pwd
/home/kmeredith/src/linux_sandbox
$ls a.txt
a.txt
$cat a.txt
$
I then tried to run a scala.sys.process.Process that would append 'hi' to a.txt:
import scala.sys.process._
import java.io.File
scala> Process( List("echo 'hi' >> a.txt"), new File(".") )
res3: scala.sys.process.ProcessBuilder = [echo 'hi' > a.txt]
scala> res3.!
java.io.IOException: Cannot run program "echo 'hi' > a.txt" (in directory "."): error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at scala.sys.process.ProcessBuilderImpl$Simple.run(ProcessBuilderImpl.scala:69)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.run(ProcessBuilderImpl.scala:98)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang(ProcessBuilderImpl.scala:112)
... 32 elided
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:248)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 35 more
Note that it also failed when I gave the full path:
scala> Process( List("echo 'hi' > a.txt"), new File("/home/kmeredith/src/linux_sandbox") )
res0: scala.sys.process.ProcessBuilder = [echo 'hi' > a.txt]
Why am I seeing this error?
scala> res0.!
java.io.IOException: Cannot run program "echo 'hi' > a.txt" (in directory "/home/kmeredith/src/linux_sandbox"): error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at scala.sys.process.ProcessBuilderImpl$Simple.run(ProcessBuilderImpl.scala:69)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.run(ProcessBuilderImpl.scala:98)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang(ProcessBuilderImpl.scala:112)
... 32 elided
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:248)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 35 more
You can use ProcessBuilder DSL for redirecting the output:
Seq("echo", "some text") #>> new File("a.txt")
Here Seq[String] will be implicitly converted to ProcessBuilder. Each element of the Seq will be treated as an argument of the command (first element echo) and properly escaped, so you don't need any extra quoting (and shouldn't put >> there).
The file that you pass as the second argument is the cwd (current working directory), so in this particular case it doesn't change anything. See "Handling Input and Output" section of the sys.process docs.