Pyspark error on Dataproc while creating dataframe with the schema details

Pyspark error on Dataproc while creating dataframe with the schema details - pyspark

I have a Dataproc cluster with Anaconda. I've created a virtual env. inside the anaconda my-env as I need to install open source RDkit there and hence I've installed PySpark again (not using pre-installed one). Now with the below code I'm getting error in my-env but not on outside of the my-env
Code:
from pyspark.sql.types import StructField, StructType, StringType, LongType
from pyspark.sql import SparkSession
from py4j.protocol import Py4JJavaError
spark = SparkSession.builder.appName("test").getOrCreate()
fields = [StructField("col0", StringType(), True),
StructField("col1", StringType(), True),
StructField("col2", StringType(), True),
StructField("col3", StringType(), True)]
schema = StructType(fields)
chem_info = spark.createDataFrame([], schema)
This is the error I'm getting:
File
"/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/sql/session.py",
line 749, in createDataFrame
jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) File
"/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/rdd.py",
line 2297, in _to_java_object_rdd
rdd = self._pickled() File "/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/rdd.py",
line 196, in _pickled
return self._reserialize(AutoBatchedSerializer(PickleSerializer())) File
"/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/rdd.py",
line 594, in _reserialize
self = self.map(lambda x: x, preservesPartitioning=True) File "/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/rdd.py",
line 325, in map
return self.mapPartitionsWithIndex(func, preservesPartitioning) File
"/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/rdd.py",
line 365, in mapPartitionsWithIndex
return PipelinedRDD(self, f, preservesPartitioning) File "/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/rdd.py",
line 2514, in __init__
self.is_barrier = prev._is_barrier() or isFromBarrier File "/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/rdd.py",
line 2414, in _is_barrier
return self._jrdd.rdd().isBarrier() File "/home/.conda/envs/my-env/lib/python3.6/site-packages/py4j/java_gateway.py",
line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name) File "/home/.conda/envs/my-env/lib/python3.6/site-packages/pyspark/sql/utils.py",
line 63, in deco
return f(*a, **kw) File "/home/.conda/envs/my-env/lib/python3.6/site-packages/py4j/protocol.py",
line 332, in get_return_value
format(target_id, ".", name, value)) py4j.protocol.Py4JError: An error occurred while calling o57.isBarrier. Trace: py4j.Py4JException:
Method isBarrier([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Can you help me resolve it?

As mentioned in the pyspark: Method isBarrier([]) does not exist question, this error caused by incompatibilities between different versions of Spark installed in Dataproc cluster and PySpark that you manually installed in your conda environment.
To solve this issues you need to check Spark version on the cluster and install appropriate version of PySpark:
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.4
/_/
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_232
$ conda install pyspark==2.4.4

Related

pyspark read hdfs file - no such file or directory error

I am using pyspark to process a text file in hdfs. If I use simple hdfs command like "hdfs dfs -cat hdfs:///data/msd/tasteprofile/mismatches/sid_matches_manually_accepted.txt"; it works.
But if I use pyspark command like the following, it keeps return "[Errno 2] No such file or directory : "
schemaMismatches = StructType([
StructField("song_id", StringType(), True),
StructField("song_artist", StringType(), True),
StructField("song_title", StringType(), True),
StructField("track_id", StringType(), True),
StructField("track_artist", StringType(), True),
StructField("track_title", StringType(), True)
])
hdfs dfs -cat hdfs:///data/msd/tasteprofile/mismatches/sid_matches_manually_accepted.txt
#hdfs:///data/msd/tasteprofile/mismatches/sid_matches_manually_accepted.txt
hdfs:///data/msd/tasteprofile/mismatches/sid_matches_manually_accepted.txt
with open("hdfs:///data/msd/tasteprofile/mismatches/sid_matches_manually_accepted.txt", "r") as f:
lines = f.readlines()
sid_matches_manually_accepted = []
for line in lines:
if line.startswith("< ERROR: "):
a = line[10:28]
b = line[29:47]
c, d = line[49:-1].split(" != ")
e, f = c.split(" - ")
g, h = d.split(" - ")
sid_matches_manually_accepted.append((a, e, f, b, g, h))
matches_manually_accepted = spark.createDataFrame(sc.parallelize(sid_matches_manually_accepted, 8), schema=schemaMismatches)
matches_manually_accepted.cache()
matches_manually_accepted.show(10, 20)
I suppose the file path that I used for pyspark is wrong, but I am not sure how to fix that though.

you should read that text file in “Spark way”, not normal Python way
df = spark.read.text('hdfs:///data/msd/tasteprofile/mismatches/sid_matches_manually_accepted.txt')
df.show()

Apply transformations on a RDD column while selecting other columns in Pyspark

I want to be able to select multiple columns of a RDD while applying transformations to one of the values. I am able to
- select specific columns
- apply transformations on one of the columns
I am unable to apply both of them together
1) Selecting specific columns
from pyspark import SparkContext
logFile = "/FileStore/tables/tendulkar.csv"
rdd = sc.textFile(logFile)
rdd.map(lambda line: (line.split(",")[0],line.split(",")[1],line.split(",")
[2])).take(4)
[('Runs', 'Mins', 'BF'),
('15', '28', '24'),
('DNB', '-', '-'),
('59', '254', '172')]
2) Apply transformations to the 1st column
df=(rdd.map(lambda line: line.split(",")[0])
.filter(lambda x: x !="DNB")
.filter(lambda x: x!= "TDNB")
.filter(lambda x: x!="absent")
.map(lambda x: x.replace("*","")))
df.take(4)
['Runs', '15', '59', '8']
I tried to do them together as follows
rdd.map(lambda line: ( (line.split(",")[0]).filter(lambda
x:x!="DNB"),line.split(",")[1],line.split(",")[2])).count()
I get an error
Py4JJavaError Traceback (most recent call last)
<command-2766458519992264> in <module>()
10 .map(lambda x: x.replace("*","")))
11
---> 12 rdd.map(lambda line: ( (line.split(",")[0]).filter(lambda x:x!="DNB"),line.split(",")[1],line.split(",")[2])).count()
/databricks/spark/python/pyspark/rdd.py in count(self)
1067 3
1068 """
-> 1069 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
1070
1071 def stats(self):
Please help

Just apply the filter with the first element in each row after your map where you select all columns you wanted:
rdd.map(lambda line: line.split(",")[:3]) \
.filter(lambda x: x[0] not in ["DNB", "TDNB", "absent"])

unable to create dataframe from sequence file in Spark created by Sqoop

I want to read orders data and create RDD out of it which is stored as sequence file in hadoop fs in cloudera vm. Below are my steps:
1) Importing orders data as sequence file:
sqoop import --connect jdbc:mysql://localhost/retail_db --username retail_dba --password cloudera --table orders -m 1 --target-dir /ordersDataSet --as-sequencefile
2) Reading file in spark scala:
Spark 1.6
val sequenceData=sc.sequenceFile("/ordersDataSet",classOf[org.apache.hadoop.io.Text],classOf[org.apache.hadoop.io.Text]).map(rec => rec.toString())
3) When I try to read data from above RDD it throws below error:
Caused by: java.io.IOException: WritableName can't load class: orders
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:77)
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2108)
... 17 more
Caused by: java.lang.ClassNotFoundException: Class orders not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2185)
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:75)
... 18 more
I don't know why it says that it can't find orders. Where am I going wrong ?
I referred codes from these two links as well but no luck:
1) Refer sequence part
2) Refer step no. 8

The sqoop has little to do with it, here is an example of a more realistic scenario, whereby saveAsSequenceFile always assumes k, v pairs - this may help you:
import org.apache.hadoop.io._
val RDD = sc.parallelize( List( (1, List("A", "B")) , (2, List("B", "C")) , (3, List("C", "D", "E")) ) )
val RDD2 = RDD.map(x => (x._1, x._2.mkString("/")))
RDD2.saveAsSequenceFile("/rushhour/seq-directory/2")
val sequence_data = sc.sequenceFile("/rushhour/seq-directory/*", classOf[IntWritable], classOf[Text])
.map{case (x, y) => (x.get(), y.toString().split("/")(0), y.toString().split("/")(1))}
sequence_data.collect
returns:
res20: Array[(Int, String, String)] = Array((1,A,B), (2,B,C), (3,C,D), (1,A,B), (2,B,C), (3,C,D))
I am not sure if you want an RDD or DF, but converting RDD to DF is of course trivial.

I figured out the solution to my own problem. Well, I am going to write a lengthy solution but I hope it will make some sense.
1) When I tried to read the data which was imported in HDFS using SQOOP, it gives an error because of following reasons:
A) Sequence file is all about key-value pair. So when I import it using sqoop, the data which is imported it is not in key value pair that is why while reading it throws an error.
B) If you try to read few characters from which you can figure out the two classes required for passing as input while reading sequence file you ll get data as below:
[cloudera#quickstart ~]$ hadoop fs -cat /user/cloudera/problem5/sequence/pa* | head -c 300
SEQ!org.apache.hadoop.io.LongWritableorders�;�M��c�K�����#���-OCLOSED#���PENDING_PAYMENT#���/COMPLETE#���"{CLOSED#���cat: Unable to write to output stream.
Above you can see only one class i.e org.apache.hadoop.io.LongWritable and when I pass this while reading the sequence data it throws an error which is mentioned in the post.
val sequenceData=sc.sequenceFile("/ordersDataSet",classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.LongWritable]).map(rec => rec.toString())
I don't think that the B point is the main reason of that error but I am very much sure that A point is the real culprit of that error.
2) Below is the way how I solved my problem.
I imported data as avro data file in other destination using SQOOP. Then I created the dataframe from avro using below ways:
scala> import com.databricks.spark.avro._;
scala> val avroData=sqlContext.read.avro("path")
Now I created key-value pair and saved it as sequence file
avroData.map(p=>(p(0).toString,(p(0)+"\t"+p(1)+"\t"+p(2)+"\t"+p(3)))).saveAsSequenceFile("/user/cloudera/problem5/sequence")
Now when I try to read few characters of the above written file it gives me two classes which I need while reading the file as below:
[cloudera#quickstart ~]$ hadoop fs -cat /user/cloudera/problem5/sequence/part-00000 | head -c 300
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text^#%���8P���11 1374735600000 11599 CLOSED&2#2 1374735600000 256 PENDING_PAYMENT!33 1374735600000 12111 COMPLETE44 1374735600000 8827 CLOSED!55 1374735600000 11318 COMPLETE 66 1374cat: Unable to write to output stream.
scala> val sequenceData=sc.sequenceFile("/user/cloudera/problem5/sequence",classOf[org.apache.hadoop.io.Text],classOf[org.apache.hadoop.io.Text]).map(rec=>rec.toString)
sequenceData: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at map at <console>:30
Now when I try to print data it displays data as below:
scala> sequenceData.take(4).foreach(println)
(1,1 1374735600000 11599 CLOSED)
(2,2 1374735600000 256 PENDING_PAYMENT)
(3,3 1374735600000 12111 COMPLETE)
(4,4 1374735600000 8827 CLOSED)
Last but not the least, Thank you everyone for your much appreciated efforts. Cheers!!

Pyspark Streaming

I wrote this code for streaming iris classification on pyspark but I have got this error "'RDD' object has no attribute '_jdf' ". I have changed RDD to dataframe but it told that "RDD is not a itterable". Please help me solve it !!!
Many thanks.
Here is my code:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.ml import PipelineModel, Pipeline
from pyspark.sql import Row, DataFrame
from pyspark.sql.types import *
from pyspark.sql.functions import *
conf = SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
sc = SparkContext.getOrCreate(conf = conf)
ssc = StreamingContext(sc,1)
lines = ssc.socketTextStream("localhost", 8889)
#Load ML model
sameModel = PipelineModel.load("g:/Demo/DecisionTree_Model1")
#Predict the type of iris from features
result = line.foreachRDD(lambda rdd: sameModel.transform(rdd))
ssc.start()
ssc.awaitTermination()
AND THE ERROR: 'RDD' object has no attribute '_jdf'
Py4JJavaError Traceback (most recent call last)
<ipython-input-6-18f3db416f1c> in <module>()
1 ssc.start()
----> 2 ssc.awaitTermination()
E:\Spark\spark\python\pyspark\streaming\context.py in awaitTermination(self,
timeout)
204 """
205 if timeout is None:
--> 206 self._jssc.awaitTermination()
207 else:
208 self._jssc.awaitTerminationOrTimeout(int(timeout * 1000))
E:\Spark\spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py in
__call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
E:\Spark\spark\python\pyspark\sql\utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
E:\Spark\spark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in
get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o35.awaitTermination.
: org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
File "E:\Spark\spark\python\pyspark\streaming\util.py", line 65, in call
r = self.func(t, *rdds)
File "E:\Spark\spark\python\pyspark\streaming\dstream.py", line 159, in
<lambda>
func = lambda t, rdd: old_func(rdd)
File "<ipython-input-5-64e27204db5a>", line 1, in <lambda>
result = lines.foreachRDD(lambda rdd: sameModel.transform(rdd))
File "E:\Spark\spark\python\pyspark\ml\base.py", line 173, in transform
return self._transform(dataset)
File "E:\Spark\spark\python\pyspark\ml\pipeline.py", line 262, in _transform
dataset = t.transform(dataset)
File "E:\Spark\spark\python\pyspark\ml\base.py", line 173, in transform
return self._transform(dataset)
File "E:\Spark\spark\python\pyspark\ml\wrapper.py", line 305, in _transform
return DataFrame(self._java_obj.transform(dataset._jdf), dataset.sql_ctx)
AttributeError: 'RDD' object has no attribute '_jdf'
at org.apache.spark.streaming.api.python.TransformFunction.callPythonTransformFunction(PythonDStream.scala:95)
at org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:78)
at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:179)
at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:179)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

The code below is shows how to load a pre-trained model. Start Spark stream with source of socket and use the transform on it. And then sink it to console.
spark = SparkSession \
.builder \
.appName("transform ml") \
.getOrCreate()
model = PipelineModel.load("./model")
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
random = Random()
words = lines.select(f.lit(random.randint(1, 10000))
.alias("id"),
lines.value.alias("text")
)
prediction = model.transform(words)
query = prediction \
.writeStream \
.outputMode("append") \
.format("console") \
.start()
query.awaitTermination()

Attempting to cluster documents with TF-IDF and KMeans in Spark. What's wrong with this piece of code?

I have a CSV file with a text field, in 2 languages (French and English). I'm attempting to perform a cluster analysis and somewhat expecting the texts to be grouped in 2 clusters due to the language difference.
I came up with the following piece of code, which doesn't work as intended :
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
import org.apache.spark.ml.clustering.KMeans
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("id_suivi", StringType, true),
StructField("id_ticket", StringType, true),
StructField("id_affectation", StringType, true),
StructField("id_contact", StringType, true),
StructField("d_date", StringType, true),
StructField("n_duree_passe", StringType, true),
StructField("isPublic", StringType, true),
StructField("Ticket_Request_Id", StringType, true),
StructField("IsDoneInHNO", StringType, true),
StructField("commments", StringType, true),
StructField("reponse", StringType, true)))
val tokenizer = new Tokenizer().setInputCol("reponse").setOutputCol("words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(32768)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("delimiter", ";").
schema(customSchema).
load("C:/noSave/tmp/22/tickets1.csv").
select("id_suivi", "reponse")
val tokenizedDF = tokenizer.transform(df)
val hashedDF = hashingTF.transform(tokenizedDF).cache()
val idfModel = idf.fit(hashedDF)
val rescaledDF = idfModel.transform(hashedDF).cache()
val kmeans = new KMeans().setK(2).setSeed(1L).setFeaturesCol("features")
val model = kmeans.fit(rescaledDF)
val clusteredDF = model.transform(rescaledDF)
I would believe that this code is correct, or at least I don't see where the bug is. However, something is really wrong because when I compute the error, it's really big :
scala> model.computeCost(rescaledDF)
res0: Double = 3.1555983509935196E7
I have also tried different values for K (I thought 2 was a good value because my texts are in 2 languages (French, English)), such as 10, 100 or even bigger, looking for the "elbow" value, but no luck.
Can anyone point me in the right direction ?
Many thanks in advance !

I'll answer my own question (hopefully this is acceptable by SO's etiquette) in case this is one day of any use for someone else.
An easier way to differentiate the 2 languages is to consider their use of stop words (i.e : words which are commonly common in each language).
Using TF-IDF was a bad idea to start with because it nullifies the contribution of the stop words (its purpose is to put the focus on the "uncommonly common" terms in a document)
I managed to get closer to my goal of clustering by language by using CountVectorizer which creates a dictionnary of the most frequently used terms and count those for each document.
The most common terms being the stop words, we end up clustering the documents by their use of stop words, which are different sets in both languages, therefore clustering by language.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark error on Dataproc while creating dataframe with the schema details - pyspark

Related

pyspark read hdfs file - no such file or directory error

Apply transformations on a RDD column while selecting other columns in Pyspark

unable to create dataframe from sequence file in Spark created by Sqoop

Pyspark Streaming

Attempting to cluster documents with TF-IDF and KMeans in Spark. What's wrong with this piece of code?

Categories

Resources