Error when calling UDF using broadcasted objects in PySpark - pyspark

I am trying to invoke a UDF that uses a broadcasted object in PySpark.
Here is a minimal example that reproduces the situation and error:
import pyspark.sql.functions as sf
from pyspark.sql.types import LongType
class SquareClass:
def compute(self, n):
return n ** 2
square = SquareClass()
square_sc = sc.broadcast(square)
def f(n):
return square_sc.value.compute(n)
numbers = sc.parallelize([{'id': i} for i in range(10)]).toDF()
f_udf = sf.udf(f, LongType())
numbers.select(f_udf(numbers.id)).show(10)
The stacktrace and error message that this snippet produces:
Traceback (most recent call last)
<ipython-input-75-6e38c014e4b2> in <module>()
13 f_udf = sf.udf(f, LongType())
14
---> 15 numbers.select(f_udf(numbers.id)).show(10)
/usr/hdp/current/spark-client/python/pyspark/sql/dataframe.py in show(self, n, truncate)
255 +---+-----+
256 """
--> 257 print(self._jdf.showString(n, truncate))
258
259 def __repr__(self):
/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id,
<snip>
An error occurred while calling o938.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 49.0 failed 1 times, most recent failure: Lost task 1.0 in stage 49.0 (TID 587, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

When calling the attributes of square_sc you're calling the module SquareClass which is not present on the workers.
If you want to use a python package, class, function in a UDF, workers should be able to have access to it you can achieve this by putting the code in a python script and deploying it using --py-files when running you spark-submit, pyspark

One thing you can do is, keep the class as separate module and add the module to sparkContext.
class_module.py
class SquareClass:
def compute(self, n):
return n ** 2
pyspark-shell
import pyspark.sql.functions as sf
from pyspark.sql.types import LongType
from class_module import SquareClass
sc.addFile('class_module.py')
square = SquareClass()
square_sc = sc.broadcast(square)
def f(n):
return square_sc.value.compute(n)
f_udf = sf.udf(f, LongType())
numbers = sc.parallelize([{'id': i} for i in range(10)]).toDF()
numbers.select(f_udf(numbers.id)).show(10)
+-----+
|f(id)|
+-----+
| 0|
| 1|
| 4|
| 9|
| 16|
| 25|
| 36|
| 49|
| 64|
| 81|
+-----+

Related

how to find the number of vertex that are reachable from a given vertex in Spark GraphX

I want to find out the number of reachable vertexes from a given vertex in a directed graph (see image below), e.g. for id=0L, since it connects to 1L and 2L, 1L connects to 3L, 2L connects to 4L, hence, the output should be 4. Following is the graph relationship data:
edgeid from to distance
0 0 1 10.0
1 0 2 5.0
2 1 2 2.0
3 1 3 1.0
4 2 1 3.0
5 2 3 9.0
6 2 4 2.0
7 3 4 4.0
8 4 0 7.0
9 4 3 5.0
I was able to set up the graph, but I am not sure how to use graph.edges.filter to get the ouput
val vertexRDD: RDD[(Long, (Double))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val graph: Graph[(Double), Int] = Graph(vertexRDD, edgeRDD)
In your example all vertices are connected with a directed path so each vertex should result in a value of 4.
But if you were to remove the 4->0 (id=8) connection there would be a different number of course.
Since your problem relies on (recursively) traversing the graph in parallel the Graphx Pregel API is probably the best approach.
The pregel call takes 3 functions
vprog to initialize each vertex with a message (in your case empty List[VertexId])
sendMsg an update step that is applied on each iteration (in your case accumulation the neighboring VertexIds and returning an Iterator with messages to send out to the next iteration
mergeMsg to merge two messages (2 List[VertexId]s into 1)
In code it would look like:
def vprog(id: VertexId, orig: List[VertexId], newly: List[VertexId]) : List[VertexId] = newly
def mergeMsg(a: List[VertexId], b: List[VertexId]) : List[VertexId] = (a ++ b).distinct
def sendMsg(trip: EdgeTriplet[List[VertexId],Double]) : Iterator[(VertexId, List[VertexId])] = {
val recursivelyConnectedNeighbors = (trip.dstId :: trip.dstAttr).filterNot(_ == trip.srcId)
if (trip.srcAttr.intersect(recursivelyConnectedNeighbors).length != recursivelyConnectedNeighbors.length)
Iterator((trip.srcId, recursivelyConnectedNeighbors))
else
Iterator.empty
}
val initList = List.empty[VertexId]
val result = graph
.mapVertices((_,_) => initList)
.pregel(
initialMsg = initList,
activeDirection = EdgeDirection.Out
)(vprog, sendMsg, mergeMsg)
.mapVertices((_, neighbors) => neighbors.length)
result.vertices.toDF("vertex", "value").show()
Output:
+------+-----+
|vertex|value|
+------+-----+
| 0| 4|
| 1| 3|
| 2| 3|
| 3| 1|
| 4| 1|
+------+-----+
Make sure to experiment with spark.graphx.pregel.checkpointInterval if you are getting OoM's traversing large graphs (or configuring the maxIterations in pregel init)

UDF to randomly assign values based on different probabilities

I would like to create a UDF to randomly assign values based on different probabilities.
In the following example depending of the value returned by rand:
0 to 0.5 the value should be A (50% probability)
0.8 to 1 the value should be B (20% probability)
anything else the value should be c (30% probability)
val names = Array("A", "B", "C")
val allocate = udf((p: Double) => {
if(p < 0.5) names(0)
else if (p > 0.8) names(1)
else names(2)})
val test = sqlContext.range(0, 100).select(($"id"),(round(abs(rand),2)).alias("val"), allocate(abs(rand)).alias("name"))
`
However when I print the result the names are not assigned based on the rules defined in the UDF.
+---+----+----+
| id| val|name|
+---+----+----+
| 0|0.17| C| => should be A
| 1|0.12| A|
| 2|0.36| A|
| 3|0.56| B|
| 4|0.82| A|=> should be C
There is nothing unexpected going on here. You call rand function twice so you get two different random values.
Either provide the same seed for both calls:
sqlContext.range(0, 100)
.select(
$"id",
abs(rand(1)).alias("val"),
allocate(abs(rand(1))).alias("name")
)
or reuse the value:
sqlContext.range(0, 100)
.withColumn("val", abs(rand))
.withColumn("name", allocate($"val"))

spark UDF Java Error: Method col([class java.util.ArrayList]) does not exist

I have a python dict as:
fileClass = {'a1' : ['a','b','c','d'], 'b1':['a','e','d'], 'c1': ['a','c','d','f','g']}
and a list of tuples as:
C = [('a','b'), ('c','d'),('e')]
I want to finally create a spark dataframe as:
Name (a,b) (c,d) (e)
a1 2 2 0
b1 1 1 1
c1 1 2 0
which simply contains the counts for the element in each tuples that appear in each item in dict A
to do this I create a dict to mapping each element to col index
classLoc = {'a':0,'b':0,'c':1,'d':1,'e':2}
then I use udf to define
import numpy as np
def convertDictToDF(v, classLoc, length) :
R = np.zeros((1,length))
for c in v:
try:
loc = classLoc[c]
R[loc] += 1
except:
pass
return(R)
udfConvertDictToDF = udf(convertDictToDF, ArrayType(IntegerType()))
df = sc.parallelize([
[k] + list(udfConvertDictToDF(v, classLoc, len(C)))
for k, v in fileClass.items()]).toDF(['Name']+ C)
then I got error msg as
---------------------------------------------------------------------------
Py4JError Traceback (most recent call last)
<ipython-input-40-ab668a12838a> in <module>()
1 df = sc.parallelize([
2 [k] + list(udfConvertDictToDF(v,classLoc, len(C)))
----> 3 for k, v in fileClass.items()]).toDF(['Name'] + C)
4
5 df.show()
/home/yizhng/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/functions.pyc in __call__(self, *cols)
1582 def __call__(self, *cols):
1583 sc = SparkContext._active_spark_context
-> 1584 jc = self._judf.apply(_to_seq(sc, cols, _to_java_column))
1585 return Column(jc)
1586
/home/yizhng/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/column.pyc in _to_seq(sc, cols, converter)
58 """
59 if converter:
---> 60 cols = [converter(c) for c in cols]
61 return sc._jvm.PythonUtils.toSeq(cols)
62
/home/yizhng/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/column.pyc in _to_java_column(col)
46 jcol = col._jc
47 else:
---> 48 jcol = _create_column_from_name(col)
49 return jcol
50
/home/yizhng/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/column.pyc in _create_column_from_name(name)
39 def _create_column_from_name(name):
40 sc = SparkContext._active_spark_context
---> 41 return sc._jvm.functions.col(name)
42
43
/home/yizhng/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814
815 for temp_arg in temp_args:
/home/yizhng/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(*a, **kw)
43 def deco(*a, **kw):
44 try:
---> 45 return f(*a, **kw)
46 except py4j.protocol.Py4JJavaError as e:
47 s = e.java_exception.toString()
/home/yizhng/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
310 raise Py4JError(
311 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
--> 312 format(target_id, ".", name, value))
313 else:
314 raise Py4JError(
Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:360)
at py4j.Gateway.invoke(Gateway.java:254)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
I don't understand what is wrong with my UDF leads to that error msg. Please help
I think it has to do with the way you are using this line
[k] + list(udfConvertDictToDF(v, classLoc, len(C)))
at the bottom.
when I do a simple python version of it I get an error as well.
import numpy as np
C = [('a','b'), ('c','d'),('e')]
classLoc = {'a':0,'b':0,'c':1,'d':1,'e':2}
import numpy as np
def convertDictToDF(v, classLoc, length) :
# I also got rid of (1,length) for (length)
# b/c pandas .from_dict() method handles this for me
R = np.zeros(length)
for c in v:
try:
loc = classLoc[c]
R[loc] += 1
except:
pass
return(R)
[[k] + convertDictToDF(v, classLoc, len(C))
for k, v in fileClass.items()]
which produces these errors
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('S32') dtype('S32') dtype('S32')
If you were to change the list comprehension to a dict comprehension, you could get it to work.
dict = {k:convertDictToDF(v, classLoc, len(C))
for k, v in fileClass.items()}
the output of which looks like this
> {'a1': array([ 2., 2., 0.]), 'c1': array([ 1., 2., 0.]), 'b1': array([ 1., 1., 1.])}
Without knowing what you're end use case is, I'm going to get you to the output you requested, but using a slightly different way, which may not scale how you'd like, so I'm sure there's a better way.
The following code will get you the rest of the way to a dataframe,
import pandas as pd
df = pd.DataFrame.from_dict(data=dict,orient='index').sort_index()
df.columns=C
which produces your desired output
(a, b) (c, d) e
a1 2.0 2.0 0.0
b1 1.0 1.0 1.0
c1 1.0 2.0 0.0
And this will get you a Spark dataframe
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_s = sqlContext.createDataFrame(df)
df_s.show()
+----------+----------+---+
|('a', 'b')|('c', 'd')| e|
+----------+----------+---+
| 2.0| 2.0|0.0|
| 1.0| 1.0|1.0|
| 1.0| 2.0|0.0|
+----------+----------+---+

Prepare data for MultilayerPerceptronClassifier in scala

Please keep in mind I'm new to scala.
This is the example I am trying to follow:
https://spark.apache.org/docs/1.5.1/ml-ann.html
It uses this dataset:
https://github.com/apache/spark/blob/master/data/mllib/sample_multiclass_classification_data.txt
I have prepared my .csv using the code below to get a data frame for classification in Scala.
//imports for ML
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row
//imports for transformation
import sqlContext.implicits._
import com.databricks.spark.csv._
import org.apache.spark.mllib.linalg.{Vector, Vectors}
//load data
val data2 = sqlContext.csvFile("/Users/administrator/Downloads/ds_15k_10-2.csv")
//Rename any one column to features
//val df2 = data.withColumnRenamed("ip_crowding", "features")
val DF2 = data2.select("gst_id_matched","ip_crowding","lat_long_dist");
scala> DF2.take(2)
res6: Array[org.apache.spark.sql.Row] = Array([0,0,0], [0,0,1628859.542])
//define doublelfunc
val toDouble = udf[Double, String]( _.toDouble)
//Convert all to double
val featureDf = DF2
.withColumn("gst_id_matched",toDouble(DF2("gst_id_matched")))
.withColumn("ip_crowding",toDouble(DF2("ip_crowding")))
.withColumn("lat_long_dist",toDouble(DF2("lat_long_dist")))
.select("gst_id_matched","ip_crowding","lat_long_dist")
//Define the format
val toVec4 = udf[Vector, Double,Double] { (v1,v2) => Vectors.dense(v1,v2) }
//Format for features which is gst_id_matched
val encodeLabel = udf[Double, String]( _ match
{ case "0.0" => 0.0 case "1.0" => 1.0} )
//Transformed dataset
val df = featureDf
.withColumn("features",toVec4(featureDf("ip_crowding"),featureDf("lat_long_dist")))
.withColumn("label",encodeLabel(featureDf("gst_id_matched")))
.select("label", "features")
val splits = df.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network:
// input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes)
val layers = Array[Int](0, 0, 0, 0)
// create the trainer and set its parameter
val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(12).setSeed(1234L).setMaxIter(10)
// train the model
val model = trainer.fit(train)
The last line generates this error
15/11/21 22:46:23 ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 15)
java.lang.ArrayIndexOutOfBoundsException: 0
My suspicions:
When I examine the dataset,it looks fine for classification
scala> df.take(2)
res3: Array[org.apache.spark.sql.Row] = Array([0.0,[0.0,0.0]], [0.0,[0.0,1628859.542]])
But the apache example dataset is different and my transformation does not give me what I need.Can some one please help me with the dataset transformation or understand the root cause of the problem.
This is what the apache dataset looks like:
scala> data.take(1)
res8: Array[org.apache.spark.sql.Row] = Array([1.0,(4,[0,1,2,3],[-0.222222,0.5,-0.762712,-0.833333])])
The source of your problems is a wrong definition of layers. When you use
val layers = Array[Int](0, 0, 0, 0)
it means you want a network with zero nodes in each layer which simply doesn't make sense. Generally speaking number of neurons in the input layer should be equal to the number of features and each hidden layer should contain at least one neuron.
Lets recreate your data simpling your code on the way:
import org.apache.spark.sql.functions.col
val df = sc.parallelize(Seq(
("0", "0", "0"), ("0", "0", "1628859.542")
)).toDF("gst_id_matched", "ip_crowding", "lat_long_dist")
Convert all columns to doubles:
val numeric = df
.select(df.columns.map(c => col(c).cast("double").alias(c)): _*)
.withColumnRenamed("gst_id_matched", "label")
Assemble features:
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array("ip_crowding","lat_long_dist"))
.setOutputCol("features")
val data = assembler.transform(numeric)
data.show
// +-----+-----------+-------------+-----------------+
// |label|ip_crowding|lat_long_dist| features|
// +-----+-----------+-------------+-----------------+
// | 0.0| 0.0| 0.0| (2,[],[])|
// | 0.0| 0.0| 1628859.542|[0.0,1628859.542]|
// +-----+-----------+-------------+-----------------+
Train and test network:
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
val layers = Array[Int](2, 3, 5, 3) // Note 2 neurons in the input layer
val trainer = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100)
val model = trainer.fit(data)
model.transform(data).show
// +-----+-----------+-------------+-----------------+----------+
// |label|ip_crowding|lat_long_dist| features|prediction|
// +-----+-----------+-------------+-----------------+----------+
// | 0.0| 0.0| 0.0| (2,[],[])| 0.0|
// | 0.0| 0.0| 1628859.542|[0.0,1628859.542]| 0.0|
// +-----+-----------+-------------+-----------------+----------+

JAAD cannot read m4a

What I wish to do is to generate a preview for every m4a file. I am trying to do this with Java Sound and JAAD.
Here is my attempt in Scala
import java.io.{File, FileOutputStream}
import javax.sound.sampled.AudioSystem
/**
* Created by khanguyen on 7/21/15.
*/
object Main extends App {
val filePath = "audio.m4a"
val file = new File(filePath)
val audio = AudioSystem.getAudioInputStream(file)
println(audio.getFrameLength) // return -1
println(audio.getFormat) // return PCM_SIGNED 0.0 Hz, 0 bit, 0 channels, 0 bytes/frame,
val output = new FileOutputStream("outputaudio.m4a")
var buffer = Array.fill[Byte](1024)(0)
for (i <- 0 to 1024) {
audio.read(buffer, i * 1024, 1024)
buffer.take(10).map(println)
output.write(buffer)
}
audio.close()
output.flush()
output.close()
}
I cannot read anything from the audio input stream. The frameLength is said to be -1. After a read pass, all the bytes in Array[Byte] are still 0. Am I missing anything?
if you are getting a -1 in audio.getFrameLength it is because the file format is not supported.
val audio = AudioSystem.getAudioInputStream(file)
//> javax.sound.sampled.UnsupportedAudioFileException: could not get audio input
//| stream from input file
//| at javax.sound.sampled.AudioSystem.getAudioInputStream(AudioSystem.java:
//| 1187)
//| at forcomp.wc$$anonfun$main$1.apply$mcV$sp(forcomp.wc.scala:15)
//| at org.scalaide.worksheet.runtime.library.WorksheetSupport$$anonfun$$exe
//| cute$1.apply$mcV$sp(WorksheetSupport.scala:76)
//| at org.scalaide.worksheet.runtime.library.WorksheetSupport$.redirected(W
//| orksheetSupport.scala:65)
//| at org.scalaide.worksheet.runtime.library.WorksheetSupport$.$execute(Wor
//| ksheetSupport.scala:75)
//| at forcomp.wc$.main(forcomp.wc.scala:5)
//| at forcomp.wc.main(forcomp.wc.scala)
println(audio.getFrameLength) // return -1
Anyways, I've tested it with a pet m4a example file but in the javax.sound.AudioSystem api (http://docs.oracle.com/javase/7/docs/api/javax/sound/sampled/AudioSystem.html) you will find more information about the supported file types, AAC (m4a extentions are part of AAC coding) should be supported.