pyspark FPGrowth doesn't work with RDD - pyspark

I am trying to use the FPGrowth function on some data in Spark. I tested the example here with no problems:
https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html
However, my dataset is coming from hive
data = hiveContext.sql('select transactionid, itemid from transactions')
model = FPGrowth.train(data, minSupport=0.1, numPartitions=100)
This failed with Method does not exist:
py4j.protocol.Py4JError: An error occurred while calling o764.trainFPGrowthModel. Trace:
py4j.Py4JException: Method trainFPGrowthModel([class org.apache.spark.sql.DataFrame, class java.lang.Double, class java.lang.Integer]) does not exist
So, I converted it to an RDD:
data=data.rdd
Now I start getting some strange pickle serializer errors.
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
Then I start looking at the types. In the example, the data is run through a flatmap. This returns a different type than the RDD.
RDD Type returned by flatmap: pyspark.rdd.PipelinedRDD
RDD Type returned by hiveContext: pyspark.rdd.RDD
FPGrowth only seems to work with the PipelinedRDD. Is there some way I can convert a regular RDD to a PipelinedRDD?
Thanks!

Well, my query was wrong, but changed that to use collect_set and then
I managed to get around the type error by doing:
data=data.map(lambda row: row[0])

Related

org.mockito.exceptions.misusing.WrongTypeOfReturnValue on spark test cases

I'm currently writing test cases for spark with mockito and I'm mocking a sparkContext which gets wholeTextFiles called on it. I have something like this
val rdd = sparkSession.sparkContext.makeRDD(Seq(("Java", 20000),
("Python", 100000), ("Scala", 3000)))
doReturn(rdd).when(mockContext).wholeTextFiles(testPath)
However, I keep getting an error saying wholeTextFiles is supposed to output an int
org.mockito.exceptions.misusing.WrongTypeOfReturnValue: ParallelCollectionRDD cannot be returned by wholeTextFiles$default$2()
wholeTextFiles$default$2() should return int
I know this isn't the case, the spark docs say that wholeTextFiles returns an RDD object, any tips on how I can fix this? I can't have my doReturn be of type int, because then the rest of my function fails, since I turn the wholeTextFiles output into a dataframe.
Resolved this with some alternative approaches. It works just fine if I do the Mockito.when() thenReturn pattern, but I decided to instead change the entire test case to start a local sparksession and load some sample files into an rdd instead, because that's a more in depth test of my function imo

What are Untyped Scala UDF and Typed Scala UDF? What are their differences?

I've been using Spark 2.4 for a while and just started switching to Spark 3.0 in these last few days. I got this error after switching to Spark 3.0 for running udf((x: Int) => x, IntegerType):
Caused by: org.apache.spark.sql.AnalysisException: You're using untyped Scala UDF, which does not have the input type information. Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. To get rid of this error, you could:
1. use typed Scala UDF APIs(without return type parameter), e.g. `udf((x: Int) => x)`
2. use Java UDF APIs, e.g. `udf(new UDF1[String, Integer] { override def call(s: String): Integer = s.length() }, IntegerType)`, if input types are all non primitive
3. set spark.sql.legacy.allowUntypedScalaUDF to true and use this API with caution;
The solutions are proposed by Spark itself and after googling for a while I got to Spark Migration guide page:
In Spark 3.0, using org.apache.spark.sql.functions.udf(AnyRef, DataType) is not allowed by default. Remove the return type parameter to automatically switch to typed Scala udf is recommended, or set spark.sql.legacy.allowUntypedScalaUDF to true to keep using it. In Spark version 2.4 and below, if org.apache.spark.sql.functions.udf(AnyRef, DataType) gets a Scala closure with primitive-type argument, the returned UDF returns null if the input values is null. However, in Spark 3.0, the UDF returns the default value of the Java type if the input value is null. For example, val f = udf((x: Int) => x, IntegerType), f($"x") returns null in Spark 2.4 and below if column x is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.
source: Spark Migration Guide
I notice that my usual way of using function.udf API, which is udf(AnyRef, DataType), is called UnTyped Scala UDF and the proposed solution, which is udf(AnyRef), is called Typed Scala UDF.
To my understanding, the first one looks more strictly typed than the second one where the first one has its output type explicitly defined and the second one does not, hence my confusion on why it's called UnTyped.
Also the function got passed to udf, which is (x:Int) => x, clearly has its input type defined but Spark claiming You're using untyped Scala UDF, which does not have the input type information?
Is my understanding correct? Even after more intensive searching I still can't find any material explaining what is UnTyped Scala UDF and what is Typed Scala UDF.
So my questions are: What are they? What are their differences?
In typed scala UDF, UDF knows the types of the columns passed as argument, whereas in untyped scala UDF, UDF doesn't know the types of the columns passed as argument
When creating typed scala UDF, the types of columns passed as argument and output of the UDF are inferred from the function arguments and output types whereas when creating untyped scala UDF, there is not type inference at all, either for arguments or output.
What can be confusing is that when creating typed UDF the types are inferred from function and not explicitly passed as argument. To be more explicit, you can write typed UDF creation as follow:
val my_typed_udf = udf[Int, Int]((x: Int) => Int)
Now, let's look at the two points you raised.
To my understanding, the first one (eg udf(AnyRef, DataType)) looks more strictly typed than the second one (eg udf(AnyRef)) where the first one has its output type explicitly defined and the second one does not, hence my confusion on why it's called UnTyped.
According to spark functions scaladoc, signatures of the udf functions that transform a function to an UDF are actually, for the first one:
def udf(f: AnyRef, dataType: DataType): UserDefinedFunction
And for the second one:
def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): UserDefinedFunction
So the second one is actually more typed than the first one, as the second one takes into account the type of the function passed as argument, whereas the first one erases the type of the function.
That's why on the first one you need to define return type, because spark needs this information but can't infer it from function passed as argument as its return type is erased, whereas in the second one the return type is inferred from function passed as argument.
Also the function got passed to udf, which is (x:Int) => x, clearly has its input type defined but Spark claiming You're using untyped Scala UDF, which does not have the input type information?
What is important here is not the function, but how Spark creates an UDF from this function.
In both cases, the function to be transformed to UDF has its input and return types defined, but those types are erased and not taken into account when creating UDF using udf(AnyRef, DataType).
This doesn't answer your original question about what the different UDFs are, but if you want to get rid of the error, in Python you can include this line in your script: spark.sql("set spark.sql.legacy.allowUntypedScalaUDF=true").

How to resolve error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[(Int, Int)]?

I am learning apache spark and trying to execute a small program on scala terminal.
I have started the dfs, yarn and history server using the following commands:
start-dfs.sh
start-yarn.sh
mr-jobhistory-deamon.sh start historyserver
and then in the scala terminal, i have written the following commands:
var file = sc.textFile("/Users/****/Documents/backups/h/*****/input/ncdc/micro-tab/sample.txt");
val records = lines.map(_.split("\t"));
val filters = records.filter(rec => (rec(1) != "9999" && rec(2).matches("[01459]")));
val tuples = filters.map(rec => (rec(0).toInt, rec(1).toInt));
val maxTemps = tuples.reduceByKey((a,b) => Math.max(a,b));
all commands are executed successfully, except the last one, which throws the following error:
error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[(Int, Int)]
i found some solutions like:
This comes from using a pair rdd function generically. The reduceByKey method is actually a method of the PairRDDFunctions class, which has an implicit conversion from RDD.So it requires several implicit typeclasses. Normally when working with simple concrete types, those are already in scope. But you should be able to amend your method to also require those same implicit.
But i am not sure how to achieve this.
Any help, how to resolve this issue?
It seems you are missing an import. Try writing this in the console:
import org.apache.spark.SparkContext._
And then running the above commands. This import brings an implicit conversion which lets you use the reduceByKey method.

What's the difference between Dataset.map(r=>xx) and Dataframe.map(r=>xx) in Spark2.0?

Some how in Spark2.0, I can use Dataframe.map(r => r.getAs[String]("field")) without problems
But DataSet.map(r => r.getAs[String]("field")) gives error that r doesn't have the "getAs" method.
What's the difference between r in DataSet and r in DataFrame and why r.getAs only works with DataFrame?
After doing some research in StackOverflow, I found a helpful answer here
Encoder error while trying to map dataframe row to updated row
Hope it's helpful
Dataset has a type parameter: class Dataset[T]. T is the type of each record in the Dataset. That T might be anything (well, anything for which you can provide an implicit Encoder[T], but that's besides the point).
A map operation on a Dataset applies the provided function to each record, so the r in the map operations you showed will have the type T.
Lastly, DataFrame is actually just an alias for Dataset[Row], which means each record has the type Row. And Row has a method named getAs that takes a type parameter and a String argument, hence you can call getAs[String]("field") on any Row. For any T that doesn't have this method - this will fail to compile.

How to replace specific columns multiple value in Spark Dataframe?

I am trying to replace or update some specific column value in dataframe, as we know Dataframe is immutable, I am trying to transform in to new dataframe instead of Update or Replacement.
I tried dataframe.replace as explained in Spark doc, but it's giving me error as error: value replace is not a member of org.apache.spark.sql.DataFrame
I tried below option.For passing multiple value I am passing in array
val new_df= df.replace("Stringcolumn", Map(array("11","17","18","10"->"12")))
but I am getting error as
error: overloaded method value array with alternatives
Help is really appreciated!!
To access org.apache.spark.sql.DataFrameNaFunctions such as replace you have to call .na. So your code should look something like this,
import com.google.common.collect.ImmutableMap
df.na.replace("Stringcolumn", Map(10 -> 12, 11 -> 17))
see here to get all the list of DataFrameNaFunctions and how to use them