What's the difference between Dataset.map(r=>xx) and Dataframe.map(r=>xx) in Spark2.0? - scala

Some how in Spark2.0, I can use Dataframe.map(r => r.getAs[String]("field")) without problems
But DataSet.map(r => r.getAs[String]("field")) gives error that r doesn't have the "getAs" method.
What's the difference between r in DataSet and r in DataFrame and why r.getAs only works with DataFrame?
After doing some research in StackOverflow, I found a helpful answer here
Encoder error while trying to map dataframe row to updated row
Hope it's helpful

Dataset has a type parameter: class Dataset[T]. T is the type of each record in the Dataset. That T might be anything (well, anything for which you can provide an implicit Encoder[T], but that's besides the point).
A map operation on a Dataset applies the provided function to each record, so the r in the map operations you showed will have the type T.
Lastly, DataFrame is actually just an alias for Dataset[Row], which means each record has the type Row. And Row has a method named getAs that takes a type parameter and a String argument, hence you can call getAs[String]("field") on any Row. For any T that doesn't have this method - this will fail to compile.


What are Untyped Scala UDF and Typed Scala UDF? What are their differences?

I've been using Spark 2.4 for a while and just started switching to Spark 3.0 in these last few days. I got this error after switching to Spark 3.0 for running udf((x: Int) => x, IntegerType):
Caused by: org.apache.spark.sql.AnalysisException: You're using untyped Scala UDF, which does not have the input type information. Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. To get rid of this error, you could:
1. use typed Scala UDF APIs(without return type parameter), e.g. `udf((x: Int) => x)`
2. use Java UDF APIs, e.g. `udf(new UDF1[String, Integer] { override def call(s: String): Integer = s.length() }, IntegerType)`, if input types are all non primitive
3. set spark.sql.legacy.allowUntypedScalaUDF to true and use this API with caution;
The solutions are proposed by Spark itself and after googling for a while I got to Spark Migration guide page:
In Spark 3.0, using org.apache.spark.sql.functions.udf(AnyRef, DataType) is not allowed by default. Remove the return type parameter to automatically switch to typed Scala udf is recommended, or set spark.sql.legacy.allowUntypedScalaUDF to true to keep using it. In Spark version 2.4 and below, if org.apache.spark.sql.functions.udf(AnyRef, DataType) gets a Scala closure with primitive-type argument, the returned UDF returns null if the input values is null. However, in Spark 3.0, the UDF returns the default value of the Java type if the input value is null. For example, val f = udf((x: Int) => x, IntegerType), f($"x") returns null in Spark 2.4 and below if column x is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.
source: Spark Migration Guide
I notice that my usual way of using function.udf API, which is udf(AnyRef, DataType), is called UnTyped Scala UDF and the proposed solution, which is udf(AnyRef), is called Typed Scala UDF.
To my understanding, the first one looks more strictly typed than the second one where the first one has its output type explicitly defined and the second one does not, hence my confusion on why it's called UnTyped.
Also the function got passed to udf, which is (x:Int) => x, clearly has its input type defined but Spark claiming You're using untyped Scala UDF, which does not have the input type information?
Is my understanding correct? Even after more intensive searching I still can't find any material explaining what is UnTyped Scala UDF and what is Typed Scala UDF.
So my questions are: What are they? What are their differences?
In typed scala UDF, UDF knows the types of the columns passed as argument, whereas in untyped scala UDF, UDF doesn't know the types of the columns passed as argument
When creating typed scala UDF, the types of columns passed as argument and output of the UDF are inferred from the function arguments and output types whereas when creating untyped scala UDF, there is not type inference at all, either for arguments or output.
What can be confusing is that when creating typed UDF the types are inferred from function and not explicitly passed as argument. To be more explicit, you can write typed UDF creation as follow:
val my_typed_udf = udf[Int, Int]((x: Int) => Int)
Now, let's look at the two points you raised.
To my understanding, the first one (eg udf(AnyRef, DataType)) looks more strictly typed than the second one (eg udf(AnyRef)) where the first one has its output type explicitly defined and the second one does not, hence my confusion on why it's called UnTyped.
According to spark functions scaladoc, signatures of the udf functions that transform a function to an UDF are actually, for the first one:
def udf(f: AnyRef, dataType: DataType): UserDefinedFunction
And for the second one:
def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): UserDefinedFunction
So the second one is actually more typed than the first one, as the second one takes into account the type of the function passed as argument, whereas the first one erases the type of the function.
That's why on the first one you need to define return type, because spark needs this information but can't infer it from function passed as argument as its return type is erased, whereas in the second one the return type is inferred from function passed as argument.
Also the function got passed to udf, which is (x:Int) => x, clearly has its input type defined but Spark claiming You're using untyped Scala UDF, which does not have the input type information?
What is important here is not the function, but how Spark creates an UDF from this function.
In both cases, the function to be transformed to UDF has its input and return types defined, but those types are erased and not taken into account when creating UDF using udf(AnyRef, DataType).
This doesn't answer your original question about what the different UDFs are, but if you want to get rid of the error, in Python you can include this line in your script: spark.sql("set spark.sql.legacy.allowUntypedScalaUDF=true").

Attempting to figure out the correct type for this

Apologies, this is going to be a somewhat noob-ish question. I have an object from the slick library that has at type like this:
Query[(Rep[String], Rep[String]), (String, String), Seq]
I'm trying to write a function which accepts queries as arguments, though the sequences in them are of uncertain length - ie, it could equally well be:
Query[(Rep[String], Rep[String], Rep[String]), (String, String, String), Seq]
So the first two components have three elements rather than two. I cannot figure out how this is done. I have tried various erroneous permutations, like Query[Product[Rep[String]], Product[String], Seq], to no avail, and even what I assumed would be the nuclear option of just using Any doesn't work. My error messages are along the lines of
[error] found : Option[slick.driver.H2Driver.api.Query[(slick.driver.H2Driver.api.Rep[String], slick.driver.H2Driver.api.Rep[String]),(String, St
[error] (which expands to) Option[slick.lifted.Query[(slick.lifted.Rep[String], slick.lifted.Rep[String]),(String, String),Seq]]
[error] required: Option[slick.driver.H2Driver.api.Rep[scala.concurrent.Future[List[String]]]]
[error] (which expands to) Option[slick.lifted.Rep[scala.concurrent.Future[List[String]]]]
[error] ReturnFunctions.completeQuery(db, query, serialize_and_send)
I think my inability to solve this may reflect some fundamental lack of understanding about scala, strongly typed languages in general and possibly also computing as a whole. Should I be resolving this Query to some more definite form before I try to even pass it into a function? I also suspect I'm not interpreting the original type correctly - what do the parantheses mean in this context? Is it that Query is expecting to receive three sets of parameters, one after the other, like when you do fn(arg1)(arg2)(arg3) = ...?
Any help with this troubling dilemma gratefully received.
I also suspect I'm not interpreting the original type correctly - what do the parentheses mean in this context?
You're looked at a reasonable advanced area, but let's try to help.
The Query type always has three type parameters. You'll see them written as Query[M, U, C].
The first parameter, M, is a tuple. That's what the parentheses mean in this context.
In your first example, M is a tuple of two elements; and in the second it's three. The same situation exists for the second parameter, U. There's a bit more detail on this in Essential Slick.
In Scala, you can have generic parameters. That means you can say something along the lines of:
def foo[M, U, C[_]](q: Query[M,U,C]) = ???
We've defined a method with:
three type parameters; and
taking an argument of a query that has those types.
We've not said anything about M, U, or much about C (other than that it's a type that takes a type as an argument). That means there's not a lot we can do with them, but you may not need to.
A post on query enrichment in Slick gives a longer (related) example which may be of use.
As Dmytro suggests, a better route would be to create a concrete example of what you'd like to achieve and work from there.
Consider the shape of Query type constructor
Query[+E, U, C[_]]
We say Query is a type constructor because it constructs a concrete type out of given type arguments E, U, and C[_], similarly to how a function constructs a concrete value out of given function arguments.
Now lets try to deconstruct the concrete type
Query[(Rep[String], Rep[String]), (String, String), Seq]
into its constituent type parameters. We have
E = (Rep[String], Rep[String])
U = (String, String)
C[_] = Seq
Note (A, B) is just syntactic sugar for Tuple2[A, B] thus
E = (Rep[String], Rep[String]) = Tuple2[Rep[String], Rep[String]]
U = (String, String) = Tuple2[String, String]
C[_] = Seq = Seq
You might be wondering about that underscore in C[_]. This specifies that the type parameter C must be a type constructor as opposed to a concrete type. For example Seq is a type constructor whilst Seq[Int] is not. Furthermore, you might be wondering about that + in +E. This specifies the inheritance relationship of parameterised types, or in other words, variance, for example, it specifies whether Seq[Dog] a subtype of Seq[Animal].
Lastly lets write the resulting concrete type in its full verbosity
Query[Tuple2[Rep[String], Rep[String]], Tuple2[String, String], Seq]

Dataframe: Adding prefix to all columns in Scala

val prefix = "ABC"
val renamedColumns = df.columns.map(c=> df(c).as(s"$prefix$c"))
val dfNew = df.select(renamedColumns: _*)
I am fairly new to scala and the code above works perfectly to add a prefix to all columns. Can someone please explain the breakdown of how it works ?
The second line above will return a map of col1 as ABCcol1, col2 as ABCcol2.... etc
I have trouble understanding what the third line is doing , especailly the ":_* at the end.
thanks for your help in advance.
The third line is an example of Scala's syntactic sugar. Essentially, Scala has ways to shorten just exactly what you are typing, and you have discovered the dreaded :_*.
There are two portions to this small bit - the : and the _* serve two different purposes. The : is typically for ascription, which tells the compiler "this is the type that I need to use for this method". The _* however, is your type - in Scala this is the type varargs. Varargs is a type that has an arbitrary number of values (good resource here). It allows you to pass a method a list that you do not know the number of elements in.
In your example, you are creating a variable called renamedColumns from the columns of your original dataframe, with the new string appendage. Although you may know just how many columns are in your df, Scala does not. When you create dfNew, you are running a select statement on that and passing in your new column names, of which there could be an arbitrary number.
Essentially, you do not know how many columns you may have, so you pass in your varargs to allow the number to be arbitrary, thus determined by the compiler.

pyspark FPGrowth doesn't work with RDD

I am trying to use the FPGrowth function on some data in Spark. I tested the example here with no problems:
However, my dataset is coming from hive
data = hiveContext.sql('select transactionid, itemid from transactions')
model = FPGrowth.train(data, minSupport=0.1, numPartitions=100)
This failed with Method does not exist:
py4j.protocol.Py4JError: An error occurred while calling o764.trainFPGrowthModel. Trace:
py4j.Py4JException: Method trainFPGrowthModel([class org.apache.spark.sql.DataFrame, class java.lang.Double, class java.lang.Integer]) does not exist
So, I converted it to an RDD:
Now I start getting some strange pickle serializer errors.
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
Then I start looking at the types. In the example, the data is run through a flatmap. This returns a different type than the RDD.
RDD Type returned by flatmap: pyspark.rdd.PipelinedRDD
RDD Type returned by hiveContext: pyspark.rdd.RDD
FPGrowth only seems to work with the PipelinedRDD. Is there some way I can convert a regular RDD to a PipelinedRDD?
Well, my query was wrong, but changed that to use collect_set and then
I managed to get around the type error by doing:
data=data.map(lambda row: row[0])

takeSample() function in Spark

I'm trying to use the takeSample() function in Spark and the parameters are - data, number of samples to be taken and the seed. But I don't want to use the seed. I want to have a different answer everytime. I'm not able to figure out how I can do that. I tried using System.nanoTime as the seed value but it gave an error since I think the data type didn't match. Is there any other function similar to takeSample() that can be used without the seed? Or is there any other implementation I can use with takeSample() so that I get a different output every time.
System.nanoTime is of type long, the seed expected by takeSample is of type Int. Hence, takeSample(..., System.nanoTime.toInt) should work.
System.nanoTime returns Long, whereas takeSample expects an Int.
You can feed scala.util.Random.nextInt as a seed value to the takeSample function.
As of Spark version 1.0.0, the seed parameter is optional. See https://issues.apache.org/jira/browse/SPARK-1438.