I have tried these two variables:
val km = (1,2,4.3,false)
val klpd = (1,2)
In the second case I see Tuple2[Int,Int] but the first case shows Tuple4[Integer,Integer,Double,Boolean] in the memory i.e. seeing the variable type in Intellij/Eclipse.
So Scala is dumping the primitive type Int and storing it as Integer.
Same is seen if I add an Int to Array[AnyVal].
PS: I am using Scala 2.10.4 and my REPL output doesn't match that of Eclipse..
In Scala, tuples are represented using classes taking generic type parameters. There are 22 such classes, but only Tuple2 is annotated to specialize (optimize) for primitive types. Anything from Tuple3 onwards, will box the primitives.
Related
In Scala I am making use of the Cats Library and I am using the mapN() function. Lets say I have created the following case class with 22 elements. The following code will work without any problem.
case class Example1 (element1: String , element2: String ..... element22)
('string1', ...'string22').mapN(Example1.apply)
Lets say I want a element23. But unfortunately the compiler will complain that there are arguments missing because the mapping is not done correctly. Is there any way of chaining the 23rd element because I saw that there is a tuple of max 22 elements handled by the mapN() function.
To my knowledge, there is no support for mapN for tuples of arity > 22. One could probably change that in cats by fiddling around with maxArity here (though that would require doing that for Scala 3 only).
However: for your specific use case (converting tuples to a case class), Scala 3 (which I assume you are using because Scala 2 does not support tuples of arity > 22) would offer a simple solution: see Converting between tuples and case classes in Scala 3 and this runnable scastie example.
I’m new to using Scala and am trying to see if a list contains any objects of a certain type.
When I make a method to do this, I get the following results:
var l = List("Some string", 3)
def containsType[T] = l.exists(_.isInstanceOf[T])
containsType[Boolean] // val res0: Boolean = true
l.exists(_.isInstanceOf[Boolean]) // val res1: Boolean = false
Could someone please help me understand why my method doesn’t return the same results as the expression on the last line?
Thank you,
Johan
Alin's answer details perfectly why the generic isn't available at runtime. You can get a bit closer to what you want with the magic of ClassTag, but you still have to be conscious of some issues with Java generics.
import scala.reflect.ClassTag
var l = List("Some string", 3)
def containsType[T](implicit cls: ClassTag[T]): Boolean = {
l.exists(cls.runtimeClass.isInstance(_))
}
Now, whenever you call containsType, a hidden extra argument of type ClassTag[T] gets passed it. So when you write, for instance, println(containsType[String]), then this gets compiled to
scala.this.Predef.println($anon.this.containsType[String](ClassTag.apply[String](classOf[java.lang.String])))
An extra argument gets passed to containsType, namely ClassTag.apply[String](classOf[java.lang.String]). That's a really long winded way of explicitly passing a Class<String>, which is what you'd have to do in Java manually. And java.lang.Class has an isInstance function.
Now, this will mostly work, but there are still major caveats. Generics arguments are completely erased at runtime, so this won't help you distinguish between an Option[Int] and an Option[String] in your list, for instance. As far as the JVM is concerned, they're both Option.
Second, Java has an unfortunate history with primitive types, so containsType[Int] will actually be false in your case, despite the fact that the 3 in your list is actually an Int. This is because, in Java, generics can only be class types, not primitives, so a generic List can never contain int (note the lowercase 'i', this is considered a fundamentally different thing in Java than a class).
Scala paints over a lot of these low-level details, but the cracks show through in situations like this. Scala sees that you're constructing a list of Strings and Ints, so it wants to construct a list of the common supertype of the two, which is Any (strings and ints have no common supertype more specific than Any). At runtime, Scala Int can translate to either int (the primitive) or Integer (the object). Scala will favor the former for efficiency, but when storing in generic containers, it can't use a primitive type. So while Scala thinks that your list l contains a String and an Int, Java thinks that it contains a String and a java.lang.Integer. And to make things even crazier, both int and java.lang.Integer have distinct Class instances.
So summon[ClassTag[Int]] in Scala is java.lang.Integer.TYPE, which is a Class<Integer> instance representing the primitive type int (yes, the non-class type int has a Class instance representing it). While summon[ClassTag[java.lang.Integer]] is java.lang.Integer::class, a distinct Class<Integer> representing the non-primitive type Integer. And at runtime, your list contains the latter.
In summary, generics in Java are a hot mess. Scala does its best to work with what it has, but when you start playing with reflection (which ClassTag does), you have to start thinking about these problems.
println(containsType[Boolean]) // false
println(containsType[Double]) // false
println(containsType[Int]) // false (list can't contain primitive type)
println(containsType[Integer]) // true (3 is converted to an Integer)
println(containsType[String]) // true (class type so it works the way you expect)
println(containsType[Unit]) // false
println(containsType[Long]) // false
Scala uses the type erasure model of generics. This means that no
information about type arguments is kept at runtime, so there's no way
to determine at runtime the specific type arguments of the given
List object. All the system can do is determine that a value is a
List of some arbitrary type parameters.
You can verify this behavior by trying any List concrete type:
val l = List("Some string", 3)
println(l.isInstanceOf[List[Int]]) // true
println(l.isInstanceOf[List[String]]) // true
println(l.isInstanceOf[List[Boolean]]) // also true
println(l.isInstanceOf[List[Unit]]) // also true
Now regarding your example:
def containsType[T] = l.exists(_.isInstanceOf[T])
println(containsType[Int]) // true
println(containsType[Boolean]) // also true
println(containsType[Unit]) // also true
println(containsType[Double]) // also true
isInstanceOf is a synthetic function (a function generated by the Scala compiler at compile-time, usually to work around the underlying JVM limitations) and does not work the way you would expect with generic type arguments like T, because after compilation, this would normally be equivalent in Java to instanceof T which, by the way - is illegal in Java.
Why is illegal? Because of type erasure. Type erasure means all your generic code (generic classes, generic methods, etc.) is converted to non-generic code. This usually means 3 things:
all type parameters in generic types are replaced with their bounds or Object if they are unbounded;
wherever necessary the compiler inserts type casts to preserve type-safety;
bridge methods are generated if needed to preserve polymorphism of all generic methods.
However, in the case of instanceof T, the JVM cannot differentiate between types of T at execution time, so this makes no sense. The type used with instanceof has to be reifiable, meaning that all information about the type needs to be available at runtime. This property does not apply to generic types.
So if Java forbids this because it can't work, why does Scala even allows it? The Scala compiler is indeed more permissive here, but for one good reason; because it treats it differently. Like the Java compiler, the Scala compiler also erases all generic code at compile-time, but since isInstanceOf is a synthetic function in Scala, calls to it using generic type arguments such as isInstanceOf[T] are replaced during compilation with instanceof Object.
Here's a sample of your code decompiled:
public <T> boolean containsType() {
return this.l().exists(x$1 -> BoxesRunTime.boxToBoolean(x$1 instanceof Object));
}
Main$.l = (List<Object>)package$.MODULE$.List().apply((Seq)ScalaRunTime$.MODULE$.wrapIntArray(new int[] { 1, 2, 3 }));
Predef$.MODULE$.println((Object)BoxesRunTime.boxToBoolean(this.containsType()));
Predef$.MODULE$.println((Object)BoxesRunTime.boxToBoolean(this.containsType()));
This is why no matter what type you give to the polymorphic function containsType, it will always result in true. Basically, containsType[T] is equivalent to containsType[_] from Scala's perspective - which actually makes sense because a generic type T, without any upper bounds, is just a placeholder for type Any in Scala. Because Scala cannot have raw types, you cannot for example, create a List without providing a type parameter, so every List must be a List of "something", and that "something" is at least an Any, if not given a more specific type.
Therefore, isInstanceOf can only be called with specific (concrete) type arguments like Boolean, Double, String, etc. That is why, this works as expected:
println(l.exists(_.isInstanceOf[Boolean])) // false
We said that Scala is more permissive, but that does not mean you get away without a warning.
To alert you of the possibly non-intuitive runtime behavior, the Scala compiler does usually emit unchecked warnings. For example, if you had run your code in the Scala interpreter (or compile it using scalac), you would have received this:
I've been using Spark 2.4 for a while and just started switching to Spark 3.0 in these last few days. I got this error after switching to Spark 3.0 for running udf((x: Int) => x, IntegerType):
Caused by: org.apache.spark.sql.AnalysisException: You're using untyped Scala UDF, which does not have the input type information. Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. To get rid of this error, you could:
1. use typed Scala UDF APIs(without return type parameter), e.g. `udf((x: Int) => x)`
2. use Java UDF APIs, e.g. `udf(new UDF1[String, Integer] { override def call(s: String): Integer = s.length() }, IntegerType)`, if input types are all non primitive
3. set spark.sql.legacy.allowUntypedScalaUDF to true and use this API with caution;
The solutions are proposed by Spark itself and after googling for a while I got to Spark Migration guide page:
In Spark 3.0, using org.apache.spark.sql.functions.udf(AnyRef, DataType) is not allowed by default. Remove the return type parameter to automatically switch to typed Scala udf is recommended, or set spark.sql.legacy.allowUntypedScalaUDF to true to keep using it. In Spark version 2.4 and below, if org.apache.spark.sql.functions.udf(AnyRef, DataType) gets a Scala closure with primitive-type argument, the returned UDF returns null if the input values is null. However, in Spark 3.0, the UDF returns the default value of the Java type if the input value is null. For example, val f = udf((x: Int) => x, IntegerType), f($"x") returns null in Spark 2.4 and below if column x is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.
source: Spark Migration Guide
I notice that my usual way of using function.udf API, which is udf(AnyRef, DataType), is called UnTyped Scala UDF and the proposed solution, which is udf(AnyRef), is called Typed Scala UDF.
To my understanding, the first one looks more strictly typed than the second one where the first one has its output type explicitly defined and the second one does not, hence my confusion on why it's called UnTyped.
Also the function got passed to udf, which is (x:Int) => x, clearly has its input type defined but Spark claiming You're using untyped Scala UDF, which does not have the input type information?
Is my understanding correct? Even after more intensive searching I still can't find any material explaining what is UnTyped Scala UDF and what is Typed Scala UDF.
So my questions are: What are they? What are their differences?
In typed scala UDF, UDF knows the types of the columns passed as argument, whereas in untyped scala UDF, UDF doesn't know the types of the columns passed as argument
When creating typed scala UDF, the types of columns passed as argument and output of the UDF are inferred from the function arguments and output types whereas when creating untyped scala UDF, there is not type inference at all, either for arguments or output.
What can be confusing is that when creating typed UDF the types are inferred from function and not explicitly passed as argument. To be more explicit, you can write typed UDF creation as follow:
val my_typed_udf = udf[Int, Int]((x: Int) => Int)
Now, let's look at the two points you raised.
To my understanding, the first one (eg udf(AnyRef, DataType)) looks more strictly typed than the second one (eg udf(AnyRef)) where the first one has its output type explicitly defined and the second one does not, hence my confusion on why it's called UnTyped.
According to spark functions scaladoc, signatures of the udf functions that transform a function to an UDF are actually, for the first one:
def udf(f: AnyRef, dataType: DataType): UserDefinedFunction
And for the second one:
def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): UserDefinedFunction
So the second one is actually more typed than the first one, as the second one takes into account the type of the function passed as argument, whereas the first one erases the type of the function.
That's why on the first one you need to define return type, because spark needs this information but can't infer it from function passed as argument as its return type is erased, whereas in the second one the return type is inferred from function passed as argument.
Also the function got passed to udf, which is (x:Int) => x, clearly has its input type defined but Spark claiming You're using untyped Scala UDF, which does not have the input type information?
What is important here is not the function, but how Spark creates an UDF from this function.
In both cases, the function to be transformed to UDF has its input and return types defined, but those types are erased and not taken into account when creating UDF using udf(AnyRef, DataType).
This doesn't answer your original question about what the different UDFs are, but if you want to get rid of the error, in Python you can include this line in your script: spark.sql("set spark.sql.legacy.allowUntypedScalaUDF=true").
Can anyone tell me why this limitation exists ? Is it related to JVM or Scala compiler ?
$ scala
Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79).
Type in expressions to have them evaluated.
Type :help for more information.
scala> def toText(ints: Int*, strings: String*) = ints.mkString("") + strings.mkString("")
<console>:7: error: *-parameter must come last
def toText(ints: Int*, strings: String*) = ints.mkString("") + strings.mkString("")
A method in Scala can have multiple varargs if you use multiple parameter lists (curried syntax):
scala> def toText(ints: Int*)(strings: String*) =
ints.mkString("") + strings.mkString("")
scala> toText(1,2,3)("a", "b")
res0: String = 123ab
Update: Multiple varargs in a single parameter list would create a syntax problem - how would the compiler know which parameter a given argument belongs to (where does one list of arguments end, and the next begin, especially if they are of compatible types?).
In theory, if the syntax of the language were modified so that one could distinguish the first and second argument lists (without currying), there's no reason this wouldn't work at the JVM level, because varargs are just compiled into an array anyway.
But I very much doubt that it's a common enough case to justify complicating the language further, especially when a solution already exists.
See also this related Java question and answer.
Scala is based on Java Virtual Machine, which is set to accept varargs only as the last argument, only one per method arguments set. No working this around, this is how the compiler works.
To put it into perspective, imagine a method signature like this:
someMethod(strings1: String*, strings2: String*)
Let's say you pass 4 separate Strings when calling it. The compiler would not know which String object belongs to which vararg.
I'm porting some java code across and have the following
val overnightChanges: java.util.Hashtable[String, Double] = ...
When I try
if (null != overnightChanges.get(...))
I get the following warning
warning: comparing values of types Null and Double using `!=' will always yield true
Primitive and reference types are much less different in scala than they are in java, and so the convention is that name starts with an uppercase for all of them. Double is scala.Double which is the primitive java double, not the reference java.lang.Double.
When you need "a double or no value" in scala, you would use Option[Double] most of the time. Option has strong library support, and the type system will not let you ignore that there might be no value. However, when you need to interact closely with java, as in your example, your table does contain java.lang.Double and you should say it so.
val a = new java.util.HashMap[String, java.lang.Double]
If java.lang.Double starts to appear everywhere in your code, you can alias to JDouble, either by importing
import java.lang.{Double => JDouble}
or by defining
type JDouble = java.lang.Double
There are implicit conversions between scala.Double and java.lang.Double, so interaction should be reasonably smooth. However, java.lang.Double should probably be confined to the scala/java interaction layer, it would be confusing to have it go deep into scala code.
In Scala Double are primitives and thus cannot be null. That's annoying when using directly java maps, because when a key is not defined, you get the default primitive value, (here 0.0):
scala> val a = new java.util.Hashtable[String,Double]()
a: java.util.Hashtable[String,Double] = {}
scala> a.get("Foo")
res9: Double = 0.0
If the value is a object like String or List, your code should work as expected.
So, to solve the problem, you can:
Use contains in an outer if condition.
Use one of the Scala maps (many conversions are defined in scala.collection.JavaConversions)
Use Scala "options", also known as "maybe" in Haskell:
http://blog.danielwellman.com/2008/03/using-scalas-op.html