How is new object instantiation handled in case of Datasets? - scala

I have to following scenario
case class A(name:String)
class Eq { def isMe(s:String) = s == "ME" }
val a = List(A("ME")).toDS
a.filter(l => new Eq().isMe(l.name))
Does this create a new object Eq every time for each data point on each executor ?

Nice one! I didn't know there is a different filter method for a typed dataset.
In order to answer your question, I will do some deep dive into Spark internals.
filter on a typed Dtaset has the following signature:
def filter(func: T => Boolean): Dataset[T]
Note that func is parameterized with T, hence Spark needs to deserialize both your object A along with the function.
TypedFilter Main$$$Lambda$, class A, [StructField(name,StringType,true)], newInstance(class A)
where Main$$$Lambda$ is a randomly generated function name
During optimization phase it might be eliminated by the EliminateSerialization rule if the following condition is met:
ds.map(...).filter(...) can be optimized by this rule to save extra deserialization, but ds.map(...).as[AnotherType].filter(...) can not be optimized.
If the rule is applicable TypedFilter is replaced by Filter.
The catch here is a Filter's condition. In fact, it is another special expression named Invoke where:
targetObject is the filter function Main$$$Lambda$
functionName is apply since it is a regular Scala function.
Spark eventually runs in one of these two modes - generate code or interpreter. Let's concentrate on the first one as it is the default.
Here is a simplified stack trace of the methods invocation that will generate the code
SparkPlan.execute
//https://github.com/apache/spark/blob/03e30063127fd71bef8a14553381e805fe5b6679/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L596
-> WholeStageCodegenExec.execute
[child: Filter]
-> child.execute
[condition Invoke]
-> Invoke.genCode
//https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L345
-> doGenCode
Simplified code after generation phase:
final class GeneratedIteratorForCodegenStage1 extends BufferedRowIterator {
private Object[] references;
private scala.collection.Iterator input;
private UnsafeRowWriter writer = new UnsafeRowWriter();
public GeneratedIteratorForCodegenStage1(Object[] references) {
this.references = references;
}
public void init(Iterator inputs) {
this.inputs = inputs;
}
protected void processNext() throws IOException {
while (input.hasNext() && !stopEarly()) {
InternalRow row = input.next();
do {
//Create A object
UTF8String value = row.getUTF8String(0));
A a = new A(value.toString)
//Filter by A's value
result = (scala.Function1) references[0].apply(a);
if (!result) continue;
writer.write(0, value)
append((writer.getRow());
}
if (shouldStop()) return;
}
}
}
We can see that projection is constructed with an array of objects passed in references variable. But where and how many times the references variable is instantiated?
It is created during WholeStageCodegenExec and instantiated only once per partition.
And this leads us to the answer that, however, filter function will be created only once per partition and not per data point, the Eq and A classes will be created per data point.
If you are curious about where it has been added to the code context:
It happens here
where javaType is scala.function1 .
and value is the implementation - Main$$$Lambda$

Related

Using Class<T> as a Map key in Haxe

I'd like to store instances of models in a common provider using their classes or interfaces as a keys and then pop them up by class references. I have written some code:
class Provider {
public function new() { }
public function set<T:Any>(instance:T, ?type:Class<T>) {
if (type == null)
type = Type.getClass(instance);
if (type != null && instance != null)
map.set(type, instance);
}
public function get<T:Any>(type:Class<T>):Null<T> {
return cast map.get(type);
}
var map = new Map<Class<Any>, Any>();
}
...alas, it's even doesn't compile.
Probably I have to use qualified class name as a key rather than class/interface reference? But I'd like to keep neat get function design that takes type as argument and returns object just of type taken, without additional type casting.
Is it possible or should I change my approach to this problem?
The issue of using Class<T> as a Map key come up every so often, here is a related discussion. The naive approach of Map<Class<T>, T> fails to compile with something like this:
Abstract haxe.ds.Map has no #:to function that accepts haxe.IMap<Class<Main.T>, Main.T>`
There's several different approaches to this problem:
One can use Type reflection to obtain the fully qualified name of a class instance, and then use that as a key in a Map<String, T>:
var map = new Map<String, Any>();
var name = Type.getClassName(Main);
map[name] = value;
For convenience, you would probably want to have a wrapper that does this for you, such as this ClassMap implementation.
A simpler solution is to simply "trick" Haxe into compiling it by using an empty structure type ({}) as the key type. This causes ObjectMap to be chosen as the underlying map implementation.
var map = new Map<{}, Any>();
map[Main] = value;
However, that allows you to use things as keys that are not of type Class<T>, such as:
map[{foo: "bar"}] = value;
The type safety issues of the previous approach can be remedied by using this ClassKey abstract:
#:coreType abstract ClassKey from Class<Dynamic> to {} {}
This still uses ObjectMap as the underlying map implementation due to the to {} implicit cast. However, using a structure as a key now fails at compile time:
var map = new Map<ClassKey, Any>();
map[{foo: "bar"}] = value; // No #:arrayAccess function accepts arguments [...]

init block position in class in Kotlin

I recently came across a situation where my standard variable's values are replaced by the default one even if I have assigned a value with the constructor using init block.
What I tried was:
class Example(function: Example.() -> Unit) {
init {
function()
}
var name = "default name"
}
// assigning it like this:
val example = Example { name = "new name" }
// print value
print(example.name) // prints "default name"
After struggling a bit, I have found that the position of the init block matters. If I put the init block at the last in the class, It initializes the name with default one first and then calls the function() which replaces the value with the "new name".
And If I put it first, it doesn't found the name and it is replaced by the "default name" when properties are initialized.
This is strange to me. Can anyone explain why this has happened?
The reason is kotlin follows top-to-bottom approach
From the documents (An in-depth look at Kotlin’s initializers) Initializers (property initializers and init blocks) are executed in the order that they are defined in the class, top-to-bottom.
You can define multiple secondary constructors, but only one will be called when you create a class instance unless the constructor explicitly calls another one.
Constructors can also have default argument values which are evaluated each time the constructor is called. Like property initializers, these can be function calls or other expressions that will run arbitrary code.
initializers are run top to bottom at the beginning of a class’ primary constructor.
This is correct way
class Example(function: Example.() -> Unit) {
var name = "default name"
init {
function()
}
}
Java constructor is just a method that run after object creation. Before running the constructor, all the class fields get initialized.
In Kotlin there are two types of constructors namely primary constructor and the secondary constructor. I see primary constructor as a regular java constructor that supports field encapsulation built-in. After compilation, primary constructor fields are put on the top of the class if they have declared visible to the whole class.
In java or kotlin, constructor is invoked after initializing class fields. But in primary constructor we cannot write any statements. If we want to write statements that need to be executed after object creation, we have to put them in the initialization blocks. But init blocks are executed as they appear in the class body. We can define multiple init blocks in the class. They will be executed from top to the bottom.
Lets do some experiment with init blocks..
Test.kt
fun main() {
Subject("a1")
}
class Element {
init {
println("Element init block 1")
}
constructor(message: String) {
println(message)
}
init {
println("Element init block 2")
}
}
class Subject(private val name: String, e: Element = Element("$name: first element")) {
private val field1: Int = 1
init {
println("$name: first init")
}
val e2 = Element("$name: second element")
init {
println("$name: second init")
}
val e3 = Element("$name: third element")
}
Lets compile the above and run it.
kotlinc Test.kt -include-runtime -d Test.jar
java -jar Test.jar
The output of the above program is
Element init block 1
Element init block 2
a1: first element
a1: first init
Element init block 1
Element init block 2
a1: second element
a1: second init
Element init block 1
Element init block 2
a1: third element
As you can see, first primary constructor was called, before secondary constructor, all the init blocks were executed. This is because init blocks become a part of the constructor in the order they appear in the class body.
Lets compile the kotlin code to java byte code and decompile it back to java. I used jd-gui to decompile java classes. You can install it with yay -S jd-gui-bin in arch linux based distributions.
Here is the output I got after decompiling Subject.class file
import kotlin.Metadata;
import kotlin.jvm.internal.DefaultConstructorMarker;
import kotlin.jvm.internal.Intrinsics;
import org.jetbrains.annotations.NotNull;
#Metadata(mv = {1, 6, 0}, k = 1, xi = 48, d1 = {"\000\034\n\002\030\002\n\002\020\000\n\000\n\002\020\016\n\000\n\002\030\002\n\002\b\007\n\002\020\b\030\0002\0020\001B\027\022\006\020\002\032\0020\003\022\b\b\002\020\004\032\0020\005\006\002\020\006R\021\020\007\032\0020\005\006\b\n\000\032\004\b\b\020\tR\021\020\n\032\0020\005\006\b\n\000\032\004\b\013\020\tR\016\020\f\032\0020\rX\006\002\n\000R\016\020\002\032\0020\003X\004\006\002\n\000"}, d2 = {"LSubject;", "", "name", "", "e", "LElement;", "(Ljava/lang/String;LElement;)V", "e2", "getE2", "()LElement;", "e3", "getE3", "field1", ""})
public final class Subject {
#NotNull
private final String name;
private final int field1;
#NotNull
private final Element e2;
#NotNull
private final Element e3;
public Subject(#NotNull String name, #NotNull Element e) {
this.name = name;
this.field1 = 1;
System.out
.println(Intrinsics.stringPlus(this.name, ": first init"));
this.e2 = new Element(Intrinsics.stringPlus(this.name, ": second element"));
System.out
.println(Intrinsics.stringPlus(this.name, ": second init"));
this.e3 = new Element(Intrinsics.stringPlus(this.name, ": third element"));
}
#NotNull
public final Element getE2() {
return this.e2;
}
#NotNull
public final Element getE3() {
return this.e3;
}
}
As you can see all the init blocks have become a part of the constructor in the order they appear in the class body. I noticed one thing different from java. Class fields were initialized in the constructor. Class fields and init blocks were initialized in the order they appear in the class body. It seems order is so important in kotlin.
As stated in the Kotlin docs:
During an instance initialization, the initializer blocks are executed in the same order as they appear in the class body, interleaved with the property initializers: ...
https://kotlinlang.org/docs/classes.html#constructors

Filter object by its members

I'm trying to filter an object in Guava. For example I have a class Team and would like to get all the teams with position below 5.
Iterable<Team> test = Iterables.filter(teams, new Predicate<Team>(){
public boolean apply(Team p) {
return p.getPosition() <= 5;
}
});
I'm getting 2 errors, Predicate cannot be resolved to a type and The method filter(Iterable, Predicate) in the type Iterables is not applicable for the arguments (List <'Team'>, new Predicate<'Team'>(){}).
I'm able to filter Iterables of type Integer.
Iterable<Integer> t6 = Iterables.filter(set1, Range.open(0, 3));
How do i filter an object based on its members in Guava ? I want to use this library in my android project and have many filtering conditions. Can it be used for class objects or is it only for simple data types ?
You need a final variable like range in this example.
This is the way to filter with external parameters, Predicate is an inner class.
final Range range = new IntRange(0, 3);
Iterable<Team> test = Iterables.filter(teams, new Predicate<Team>() {
public boolean apply(Team p) {
return range.containsInteger(p.getPosition());
}
});

Define a static variable in a function like c++

In my function, can I have a variable that
Retains its value between function calls.
Is only visible inside that function
Is unique for each thread i.e. if I'm calling the function from two threads then there are two variables that are static with regard to each thread.
Why I want that:
I have a function in which I fill in a list and return that list. The problem is that if I declare a variable normally, then I will have to allocate memory for it every time I call the function. I want to avoid that and allocate only once then every time I call the function it would fill in that variable with the proper values then return it.
I can do the following inside a class:
class MyClass {
val __readLineTemp = mutable.IndexedSeq.fill[Int](5)(-1)
def readLine() = {
var i = 0
while (i < __readLineTemp.length)
{
__readLineTemp(i) = Random.nextInt()
i += 1
}
__readLineTemp
}
}
My problems with this approach is that it doesn't satisfy the points 2 and 3 namely visibility only inside the method and being unique for each thread. However, for point 3 I can simply make each thread initialise its own object of MyClass.
I understand there is probably no way of achieving exactly what I want, but sometimes people come up with clever ideas to overcome this, specially that Scala seems quite deep and there is a lot of tricks you can do
You can use a closure to satisfy 1 and 2:
def foo = {
var a = 5
() => {
a = a + 1
a
}
}
i.e. create a closure that will contain the static variable (in your case, this is __readLineTemp) and return a function that's the only thing with access to the variable.
Then use it like this to satisfy the thread requirement:
val t1 = new Thread(new Runnable {
def run = {
val f = new DynamicVariable(foo)
println(f.value())
println(f.value())
}
})

C# - Why can I not cast a List<MyObject> to a class that inherits from List<MyObject>?

I've got an object, which I'll call MyObject. It's a class that controls a particular data row.
I've then got a collection class, called MyObjectCollection:
public class MyObjectCollection : List<MyObject> {}
Why can I not do the following:
List<MyObject> list = this.DoSomethingHere();
MyObjectCollection collection = (MyObjectCollection)list;
Thanks in advance.
Edit: The error is InvalidCastException
My guess is that DoSomethingHere doesn't return an instance of MyObjectCollection.
Let's get rid of all the generics etc here, as they're not relevant. Here's what I suspect you're trying to do:
public static object CreateAnObject()
{
return new object();
}
object o = CreateAnObject();
string s = (string) o;
That will fail (at execution time) and quite rightly so.
To bring it back to your code, unless DoSomethingHere actually returns a MyObjectCollection at execution time, the cast will fail.
Because a List<MyObject> is not a MyObjectCollection. The reverse is true: you could cast a MyObjectCollection to a List because MyObjectCollection inherits from List<MyObject> and thus, for all intents and purposes, IS A List<MyObject>.
The only thing you can do is to define a constructor on MyObjectCollection that takes an Ienumerable as a parameter and initalizes itself with the data in the other one, but that will make a new object containing the same data:
public class MyObjectCollection : List<MyObject>
{
public MyObjectCollection(IEnumerable<MyObject> items)
{
Addrange(items);
}
}
UPDATE:
As noted in the comment, you COULD have the cast succeed at runtime, provided that DoSomething actually returns an instance of MyObjectCollection. If it does, the object effectively is a MyObjectCollection, and the cast is completely legal.
I'd have to say, it is bad practice in my view to upcast something like that. If the function returns a List, you should not rely on a specific implementation of List. Either modify the return type of DoSomething, if you own that function, and return a MyObjectCollection, or deal with it as a list.
Without knowing what exactly is created inside DoSomething() we have to assume either:
You have a misunderstanding about the inheritence in .Net.
you have
A : B
B DoSomething()
{
return new B();
}
// then this is
B b = new B();
A a = (A)b;
Clearly b is a B but not an A. B might look much like A but it is not (if you traverse the parentage of b you won't find A anywhere)
This is true irrespective of the Generics involved (though that sometimes can cause situations where something that could work doesn't see the co-contra variance in c# 4.0)
or
A : B
B DoSomething()
{
return new A();
}
// then this is
B b = new A();
A a = (A)b;
Which in the absence of Generics will work.
You can't do it because (I guessing) the list instance returned from DoSomethingHere isn't derived from MyObjectCollection
You could create an implicit operator that would allow you to convert between your object and the list. You would need an constructor that takes a list and to property that returns the underlaying list.
public static implicit operator List<MyObject>(MyObjectCollection oCollection)
{
//Convert here
return MyObjectCollection.BaseList;
}
public static implicit operator MyObjectCollection(List<MyObject> oList)
{
//Convert here
return new MyObjectCollection(oList);
}