value option is not a member of org.apache.spark.sql.DataFrame - scala

I'm trying to create a data frame in scala as below:
var olympics =spark.read.csv("/FileStore/tables/Soccer_Data_Set_c46d1.txt").option("inferSchema","true").option("header","true").option("delimiter",",")
When I submit the code it throws me value option is not a member of org.apache.spark.sql.DataFrame error.
However when i modify the code as below:
var olympics = spark.read.option("inferSchema","true").option("header","true").option("delimiter",",").csv("/FileStore/tables/Soccer_Data_Set_-c46d1.txt")
olympics dataframe is successfully created.
Can someone please help me understand the difference between these two code snippets?

After you've called csv method, you already have a DataFrame, and data is already read "into" spark, so it doesn't make sense to set options there.
In the second example, you're calling read to "say" that you want spark to read a file, setting properties of such read, and then actually reading the file.

In the first set of code: On invoking 'read.csv("/FileStore/tables/Soccer_Data_Set_c46d1.txt")' method you will be getting 'org.apache.spark.sql.Dataset' object as return value. This class do not define any 'option()' method which you are trying to invoke later ('csv(..).option("inferSchema", "true")'). So, the compiler is restricting you and throwing the error.
Please refer: Dataset class API where you can find no definition of 'option()' method
In the second set of code: On invoking 'spark.read' method you will be getting 'org.apache.spark.sql.DataFrameReader' object as return value. This class has got some of the overloaded 'option' methods been defined and as you are using one of the valid methods you are not getting any error from compiler.
Please refer DataFrameReader class API where you can find overloaded methods of 'option()' been defined.

Related

Using flutter/dart vscode shows "method isn't defined" when i check for a specific mixin before calling it

This is my first time messing with dart and i'm stuck with this silly error. I'm 90% confident that the error lies on the vscode part of it, because there aren't errors showing when running the app.
Maybe I'm approaching the problem in the wrong way, I simply want to call a mixin function in objects that implements the mixin. In java for example it would be required to cast the component variable, but I couldn't get cast to work in this situation.
Code
Error
The method 'onPanUpdate' isn't defined for the class 'Component'.
Try correcting the name to the name of an existing method, or defining a method named 'onPanUpdate'.
Source Code
Repo link
Component class source
Maybe I'm approaching the problem in the wrong way, I simply want to call a mixin function in objects that implements the mixin. In java for example it would be required to cast the component variable, but I couldn't get cast to work in this situation.
For this to work it must be done with the following code:
for (var component in this.components) {
if (component is PanDetector) {
(component as PanDetector).onPanUpdate(details);
}
}
Especial thanks to #Moqbel

Is it bad to put an RDD inside a Serializable Class?

According to this article, when you use an object inside an RDD.map for example, Spark would serialize the whole ojbect first. Now, let us say, I have an RDD defined as member of that serializable class. What would Spark do for that RDD, would it try to serialize it as well. If so, how?
Following is an example code.
class SomeClass extends Serializable {
var a: String
var b: Int
var rdd: RDD[...]
....
}
objectOfSomeClass = new SomeClass(...)
...
someRDD.map(x => someFunc(objectOfSomeClass))
Re:
I am just worried if serialization of the whole class, also involves serialization of the RDD inside it.
The code that you have shown does not need whole object to be serialized. Hence you are not facing any serialization issues until now. Instead of passing a and bseparately, if you pass objectOfSomeClass, then I believe you would face serialization issue.
In one of your comment you have also mentioned
I am just worried if it affects the performance.
This too does not come in picture unless you do any action on that RDD. RDDs are lazily evaluated only when any action is called on that RDD. That is the time when it will read and run the transformations. In your example I do not see any action in there hence it should not affect performance of your application.
Hope this clarifies couple of your doubts.
-Amit

Is it a getter in scala? (from the RDD class source of Spark)

When we do checkpointing in spark, we go through a statement that:
checkpointData.get.doCheckPoint()
Why not instead use checkpointData.doCheckPoint()?
Is the get in the statement something like getter? I know that scala class will automatically generate getter and setter.
Or is it some other syntax I do not know?
If you are talking about source code of RDD class (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala), then it's because checkpointData has type of Option[RDDCheckpointData[T]]
See declaration in source code:
private[spark] var checkpointData: Option[RDDCheckpointData[T]] = ...
So to do call a method of RDDCheckpointData we need to get it from Option (after being sure it isDefined, as you can see in code)
Read more about scala Option class:
http://www.scala-lang.org/api/current/index.html#scala.Option
http://danielwestheide.com/blog/2012/12/19/the-neophytes-guide-to-scala-part-5-the-option-type.html

what is the input type of classOf

I am wondering what type do I put in place of XXX
def registerClass(cl:XXX) = kryo.register(classOf[cl])
EDIT: For why I want to do this.
I have to register many classes using the above code. I wanted to remove the duplication of calling kyro.register several times, hoping to write code like below:
Seq(com.mypackage.class1,com.mypackage.class2,com.mypackage.class3).foreach(registerClass)
Another question, can I pass String instead? and convert it somehow to a class in registerClass?
Seq("com.mypackage.class1","com.mypackage.class2").foreach(registerClass)
EDIT 2:
When I write com.mypackage.class1, it means any class defined in my source. So if I create a class
package com.mypackage.model
class Dummy(val ids:Seq[Int],val name:String)
I would provide com.mypackage.model.Dummy as input
So,
kryo.register(classOf[com.mypackage.model.Dummy])
Kryo is a Java Serialization library. The signature of the register class is
register(Class type)
You could do it like this:
def registerClass(cl:Class[_]) = kryo.register(cl)
And then call it like this:
registerClass(classOf[Int])
The type parameter to classOf needs to be known at compile time. Without knowing more about what you're trying to do, is there any reason you can't use:
def registerClass(cl:XXX) = kryo.register(cl.getClass)

Eclipse JDT AST: how to find a calling method returns value of an instance variable?

I'm using Eclipse JDT AST to parse a given java source code. While parsing the code, when it hits a method invocation, I want to find out whether that particular method returns or sets a value of an instance variable (basically to find out whether the callee method is a getter/setter of the same class of caller method).
E.g.:
public void test(){
//when parsing the following line I want to check whether "getName"
//returns a value of an instance variable.
String x = getName();
//when parsing the following line I want to check whether "setName"
//sets the value of an instance variable.
setName("some-name");
}
I've used the AST plugin also find out a possible path which would help me to refer it from the API, but couldn't.
Please let me know whether this is possible and if so, which approach that would help me to get the required information.
Don't think that there is an api which tells you whether a method is a getter or a setter.
You will have to write code to do this. For a getter, you can probably simply check if the last statement in the method is a return statement which returns an instance variable.