Scala dataframe - where is the spark/ scala dataframe source code for explode on github - scala

As explained in this article, Explode is slow in scala 2.11.8 and spark 2.0.2. Without moving to higher spark versions, alternate methods to improve it are also slow. Since the issue has been fixed in later versions of spark, one approach would be to copy the fixed source code. In looking for the source code, I found a reference to explode in functions, but, I do not know how to track the function further. How would I find the source code for working Explode in new spark source code - so, I can use it instead of the current version of explode?

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala is the link I think you're looking for
I was able to find it by expanding all the import org.apache._ imports within https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala after seeing that the explode function there was just def explode(e: Column): Column = withExpr { Explode(e.expr) }
if you wanted to import the underlying Explode function, I believe the direct import would be import org.apache.spark.sql.catalyst.expressions.Explode

Related

Example of zero-copy share of a Polars dataframe between Python and Rust?

I have a Python function such as
def add_data(input_df):
"""
some manipulation of input_df (Polars dataframe) such as filling some columns with new values
"""
I would like to use this function from a Rust function. input_df can be tens of megabytes big, so I'd like to use zero-copy share between Python and Rust. Is there any example code on this available?
I found Is it possible to access underlying data from Polars in cython? but this seems to be Cython. I am looking for a pure Python way.
I made a crate for this to make this easy: https://github.com/pola-rs/pyo3-polars
See the examples to get started: https://github.com/pola-rs/pyo3-polars/tree/main/example

Using DATASet Api with Spark Scala

Hi I am very new to spark/Scala and trying to implement some functionality.My requirement is very simple.I have to perform all the the operations using DataSet API.
Question1:
I converted the csv in form a case Class?Is it correct way of converting data frame to DataSet??Am I doing it correctly?
Also when I am trying to do transformation on orderItemFile1,for filter/map operation I am able to access with _.order_id.But same is not happening with groupBy
case class orderItemDetails (order_id_order_item:Int, item_desc:String,qty:Int, sale_value:Int)
val orderItemFile1=ss.read.format("csv")
.option("header",true)
.option("infersSchema",true)
.load("src/main/resources/Order_ItemData.csv").as[orderItemDetails]
orderItemFile1.filter(_.order_id_order_item>100) //Works Fine
orderItemFile1.map(_.order_id_order_item.toInt)// Works Fine
//Error .Inside group By I am unable to access it as _.order_id_order_item. Why So?
orderItemFile1.groupBy(_.order_id_order_item)
//Below Works..But How this will provide compile time safely as committed
//by DataSet Api.I can pass any wrong column name also here and it will be //caught only on run time
orderItemFile1.groupBy(orderItemFile1("order_id_order_item")).agg(sum(orderItemFile1("item_desc")))
Perhaps the functionality you're looking for is #groupByKey. See example here.
As for your first question, basically yes, you're reading a CSV into a Dataset[A] where A is a case class you've declared.

How use Window aggrgates on strongly typed Spark Datasets?

I'm slowely trying to adapt to the new (strongly typed) Dataset[U] from Spark 2.x, but struggling to maintain the type info when using Window functions.
case class Measurement(nb:Long,x:Double)
ds being a Dataset[Measurement], I would like to do something like
ds.map{m => (m,sum($"x").over(Window.orderBy($"nb"))}
But this will not work (as it gives my a Dataset[(Measurement,Column)]) instead of Dataset[(Measurement,Double)]
Using withColumn gives me a Dataset[Row], so I'm loosing the type info:
ds.withColumn("cumsum",sum($"x").over(Window.orderBy($"nb")))
So, is there a better way to use Window functions on strongly typed Datasets?
As you adding a new column to your dataset, I guess there is no choices but using the dataframe.as[New Type] method
More information can be found here How to add a column to Dataset without converting from a DataFrame and accessing it?
More information on Window functions can be found on this blog article Window Functions in Spark SQL by Databricks
You can use the as[U] method to convert a Dataframe (or Dataset[Row]) to a Dataset[U]
For you special case :
ds.withColumn("cumsum",sum($"x").over(Window.orderBy($"nb"))).as[(Measurement,Column)]
Hope it helps

Scala Unit Testing - Mocking an implicitly wrapped function

I have a question concerning unit tests that I'm trying to achieve using Mockito in Scala. I've also looked up ScalaMock but it sounds like the feature is not provided as well. I suppose that maybe I'm looking from a narrow way to the solution and there might be a different perspective or approach to what im doing so all your opinions are welcomed.
Basically, I want to mock a function that is available to the object using implicit conversion, and I don't have any control to change how that is done. Since I'm a user to the library. The concrete example is similar to the following scenario
rdd: RDD[T] = //existing RDD
sqlContext: SQLContext = //existing sqlcontext
import sqlContext.implicits._
rdd.toDF()
/*toDF() doesn't originally exist at RDD but is implicitly added when importing sqlContext.implicits._*/
Now In the testing, I'm mocking the rdd and the sqlContext and I want to mock the toDF() function. I Can't mock the function toDF() since it doesn't exist on the RDD level. Even if I do a simple trick, importing the mocked sqlContext.implicit._ I get an error that any function that is not publicaly available to the object can't be mocked. I even tried to mock the code that is implicitly executed until toDF() but I get stuck with Final/Pivate[in accessible] classes that I also can't mock. Your suggestions are more than welcomed. Thanks in advance :)

Error with map & flatMap on RDDs in Eclipse with Spark

I finally have Eclipse set up to be able to use Spark in a worksheet. I have the Scala 2.10.5 library in my build path & also included this jar: spark-assembly-1.4.1-hadoop2.6.0.jar
I can do most things on RDDs except map and flatMap. For example, given this data (sampleData.txt):
0,1 0 0
0,2 0 0
1,0 1 0
1,0 2 0
2,0 0 1
2,0 0 2
The following code gives a "macro has not been expanded" error.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD._
object sparkTestWS {
val conf = new SparkConf().setMaster("local[*]").setAppName("My App")
val sc = new SparkContext(conf)
// start model section
val data = sc.textFile("sampleData.txt")
val dataM = data.map(x => x)
}
I looked up this macro error, and there's a post saying that it has to do with implicit Types and that it will be (or now is) fixed with Scala 2.11, but Spark is on Scala 2.10...
I also wondered if I might need to explicitly import the classes with these functions since there was a post that said that some implicit imports need to be made explicit, but so far I haven't been able to figure out what to import. I've tried scala.Array, scala.immutable., org.apache.spark.rdd., etc.
Any ideas? There are other posts stating that people are using Spark with Eclipse, so there must be a way to make Spark work in Eclipse (though the posts don't note whether or not they are using Scala worksheets.) I'm pretty new to Spark and only slightly less new to Scala, so any advice would be greatly appreciated. I really like Scala worksheets, so I'd like to get all this working if possible. Thx!
Your code looks good to me.
Your problem is likely to be with the worksheets themselves. They are nice but being based on the REPL they are not exactly the same as compiled classes, they do a bunch of extra things in order to allow the code to flow (like redefining the same variable), each REPL command is wrapped on it own scope and this can mess with implicits, imports, etc. in subtle ways.
If you are new to both Scala and Spark, I would recommend using compiled classes for the time being and postpone Worksheets until you get a better grasp of the fundamentals.
That said, have you tried spark-shell?