I can read in a Spark dataframe as a custom object like this:
spark.read.csv("path/to/file").as[Gizmo]
But how can I convert a single Spark Row object to its equivalent case class? (If you're worried about why I want to do this, please consult this question.) Clearly Spark knows how to do this, but I don't see any straightforward way of accomplishing it (short of converting the Row into an RDD of length 1 and then converting back).
row.as[Gizmo] // doesn't work. What goes here?
Related
According to this
Spark Catalyst is An implementation-agnostic framework for manipulating trees of relational operators and expressions.
I want to use Spark Catalyst to parse SQL DMLs and DDLs to write and generate custom Scala code for. However, it is not clear for me by reading the code if there is any wrapper class around Catalyst that I can use? The ideal wrapper would receive a sql statement and produces the equivalent Scala code. For my use case would look like this
def generate("select substring(s, 1, 3) as from t1") =
{ // custom code
return custom_scala_code_which is executable given s as List[String]
}
This is a simple example, but the idea is that I don't want to write another parser and I need to parse many SQL functionality from a legacy system that I have to write a custom Scala implementation for them.
In a more general question, with a lack of class level design documentation, how can someone learn the code base and make contributions?
Spark takes SQL queries using spark.sql. For example: you can just feed the string SELECT * FROM table as an argument to such as spark.sql("SELECT * FROM table") after having defined your dataframe as "table". To define your dataframe as "table" for use in SQL queries create a temporary view using
DataFrame.createOrReplaceTempView("table")
You can see examples here:
https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#running-sql-queries-programmatically
Dataframe automatically changes into RDD and optimise the code, and this optimization is done through Catalyst. When a programmer writes a code in Dataframe , internally code will be optimized. For more detail visit
Catalyst optimisation in Spark
I am trying to write a user-defined scalar function in Flink which takes in multiple expressions (arbitrary number of expressions) and combine that into a single expression.
Coming from Spark world, I could achieve this by using struct which returns a Row type and pass it to a udf, like
val structCol = SparkSql.functions.struct(cols: _*)
vecUdf(structCol)
I am not able to find an equivalent in Flink. I am also trying to see if I can write a ScalarFunction that takes in the arbitrary number of expressions, but not able to find any examples.
Can anyone help guide me to either of the above two approaches? Thanks!
Note, I can't make it an Array since each expression can be of different type (actually, same value type but could be arrays or scalars).
This question already has answers here:
Encoder error while trying to map dataframe row to updated row
(4 answers)
Closed 5 years ago.
So I have this code
val expanededDf = io.readInputs().mapPartitions{
(iter:Iterator[Row]) => {
iter.map{
(item:Row) => {
val myNewColumn = getUdf($"someColumnOriginal")
Row.fromSeq(item.toSeq :+(myNewColumn))
}
}
}
}
I am getting a exception:"Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases."
My imports are:
import spark.implicits._
import org.apache.spark.sql._
I have to use the UDF as the function is very complex making some REST calls. Basically the code tries to add a new column into a Row using a particular column value and then returns a dataframe. I have tried using withColumn but since I am dealing here with Petabytes of data it is extremely slow. I am a newbie to spark and scala and hence I apologise in advance if my question is extremely lame.
First of all, withColumn is the way to go, and if it's slow, it's probably because your job needs tuning, and I think switching to RDDs won't make it any faster.
But anyway...you are not supposed to refer to a DataFrame within the function that is called on every single row of an RDD.
To better understand what's happening, when running a spark program, there's a Driver, which is the master, and there are the Executors, which are the slaves.
The slaves don't know about DataFrames, only the driver does.
There is another important point, when you're writing code that runs in the executor, you must be careful when referencing variables that are in the Driver's scope. If you do, Spark will try to serialize them and send them to the Executors. It's ok if it's what you want AND if those objects are small AND if Spark knows how to serialize them.
In this case, Spark is trying to serialize $"someColumnOriginal", which is an object of class Column, but it doesn't know how and it fails.
In this case, to make it work, you have to know in what position the field you want is, let's say it's in position 2, you would write
Row.fromSeq(item.toSeq :+ item.get(2))
You can get the position by looking at the schema if it's available (item.schema, rdd.schema), and since it's an int, it can be done outside the loops and Spark will be able to serialize that.
You can read this article http://www.cakesolutions.net/teamblogs/demystifying-spark-serialisation-error for more about serialization.
I need something similar to the randomSplit function:
val Array(df1, df2) = myDataFrame.randomSplit(Array(0.6, 0.4))
However, I need to split myDataFrame based on a boolean condition. Does anything like the following exist?
val Array(df1, df2) = myDataFrame.booleanSplit(col("myColumn") > 100)
I'd like not to do two separate .filter calls.
Unfortunately the DataFrame API doesn't have such a method, to split by a condition you'll have to perform two separate filter transformations:
myDataFrame.cache() // recommended to prevent repeating the calculation
val condition = col("myColumn") > 100
val df1 = myDataFrame.filter(condition)
val df2 = myDataFrame.filter(not(condition))
I understand that caching and filtering twice looks a bit ugly, but please bear in mind that DataFrames are translated to RDDs, which are evaluated lazily, i.e. only when they are directly or indirectly used in an action.
If a method booleanSplit as suggested in the question existed, the result would be translated to two RDDs, each of which would be evaluated lazily. One of the two RDDs would be evaluated first and the other would be evaluated second, strictly after the first. At the point the first RDD is evaluated, the second RDD would not yet have "come into existence" (EDIT: Just noticed that there is a similar question for the RDD API with an answer that gives a similar reasoning)
To actually gain any performance benefit, the second RDD would have to be (partially) persisted during the iteration of the first RDD (or, actually, during the iteration of the parent RDD of both, which is triggered by the iteration of the first RDD). IMO this wouldn't align overly well with the design of the rest of the RDD API. Not sure if the performance gains would justify this.
I think the best you can achieve is to avoid writing two filter calls directly in your business code, by writing an implicit class with a method booleanSplit as a utility method does that part in a similar way as Tzach Zohar's answer, maybe using something along the lines of myDataFrame.withColumn("__condition_value", condition).cache() so the the value of the condition is not calculated twice.
I want to read in a csv log which has as it's first column a timestamp of form hh:mm:ss. I would like to partition the entries into buckets, say hourly. I'm curious what the best approach is that adheres to Scala's semantics, i.e., reading the file as a stream, parsing it (maybe by a match predicate?) and emitting the csv entries as tuples.
It's been a couple of years since I looked at Scala but this problem seems particularly well suited to the language.
log format example:
[time],[string],[int],[int],[int],[int],[string]
The last field in the input could be mapped to an emum in the output tuple but I'm not sure there's value in that.
I'd be happy with a general recipe that I could use, with suggestions for certain built-in functions that are well suited to the problem.
The overall goal is a map-reduce, where I want to count elements in a time window but those elements first need to be preprocessed by a regex replace, before sorting and counting.
I've tried to keep the problem abstract, so the problem can be approached as a pattern to follow.
Thanks.
Perhaps as a first pass, a simple groupBy would do the trick ?
logLines.groupBy(line => line.timestamp.hours)
Using the groupBy idiom, and some filtering, my solution looks like
val lines: Traversable[String] = source.getLines.map(_.trim).toTraversable
val events: List[String] = lines.filter(line => line.matches("[\\d]+:.*")).toList
val buckets: Map[String, List[String]] = events.groupBy { line => line.substring(0, line.indexOf(":")) }
This gives me 24 buckets, one for each hour. Now I can process each bucket, perform the regex replace that I need to de-parameterize the URIs and finally map-reduce those to find the frequency each route has occurred.
Important note. I learned that groupBy doesn't work as desired, without first creating a List from the Traversable stream. Without that step, the end result is a single valued map for each hour. Possibly not the most performant solution, since it requires all events to be loaded in memory before partitioning. Is there a better solution that can partition a stream? Perhaps something that adds events to a mutable Set as the stream is processed?