Understanding some basics of Spark SQL - scala

I'm following http://spark.apache.org/docs/latest/sql-programming-guide.html
After typing:
val df = spark.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
I have some questions that I didn't see the answers to.
First, what is the $-notation?
As in
df.select($"name", $"age" + 1).show()
Second, can I get the data from just the 2nd row (and I don't know what the data is in the second row).
Third, how would you read in a color image with spark sql?
4th, I'm still not sure what the difference is between a dataset and dataframe in spark. The variable df is a dataframe, so could I change "Michael" to the integer 5? Could I do that in a dataset?

$ is not annotation. It is a method call (shortcut for new ColumnName("name")).
You wouldn't. Spark SQL has no notion of row indexing.
You wouldn't. You can use low level RDD API with specific input formats (like ones from HIPI project) and then convert.
Difference between DataSet API and DataFrame

1) For question 1, $ sign is used as a shortcut for selecting a column and applying functions on top of it. For example:
df.select($"id".isNull).show
which can be other wise written as
df.select(col("id").isNull)
2) Spark does not have indexing, but for prototyping you can use df.take(10)(i) where i could be the element you want. Note: the behaviour could be different each time as the underlying data is partitioned.

Related

Question regarding best choice for specifying schema in spark and How to Delete/remove column from rdd spark?

I have two datasets NYC taxi data and weather data. Weather data a huge number of columns from, around 100, of which I need only 5-10. And I want them to be typed instead of strings hence need a schema. I know two ways for this
Rdd->Rows-> give schema and then convert to DF
Dataframe InferSchema (the documentation says its double passes)
Is inferSchema a good choice for the 100 column thing and don't have to write a StructType or CaseClass for 100 columns?
And the taxi data with billion+ records which also has around 60 columns but I need only around 10. What would be a suitable choice for this dataset? Write schema for all 60 columns?
Second Question: as I mentioned I don't need all the columns so I'm dropping columns. from the documentation and Internet, I got to know how to do this using DF just with the select function.
But I in case I have to write schema which in my knowledge is only possible using RDD. How to remove/drop columns in RDD?
Ideally, using any RDD/DF id like to drop columns first then specify a schema. Is this possible?
I know it's a lot of questions but I'm a newbie with spark all this popped up in my mind and i want to do it the right way.
Thanks
You don't need RDD to achieve it, it's really simple. Just load your data to DF then select and cast wanted columns.
scala> val df = Seq("1","2","3").toDF("c1")
scala> df.show()
+---+
| c1|
+---+
| 1|
| 2|
| 3|
+---+
scala> df.printSchema()
root
|-- c1: string (nullable = true)
scala> val newDF = df.select('c1.cast("int"))
scala> newDF.printSchema()
root
|-- c1: integer (nullable = true)

Should I choose RDD over DataSet/DataFrame if I intend to perform a lot of aggregations by key?

I have a use case where I intend to group by key(s) while aggregating over column(s). I am using Dataset and tried to achieve these operations by using groupBy and agg. For example take the following scenario
case class Result(deptId:String,locations:Seq[String])
case class Department(deptId:String,location:String)
// using spark 2.0.2
// I have a Dataset `ds` of type Department
+-------+--------------------+
|deptId | location |
+-------+--------------------+
| d1|delhi |
| d1|mumbai |
| dp2|calcutta |
| dp2|hyderabad |
+-------+--------------------+
I intended to convert it to
// Dataset `result` of type Result
+-------+--------------------+
|deptId | locations |
+-------+--------------------+
| d1|[delhi,mumbai] |
| dp2|[calcutta,hyderabad]|
+-------+--------------------+
For this I searched on stack and found the following:
val flatten = udf(
(xs: Seq[Seq[String]]) => xs.flatten)
val result = ds.groupBy("deptId").
agg(flatten(collect_list("location")).as("locations")
The above seemed pretty neat for me.
But before searching for the above, I first searched if Dataset had an inbuilt reduceByKey like a RDD does. But couldn't find, so opted for above. But I read this article grouByKey vs reduceByKey and came to know reduceByKey has less shuffles and is more efficient. Which is my first reason to ask the question, should I opt for RDD in my scenario ?
The reason I initially went for Dataset was solely enforcement of type,ie. each row being of type Department. But as my result has an entirely different schema should I bother with type safety ? So I tried doing result.as[Result] but that doesn't seem to do any compile time type check. Another reason I chose Dataset was, I'll pass the result Dataset to some other function, having a structure makes code easy to maintain. Also the case class can be highly nested, I cannot imagine maintaining that nesting in pairRDD while writing reduce/map operations.
Another thing I'm unsure of is about using udf. I came across post, where people said they would prefer changing Dataset to RDD, rather than using udf for complex aggregations/grouby.
I also googled around a bit and saw posts/articles where people said Dataset has overhead of type checking, but in higher spark version is better performance wise compared to RDD. Again not sure should I switch back to RDD ?
PS: please forgive, if I used some terms incorrectly.
To answer some of you questions:
groupBy + agg is not groupByKey - DataFrame / Dataset groupBy behaviour/optimization - in general case. There are specific cases where it might behave like one, this includes collect_list.
reduceByKey is not better than RDD-style groupByKey when groupByKey-like logic is required - Be Smart About groupByKey - and in fact it is almost always worse.
There is a important trade-off between static type checking and performance in Spark's Dataset - Spark 2.0 Dataset vs DataFrame
The linked post specifically advises against using UserDefinedAggregateFunction (not UserDefinedFunction) because of excessive copying of data - Spark UDAF with ArrayType as bufferSchema performance issues
You don't even need UserDefinedFunction as flattening is not required in your case:
val df = Seq[Department]().toDF
df.groupBy("deptId").agg(collect_list("location").as("locations"))
And this is what you should go for.
A statically typed equivalent would be
val ds = Seq[Department]().toDS
ds
.groupByKey(_.deptId)
.mapGroups { case (deptId, xs) => Result(deptId, xs.map(_.location).toSeq) }
considerably more expensive than the DataFrame option.

Iterate Data frame and use those value in Spark SQL statement

I have a data frame say
DF
Animal
======
Cat
Dog
Horse
I want to iterate these values and use them in Spark SQL statement.
Can someone please help me with this?
Spark dataset/dataframe APIs are more declarative than imperative (like SQL) that means you describe what you want the end data to be and let the spark engine figure out the exact transformation.
What you're describing doesn't make sense as a use case for spark
It's a weird use case, but you can iterate over your values an do whatever you want with a foreach.
INPUT
df.show
+------+
|animal|
+------+
| cat|
| dog|
| horse|
+------+
SENTENCE
Same as I used a print, you can do any other function, but as in the comments said, it's a bit weird
df.foreach(row => println(row.getAs[String](0)))
With this piece you get the actual value
row.getAs[String](0)

How to parse a csv string into a Spark dataframe using scala?

I would like to convert a RDD containing records of strings, like below, to a Spark dataframe.
"Mike,2222-003330,NY,34"
"Kate,3333-544444,LA,32"
"Abby,4444-234324,MA,56"
....
The schema line is not inside the same RDD, but in a another variable:
val header = "name,account,state,age"
So now my question is, how do I use the above two, to create a dataframe in Spark? I am using Spark version 2.2.
I did search and saw a post:
Can I read a CSV represented as a string into Apache Spark using spark-csv
.
However it's not exactly what I need and I can't figure out a way to modify this piece of code to work in my case.
Your help is greatly appreciated.
The easier way would probably be to start from the CSV file and read it directly as a dataframe (by specifying the schema). You can see an example here: Provide schema while reading csv file as a dataframe.
When the data already exists in an RDD you can use toDF() to convert to a dataframe. This function also accepts column names as input. To use this functionality, first import the spark implicits using the SparkSession object:
val spark: SparkSession = SparkSession.builder.getOrCreate()
import spark.implicits._
Since the RDD contains strings it needs to first be converted to tuples representing the columns in the dataframe. In this case, this will be a RDD[(String, String, String, Int)] since there are four columns (the last age column is changed to int to illustrate how it can be done).
Assuming the input data are in rdd:
val header = "name,account,state,age"
val df = rdd.map(row => row.split(","))
.map{ case Array(name, account, state, age) => (name, account, state, age.toInt)}
.toDF(header.split(","):_*)
Resulting dataframe:
+----+-----------+-----+---+
|name| account|state|age|
+----+-----------+-----+---+
|Mike|2222-003330| NY| 34|
|Kate|3333-544444| LA| 32|
|Abby|4444-234324| MA| 56|
+----+-----------+-----+---+

How to filter jbehave examples table rows based on scenario meta data

Is there a way we can filter jbehave examples table rows at runtime using the scenario meta data? For e.g
Scenario: my scenario title
Meta:
#id 1
Examples:
|Meta:|col1|col2|
|id 1 |val1|val2|
|id 2| val |val |
|id 1| val |val |
When we run this scenario it should iterate only for the 1st and 3rd row, based on the meta data set on the scenario.
What I am trying to do is to externalize data across scenarios/ stories and try to use filtered data rows applicable for particular scenario.
I found some similar topics based meta filtering but not specific to this.
Appreciate any help. Thanks
A meta character # must be used in the example table, in this way:
Scenario: some scenario
Meta: #id
Given I pass value '1'
Examples:
|Meta:|col1|col2|
|#id 1|val1|val2|
|#id 2| val|val |
|#id 1| val|val |
Then you need to define the filter in the configuration, for example:
configuredEmbedder().useMetaFilters(Arrays.asList("+id 1"));
More on this topic can be found here:
http://jbehave.org/reference/stable/meta-filtering.html