Iterate Data frame and use those value in Spark SQL statement - scala

I have a data frame say
DF
Animal
======
Cat
Dog
Horse
I want to iterate these values and use them in Spark SQL statement.
Can someone please help me with this?

Spark dataset/dataframe APIs are more declarative than imperative (like SQL) that means you describe what you want the end data to be and let the spark engine figure out the exact transformation.
What you're describing doesn't make sense as a use case for spark

It's a weird use case, but you can iterate over your values an do whatever you want with a foreach.
INPUT
df.show
+------+
|animal|
+------+
| cat|
| dog|
| horse|
+------+
SENTENCE
Same as I used a print, you can do any other function, but as in the comments said, it's a bit weird
df.foreach(row => println(row.getAs[String](0)))
With this piece you get the actual value
row.getAs[String](0)

Related

Scala: best way to update a deltatable after filling missing values

I have the following delta table
+-+----+
|A|B |
+-+----+
|1|10 |
|1|null|
|2|20 |
|2|null|
+-+----+
I want to fill the null values in column B based on the A column.
I figured this to do so:
var df = spark.sql("select * from MyDeltaTable")
val w = Window.partitionBy("A")
df = df.withColumn("B", last("B", true).over(w))
Which gives me the desired output:
+-+----+
|A|B |
+-+----+
|1|10 |
|1|10 |
|2|20 |
|2|20 |
+-+----+
Now, my question is:
What is the best way to write the result in my delta table correctly ?
Should I merge ? Re-write with overwrite option ?
My delta table us huge and it will keep on increasing, I am looking for the best possible method to achieve so.
Thank you
It depends on the distribution of the rows (aka. are they all in 1 file or spread through many?) that contain null values you'd like to fill.
MERGE will rewrite entire files, so you may end up rewriting enough of the table to justify simply overwriting it instead. You'll have to test this to determine what's best for your use case.
Also, to use MERGE, you need to filter the dataset down only to the changes. Your example "desired output" table has the all the data, which you'd fail to MERGE in its current state because there are duplicate keys.
Check the Important! section in the docs for more

Should I choose RDD over DataSet/DataFrame if I intend to perform a lot of aggregations by key?

I have a use case where I intend to group by key(s) while aggregating over column(s). I am using Dataset and tried to achieve these operations by using groupBy and agg. For example take the following scenario
case class Result(deptId:String,locations:Seq[String])
case class Department(deptId:String,location:String)
// using spark 2.0.2
// I have a Dataset `ds` of type Department
+-------+--------------------+
|deptId | location |
+-------+--------------------+
| d1|delhi |
| d1|mumbai |
| dp2|calcutta |
| dp2|hyderabad |
+-------+--------------------+
I intended to convert it to
// Dataset `result` of type Result
+-------+--------------------+
|deptId | locations |
+-------+--------------------+
| d1|[delhi,mumbai] |
| dp2|[calcutta,hyderabad]|
+-------+--------------------+
For this I searched on stack and found the following:
val flatten = udf(
(xs: Seq[Seq[String]]) => xs.flatten)
val result = ds.groupBy("deptId").
agg(flatten(collect_list("location")).as("locations")
The above seemed pretty neat for me.
But before searching for the above, I first searched if Dataset had an inbuilt reduceByKey like a RDD does. But couldn't find, so opted for above. But I read this article grouByKey vs reduceByKey and came to know reduceByKey has less shuffles and is more efficient. Which is my first reason to ask the question, should I opt for RDD in my scenario ?
The reason I initially went for Dataset was solely enforcement of type,ie. each row being of type Department. But as my result has an entirely different schema should I bother with type safety ? So I tried doing result.as[Result] but that doesn't seem to do any compile time type check. Another reason I chose Dataset was, I'll pass the result Dataset to some other function, having a structure makes code easy to maintain. Also the case class can be highly nested, I cannot imagine maintaining that nesting in pairRDD while writing reduce/map operations.
Another thing I'm unsure of is about using udf. I came across post, where people said they would prefer changing Dataset to RDD, rather than using udf for complex aggregations/grouby.
I also googled around a bit and saw posts/articles where people said Dataset has overhead of type checking, but in higher spark version is better performance wise compared to RDD. Again not sure should I switch back to RDD ?
PS: please forgive, if I used some terms incorrectly.
To answer some of you questions:
groupBy + agg is not groupByKey - DataFrame / Dataset groupBy behaviour/optimization - in general case. There are specific cases where it might behave like one, this includes collect_list.
reduceByKey is not better than RDD-style groupByKey when groupByKey-like logic is required - Be Smart About groupByKey - and in fact it is almost always worse.
There is a important trade-off between static type checking and performance in Spark's Dataset - Spark 2.0 Dataset vs DataFrame
The linked post specifically advises against using UserDefinedAggregateFunction (not UserDefinedFunction) because of excessive copying of data - Spark UDAF with ArrayType as bufferSchema performance issues
You don't even need UserDefinedFunction as flattening is not required in your case:
val df = Seq[Department]().toDF
df.groupBy("deptId").agg(collect_list("location").as("locations"))
And this is what you should go for.
A statically typed equivalent would be
val ds = Seq[Department]().toDS
ds
.groupByKey(_.deptId)
.mapGroups { case (deptId, xs) => Result(deptId, xs.map(_.location).toSeq) }
considerably more expensive than the DataFrame option.

Understanding some basics of Spark SQL

I'm following http://spark.apache.org/docs/latest/sql-programming-guide.html
After typing:
val df = spark.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
I have some questions that I didn't see the answers to.
First, what is the $-notation?
As in
df.select($"name", $"age" + 1).show()
Second, can I get the data from just the 2nd row (and I don't know what the data is in the second row).
Third, how would you read in a color image with spark sql?
4th, I'm still not sure what the difference is between a dataset and dataframe in spark. The variable df is a dataframe, so could I change "Michael" to the integer 5? Could I do that in a dataset?
$ is not annotation. It is a method call (shortcut for new ColumnName("name")).
You wouldn't. Spark SQL has no notion of row indexing.
You wouldn't. You can use low level RDD API with specific input formats (like ones from HIPI project) and then convert.
Difference between DataSet API and DataFrame
1) For question 1, $ sign is used as a shortcut for selecting a column and applying functions on top of it. For example:
df.select($"id".isNull).show
which can be other wise written as
df.select(col("id").isNull)
2) Spark does not have indexing, but for prototyping you can use df.take(10)(i) where i could be the element you want. Note: the behaviour could be different each time as the underlying data is partitioned.

Can I use SELECT from dataframe instead of creating this temp table?

I am currently using :
+---+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |sen |attributes |
+---+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |Stanford is good college.|[[Stanford,ORGANIZATION,NNP], [is,O,VBZ], [good,O,JJ], [college,O,NN], [.,O,.], [Stanford,ORGANIZATION,NNP], [is,O,VBZ], [good,O,JJ], [college,O,NN], [.,O,.]]|
+---+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
I want to get above df from :
+----------+--------+--------------------+
|article_id| sen| attribute|
+----------+--------+--------------------+
| 1|example1|[Standford,Organi...|
| 1|example1| [is,O,VP]|
| 1|example1| [good,LOCATION,ADP]|
+----------+--------+--------------------+
using :
df3.registerTempTable("d1")
val df4 = sqlContext.sql("select article_id,sen,collect(attribute) as attributes from d1 group by article_id,sen")
Is there any way that I don't have to register temp table, as while saving dataframe, it is giving lot of garbage!! Something lige df3.Select""??
The only way Spark currently has to run SQL against a dataframe is via a temporary table. However, you can add implicit methods to DataFrame to automate this, as we have done at Swoop. I can't share all the code as it uses a number of our internal utilities & implicits but the core is in the following gist. The importance of using unique temporary tables is that (at least until Spark 2.0) temporary tables are cluster global.
We use this approach regularly in our work, especially since there are many situations in which SQL is much simpler/easier to write and understand than the Scala DSL.
Hope this helps!

Using groupBy in Spark and getting back to a DataFrame

I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy I don't get a data frame back.
For example, I have a DataFrame called logs that has the following form:
machine_id | event | other_stuff
34131231 | thing | stuff
83423984 | notathing | notstuff
34131231 | thing | morestuff
and I would like the unique machine ids where event is thing stored in a new DataFrame to allow me to do some filtering of some kind. Using
val machineId = logs
.where($"event" === "thing")
.select("machine_id")
.groupBy("machine_id")
I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame to extract all events for individual machine ids.
I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:
Extract unique id's from a log table.
Use unique ids to extract all events for a particular id.
Use some kind of analysis on this data that has been extracted.
It's the first two steps I would appreciate some guidance with here.
I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.
Thanks
Just use distinct not groupBy:
val machineId = logs.where($"event"==="thing").select("machine_id").distinct
Which will be equivalent to SQL:
SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'
GroupedData is not intended to be used directly. It provides a number of methods, where agg is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame. In terms of SQL what you have after where and groupBy is equivalent to something like this
SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id
where ... has to be provided by agg or equivalent method.
A group by in spark followed by aggregation and then a select statement will return a data frame. For your example it should be something like:
val machineId = logs
.groupBy("machine_id", "event")
.agg(max("other_stuff") )
.select($"machine_id").where($"event" === "thing")