Selecting a column from a dataset's row - scala

I'd like to loop on a Spark dataset and save specific values in a Map depending on the characteristics of each row. I'm new to Spark and Scala so I joined a simple example of what I'm trying to do in python.
Minimal working example in python:
mydict = dict()
for row in data:
if row['name'] == "Yan":
mydict[row['id']] = row['surname']
else:
mydict[row['id']] = "Random lad"
Where data is a (big) spark dataset, of type org.apache.spark.sql.Dataset[org.apache.spark.sql.Row].
Do you know the Spark or Scala way of doing it?

You can not loop over the contents of a Dataset because they are not accessible on the machine running this code but instead are scattered over (possibly many) different worker nodes. That is a fundamental concept of distributed execution engines like spark.
Instead you have to manipulate your data in a functional (where map, filter, reduce, ... operations are spread to the workers) or declarative (sql queries that are performed on the workers) way.
To achieve your goal you could run a map over you data which checks whether the name equals "Yan" and go on from there. After this transformation you can collect your dataframe and transform it into a dict.
You should also check your approach on using Spark and the map: it seems you want to create an entry in mydict for each element of data. This means your data is either small enough that you don't really have to use Spark or it will probably fail because it does not fit in your drivers memory.

I think you are looking for something like that. If your final df is not big you can collect it and store as map.
scala> df.show()
+---+----+--------+
| id|name|surrname|
+---+----+--------+
| 1| Yan| abc123|
| 2| Abc| def123|
+---+----+--------+
scala> df.select('id, when('name === "Yan", 'surrname).otherwise("Random lad")).toDF("K","V").show()
+---+----------+
| K| V|
+---+----------+
| 1| abc123|
| 2|Random lad|
+---+----------+

Here is a simple way to do it, but be careful with collect(), since it collects the data in driver. The data should be able to fit in driver.
I don't recommend you to do this.
var df: DataFrame = Seq(
("1", "Yan", "surname1"),
("2", "Yan1", "surname2"),
("3", "Yan", "surname3"),
("4", "Yan2", "surname4")
).toDF("id", "name", "surname")
val myDict = df.withColumn("newName", when($"name" === "Yan", $"surname").otherwise("RandomeName"))
.rdd.map(row => (row.getAs[String]("id"), row.getAs[String]("newName")))
.collectAsMap()
myDict.foreach(println)
Output:
(2,RandomeName)
(1,surname1)
(4,RandomeName)
(3,surname3)

Related

Spark Column merging all list into 1 single list

I want the below column to merge into a single list for n-gram calculation. I am not sure how can I merge all the lists in a column into a single one.
+--------------------+
| author|
+--------------------+
| [Justin, Lee]|
|[Chatbots, were, ...|
|[Our, hopes, were...|
|[And, why, wouldn...|
|[At, the, Mobile,...|
+--------------------+
(Edit)Some more info:
I would like this as a spark df column and all the words including the repeated ones in a single list. The data is kind of big so I want to try avoiding methods like collect
OP wants to aggregate all the arrays/lists into the top row.
values = [(['Justin','Lee'],),(['Chatbots','were'],),(['Our','hopes','were'],),
(['And','why','wouldn'],),(['At','the','Mobile'],)]
df = sqlContext.createDataFrame(values,['author',])
df.show()
+------------------+
| author|
+------------------+
| [Justin, Lee]|
| [Chatbots, were]|
|[Our, hopes, were]|
|[And, why, wouldn]|
| [At, the, Mobile]|
+------------------+
This step suffices.
from pyspark.sql import functions as F
df = df.groupby().agg(F.collect_list('author').alias('list_of_authors'))
df.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|list_of_authors |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray(Justin, Lee), WrappedArray(Chatbots, were), WrappedArray(Our, hopes, were), WrappedArray(And, why, wouldn), WrappedArray(At, the, Mobile)]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
DataFrames, same as other distributed data structures, are not iterable and by only using dedicated higher order function and / or SQL methods can be accessed
Suppose your Dataframe is DF1 and Output is DF2
You need something like :
values = [(['Justin', 'Lee'],), (['Chatbots', 'were'],), (['Our', 'hopes', 'were'],),
(['And', 'why', 'wouldn'],), (['At', 'the', 'Mobile'],)]
df = spark.createDataFrame(values, ['author', ])
df.agg(F.collect_list('author').alias('author')).show(truncate=False)
Upvote if works

Collecting two values from a DataFrame, and using them as parameters for a case class; looking for less verbose solution

I've got some data in spark, result: DataFrame = ..., where two integer columns are of interest; week and year. The values of these columns are identical for all rows.
I want to extract these two integer values, and pass them as parameters to create a WeekYear:
case class WeekYear(week: Int, year: Int)
Below is my current solution, but I'm thinking there must be a more elegant way to do this. How can this be done without the intermediate step of creating temp?
val temp = result
.select("week", "year")
.first
.toSeq
.map(_.toString.toInt)
val resultWeekYear = WeekYear(temp(0), temp(1))
The best way to utilize a case class with dataframes is to allow spark to convert it to a dataset with the .as() method. As long as your case class has attributes which match all of the column names, it should work very easily.
case class WeekYear(week: Int, year: Int)
val df = spark.createDataset(Seq((1, 1), (2, 2), (3, 3))).toDF("week", "year")
val ds = df.as[WeekYear]
ds.show()
Which provides a Dataset[WeekYear] that looks like this:
+----+----+
|week|year|
+----+----+
| 1| 1|
| 2| 2|
| 3| 3|
+----+----+
You can utilize some more complicated nested classes, but you have to start working with Encoders for that, so that spark knows how to convert back and forth.
Spark does some implicit conversions, so ds may still look like a Dataframe, but it is actually a strongly typed Dataset[WeekYear], instead of a Dataset[Row] that has arbitrary columns. You operate on it similarly to an RDD. Then just grab the .first() one of those and you'll already have the type you need.
val resultWeekYear = ds.first

Spark: Is "count" on Grouped Data a Transformation or an Action?

I know that count called on an RDD or a DataFrame is an action. But while fiddling with the spark shell, I observed the following
scala> val empDF = Seq((1,"James Gordon", 30, "Homicide"),(2,"Harvey Bullock", 35, "Homicide"),(3,"Kristen Kringle", 28, "Records"),(4,"Edward Nygma", 30, "Forensics"),(5,"Leslie Thompkins", 31, "Forensics")).toDF("id", "name", "age", "department")
empDF: org.apache.spark.sql.DataFrame = [id: int, name: string, age: int, department: string]
scala> empDF.show
+---+----------------+---+----------+
| id| name|age|department|
+---+----------------+---+----------+
| 1| James Gordon| 30| Homicide|
| 2| Harvey Bullock| 35| Homicide|
| 3| Kristen Kringle| 28| Records|
| 4| Edward Nygma| 30| Forensics|
| 5|Leslie Thompkins| 31| Forensics|
+---+----------------+---+----------+
scala> empDF.groupBy("department").count //count returned a DataFrame
res1: org.apache.spark.sql.DataFrame = [department: string, count: bigint]
scala> res1.show
+----------+-----+
|department|count|
+----------+-----+
| Homicide| 2|
| Records| 1|
| Forensics| 2|
+----------+-----+
When I called count on GroupedData (empDF.groupBy("department")), I got another DataFrame as the result (res1). This leads me to believe that count in this case was a transformation. It is further supported by the fact that no computations were triggered when I called count, instead, they started when I ran res1.show.
I haven't been able to find any documentation that suggests count could be a transformation as well. Could someone please shed some light on this?
The .count() what you have used in your code is over RelationalGroupedDataset, which creates a new column with count of elements in the grouped dataset. This is a transformation. Refer:
https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.GroupedDataset
The .count() that you use normally over RDD/DataFrame/Dataset is completely different from the above and this .count() is an Action. Refer: https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.rdd.RDD
EDIT:
always use .count() with .agg() while operating on groupedDataSet in order to avoid confusion in future:
empDF.groupBy($"department").agg(count($"department") as "countDepartment").show
Case 1:
You use rdd.count() to count the number of rows. Since it initiates the DAG execution and returns the data to the driver, its an action for RDD.
for ex: rdd.count // it returns a Long value
Case 2:
If you call count on Dataframe, it initiates the DAG execution and returns the data to the driver, its an action for Dataframe.
for ex: df.count // it returns a Long value
Case 3:
In your case you are calling groupBy on dataframe which returns RelationalGroupedDataset object, and you are calling count on grouped Dataset which returns a Dataframe, so its a transformation since it doesn't gets the data to the driver and initiates the DAG execution.
for ex:
df.groupBy("department") // returns RelationalGroupedDataset
.count // returns a Dataframe so a transformation
.count // returns a Long value since called on DF so an action
As you've already figure out - if method returns a distributed object (Dataset or RDD) it can be qualified as a transformations.
However these distinctions are much better suited for RDDs than Datasets. The latter ones features an optimizer, including recently added cost based optimizer, and might be much less lazy the old API, blurring differences between transformation and action in some case.
Here however it is safe to say count is a transformation.

How to flatten a nested field in a Spark Dataset?

I have nested field like below. I want to call flatmap (I think) to produce a flattened row.
My dataset has
A,B,[[x,y,z]],C
I want to convert it to produce output like
A,B,X,Y,Z,C
This is for Spark 2.0+
Thanks!
Apache DataFu has a generic explodeArray method that will do
exactly what you need.
import datafu.spark.DataFrameOps._
val df = sc.parallelize(Seq(("A","B",Array("X","Y","Z"),"C"))).toDF
df.explodeArray(col("_3"), "token").show
This will produce:
+---+---+---------+---+------+------+------+
| _1| _2| _3| _4|token0|token1|token2|
+---+---+---------+---+------+------+------+
| A| B|[X, Y, Z]| C| X| Y| Z|
+---+---+---------+---+------+------+------+
One thing to consider is that this method evaluates the data frame in order to determine how many columns to create - if it's expensive to compute it should be cached.
Full disclosure - I am a member of Apache DataFu.
Try this for RDD:
val rdd = sc.parallelize(Seq(("A","B",Array("X","Y","Z"),"C")))
rdd.flatMap(x => (Option(x._3).map(y => (x._1,x._2,y(0),y(1),y(2),x._4 )))).collect.foreach(println)
Output:
(A,B,X,Y,Z,C)

spark (Scala) dataframe filtering (FIR)

Let say I have a dataframe ( stored in scala val as df) which contains the data from a csv:
time,temperature
0,65
1,67
2,62
3,59
which I have no problem reading this from file as a spark dataframe in scala language.
I would like to add a filtered column (by filter I meant signal processing moving average filtering), (say I want to do (T[n]+T[n-1])/2.0):
time,temperature,temperatureAvg
0,65,(65+0)/2.0
1,67,(67+65)/2.0
2,62,(62+67)/2.0
3,59,(59+62)/2.0
(Actually, say for the first row, I want 32.5 instead of (65+0)/2.0. I wrote it to clarify the expected 2-time-step filtering operation output)
So how to achieve this? I am not familiar with spark dataframe operation which combine rows iteratively along column...
Spark 3.1+
Replace
$"time".cast("timestamp")
with
import org.apache.spark.sql.functions.timestamp_seconds
timestamp_seconds($"time")
Spark 2.0+
In Spark 2.0 and later it is possible to use window function as a input for groupBy. It allows you to specify windowDuration, slideDuration and startTime (offset). It works only with TimestampType column but it is not that hard to find a workaround for that. In your case it will require some additional steps to correct for boundaries but general solution can expressed as shown below:
import org.apache.spark.sql.functions.{window, avg}
df
.withColumn("ts", $"time".cast("timestamp"))
.groupBy(window($"ts", windowDuration="2 seconds", slideDuration="1 second"))
.avg("temperature")
Spark < 2.0
If there is a natural way to partition your data you can use window functions as follows:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.mean
val w = Window.partitionBy($"id").orderBy($"time").rowsBetween(-1, 0)
val df = sc.parallelize(Seq(
(1L, 0, 65), (1L, 1, 67), (1L, 2, 62), (1L, 3, 59)
)).toDF("id", "time", "temperature")
df.select($"*", mean($"temperature").over(w).alias("temperatureAvg")).show
// +---+----+-----------+--------------+
// | id|time|temperature|temperatureAvg|
// +---+----+-----------+--------------+
// | 1| 0| 65| 65.0|
// | 1| 1| 67| 66.0|
// | 1| 2| 62| 64.5|
// | 1| 3| 59| 60.5|
// +---+----+-----------+--------------+
You can create windows with arbitrary weights using lead / lag functions:
lit(0.6) * $"temperature" +
lit(0.3) * lag($"temperature", 1) +
lit(0.2) * lag($"temperature", 2)
It is still possible without partitionBy clause but will be extremely inefficient. If this is the case you won't be able to use DataFrames. Instead you can use sliding over RDD (see for example Operate on neighbor elements in RDD in Spark). There is also spark-timeseries package you may find useful.