Return combined Dataset after joinWith in Spark Scala

Return combined Dataset after joinWith in Spark Scala - scala

Given the below two Spark Datasets, flights and capitals, what would be the most efficient way to return combined (i.e. "joined") result without converting first to a DataFrame or writing out all the columns out by name in a .select() method? I know, for example, that I can access either tuple with (e.g. .map(x => x._1) or use the * operator with:
result.select("_1.*","_2.*")
But the latter may result in duplicate column names and I'm hoping for a cleaner solution.
Thank you for your help.
case class Flights(tripNumber: Int, destination: String)
case class Capitals(state: String, capital: String)
val flights = Seq(
(55, "New York"),
(3, "Georgia"),
(12, "Oregon")
).toDF("tripNumber","destination").as[Flights]
val capitals = Seq(
("New York", "Albany"),
("Georgia", "Atlanta"),
("Oregon", "Salem")
).toDF("state","capital").as[Capitals]
val result = flights.joinWith(capitals,flights.col("destination")===capitals.col("state"))

There are 2 options, but you will have to use join instead of joinWith:
That is the best part of the Dataset API, is to drop one of the join columns
, thus no need to repeat projection columns in a select: val result = flights.join(capitals,flights("destination")===capitals("state")).drop(capitals("state"))
rename join column to be the same in both datasets and use a slightly different way of specifying the join: val result = flights.join(capitals.withColumnRenamed("state", "destination"), Seq("destination"))
Output:
result.show
+-----------+----------+-------+
|destination|tripNumber|capital|
+-----------+----------+-------+
| New York| 55| Albany|
| Georgia| 3|Atlanta|
| Oregon| 12| Salem|
+-----------+----------+-------+

Related

Capitalize the first letter of each word | Spark Scala

I have a table as below -
ID
City
Country
1
Frankfurt am main
Germany
The dataframe needs to be displayed by capitalizing the first letter of each word in the city i.e. output should look like this ->
ID
City
Country
1
Frankfurt Am Main
Germany
The solution I worked with is as below ->
df.map(x => x.getString(1).trim().split(' ').map(_.capitalize).mkString(" ")).show()
This only provides the City column aliased as "value".
How can I get all the columns with the above-mentioned transformation implemented?

You can use initcap function Docu
public static Column initcap(Column e)
Returns a new string column by converting the first letter of each
word to uppercase. Words are delimited by whitespace.
For example, "hello world" will become "Hello World".
Parameters:
e - (undocumented) Returns:
(undocumented) Since:
1.5.0
Sample code
import org.apache.spark.sql.functions._
val data = Seq(("1", "Frankfurt am main", "Germany"))
val df = data.toDF("Id", "City", "Country")
df.withColumn("City", initcap(col("City"))).show
And the output is:
+---+-----------------+-------+
| Id| City|Country|
+---+-----------------+-------+
| 1|Frankfurt Am Main|Germany|
+---+-----------------+-------+
Your sample code was returning only 1 column because that's exactly what you coded in your map. Take x(so your df), get from it column on index 1, do some transformations and return it.
You could do what you wanted with map as you can see in other answers but output of your map needs to include all columns.
Why in my answer i am not doing map? General rule is: when there is build in sql function use it instead of custom map/udf. Most of the time sql function will be better in terms of performance as it easier to optimize for Catalyst

You can call an udf and loop over all columns:
import spark.implicits._
val data = Seq(
(1, "Frankfurt am main", "just test", "Germany"),
(2, "should do this also", "test", "France"),
)
val df = spark.sparkContext.parallelize(data).toDF("ID", "City", "test", "Country")
val convertUDF = udf((value: String) => value.split(' ').map(_.capitalize).mkString(" "))
val dfCapitalized = df.columns.foldLeft(df) {
(df, column) => df.withColumn(column, convertUDF(col(column)))
}
dfCapitalized.show(false)
+---+-------------------+---------+-------+
|ID |City |test |Country|
+---+-------------------+---------+-------+
|1 |Frankfurt Am Main |Just Test|Germany|
|2 |Should Do This Also|Test |France |
+---+-------------------+---------+-------+

You could map over your Dataframe, and then simply use normal Scala functions to capitalize. This gives you quite some flexibility in which exact transformations you want to do, giving you the Scala language to your disposal.
Something like this:
import spark.implicits._
val df = Seq(
(1, "Frankfurt am main", "Germany")
).toDF("ID", "City", "Country")
val output = df.map{
row => (
row.getInt(0),
row.getString(1).split(' ').map(_.capitalize).mkString(" "),
row.getString(2)
)
}
output.show
+---+-----------------+-------+
| _1| _2| _3|
+---+-----------------+-------+
| 1|Frankfurt Am Main|Germany|
+---+-----------------+-------+
Inside of the map function, we're outputting a tuple with the same amount of elements as the amount of columns you want to end up with.
Hope this helps!

How to convert spark dataframe array to tuple

How can I convert spark dataframe to a tuple of 2 in scala?
I tried to explode the array and create a new column with help of lead function, so that I can use two columns to create tuple.
In order to use lead function, I need a column to sort by, I don't have any.
Please suggest which is best way to solve this?
Note: I need to retain the same order in the array.
For example:
Input
For example, input looks something like this,
id1 | [text1, text2, text3, text4]
id2 | [txt, txt2, txt4, txt5, txt6, txt7, txt8, txt9]
expected o/p:
I need to get output of tuple of length 2
id1 | [(text1, text2), (text2, text3), (text3,text4)]
id2 | [(txt, txt2), (txt2, txt4), (txt4, txt5), (txt5, txt6), (txt6, txt7), (txt7, txt8), (txt8, txt9)]

You can create an udf to create list of tuple using sliding window function
val df = Seq(
("id1", List("text1", "text2", "text3", "text4")),
("id2", List("txt", "txt2", "txt4", "txt5", "txt6", "txt7", "txt8", "txt9"))
).toDF("id", "text")
val sliding = udf((value: Seq[String]) => {
value.toList.sliding(2).map { case List(a, b) => (a, b) }.toList
})
val result = df.withColumn("text", sliding($"text"))
Output:
+---+-------------------------------------------------------------------------------------------------+
|id |text |
+---+-------------------------------------------------------------------------------------------------+
|id1|[[text1, text2], [text2, text3], [text3, text4]] |
|id2|[[txt, txt2], [txt2, txt4], [txt4, txt5], [txt5, txt6], [txt6, txt7], [txt7, txt8], [txt8, txt9]]|
+---+-------------------------------------------------------------------------------------------------+
Hope this helps!

Collecting two values from a DataFrame, and using them as parameters for a case class; looking for less verbose solution

I've got some data in spark, result: DataFrame = ..., where two integer columns are of interest; week and year. The values of these columns are identical for all rows.
I want to extract these two integer values, and pass them as parameters to create a WeekYear:
case class WeekYear(week: Int, year: Int)
Below is my current solution, but I'm thinking there must be a more elegant way to do this. How can this be done without the intermediate step of creating temp?
val temp = result
.select("week", "year")
.first
.toSeq
.map(_.toString.toInt)
val resultWeekYear = WeekYear(temp(0), temp(1))

The best way to utilize a case class with dataframes is to allow spark to convert it to a dataset with the .as() method. As long as your case class has attributes which match all of the column names, it should work very easily.
case class WeekYear(week: Int, year: Int)
val df = spark.createDataset(Seq((1, 1), (2, 2), (3, 3))).toDF("week", "year")
val ds = df.as[WeekYear]
ds.show()
Which provides a Dataset[WeekYear] that looks like this:
+----+----+
|week|year|
+----+----+
| 1| 1|
| 2| 2|
| 3| 3|
+----+----+
You can utilize some more complicated nested classes, but you have to start working with Encoders for that, so that spark knows how to convert back and forth.
Spark does some implicit conversions, so ds may still look like a Dataframe, but it is actually a strongly typed Dataset[WeekYear], instead of a Dataset[Row] that has arbitrary columns. You operate on it similarly to an RDD. Then just grab the .first() one of those and you'll already have the type you need.
val resultWeekYear = ds.first

Spark generate a list of column names that contains(SQL LIKE) a string

This one below is a simple syntax to search for a string in a particular column uisng SQL Like functionality.
val dfx = df.filter($"name".like(s"%${productName}%"))
The questions is How do I grab each and every column NAME that contained the particular string in its VALUES and generate a new column with a list of those "column names" for every row.
So far this is the approach I took but stuck as I cant use spark-sql "Like" function inside a UDF.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
import spark.implicits._
val df1 = Seq(
(0, "mango", "man", "dit"),
(1, "i-man", "man2", "mane"),
(2, "iman", "mango", "ho"),
(3, "dim", "kim", "sim")
).toDF("id", "col1", "col2", "col3")
val df2 = df1.columns.foldLeft(df1) {
(acc: DataFrame, colName: String) =>
acc.withColumn(colName, concat(lit(colName + "="), col(colName)))
}
val df3 = df2.withColumn("merged_cols", split(concat_ws("X", df2.columns.map(c=> col(c)):_*), "X"))
Here is a sample output. Note that here there are only 3 columns but in the real job I'll be reading multiple tables which can contain dynamic number of columns.
+--------------------------------------------+
|id | col1| col2| col3| merged_cols
+--------------------------------------------+
0 | mango| man | dit | col1, col2
1 | i-man| man2 | mane | col1, col2, col3
2 | iman | mango| ho | col1, col2
3 | dim | kim | sim|
+--------------------------------------------+

This can be done using a foldLeft over the columns together with when and otherwise:
val e = "%man%"
val df2 = df1.columns.foldLeft(df.withColumn("merged_cols", lit(""))){(df, c) =>
df.withColumn("merged_cols", when(col(c).like(e), concat($"merged_cols", lit(s"$c,"))).otherwise($"merged_cols"))}
.withColumn("merged_cols", expr("substring(merged_cols, 1, length(merged_cols)-1)"))
All columns that satisfies the condition e will be appended to the string in the merged_cols column. Note that the column must exist for the first append to work so it is added (containing an empty string) to the dataframe when sent into the foldLeft.
The last row in the code simply removes the extra , that is added in the end. If you want the result as an array instead, simply adding .withColumn("merged_cols", split($"merged_cols", ",")) would work.
An alternative appraoch is to instead use an UDF. This could be preferred when dealing with many columns since foldLeft will create multiple dataframe copies. Here regex is used (not the SQL like since that operates on whole columns).
val e = ".*man.*"
val concat_cols = udf((vals: Seq[String], names: Seq[String]) => {
vals.zip(names).filter{case (v, n) => v.matches(e)}.map(_._2)
})
val df2 = df.withColumn("merged_cols", concat_cols(array(df.columns.map(col(_)): _*), typedLit(df.columns.toSeq)))
Note: typedLit can be used in Spark versions 2.2+, when using older versions use array(df.columns.map(lit(_)): _*) instead.

Scala spark Select as not working as expected

Hope someone can help. Fairly certain this is something I'm doing wrong.
I have a dataframe called uuidvar with 1 column called 'uuid' and another dataframe, df1, with a number of columns, one of which is also 'uuid'. I would like to select from from df1 all of the rows which have a uuid that appear in uuidvar. Now, having the same column names is not ideal so I tried to do it with
val uuidselection=df1.join(uuidvar, df1("uuid") === uuidvar("uuid").as("another_uuid"), "right_outer").select("*")
However when I show uuidselection I have 2 columns called "uuid". Furthermore, if I try and select the specific columns I want, I am told
cannot resolve 'uuidvar' given input columns
or similar depending on what I try and select.
I have tried to make it simpler and just do
val uuidvar2=uuidvar.select("uuid").as("uuidvar")
and this doesn't rename the column in uuidvar.
Does 'as' not operate as I am expecting it to, am I making some other fundamental error or is it broken?
I'm using spark 1.5.1 and scala 1.10.

Answer
You can't use as when specifying the join-criterion.
Use withColumnRenamed to modify the column before the join.
Seccnd, use generic col function for accessing columns via name (instead of using the dataframe's apply method, e.g. df1(<columnname>)
case class UUID1 (uuid: String)
case class UUID2 (uuid: String, b:Int)
class UnsortedTestSuite2 extends SparkFunSuite {
configuredUnitTest("SO - uuid") { sc =>
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val uuidvar = sc.parallelize( Seq(
UUID1("cafe-babe-001"),
UUID1("cafe-babe-002"),
UUID1("cafe-babe-003"),
UUID1("cafe-babe-004")
)).toDF()
val df1 = sc.parallelize( Seq(
UUID2("cafe-babe-001", 1),
UUID2("cafe-babe-002", 2),
UUID2("cafe-babe-003", 3)
)).toDF()
val uuidselection=df1.join(uuidvar.withColumnRenamed("uuid", "another_uuid"), col("uuid") === col("another_uuid"), "right_outer")
uuidselection.show()
}
}
delivers
+-------------+----+-------------+
| uuid| b| another_uuid|
+-------------+----+-------------+
|cafe-babe-001| 1|cafe-babe-001|
|cafe-babe-002| 2|cafe-babe-002|
|cafe-babe-003| 3|cafe-babe-003|
| null|null|cafe-babe-004|
+-------------+----+-------------+
Comment
.select("*") does not have any effect. So
df.select("*") =^= df

I've always used the withColumnRenamed api to rename columns:
Take this table as an example:
| Name | Age |
df.withColumnRenamed('Age', 'newAge').show()
| Name | newAge |
So to make it work with your code, something like this should work:
val uuidvar_another = uuidvar.withColumnRenamed("uuid", "another_uuid")
val uuidselection=df1.join(uuidvar, df1("uuid") === uuidvar("another_uuid"), "right_outer").select("*")