I have a column that is a list of identifiers (in this case runways). It could be an array or a comma-separated list, in this example I'm converting it to an array. I'm trying to figure out the idiomatic/programmatic way to update a set of columns based on the contents of said array.
Working example that uses anti-pattern:
val data = Seq("08L,08R,09")
val df = data.toDF("runways")
.withColumn("runway_set", split('runways, ","))
.withColumn("runway_in_use_08L", when(array_contains('runway_set, "08L"), 1).otherwise(0))
.withColumn("runway_in_use_26R", when(array_contains('runway_set, "26R"), 1).otherwise(0))
.withColumn("runway_in_use_08R", when(array_contains('runway_set, "08R"), 1).otherwise(0))
.withColumn("runway_in_use_26L", when(array_contains('runway_set, "26L"), 1).otherwise(0))
.withColumn("runway_in_use_09", when(array_contains('runway_set, "09"), 1).otherwise(0))
.withColumn("runway_in_use_27", when(array_contains('runway_set, "27"), 1).otherwise(0))
.withColumn("runway_in_use_15L", when(array_contains('runway_set, "15L"), 1).otherwise(0))
.withColumn("runway_in_use_33R", when(array_contains('runway_set, "33R"), 1).otherwise(0))
.withColumn("runway_in_use_15R", when(array_contains('runway_set, "15R"), 1).otherwise(0))
.withColumn("runway_in_use_33L", when(array_contains('runway_set, "33L"), 1).otherwise(0))
This produces essentially one-hot encoded values like so:
+----------+--------------+-----------------+-----------------+-----------------+-----------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+
| runways| runway_set|runway_in_use_08L|runway_in_use_26R|runway_in_use_08R|runway_in_use_26L|runway_in_use_09|runway_in_use_27|runway_in_use_15L|runway_in_use_33R|runway_in_use_15R|runway_in_use_33L|
+----------+--------------+-----------------+-----------------+-----------------+-----------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+
|08L,08R,09|[08L, 08R, 09]| 1| 0| 1| 0| 1| 0| 0| 0| 0| 0|
+----------+--------------+-----------------+-----------------+-----------------+-----------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+
Feels like I should be able to take a static sequence of all identifiers and perform some programmatic operation to do all of the above in a loop/map/foreach type of expression, but I am not sure how to formulate it.
E.g.:
val all_runways = Seq("08L","26R","08R","26L","09","27","15L","33R","15R","33L")
// iterate through and update each column, e.g. 'runway_in_use_$i'
Any pointers? Thanks in advance.
Typical use case for fold.
val df = data.toDF("runways")
.withColumn("runway_set", split('runways, ","))
val df2 = all_runways.foldLeft(df) { (acc, x) =>
acc.withColumn(s"runway_in_use_$x", when(array_contains('runway_set, x), 1).otherwise(0))
}
Related
I'm comparing two dataframes in spark using except().
For exmaple: df.except(df2)
I will get all the records that are not available in df2 from df. However, I would like to list field details also which are not matching.
For example:
df:
------------------
id,name,age,city
101,kp,28,CHN
------------------
df2:
-----------------
id,name,age,city
101,kp,28,HYD
----------------
Expected output:
df3
--------------------------
id,name,age,city,diff
101,kp,28,CHN,City is not matching
--------------------------------
How can I acheive this?
Use intersect to get the values common to both DataFrames,then build your not matching logic
intersect -returns a new Dataset containing rows only in both this Dataset and another Dataset.
df.intersect(df2)
return a new RDD that contains the intersection of elements in the source dataset and the argument.
intersection(anotherrdd) returns the elements which are present in both the DF.
intersection(anotherrdd) remove all the duplicate including duplicated in single DF
Newer again attempt on the above but not possible elegantly, but with JOIN as opposed to except. Best I can do.
I believe it does what you need and takes into the fact there are things in one data set or not.
Run under Databricks.
case class Person(personid: Int, personname: String, cityid: Int)
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.functions._
val df1 = Seq(
Person(0, "AgataZ", 0),
Person(1, "Iweta", 0),
Person(2, "Patryk", 2),
Person(9999, "Maria", 2),
Person(5, "John", 2),
Person(6, "Patsy", 2),
Person(7, "Gloria", 222),
Person(3333, "Maksym", 0)).toDF
val df2 = Seq(
Person(0, "Agata", 0),
Person(1, "Iweta", 0),
Person(2, "Patryk", 2),
Person(5, "John", 2),
Person(6, "Patsy", 333),
Person(7, "Gloria", 2),
Person(4444, "Hans", 3)).toDF
val joined = df1.join(df2, df1("personid") === df2("personid"), "outer")
val newNames = Seq("personId1", "personName1", "personCity1", "personId2", "personName2", "personCity2")
val df_Renamed = joined.toDF(newNames: _*)
// Some deliberate variation shown in approach for learning
val df_temp = df_Renamed.filter($"personCity1" =!= $"personCity2" || $"personName1" =!= $"personName2" || $"personName1".isNull || $"personName2".isNull || $"personCity1".isNull || $"personCity2".isNull).select($"personId1", $"personName1".alias("Name"), $"personCity1", $"personId2", $"personName2".alias("Name2"), $"personCity2"). withColumn("PersonID", when($"personId1".isNotNull, $"personId1").otherwise($"personId2"))
val df_final = df_temp.withColumn("nameChange ?", when($"Name".isNull or $"Name2".isNull or $"Name" =!= $"Name2", "Yes").otherwise("No")).withColumn("cityChange ?", when($"personCity1".isNull or $"personCity2".isNull or $"personCity1" =!= $"personCity2", "Yes").otherwise("No")).drop("PersonId1").drop("PersonId2")
df_final.show()
gives:
+------+-----------+------+-----------+--------+------------+------------+
| Name|personCity1| Name2|personCity2|PersonID|nameChange ?|cityChange ?|
+------+-----------+------+-----------+--------+------------+------------+
| Patsy| 2| Patsy| 333| 6| No| Yes|
|Maksym| 0| null| null| 3333| Yes| Yes|
| null| null| Hans| 3| 4444| Yes| Yes|
|Gloria| 222|Gloria| 2| 7| No| Yes|
| Maria| 2| null| null| 9999| Yes| Yes|
|AgataZ| 0| Agata| 0| 0| Yes| No|
+------+-----------+------+-----------+--------+------------+------------+
I have a dataframe
df
+----------+----+----+----+---+---+----+---+---+-------+-------+
| WEEK|DIM1|DIM2| T1| T2| T3| T1| T2| T3|T1_diff|T2_diff|
+----------+----+----+----+---+---+----+---+---+-------+-------+
|2016-04-02| 14|NULL|9874|880| 23|9879|820| 45| -5| 60|
|2016-04-30| 14| FR|9875| 13| 34|9785| 9| 67| 90| 4|
+----------+----+----+----+---+---+----+---+---+-------+-------+
I want to do two things on this data frame:
Select only WEEK, DIM1, DIM2, T1_diff, T2_diff
Filter T1_diff or T2_diff > 3
I am currently doing it like this -
val selectColumns = Seq("WEEK", "DIM1", "DIM2","T1_diff","T2_diff")
df.select(selectColumns.head, selectColumns.tail: _*).filter($"T1_diff" > 3 or $"T2_diff" > 3).show()
I have a use case, where i have my targetColumns defined like this -
val targetColumns = Seq("T1_diff", "T2_diff")
I need to use the above sequence to apply it in the filter. This is in sequence, because more columns can be added in targetColumns list.
I tried something like this -
df.filter(r => !targetColumns.map(x => col(x) > 3).isEmpty).show()
This doesnt seem to work. Can anyone tell me what is the best way to do this ?
You can use reduce on the sequence of target columns after you've mapped each of them into a condition (col(name) > 3), using or to "merge" them together into one condition:
import org.apache.spark.sql.functions._
val selectColumns = Seq("id", "type", "DIM2","T1_diff","T2_diff")
val targetColumns = Seq("T1_diff", "T2_diff")
df.select(selectColumns.head, selectColumns.tail: _*)
.filter(targetColumns.map(name => col(name) > 3).reduce(_ or _))
.show()
You can create a String using targetColumns List and then pass that String to where function.
val targetColumns = List("T1_diff", "T2_diff")
val selectColumns = Seq("WEEK", "DIM1", "DIM2", "T1_diff", "T2_diff")
//create the where condition to filter the columns
val condition = targetColumns.map(c => s"$c>3").mkString(" OR ")
//select the columns and apply filter using where function.
df.select(selectColumns.head, selectColumns.tail: _*).where(condition).show(false)
you can simply do the following making a string query
val targetColumns = Seq("T1_diff", "T2_diff")
df.filter(targetColumns.map(x => s"$x > 3").mkString(" or ")).show()
and you can add as many columns in targetColumns as you want
test = "a1-b1,a2-b2"
I want this string to be converted to a dataframe as
with columns A and B holding respective a1,a2 and b1,b2
You can convert the string into a RDD which is then converted into a DataFrame:
val s = "a1-b1,a2-b2"
val df = sc.parallelize(
s.split(",").map(_.split("-")).map{ case Array(a, b) => (a, b) }
).toDF("A", "B")
df.show
+---+---+
| A| B|
+---+---+
| a1| b1|
| a2| b2|
+---+---+
I have a dataframe in Spark using scala that has a column that I need split.
scala> test.show
+-------------+
|columnToSplit|
+-------------+
| a.b.c|
| d.e.f|
+-------------+
I need this column split out to look like this:
+--------------+
|col1|col2|col3|
| a| b| c|
| d| e| f|
+--------------+
I'm using Spark 2.0.0
Thanks
Try:
import sparkObject.spark.implicits._
import org.apache.spark.sql.functions.split
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select(
$"_tmp".getItem(0).as("col1"),
$"_tmp".getItem(1).as("col2"),
$"_tmp".getItem(2).as("col3")
)
The important point to note here is that the sparkObject is the SparkSession object you might have already initialized. So, the (1) import statement has to be compulsorily put inline within the code, not before the class definition.
To do this programmatically, you can create a sequence of expressions with (0 until 3).map(i => col("temp").getItem(i).as(s"col$i")) (assume you need 3 columns as result) and then apply it to select with : _* syntax:
df.withColumn("temp", split(col("columnToSplit"), "\\.")).select(
(0 until 3).map(i => col("temp").getItem(i).as(s"col$i")): _*
).show
+----+----+----+
|col0|col1|col2|
+----+----+----+
| a| b| c|
| d| e| f|
+----+----+----+
To keep all columns:
df.withColumn("temp", split(col("columnToSplit"), "\\.")).select(
col("*") +: (0 until 3).map(i => col("temp").getItem(i).as(s"col$i")): _*
).show
+-------------+---------+----+----+----+
|columnToSplit| temp|col0|col1|col2|
+-------------+---------+----+----+----+
| a.b.c|[a, b, c]| a| b| c|
| d.e.f|[d, e, f]| d| e| f|
+-------------+---------+----+----+----+
If you are using pyspark, use a list comprehension to replace the map in scala:
df = spark.createDataFrame([['a.b.c'], ['d.e.f']], ['columnToSplit'])
from pyspark.sql.functions import col, split
(df.withColumn('temp', split('columnToSplit', '\\.'))
.select(*(col('temp').getItem(i).alias(f'col{i}') for i in range(3))
).show()
+----+----+----+
|col0|col1|col2|
+----+----+----+
| a| b| c|
| d| e| f|
+----+----+----+
A solution which avoids the select part. This is helpful when you just want to append the new columns:
case class Message(others: String, text: String)
val r1 = Message("foo1", "a.b.c")
val r2 = Message("foo2", "d.e.f")
val records = Seq(r1, r2)
val df = spark.createDataFrame(records)
df.withColumn("col1", split(col("text"), "\\.").getItem(0))
.withColumn("col2", split(col("text"), "\\.").getItem(1))
.withColumn("col3", split(col("text"), "\\.").getItem(2))
.show(false)
+------+-----+----+----+----+
|others|text |col1|col2|col3|
+------+-----+----+----+----+
|foo1 |a.b.c|a |b |c |
|foo2 |d.e.f|d |e |f |
+------+-----+----+----+----+
Update: I highly recommend to use Psidom's implementation to avoid splitting three times.
This appends columns to the original DataFrame and doesn't use select, and only splits once using a temporary column:
import spark.implicits._
df.withColumn("_tmp", split($"columnToSplit", "\\."))
.withColumn("col1", $"_tmp".getItem(0))
.withColumn("col2", $"_tmp".getItem(1))
.withColumn("col3", $"_tmp".getItem(2))
.drop("_tmp")
This expands on Psidom's answer and shows how to do the split dynamically, without hardcoding the number of columns. This answer runs a query to calculate the number of columns.
val df = Seq(
"a.b.c",
"d.e.f"
).toDF("my_str")
.withColumn("letters", split(col("my_str"), "\\."))
val numCols = df
.withColumn("letters_size", size($"letters"))
.agg(max($"letters_size"))
.head()
.getInt(0)
df
.select(
(0 until numCols).map(i => $"letters".getItem(i).as(s"col$i")): _*
)
.show()
We can write using for with yield in Scala :-
If your number of columns exceeds just add it to desired column and play with it. :)
val aDF = Seq("Deepak.Singh.Delhi").toDF("name")
val desiredColumn = Seq("name","Lname","City")
val colsize = desiredColumn.size
val columList = for (i <- 0 until colsize) yield split(col("name"),".").getItem(i).alias(desiredColumn(i))
aDF.select(columList: _ *).show(false)
Output:-
+------+------+-----+--+
|name |Lname |city |
+-----+------+-----+---+
|Deepak|Singh |Delhi|
+---+------+-----+-----+
If you don't need name column then, drop the column and just use withColumn.
Example:
Without using the select statement.
Lets assume we have a dataframe having a set of columns and we want to split a column having column name as name
import spark.implicits._
val columns = Seq("name","age","address")
val data = Seq(("Amit.Mehta", 25, "1 Main st, Newark, NJ, 92537"),
("Rituraj.Mehta", 28,"3456 Walnut st, Newark, NJ, 94732"))
var dfFromData = spark.createDataFrame(data).toDF(columns:_*)
dfFromData.printSchema()
val newDF = dfFromData.map(f=>{
val nameSplit = f.getAs[String](0).split("\\.").map(_.trim)
(nameSplit(0),nameSplit(1),f.getAs[Int](1),f.getAs[String](2))
})
val finalDF = newDF.toDF("First Name","Last Name", "Age","Address")
finalDF.printSchema()
finalDF.show(false)
output:
I need to write one scenario in Spark using Scala API.
I am passing a user defined function to a Dataframe which processes each row of data frame one by one and returns tuple(Row, Row). How can i change RDD ( Row, Row) to Dataframe (Row)? See below code sample -
**Calling map function-**
val df_temp = df_outPut.map { x => AddUDF.add(x,date1,date2)}
**UDF definition.**
def add(x: Row,dates: String*): (Row,Row) = {
......................
........................
var result1,result2:Row = Row()
..........
return (result1,result2)
Now df_temp is a RDD(Row1, Row2). my requirement is to make it one RDD or Dataframe by breaking tuple elements to 1 record of RDD or Dataframe
RDD(Row). Appreciate your help.
You can use flatMap to flatten your Row tuples, say if we start from this example rdd:
rddExample.collect()
// res37: Array[(org.apache.spark.sql.Row, org.apache.spark.sql.Row)] = Array(([1,2],[3,4]), ([2,1],[4,2]))
val flatRdd = rddExample.flatMap{ case (x, y) => List(x, y) }
// flatRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[45] at flatMap at <console>:35
To convert it to data frame.
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(StructField("x", IntegerType, true)::
StructField("y", IntegerType, true)::Nil)
val df = sqlContext.createDataFrame(flatRdd, schema)
df.show
+---+---+
| x| y|
+---+---+
| 1| 2|
| 3| 4|
| 2| 1|
| 4| 2|
+---+---+