I have a dataframe like below:
Rowkey timestamp col_1 col_2 col_3.... col_n
1234 165789 20 null 30 ... null
1234 155789 20 20 null ... 40
1234 145789 20 10 30 ... 50
and except to transform it into following dataframe:
Rowkey timestamp col_1 col_2 col_3.... col_n
1234 165789 20 20 30 ... 40
I want latest timestamp. Also, if a cell is null, and following cell with same Rowkey has value, then that value should be used.
I am using Spark with Scala.
Here's my take:
Use a Window function to select the first non-null value of each Rowkey partition, ordered by timestamp - then drop duplicates to have only one row per Rowkey.
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val simpleData: Seq[(String, Integer,Integer,Integer,Integer,Integer)] = Seq(
("1234",165789,20,null,30, null),
("1234",155789,10,20,null, 40),
("1234",145789,2,10,30, 50),
("123e4",145789,2,10,30, 50)
)
val someDF = simpleData.toDF("Rowkey","timestamp","col_1","col_2","col_3","col_4")
someDF.show()
val listCols= List("Rowkey","timestamp","col_1","col_2","col_3","col_4")
val windowSpec = Window.partitionBy("Rowkey").orderBy($"timestamp".desc).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
someDF.select(
listCols.map(m=> first(m, true)
.over(windowSpec).alias(m)
) :_*
)
.dropDuplicates()
.show()
Result:
+------+---------+-----+-----+-----+-----+
|Rowkey|timestamp|col_1|col_2|col_3|col_4|
+------+---------+-----+-----+-----+-----+
| 1234| 165789| 20| null| 30| null|
| 1234| 155789| 10| 20| null| 40|
| 1234| 145789| 2| 10| 30| 50|
| 123e4| 145789| 2| 10| 30| 50|
+------+---------+-----+-----+-----+-----+
+------+---------+-----+-----+-----+-----+
|Rowkey|timestamp|col_1|col_2|col_3|col_4|
+------+---------+-----+-----+-----+-----+
| 1234| 165789| 20| 20| 30| 40|
| 123e4| 145789| 2| 10| 30| 50|
+------+---------+-----+-----+-----+-----+
Related
It's not about unique id so I don't mean to use the increase unique number api, but try to resolve it by customized query
consider given value like 30, now current dataframe df needs to add a new column called hop_number so each field in the column from top to bottom will increment by 2 starts from 30, so that
with 2 parameters
x -> start number, here is 30
y -> like step or offset, here is 2
hop_number
---------------
30
32
34
36
38
40
......
I know in RDD we can use a map to handle the job, but how to do the same in dataframe with minimal cost?
df.column("hop_number", 30 + map(x => x + 2)) // pseudo code
Check below code.
scala> import org.apache.spark.sql.expressions._
scala> import org.apache.spark.sql.functions._
scala> val x = lit(30)
x: org.apache.spark.sql.Column = 30
scala> val y = lit(2)
y: org.apache.spark.sql.Column = 2
scala> df.withColumn("hop_number",(x + (row_number().over(Window.orderBy(lit(1)))-1) * y)).show(false)
+----------+
|hop_number|
+----------+
|30 |
|32 |
|34 |
|36 |
|38 |
+----------+
Assuming you have a grouping and ordering column, you can use the window function.
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
from pyspark.sql import Window
tst= sqlContext.createDataFrame([(1,1,14),(1,2,4),(1,3,10),(2,1,90),(7,2,30),(2,3,11)],schema=['group','order','value'])
w=Window.partitionBy('group').orderBy('order')
tst_hop= tst.withColumn("temp",F.sum(F.lit(2)).over(w)).withColumn("hop_number",F.col('temp')+28)
The results:
tst_hop.show()
+-----+-----+-----+----+----------+
|group|order|value|temp|hop_number|
+-----+-----+-----+----+----------+
| 1| 1| 14| 2| 30|
| 1| 2| 4| 4| 32|
| 1| 3| 10| 6| 34|
| 2| 1| 90| 2| 30|
| 2| 3| 11| 4| 32|
| 7| 2| 30| 2| 30|
+-----+-----+-----+----+----------+
If you need a different approach, please provide a sample data of the dataframe.
I have to fill the first null values with immediate value of the same column in dataframe. This logic applies only on first consecutive null values only of the column.
I have a dataframe with similar to below
//I replaced null to 0 in value column
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|0 |exA |30 |
|0 |exB |22 |
|0 |exC |19 |
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 |
|13 |exH |53 |
+-----+----+----+
From this dataframe I am expecting as below
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|16 |exA |30 | // Change the value 0 to 16 at value column
|16 |exB |22 | // Change the value 0 to 16 at value column
|16 |exC |19 | // Change the value 0 to 16 at value column
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 | // value should not be change here
|13 |exH |53 |
+-----+----+----+
Please help me solve this.
You can use Window function for this purpose
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
val w = Window.orderBy($"col2".desc)
df.withColumn("Result", last(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.orderBy($"col2")
.show(10)
Will result in
+-----+----+----+------+
|value|col2|col3|Result|
+-----+----+----+------+
| 0| exA| 30| 16|
| 0| exB| 22| 16|
| 0| exC| 19| 16|
| 16| exD| 13| 16|
| 5| exE| 28| 5|
| 6| exF| 26| 6|
| 0| exG| 12| 13|
| 13| exH| 53| 13|
+-----+----+----+------+
Expression df.orderBy($"col2") is needed only to show final results in right order. You can skip it if you don't care about final order.
UPDATE
To get exactly what you need you should you a little bit more complicated code
val w = Window.orderBy($"col2")
val w2 = Window.orderBy($"col2".desc)
df.withColumn("IntermediateResult", first(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.withColumn("Result", when($"IntermediateResult".isNull, last($"IntermediateResult", ignoreNulls = true).over(w2)).otherwise($"value"))
.orderBy($"col2")
.show(10)
+-----+----+----+------------------+------+
|value|col2|col3|IntermediateResult|Result|
+-----+----+----+------------------+------+
| 0| exA| 30| null| 16|
| 0| exB| 22| null| 16|
| 0| exC| 19| null| 16|
| 16| exD| 13| 16| 16|
| 5| exE| 28| 16| 5|
| 6| exF| 26| 16| 6|
| 0| exG| 12| 16| 0|
| 13| exH| 53| 16| 13|
+-----+----+----+------------------+------+
I think you need to take the 1st not null or non-zero value based on col2 's order. Please find the script below. I have created a table in spark's memory to write sql.
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
df.registerTempTable("table_df")
spark.sql("with cte as(select *,row_number() over(order by col2) rno from table_df) select case when value = 0 and rno<(select min(rno) from cte where value != 0) then (select value from cte where rno=(select min(rno) from cte where value != 0)) else value end value,col2,col3 from cte").show(df.count.toInt,false)
Please let me know if you have any questions.
I added a new column with incremental id to your DF
import org.apache.spark.sql.functions._
val df_1 = Seq((0,"exA",30),
(0,"exB",22),
(0,"exC",19),
(16,"exD",13),
(5,"exE",28),
(6,"exF",26),
(0,"exG",12),
(13,"exH",53))
.toDF("value", "col2", "col3")
.withColumn("UniqueID", monotonically_increasing_id)
filter DF to have non-zero values
val df_2 = df_1.filter("value != 0")
create a variable "limit" to limit first N row that we need and variable Nvar for the first non-zero value
val limit = df_2.agg(min("UniqueID")).collect().map(_(0)).mkString("").toInt + 1
val nVal = df_1.limit(limit).agg(max("value")).collect().map(_(0)).mkString("").toInt
create DF with a column with the same name ("value") with a condition
val df_4 = df_1.withColumn("value", when(($"UniqueID" < limit), nVal).otherwise($"value"))
df2000.drop('jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec').show()
now it's showing without deleted columns in dataframe
df2000.show()
when i run the show command alone to check the table .but comes with deleted column.
drop is not a side-effecting function. it returns a new Dataframe with specified columns removed. so you would have assign the new dataframe to a value to be referenced later as shown below.
>>> df2000 = spark.createDataFrame([('a',10,20,30),('a',10,20,30),('a',10,20,30),('a',10,20,30)],['key', 'jan', 'feb', 'mar'])
>>> cols = ['jan', 'feb', 'mar']
>>> df2000.show()
+---+---+---+---+
|key|jan|feb|mar|
+---+---+---+---+
| a| 10| 20| 30|
| a| 10| 20| 30|
| a| 10| 20| 30|
| a| 10| 20| 30|
+---+---+---+---+
>>> cols = ['jan', 'feb', 'mar']
>>> df2000_dropped_col = reduce(lambda x,y: x.drop(y),cols,df2000)
>>> df2000_dropped_col.show()
+---+
|key|
+---+
| a|
| a|
| a|
| a|
+---+
now doing a show on the new dataframe will yield the desired result with all the month columns dropped.
Dears,
I'm New on SparK Scala, and,
I have a DF of two columns: "UG" and "Counts" and I like to obtain the Third
How was exposed in thsi list.
DF: UG, Counts, CUG ( the columns)
of 12 4
of 23 4
the 134 3
love 68 2
pain 3 1
the 18 3
love 100 2
of 23 4
the 12 3
of 11 4
I need to add a new column called "CUG", the third one exposed, where CUG(i) is the number of times that the string(i) in UG appears in the whole Column.
I tried with the following scheme:
Having the DF like the previous table in df. I did a sql UDF function to count the number of times that the string appear in the column "UG", that is:
val NW1 = (w1:String) => {
df.filter($"UG".like(w1.substring(1,(w1.length-1))).count()
}:Long
val sqlfunc = udf(NW1)
val df2= df.withColumn("CUG",sqlfunc(col("UG")))
But when I tried, ...It did'nt work. I obtained an error of Null Point exception. The UDF scheme worked isolated but not with in DF.
What can I do in order to obtain the asked results using DF.
Thanks In advance.
jm3
So what you can do is firstly count the number of rows grouped by the UG column which gives the third column you need, and then join with the original data frame. You can rename the column name if you want with the withColumnRenamed function.
scala> import org.apache.spark.sql.functions._
scala> myDf.show()
+----+------+
| UG|Counts|
+----+------+
| of| 12|
| of| 23|
| the| 134|
|love| 68|
|pain| 3|
| the| 18|
|love| 100|
| of| 23|
| the| 12|
| of| 11|
+----+------+
scala> myDf.join(myDf.groupBy("UG").count().withColumnRenamed("count", "CUG"), "UG").show()
+----+------+---+
| UG|Counts|CUG|
+----+------+---+
| of| 12| 4|
| of| 23| 4|
| the| 134| 3|
|love| 68| 2|
|pain| 3| 1|
| the| 18| 3|
|love| 100| 2|
| of| 23| 4|
| the| 12| 3|
| of| 11| 4|
+----+------+---+
I have the following DataFrame:
January | February | March
-----------------------------
10 | 10 | 10
20 | 20 | 20
50 | 50 | 50
I'm trying to add a column to this which is the sum of the values of each row.
January | February | March | TOTAL
----------------------------------
10 | 10 | 10 | 30
20 | 20 | 20 | 60
50 | 50 | 50 | 150
As far as I can see, all the built in aggregate functions seem to be for calculating values in single columns. How do I go about using values across columns on a per row basis (using Scala)?
I've gotten as far as
val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...
You were very close with this:
val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...
Instead, try this:
val newDf = df.select(colsToSum.map(col).reduce((c1, c2) => c1 + c2) as "sum")
I think this is the best of the the answers, because it is as fast as the answer with the hard-coded SQL query, and as convenient as the one that uses the UDF. It's the best of both worlds -- and I didn't even add a full line of code!
Alternatively and using Hugo's approach and example, you can create a UDF that receives any quantity of columns and sum them all.
from functools import reduce
def superSum(*cols):
return reduce(lambda a, b: a + b, cols)
add = udf(superSum)
df.withColumn('total', add(*[df[x] for x in df.columns])).show()
+-------+--------+-----+-----+
|January|February|March|total|
+-------+--------+-----+-----+
| 10| 10| 10| 30|
| 20| 20| 20| 60|
+-------+--------+-----+-----+
This code is in Python, but it can be easily translated:
# First we create a RDD in order to create a dataFrame:
rdd = sc.parallelize([(10, 10,10), (20, 20,20)])
df = rdd.toDF(['January', 'February', 'March'])
df.show()
# Here, we create a new column called 'TOTAL' which has results
# from add operation of columns df.January, df.February and df.March
df.withColumn('TOTAL', df.January + df.February + df.March).show()
Output:
+-------+--------+-----+
|January|February|March|
+-------+--------+-----+
| 10| 10| 10|
| 20| 20| 20|
+-------+--------+-----+
+-------+--------+-----+-----+
|January|February|March|TOTAL|
+-------+--------+-----+-----+
| 10| 10| 10| 30|
| 20| 20| 20| 60|
+-------+--------+-----+-----+
You can also create an User Defined Function it you want, here a link of Spark documentation:
UserDefinedFunction (udf)
Working Scala example with dynamic column selection:
import sqlContext.implicits._
val rdd = sc.parallelize(Seq((10, 10, 10), (20, 20, 20)))
val df = rdd.toDF("January", "February", "March")
df.show()
+-------+--------+-----+
|January|February|March|
+-------+--------+-----+
| 10| 10| 10|
| 20| 20| 20|
+-------+--------+-----+
val sumDF = df.withColumn("TOTAL", df.columns.map(c => col(c)).reduce((c1, c2) => c1 + c2))
sumDF.show()
+-------+--------+-----+-----+
|January|February|March|TOTAL|
+-------+--------+-----+-----+
| 10| 10| 10| 30|
| 20| 20| 20| 60|
+-------+--------+-----+-----+
You can use expr() for this.In scala use
df.withColumn("TOTAL", expr("January+February+March"))