Converting dataframe column into onehotencoder like columns - scala

I am trying to find the solution to convert specific column into onehotencoder type columns. For example
-------------
Content|type|
-------------
alpha | A |
beta | B |
gamma | C |
theta | A |
zeta | C |
neta | B |
-------------
And, what I am trying to do is following.
----------------------------
Content|type_A|type_B|type_C|
----------------------------
alpha | 1 | 0 | 0 |
beta | 0 | 1 | 0 |
gamma | 0 | 0 | 1 |
theta | 1 | 0 | 0 |
zeta | 0 | 0 | 1 |
neta | 0 | 1 | 0 |
-----------------------------

I think pivot is what you are looking for
val df = Seq(
("alpha", "A"),
("beta", "B"),
("gamma", "C"),
("theta", "A"),
("zeta", "C"),
("neta", "B")
).toDF("Content", "type")
val result = df.groupBy("Content")
.pivot("type")
.agg(count("type"))
.na.fill(0)
Output:
+-------+---+---+---+
|Content|A |B |C |
+-------+---+---+---+
|neta |0 |1 |0 |
|beta |0 |1 |0 |
|gamma |0 |0 |1 |
|theta |1 |0 |0 |
|zeta |0 |0 |1 |
|alpha |1 |0 |0 |
+-------+---+---+---+

Related

How to add some values in a dataframe in Scala Spark?

Here is the dataframe I have for now, suppose there are totally 4 days{1,2,3,4}:
+-------------+----------+------+
| key | Time | Value|
+-------------+----------+------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 4 | 3 |
| 2 | 2 | 4 |
| 2 | 3 | 5 |
+-------------+----------+------+
And what I want is
+-------------+----------+------+
| key | Time | Value|
+-------------+----------+------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | null |
| 1 | 4 | 3 |
| 2 | 1 | null |
| 2 | 2 | 4 |
| 2 | 3 | 5 |
| 2 | 4 | null |
+-------------+----------+------+
If there is some ways that can help me get this?
Say df1 is our main table:
+---+----+-----+
|key|Time|Value|
+---+----+-----+
|1 |1 |1 |
|1 |2 |2 |
|1 |4 |3 |
|2 |2 |4 |
|2 |3 |5 |
+---+----+-----+
We can use the following transformations:
val data = df1
// we first group by and aggregate the values to a sequence between 1 and 4 (your number)
.groupBy("key")
.agg(sequence(lit(1), lit(4)).as("Time"))
// we explode the sequence, thus creating all 'Time' per 'key'
.withColumn("Time", explode(col("Time")))
// finally, we join with our main table on 'key' and 'Time'
.join(df1, Seq("key", "Time"), "left")
To get this output:
+---+----+-----+
|key|Time|Value|
+---+----+-----+
|1 |1 |1 |
|1 |2 |2 |
|1 |3 |null |
|1 |4 |3 |
|2 |1 |null |
|2 |2 |4 |
|2 |3 |5 |
|2 |4 |null |
+---+----+-----+
Which should be what you are looking for, good luck!

Spark Scala input empty values according result from self joined dataframe query

I struggle to write my spark scala code to fill rows for which the coverage is empty using self join with conditions.
This is the data :
+----+--------------+----------+--------+
| ID | date_in_days | coverage | values |
+----+--------------+----------+--------+
| 1 | 2020-09-01 | | 0.128 |
| 1 | 2020-09-03 | 0 | 0.358 |
| 1 | 2020-09-04 | 0 | 0.035 |
| 1 | 2020-09-05 | | |
| 1 | 2020-09-06 | | |
| 1 | 2020-09-19 | | |
| 1 | 2020-09-12 | | |
| 1 | 2020-09-18 | | |
| 1 | 2020-09-11 | | |
| 1 | 2020-09-16 | | |
| 1 | 2020-09-21 | 13 | 0.554 |
| 1 | 2020-09-23 | | |
| 1 | 2020-09-30 | | |
+----+--------------+----------+--------+
Expected result :
+----+--------------+----------+--------+
| ID | date_in_day | coverage | values |
+----+--------------+----------+--------+
| 1 | 2020-09-01 | -1 | 0.128 |
| 1 | 2020-09-03 | 0 | 0.358 |
| 1 | 2020-09-04 | 0 | 0.035 |
| 1 | 2020-09-05 | 0 | |
| 1 | 2020-09-06 | 0 | |
| 1 | 2020-09-19 | 0 | |
| 1 | 2020-09-12 | 0 | |
| 1 | 2020-09-18 | 0 | |
| 1 | 2020-09-11 | 0 | |
| 1 | 2020-09-16 | 0 | |
| 1 | 2020-09-21 | 13 | 0.554 |
| 1 | 2020-09-23 | -1 | |
| 1 | 2020-09-30 | -1 | |
What I am trying to do:
For each different ID (Dataframe partitioned by ID) sorted by date
Use case : row coverage column is null let's call it rowEmptycoverage:
Find in the DF the first row with date_in_days > rowEmptycoverage.date_in_days and with coverage >= 0. Let's call it rowFirstDateGreater
Then if rowFirstDateGreater.values > 500 set rowEmptycoverage.coverage to 0. Set it to -1 otherwise.
I am kind of lost in mixing when join where...
I am assuming that you mean values > 0.500 and not values > 500. Also the logic remains unclear. Here I am assuming that you are searching in the order of the column date_in_days and not in the order of the dataframe.
In any case we can refine the solution to match your exact need. The overall idea is to use a Window to fetch the next date for which the coverage is not null, check if values meet the desired criteria and update coverage.
It goes as follows:
val win = Window.partitionBy("ID").orderBy("date_in_days")
.rangeBetween(Window.currentRow, Window.unboundedFollowing)
df
// creating a struct binding coverage and values
.withColumn("cov_str", when('coverage isNull, lit(null))
.otherwise(struct('coverage, 'values)))
// finding the first row (starting from the current date, in order of
// date_in_days) for which the coverage is not null
.withColumn("next_cov_str", first('cov_str, ignoreNulls=true) over win)
// updating coverage. We keep the original value if not null, put 0 if values
// meets the criteria (that you can change) and -1 otherwise.
.withColumn("coverage", coalesce(
'coverage,
when($"next_cov_str.values" > 0.500, lit(0)),
lit(-1)
))
.show(false)
+---+-------------------+--------+------+-----------+------------+
|ID |date_in_days |coverage|values|cov_str |next_cov_str|
+---+-------------------+--------+------+-----------+------------+
|1 |2020-09-01 00:00:00|-1 |0.128 |null |[0, 0.358] |
|1 |2020-09-03 00:00:00|0 |0.358 |[0, 0.358] |[0, 0.358] |
|1 |2020-09-04 00:00:00|0 |0.035 |[0, 0.035] |[0, 0.035] |
|1 |2020-09-05 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-06 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-11 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-12 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-16 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-18 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-19 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-21 00:00:00|13 |0.554 |[13, 0.554]|[13, 0.554] |
|1 |2020-09-23 00:00:00|-1 |null |null |null |
|1 |2020-09-30 00:00:00|-1 |null |null |null |
+---+-------------------+--------+------+-----------+------------+
You may then use drop("cov_str", "next_cov_str") but I leave them here for clarity.

Creating a new column based on a window and a condition in Spark

INITIAL DATA FRAME:
+------------------------------+----------+-------+
| Timestamp | Property | Value |
+------------------------------+----------+-------+
| 2019-09-01T01:36:57.000+0000 | X | N |
| 2019-09-01T01:37:39.000+0000 | A | 3 |
| 2019-09-01T01:42:55.000+0000 | X | Y |
| 2019-09-01T01:53:44.000+0000 | A | 17 |
| 2019-09-01T01:55:34.000+0000 | A | 9 |
| 2019-09-01T01:57:32.000+0000 | X | N |
| 2019-09-01T02:59:40.000+0000 | A | 2 |
| 2019-09-01T02:00:03.000+0000 | A | 16 |
| 2019-09-01T02:01:40.000+0000 | X | Y |
| 2019-09-01T02:04:03.000+0000 | A | 21 |
+------------------------------+----------+-------+
FINAL DATA FRAME:
+------------------------------+----------+-------+---+
| Timestamp | Property | Value | X |
+------------------------------+----------+-------+---+
| 2019-09-01T01:37:39.000+0000 | A | 3 | N |
| 2019-09-01T01:53:44.000+0000 | A | 17 | Y |
| 2019-09-01T01:55:34.000+0000 | A | 9 | Y |
| 2019-09-01T02:00:03.000+0000 | A | 16 | N |
| 2019-09-01T02:04:03.000+0000 | A | 21 | Y |
| 2019-09-01T02:59:40.000+0000 | A | 2 | Y |
+------------------------------+----------+-------+---+
Basically, I have a Timestamp, a Property, and a Value field. The Property could be either A or X and it has a value. I would like to have a new DataFrame with a fourth column named X based on the values of the X property.
I start going through the rows from the earliest to the oldest.
I encounter a row with the X-property, I store its value and I insert it into the X-column.
IF I encounter an A-property row: I insert the stored value from the previous step into the X-column.
ELSE (meaning I encounter an X-property row): I update the stored value (since it is more recent) and I insert the new stored value into the X column.
I keep doing so until I have gone through the whole dataframe.
I remove the rows with the X property to have the final dataframe showed above.
I am sure there is some sort of way to do so efficiently with the Window function.
create a temp column with value X's value, null if A. Then use window to get last not-null Temp value. Filter property "A" in the end.
scala> val df = Seq(
| ("2019-09-01T01:36:57.000+0000", "X", "N"),
| ("2019-09-01T01:37:39.000+0000", "A", "3"),
| ("2019-09-01T01:42:55.000+0000", "X", "Y"),
| ("2019-09-01T01:53:44.000+0000", "A", "17"),
| ("2019-09-01T01:55:34.000+0000", "A", "9"),
| ("2019-09-01T01:57:32.000+0000", "X", "N"),
| ("2019-09-01T02:59:40.000+0000", "A", "2"),
| ("2019-09-01T02:00:03.000+0000", "A", "16"),
| ("2019-09-01T02:01:40.000+0000", "X", "Y"),
| ("2019-09-01T02:04:03.000+0000", "A", "21")
| ).toDF("Timestamp", "Property", "Value").withColumn("Temp", when($"Property" === "X", $"Value").otherwise(null))
df: org.apache.spark.sql.DataFrame = [Timestamp: string, Property: string ... 2 more fields]
scala> df.show(false)
+----------------------------+--------+-----+----+
|Timestamp |Property|Value|Temp|
+----------------------------+--------+-----+----+
|2019-09-01T01:36:57.000+0000|X |N |N |
|2019-09-01T01:37:39.000+0000|A |3 |null|
|2019-09-01T01:42:55.000+0000|X |Y |Y |
|2019-09-01T01:53:44.000+0000|A |17 |null|
|2019-09-01T01:55:34.000+0000|A |9 |null|
|2019-09-01T01:57:32.000+0000|X |N |N |
|2019-09-01T02:59:40.000+0000|A |2 |null|
|2019-09-01T02:00:03.000+0000|A |16 |null|
|2019-09-01T02:01:40.000+0000|X |Y |Y |
|2019-09-01T02:04:03.000+0000|A |21 |null|
+----------------------------+--------+-----+----+
scala> val overColumns = Window.orderBy("TimeStamp").rowsBetween(Window.unboundedPreceding, Window.currentRow)
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#1b759662
scala> df.withColumn("X", last($"Temp",true).over(overColumns)).show(false)
+----------------------------+--------+-----+----+---+
|Timestamp |Property|Value|Temp|X |
+----------------------------+--------+-----+----+---+
|2019-09-01T01:36:57.000+0000|X |N |N |N |
|2019-09-01T01:37:39.000+0000|A |3 |null|N |
|2019-09-01T01:42:55.000+0000|X |Y |Y |Y |
|2019-09-01T01:53:44.000+0000|A |17 |null|Y |
|2019-09-01T01:55:34.000+0000|A |9 |null|Y |
|2019-09-01T01:57:32.000+0000|X |N |N |N |
|2019-09-01T02:00:03.000+0000|A |16 |null|N |
|2019-09-01T02:01:40.000+0000|X |Y |Y |Y |
|2019-09-01T02:04:03.000+0000|A |21 |null|Y |
|2019-09-01T02:59:40.000+0000|A |2 |null|Y |
+----------------------------+--------+-----+----+---+
scala> df.withColumn("X", last($"Temp",true).over(overColumns)).filter($"Property" === "A").show(false)
+----------------------------+--------+-----+----+---+
|Timestamp |Property|Value|Temp|X |
+----------------------------+--------+-----+----+---+
|2019-09-01T01:37:39.000+0000|A |3 |null|N |
|2019-09-01T01:53:44.000+0000|A |17 |null|Y |
|2019-09-01T01:55:34.000+0000|A |9 |null|Y |
|2019-09-01T02:00:03.000+0000|A |16 |null|N |
|2019-09-01T02:04:03.000+0000|A |21 |null|Y |
|2019-09-01T02:59:40.000+0000|A |2 |null|Y |
+----------------------------+--------+-----+----+---+

How to count the number of missing values in each row of a data frame -spark scala?

I want to count the number of missing values in each row of a data frame in spark scala.
Code:
val samplesqlDF = spark.sql("SELECT * FROM sampletable")
samplesqlDF.show()
Input Dataframe:
------------------------------------------------------------------
| name | age | degree | Place |
| -----------------------------------------------------------------|
| Ram | | MCA | Bangalore |
| | 25 | | |
| | 26 | BE | |
| Raju | 21 | Btech | Chennai |
-----------------------------------------------------------------
The Output Data frame (Row Level Count) as follows:
-----------------------------------------------------------------
| name | age | degree | Place | rowcount |
| ----------------------------------------------------------------|
| Ram | | MCA | Bangalore | 1 |
| | 25 | | | 3 |
| | 26 | BE | | 2 |
| Raju | 21 | Btech | Chennai | 0 |
-----------------------------------------------------------------
I am a beginner to scala and spark. Thanks in advance.
Looks like you want to get the null count in a dynamic way. Check this out
val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,null),(null,"26","BE",null),("Raju","21","Btech","Chennai")).toDF("name","age","degree","Place")
df.show(false)
val df2 = df.columns.foldLeft(df)( (df,c) => df.withColumn(c+"_null", when(col(c).isNull,1).otherwise(0) ) )
df2.createOrReplaceTempView("student")
val sql_str_null = df.columns.map( x => x+"_null").mkString(" ","+"," as null_count ")
val sql_str_full = df.columns.mkString( "select ", ",", " , " + sql_str_null + " from student")
spark.sql(sql_str_full).show(false)
Output:
+----+----+------+---------+----------+
|name|age |degree|Place |null_count|
+----+----+------+---------+----------+
|Ram |null|MCA |Bangalore|1 |
|null|25 |null |null |3 |
|null|26 |BE |null |2 |
|Raju|21 |Btech |Chennai |0 |
+----+----+------+---------+----------+
Also a possibility and checking also for "" but not using foldLeft just to demonstrate the point:
import org.apache.spark.sql.functions._
val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,""),(null,"26","BE",null),("Raju","21","Btech","Chennai")).toDF("name","age","degree","place")
// Count per row the null or "" columns!
val null_counter = Seq("name", "age", "degree", "place").map(x => when(col(x) === "" || col(x).isNull , 1).otherwise(0)).reduce(_ + _)
val df2 = df.withColumn("nulls_cnt", null_counter)
df2.show(false)
returns:
+----+----+------+---------+---------+
|name|age |degree|place |nulls_cnt|
+----+----+------+---------+---------+
|Ram |null|MCA |Bangalore|1 |
|null|25 |null | |3 |
|null|26 |BE |null |2 |
|Raju|21 |Btech |Chennai |0 |
+----+----+------+---------+---------+
A simplified version of the one suggested by #stack0114106 is
val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,null),
(null,"26","BE",null),("Raju","21","Btech","Chennai"))
.toDF("name","age","degree","Place")
.withColumn("null_count", lit(0))
val df2 = df.columns.foldLeft(df)((df,c) =>
df.withColumn("null_count",
when(col(c).isNull,$"null_count" + 1).otherwise($"null_count")
)
)
df2.show(false)
the output is
+----+----+------+---------+----------+
|name|age |degree|Place |null_count|
+----+----+------+---------+----------+
|Ram |null|MCA |Bangalore|1 |
|null|25 |null |null |3 |
|null|26 |BE |null |2 |
|Raju|21 |Btech |Chennai |0 |
+----+----+------+---------+----------+

very specific requirement for outlier treatment in Spark Dataframe

I have very specific requirement for outlier treatment in Spark Dataframe(Scala)
i want to treat just first outlier and make it equal to second group.
Input:
+------+-----------------+------+
|market|responseVariable |blabla|
+------+-----------------+------+
|A |r1 | da |
|A |r1 | ds |
|A |r1 | s |
|A |r1 | f |
|A |r1 | v |
|A |r2 | s |
|A |r2 | s |
|A |r2 | c |
|A |r3 | s |
|A |r3 | s |
|A |r4 | s |
|A |r5 | c |
|A |r6 | s |
|A |r7 | s |
|A |r8 | s |
+------+-----------------+------+
Now per market and responseVariable i want to treat just first outlier..
Group per market and responseVariable:
+------+-----------------+------+
|market|responseVariable |count |
+------+-----------------+------+
|A |r1 | 5 |
|A |r2 | 3 |
|A |r3 | 2 |
|A |r4 | 1 |
|A |r5 | 1 |
|A |r6 | 1 |
|A |r7 | 1 |
|A |r8 | 1 |
+------+-----------------+------+
I want to treat outlier for group market=A and responseVariable=r1 in actual dataset. I want to randomly remove records from group 1 and make it equal to group 2.
Expected output:
+------+-----------------+------+
|market|responseVariable |blabla|
+------+-----------------+------+
|A |r1 | da |
|A |r1 | s |
|A |r1 | v |
|A |r2 | s |
|A |r2 | s |
|A |r2 | c |
|A |r3 | s |
|A |r3 | s |
|A |r4 | s |
|A |r5 | c |
|A |r6 | s |
|A |r7 | s |
|A |r8 | s |
+------+-----------------+------+
group:
+------+-----------------+------+
|market|responseVariable |count |
+------+-----------------+------+
|A |r1 | 3 |
|A |r2 | 3 |
|A |r3 | 2 |
|A |r4 | 1 |
|A |r5 | 1 |
|A |r6 | 1 |
|A |r7 | 1 |
|A |r8 | 1 |
+------+-----------------+------+
I want to repeat this for multiple market.
You will have to know the first and the second groups counts and names which can be done as below
import org.apache.spark.sql.functions._
val first_two_values = df.groupBy("market", "responseVariable").agg(count("blabla").as("count")).orderBy($"count".desc).take((2)).map(row => (row(1) -> row(2))).toList
val rowsToFilter = first_two_values(0)._1
val countsToFilter = first_two_values(1)._2
After you know the first two groups, you need to filter out the extra rows from the first group which can be done by generating row_number and filtering out the extra rows as below
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("market","responseVariable").orderBy("blabla")
df.withColumn("rank", row_number().over(windowSpec))
.withColumn("rank", when(col("rank") > countsToFilter && col("responseVariable") === rowsToFilter, false).otherwise(true))
.filter(col("rank"))
.drop("rank")
.show(false)
You should get your requirement fulfilled