How to transform this dataset to the following dataset - scala

Input
+------+------+------+------+
|emp_name|emp_area| dept|zip|
+------+------+------+------+
|ram|USA|"Sales"|805912|
|sham|USA|"Sales"|805912|
|ram|Canada|"Marketing"|805912|
|ram|USA|"Sales"|805912|
|sham|USA|"Marketing"|805912|
+------+------+------+------
Desired output
feature |Top1 name |Top 1 value1|Top2 name|top 2 value|
emp_name ram |3|sham |2
emp_area Usa |4|canada |1
dept sales|3|Marketing|3
zip 805912|5|NA|NA
I started with dynamically generating the count for each one of them but unable to store them in a dataset
val features=ds.columns.toList
for (e <- features) {
val ds1=ds.groupBy(e).count().sort(desc("count")).limit(5).withColumnRenamed("count", e+"_count")
}
Now how to collect all the values into one dataframe and transform to the output?

Here's a slightly verbose approach. You can map each column to a dataframe with one row, which corresponds to the row in the desired output. Add NA columns if necessary. Convert the column names to the desired ones, and finally do a unionAll to combine the dataframes (one row each).
import org.apache.spark.sql.expressions.Window
val top = 2
val result = ds.columns.map(
c => ds.groupBy(c).count()
.withColumn("rn", row_number().over(Window.orderBy(desc("count"))))
.filter(s"rn <= $top")
.groupBy().pivot("rn")
.agg(first(col(c)), first(col("count")))
.select(lit(c), col("*"))
).map(df =>
if (df.columns.size != 1 + top*2)
df.select(List(col("*")) ::: (1 to (top*2+1 - df.columns.size)).toList.map(x => lit("NA")): _*)
else df
).map(df =>
df.toDF(List("feature") ::: (1 to top).toList.flatMap(x => Seq(s"top$x name", s"top$x value")): _*)
).reduce(_ unionAll _)
result.show
+--------+---------+----------+---------+----------+
| feature|top1 name|top1 value|top2 name|top2 value|
+--------+---------+----------+---------+----------+
|emp_name| ram| 3| sham| 2|
|emp_area| USA| 4| Canada| 1|
| dept| Sales| 3|Marketing| 2|
| zip| 805912| 5| NA| NA|
+--------+---------+----------+---------+----------+

Related

how to find total count of nulls present in a DataFrame using Scala

I'm looking for a command in Scala to find the sum of nulls in all the columns present in a DataFrame.
In python I can get it using the command:
Df.isnull().sum()
Can you let me know the scala command for the same?
AFAIK, there is no spark function to do that but you can count the number of null values in each column and then compute the sum of all these values.
Here is sample data:
val df = spark.range(5).select(
when('id % 2 === 0, 'id) as "mod2",
when('id %3 === 0, 'id) as "mod3"
)
df.show
+----+----+
|mod2|mod3|
+----+----+
| 0| 0|
|null|null|
| 2|null|
|null| 3|
| 4|null|
+----+----+
And here is a solution:
val result = df.select(
df.columns
// null count per column
.map(c => sum(isnull(col(c)) cast "int"))
// computing the sum
.reduce(_+_) as "total_nulls"
)
result.show
+-----------+
|total_nulls|
+-----------+
| 5|
+-----------+
NB: use result.head.getAs[Long](0) to get the total out of the result dataframe.

Maximum of some specific columns in a spark scala dataframe

I have a dataframe like this.
+---+---+---+---+
| M| c2| c3| d1|
+---+---+---+---+
| 1|2_1|4_3|1_2|
| 2|3_4|4_5|1_2|
+---+---+---+---+
I have to transform this df should look like below. Here, c_max = max(c2,c3) after splitting with _.ie, all the columns (c2 and c3) have to be splitted with _ and then getting the max.
In the actual scenario, I have 50 columns ie, c2,c3....c50 and need to take the max from this.
+---+---+---+---+------+
| M| c2| c3| d1|c_Max |
+---+---+---+---+------+
| 1|2_1|4_3|1_2| 4 |
| 2|3_4|4_5|1_2| 5 |
+---+---+---+---+------+
Here is one way using expr and build-in array functions for Spark >= 2.4.0:
import org.apache.spark.sql.functions.{expr, array_max, array}
val df = Seq(
(1, "2_1", "3_4", "1_2"),
(2, "3_4", "4_5", "1_2")
).toDF("M", "c2", "c3", "d1")
// get max c for each c column
val c_cols = df.columns.filter(_.startsWith("c")).map{ c =>
expr(s"array_max(cast(split(${c}, '_') as array<int>))")
}
df.withColumn("max_c", array_max(array(c_cols:_*))).show
Output:
+---+---+---+---+-----+
| M| c2| c3| d1|max_c|
+---+---+---+---+-----+
| 1|2_1|3_4|1_2| 4|
| 2|3_4|4_5|1_2| 5|
+---+---+---+---+-----+
For older versions use the next code:
val c_cols = df.columns.filter(_.startsWith("c")).map{ c =>
val c_ar = split(col(c), "_").cast("array<int>")
when(c_ar.getItem(0) > c_ar.getItem(1), c_ar.getItem(0)).otherwise(c_ar.getItem(1))
}
df.withColumn("max_c", greatest(c_cols:_*)).show
Use greatest function:
val df = Seq((1, "2_1", "3_4", "1_2"),(2, "3_4", "4_5", "1_2"),
).toDF("M", "c2", "c3", "d1")
// get all `c` columns and split by `_` to get the values after the underscore
val c_cols = df.columns.filter(_.startsWith("c"))
.flatMap{
c => Seq(split(col(c), "_").getItem(0).cast("int"),
split(col(c), "_").getItem(1).cast("int")
)
}
// apply greatest func
val c_max = greatest(c_cols: _*)
// add new column
df.withColumn("c_Max", c_max).show()
Gives:
+---+---+---+---+-----+
| M| c2| c3| d1|c_Max|
+---+---+---+---+-----+
| 1|2_1|3_4|1_2| 4|
| 2|3_4|4_5|1_2| 5|
+---+---+---+---+-----+
In spark >= 2.4.0, you can use the array_max function and get some code that would work even with columns containing more than 2 values. The idea is to start by concatenating all the columns (concat column). For that, I use concat_ws on an array of all the columns I want to concat, that I obtain with array(cols.map(col) :_*). Then I split the resulting string to get a big array of strings containing all the values of all the columns. I cast it to an array of ints and I call array_max on it.
val cols = (2 to 50).map("c"+_)
val result = df
.withColumn("concat", concat_ws("_", array(cols.map(col) :_*)))
.withColumn("array_of_ints", split('concat, "_").cast(ArrayType(IntegerType)))
.withColumn("c_max", array_max('array_of_ints))
.drop("concat", "array_of_ints")
In spark < 2.4, you can define array_max yourself like this:
val array_max = udf((s : Seq[Int]) => s.max)
The previous code does not need to be modified. Note however that UDFs can be slower than predefined spark SQL functions.

Using a Map to rename and select columns on an Apache Spark Dataframe (Scala) [duplicate]

This question already has answers here:
Renaming column names of a DataFrame in Spark Scala
(6 answers)
Closed 4 years ago.
Starting with a dataframe:
val someDF = Seq(
(8, "bat", "h"),
(64, "mouse", "t"),
(-27, "horse", "x")
).toDF("number", "thing", "letter")
someDF.show()
+------+-----+------+
|number|thing|letter|
+------+-----+------+
| 8| bat| h|
| 64|mouse| t|
| -27|horse| x|
+------+-----+------+
and a Map:
val lookup = Map(
"number" -> "id",
"thing" -> "animal"
)
I'd like to select and rename the columns such that number becomes id, thing becomes animal and so on.
The renaming is covered in another Stack Overflow question: Renaming column names of a DataFrame in Spark Scala, I'm sure there is a straightforward way to do the select at the same time that I'm not seeing.
I thought something along these lines would work, but get lots of type mismatches despite the input is a string and it works with a Seq instead of map:
val renamed_selected = someDF.select(
lookup.map(m => col(m._1).as(m._2))
):_*
So the desired output is:
+------+------+
|id |animal|
+------+------+
| 8| bat |
| 64|mouse |
| -27|horse |
+------+------+
Thanks 👍🏻
Clarification on duplicate question flag: The question Renaming column names of a DataFrame in Spark Scala does not cover how to rename and select columns at the same time.
Here is one way; Use pattern matching to check whether the name exists in the lookup, and give the column an alias if it does otherwise use the original name:
val cols = someDF.columns.map(name => lookup.get(name) match {
case Some(newname) => col(name).as(newname)
case None => col(name)
})
someDF.select(cols: _*).show
+---+------+------+
| id|animal|letter|
+---+------+------+
| 8| bat| h|
| 64| mouse| t|
|-27| horse| x|
+---+------+------+
If you only need columns in the lookup:
val cols = someDF.columns.collect(name => lookup.get(name) match {
case Some(newname) => col(name).as(newname)
})
someDF.select(cols: _*).show
+---+------+
| id|animal|
+---+------+
| 8| bat|
| 64| mouse|
|-27| horse|
+---+------+

Pyspark: Delete rows on column condition after groupBy

This is my input dataframe:
id val
1 Y
1 N
2 a
2 b
3 N
Result should be:
id val
1 Y
2 a
2 b
3 N
I want to group by on col id which has both Y and N in the val and then remove the row where the column val contains "N".
Please help me resolve this issue as i am beginner to pyspark
you can first identify the problematic rows with a filter for val=="Y" and then join this dataframe back to the original one. Finally you can filter for Null values and for the rows you want to keep, e.g. val==Y. Pyspark should be able to handle the self-join even if there are a lot of rows.
The example is shown below:
df_new = spark.createDataFrame([
(1, "Y"), (1, "N"), (1,"X"), (1,"Z"),
(2,"a"), (2,"b"), (3,"N")
], ("id", "val"))
df_Y = df_new.filter(col("val")=="Y").withColumnRenamed("val","val_Y").withColumnRenamed("id","id_Y")
df_new = df_new.join(df_Y, df_new["id"]==df_Y["id_Y"],how="left")
df_new.filter((col("val_Y").isNull()) | ((col("val_Y")=="Y") & ~(col("val")=="N"))).select("id","val").show()
The result would be your preferred:
+---+---+
| id|val|
+---+---+
| 1| X|
| 1| Y|
| 1| Z|
| 3| N|
| 2| a|
| 2| b|
+---+---+

Split 1 column into 3 columns in spark scala

I have a dataframe in Spark using scala that has a column that I need split.
scala> test.show
+-------------+
|columnToSplit|
+-------------+
| a.b.c|
| d.e.f|
+-------------+
I need this column split out to look like this:
+--------------+
|col1|col2|col3|
| a| b| c|
| d| e| f|
+--------------+
I'm using Spark 2.0.0
Thanks
Try:
import sparkObject.spark.implicits._
import org.apache.spark.sql.functions.split
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select(
$"_tmp".getItem(0).as("col1"),
$"_tmp".getItem(1).as("col2"),
$"_tmp".getItem(2).as("col3")
)
The important point to note here is that the sparkObject is the SparkSession object you might have already initialized. So, the (1) import statement has to be compulsorily put inline within the code, not before the class definition.
To do this programmatically, you can create a sequence of expressions with (0 until 3).map(i => col("temp").getItem(i).as(s"col$i")) (assume you need 3 columns as result) and then apply it to select with : _* syntax:
df.withColumn("temp", split(col("columnToSplit"), "\\.")).select(
(0 until 3).map(i => col("temp").getItem(i).as(s"col$i")): _*
).show
+----+----+----+
|col0|col1|col2|
+----+----+----+
| a| b| c|
| d| e| f|
+----+----+----+
To keep all columns:
df.withColumn("temp", split(col("columnToSplit"), "\\.")).select(
col("*") +: (0 until 3).map(i => col("temp").getItem(i).as(s"col$i")): _*
).show
+-------------+---------+----+----+----+
|columnToSplit| temp|col0|col1|col2|
+-------------+---------+----+----+----+
| a.b.c|[a, b, c]| a| b| c|
| d.e.f|[d, e, f]| d| e| f|
+-------------+---------+----+----+----+
If you are using pyspark, use a list comprehension to replace the map in scala:
df = spark.createDataFrame([['a.b.c'], ['d.e.f']], ['columnToSplit'])
from pyspark.sql.functions import col, split
(df.withColumn('temp', split('columnToSplit', '\\.'))
.select(*(col('temp').getItem(i).alias(f'col{i}') for i in range(3))
).show()
+----+----+----+
|col0|col1|col2|
+----+----+----+
| a| b| c|
| d| e| f|
+----+----+----+
A solution which avoids the select part. This is helpful when you just want to append the new columns:
case class Message(others: String, text: String)
val r1 = Message("foo1", "a.b.c")
val r2 = Message("foo2", "d.e.f")
val records = Seq(r1, r2)
val df = spark.createDataFrame(records)
df.withColumn("col1", split(col("text"), "\\.").getItem(0))
.withColumn("col2", split(col("text"), "\\.").getItem(1))
.withColumn("col3", split(col("text"), "\\.").getItem(2))
.show(false)
+------+-----+----+----+----+
|others|text |col1|col2|col3|
+------+-----+----+----+----+
|foo1 |a.b.c|a |b |c |
|foo2 |d.e.f|d |e |f |
+------+-----+----+----+----+
Update: I highly recommend to use Psidom's implementation to avoid splitting three times.
This appends columns to the original DataFrame and doesn't use select, and only splits once using a temporary column:
import spark.implicits._
df.withColumn("_tmp", split($"columnToSplit", "\\."))
.withColumn("col1", $"_tmp".getItem(0))
.withColumn("col2", $"_tmp".getItem(1))
.withColumn("col3", $"_tmp".getItem(2))
.drop("_tmp")
This expands on Psidom's answer and shows how to do the split dynamically, without hardcoding the number of columns. This answer runs a query to calculate the number of columns.
val df = Seq(
"a.b.c",
"d.e.f"
).toDF("my_str")
.withColumn("letters", split(col("my_str"), "\\."))
val numCols = df
.withColumn("letters_size", size($"letters"))
.agg(max($"letters_size"))
.head()
.getInt(0)
df
.select(
(0 until numCols).map(i => $"letters".getItem(i).as(s"col$i")): _*
)
.show()
We can write using for with yield in Scala :-
If your number of columns exceeds just add it to desired column and play with it. :)
val aDF = Seq("Deepak.Singh.Delhi").toDF("name")
val desiredColumn = Seq("name","Lname","City")
val colsize = desiredColumn.size
val columList = for (i <- 0 until colsize) yield split(col("name"),".").getItem(i).alias(desiredColumn(i))
aDF.select(columList: _ *).show(false)
Output:-
+------+------+-----+--+
|name |Lname |city |
+-----+------+-----+---+
|Deepak|Singh |Delhi|
+---+------+-----+-----+
If you don't need name column then, drop the column and just use withColumn.
Example:
Without using the select statement.
Lets assume we have a dataframe having a set of columns and we want to split a column having column name as name
import spark.implicits._
val columns = Seq("name","age","address")
val data = Seq(("Amit.Mehta", 25, "1 Main st, Newark, NJ, 92537"),
("Rituraj.Mehta", 28,"3456 Walnut st, Newark, NJ, 94732"))
var dfFromData = spark.createDataFrame(data).toDF(columns:_*)
dfFromData.printSchema()
val newDF = dfFromData.map(f=>{
val nameSplit = f.getAs[String](0).split("\\.").map(_.trim)
(nameSplit(0),nameSplit(1),f.getAs[Int](1),f.getAs[String](2))
})
val finalDF = newDF.toDF("First Name","Last Name", "Age","Address")
finalDF.printSchema()
finalDF.show(false)
output: