Scala list columns dynamically drop the columns from the list - scala

I have a spark scala code which performs as below:
val ua_list = List()
for (a <- a_col_names)
if (some condition ) {
ua_list :+ (a)
Now i am calling the list in dataframe to drop all the columns from the list
val df_d = df_p.drop(ua_list.map(name => col(name)): _*)
The error i am facing is no `: _*' annotation allowed here (such annotations are only allowed in arguments to *-parameters)
Not sure what exactly the issue is ? Any suggestions and ideas.

if you are using _* means all columns in the list, no need to map and get each column.
simply you can do like below.
df_p.drop(ua_list : _*)
Full example :
import spark.implicits._
val df = Seq(
(123, "ITA", 1475600500, 18.0),
(123, "ITA", 1475600500, 18.0),
(123, "ITA", 1475600516, 19.0)
).toDF("Value", "Country", "Timestamp", "Sum")
df.show
val ua_list = List("Value", "Timestamp")
df.drop(ua_list: _*).show
Result :
+-----+-------+----------+----+
|Value|Country| Timestamp| Sum|
+-----+-------+----------+----+
| 123| ITA|1475600500|18.0|
| 123| ITA|1475600500|18.0|
| 123| ITA|1475600516|19.0|
+-----+-------+----------+----+
+-------+----+
|Country| Sum|
+-------+----+
| ITA|18.0|
| ITA|18.0|
| ITA|19.0|
+-------+----+

Related

Dynamic dataframe with n columns and m rows

Reading data from json(dynamic schema) and i'm loading that to dataframe.
Example Dataframe:
scala> import spark.implicits._
import spark.implicits._
scala> val DF = Seq(
(1, "ABC"),
(2, "DEF"),
(3, "GHIJ")
).toDF("id", "word")
someDF: org.apache.spark.sql.DataFrame = [number: int, word: string]
scala> DF.show
+------+-----+
|id | word|
+------+-----+
| 1| ABC|
| 2| DEF|
| 3| GHIJ|
+------+-----+
Requirement:
Column count and names can be anything. I want to read rows in loop to fetch each column one by one. Need to process that value in subsequent flows. Need both column name and value. I'm using scala.
Python:
for i, j in df.iterrows():
print(i, j)
Need the same functionality in scala and it column name and value should be fetched separtely.
Kindly help.
df.iterrows is not from pyspark, but from pandas. In Spark, you can use foreach :
DF
.foreach{_ match {case Row(id:Int,word:String) => println(id,word)}}
Result :
(2,DEF)
(3,GHIJ)
(1,ABC)
I you don't know the number of columns, you cannot use unapply on Row, then just do :
DF
.foreach(row => println(row))
Result :
[1,ABC]
[2,DEF]
[3,GHIJ]
And operate with row using its methods getAs etc

How to Transform a Spark Scala Nested Map within a Map Data Structure?

I want to write a nested data structure consisting of a Map inside another Map using an array of a Scala case class.
The result should transform this dataframe:
|Value|Country| Timestamp| Sum|
+-----+-------+----------+----+
| 123| ITA|1475600500|18.0|
| 123| ITA|1475600516|19.0|
+-----+-------+----------+----+
into:
+--------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------+
[{"value":123,"attributes":{"ITA":{"1475600500":18,"1475600516":19}}}]
+--------------------------------------------------------------------+
The actualResult dataset below gets me close but the structure isn't quite the same as my expected dataframe.
case class Record(value: Integer, attributes: Map[String, Map[String, BigDecimal]])
val actualResult = df
.map(r =>
Array(
Record(
r.getAs[Int]("Value"),
Map(
r.getAs[String]("Country") ->
Map(
r.getAs[String]("Timestamp") -> new BigDecimal(
r.getAs[Double]("Sum").toString
)
)
)
)
)
)
The Timestamp column in the actualResult dataset doesn't get combined together into the same Record row but rather creates two separate rows instead.
+----------------------------------------------------+
|value |
+----------------------------------------------------+
[{"value":123,"attributes":{"ITA":{"1475600516":19}}}]
[{"value":123,"attributes":{"ITA":{"1475600500":18}}}]
+----------------------------------------------------+
With the use of groupBy and collect_list by creatng combined column using struct I was able to get single row as below output.
val mycsv =
"""
|Value|Country|Timestamp|Sum
| 123|ITA|1475600500|18.0
| 123|ITA|1475600516|19.0
""".stripMargin('|').lines.toList.toDS()
val df: DataFrame = spark.read.option("header", true)
.option("sep", "|")
.option("inferSchema", true)
.csv(mycsv)
df.show
val df1 = df.
groupBy("Value","Country")
.agg( collect_list(struct(col("Country"), col("Timestamp"), col("Sum"))).alias("attributes")).drop("Country")
val json = df1.toJSON // you can save in to file
json.show(false)
Result combined 2 rows
+-----+-------+----------+----+
|Value|Country| Timestamp| Sum|
+-----+-------+----------+----+
|123.0|ITA |1475600500|18.0|
|123.0|ITA |1475600516|19.0|
+-----+-------+----------+----+
+----------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|{"Value":123.0,"attributes":[{"Country":"ITA","Timestamp":1475600500,"Sum":18.0},{"Country":"ITA","Timestamp":1475600516,"Sum":19.0}]}|
+----------------------------------------------------------------------------------------------------------------------------------------------+

append multiple columns to existing dataframe in spark

I need to append multiple columns to the existing spark dataframe where column names are given in List
assuming values for new columns are constant, for example given input columns and dataframe are
val columnsNames=List("col1","col2")
val data = Seq(("one", 1), ("two", 2), ("three", 3), ("four", 4))
and after appending both columns, assuming constant values are "val1" for col1 and "val2" for col2,output data frame should be
+-----+---+-------+------+
| _1| _2|col1 |col2|
+-----+---+-------+------+
| one| 1|val1 |val2|
| two| 2|val1 |val2|
|three| 3|val1 |val2|
| four| 4|val1 |val2|
+-----+---+-------+------+
i have written a function to append columns
def appendColumns (cols: List[String], ds: DataFrame): DataFrame = {
cols match {
case Nil => ds
case h :: Nil => appendColumns(Nil, ds.withColumn(h, lit(h)))
case h :: tail => appendColumns(tail, ds.withColumn(h, lit(h)))
}
}
Is there any better way and more functional way to do it.
thanks
Yes, there is a better and simpler way. Basically, you make as many calls to withColumn as you have columns. With lots of columns, catalyst, the engine that optimizes spark queries may feel a bit overwhelmed (I've had the experience in the past with a similar use case). I've even seen it cause an OOM on the driver when experimenting with thousands of columns. To avoid stressing catalyst (and write less code ;-) ), you can simply use select like this below to get this done in one spark command:
val data = Seq(("one", 1), ("two", 2), ("three", 3), ("four", 4)).toDF
// let's assume that we have a map that associates column names to their values
val columnMap = Map("col1" -> "val1", "col2" -> "val2")
// Let's create the new columns from the map
val newCols = columnMap.keys.map(k => lit(columnMap(k)) as k)
// selecting the old columns + the new ones
data.select(data.columns.map(col) ++ newCols : _*).show
+-----+---+----+----+
| _1| _2|col1|col2|
+-----+---+----+----+
| one| 1|val1|val2|
| two| 2|val1|val2|
|three| 3|val1|val2|
| four| 4|val1|val2|
+-----+---+----+----+
As opposed to recursion the more general approach using a foldLeft would I think be more general, for a limited number of columns. Using Databricks Notebook:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import spark.implicits._
val columnNames = Seq("c3","c4")
val df = Seq(("one", 1), ("two", 2), ("three", 3), ("four", 4)).toDF("c1", "c2")
def addCols(df: DataFrame, columns: Seq[String]): DataFrame = {
columns.foldLeft(df)((acc, col) => {
acc.withColumn(col, lit(col)) })
}
val df2 = addCols(df, columnNames)
df2.show(false)
returns:
+-----+---+---+---+
|c1 |c2 |c3 |c4 |
+-----+---+---+---+
|one |1 |c3 |c4 |
|two |2 |c3 |c4 |
|three|3 |c3 |c4 |
|four |4 |c3 |c4 |
+-----+---+---+---+
Please beware of the following: https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015 albeit in a slightly different context and the other answer alludes to this via the select approach.

Scala - Fill "null" column with another column

I want to replicate the problem mentioned here in Scala DataFrames. I have tried using the following approaches, to no success so far.
Input
Col1 Col2
A M
B K
null S
Expected Output
Col1 Col2
A M
B K
S <---- S
Approach 1
val output = df.na.fill("A", Seq("col1"))
The fill method does not take a column as the (first) input.
Approach 2
val output = df.where(df.col("col1").isNull)
I cannot find a suitable method to call after I have identified the null values.
Approach 3
val output = df.dtypes.map(column =>
column._2 match {
case "null" => (column._2 -> 0)
}).toMap
I get a StringType error.
I'd use when/otherwise, as shown below:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
("A", "M"), ("B", "K"), (null, "S")
).toDF("Col1", "Col2")
df.withColumn("Col1", when($"Col1".isNull, $"Col2").otherwise($"Col1")).show
// +----+----+
// |Col1|Col2|
// +----+----+
// | A| M|
// | B| K|
// | S| S|
// +----+----+

spark expression rename the column list after aggregation

I have written below code to group and aggregate the columns
val gmList = List("gc1","gc2","gc3")
val aList = List("val1","val2","val3","val4","val5")
val cype = "first"
val exprs = aList.map((_ -> cype )).toMap
dfgroupBy(gmList.map (col): _*).agg (exprs).show
but this create a columns with appending aggregation name in all column as shown below
so I want to alias that name first(val1) -> val1, I want to make this code generic as part of exprs
+----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+
| gc1 | gc2 | gc3 | first(val1) | first(val2)| first(val3) | first(val4) | first(val5) |
+----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+
One approach would be to alias the aggregated columns to the original column names in a subsequent select. I would also suggest generalizing the single aggregate function (i.e. first) to a list of functions, as shown below:
import org.apache.spark.sql.functions._
val df = Seq(
(1, 10, "a1", "a2", "a3"),
(1, 10, "b1", "b2", "b3"),
(2, 20, "c1", "c2", "c3"),
(2, 30, "d1", "d2", "d3"),
(2, 30, "e1", "e2", "e3")
).toDF("gc1", "gc2", "val1", "val2", "val3")
val gmList = List("gc1", "gc2")
val aList = List("val1", "val2", "val3")
// Populate with different aggregate methods for individual columns if necessary
val fList = List.fill(aList.size)("first")
val afPairs = aList.zip(fList)
// afPairs: List[(String, String)] = List((val1,first), (val2,first), (val3,first))
df.
groupBy(gmList.map(col): _*).agg(afPairs.toMap).
select(gmList.map(col) ::: afPairs.map{ case (v, f) => col(s"$f($v)").as(v) }: _*).
show
// +---+---+----+----+----+
// |gc1|gc2|val1|val2|val3|
// +---+---+----+----+----+
// | 2| 20| c1| c2| c3|
// | 1| 10| a1| a2| a3|
// | 2| 30| d1| d2| d3|
// +---+---+----+----+----+
You can slightly change the way you are generating the expression and use the function alias in there:
import org.apache.spark.sql.functions.col
val aList = List("val1","val2","val3","val4","val5")
val exprs = aList.map(c => first(col(c)).alias(c) )
dfgroupBy( gmList.map(col) : _*).agg(exprs.head , exprs.tail: _*).show
Here's a more generic version that will work with any aggregate functions and doesn't require naming your aggregate columns up front. Build your grouped df as you normally would, then use:
val colRegex = raw"^.+\((.*?)\)".r
val newCols = df.columns.map(c => col(c).as(colRegex.replaceAllIn(c, m => m.group(1))))
df.select(newCols: _*)
This will extract out only what is inside the parentheses, regardless of what aggregate function is called (e.g. first(val) -> val, sum(val) -> val, count(val) -> val, etc.).