Using a Map to rename and select columns on an Apache Spark Dataframe (Scala) [duplicate] - scala

This question already has answers here:
Renaming column names of a DataFrame in Spark Scala
(6 answers)
Closed 4 years ago.
Starting with a dataframe:
val someDF = Seq(
(8, "bat", "h"),
(64, "mouse", "t"),
(-27, "horse", "x")
).toDF("number", "thing", "letter")
someDF.show()
+------+-----+------+
|number|thing|letter|
+------+-----+------+
| 8| bat| h|
| 64|mouse| t|
| -27|horse| x|
+------+-----+------+
and a Map:
val lookup = Map(
"number" -> "id",
"thing" -> "animal"
)
I'd like to select and rename the columns such that number becomes id, thing becomes animal and so on.
The renaming is covered in another Stack Overflow question: Renaming column names of a DataFrame in Spark Scala, I'm sure there is a straightforward way to do the select at the same time that I'm not seeing.
I thought something along these lines would work, but get lots of type mismatches despite the input is a string and it works with a Seq instead of map:
val renamed_selected = someDF.select(
lookup.map(m => col(m._1).as(m._2))
):_*
So the desired output is:
+------+------+
|id |animal|
+------+------+
| 8| bat |
| 64|mouse |
| -27|horse |
+------+------+
Thanks 👍🏻
Clarification on duplicate question flag: The question Renaming column names of a DataFrame in Spark Scala does not cover how to rename and select columns at the same time.

Here is one way; Use pattern matching to check whether the name exists in the lookup, and give the column an alias if it does otherwise use the original name:
val cols = someDF.columns.map(name => lookup.get(name) match {
case Some(newname) => col(name).as(newname)
case None => col(name)
})
someDF.select(cols: _*).show
+---+------+------+
| id|animal|letter|
+---+------+------+
| 8| bat| h|
| 64| mouse| t|
|-27| horse| x|
+---+------+------+
If you only need columns in the lookup:
val cols = someDF.columns.collect(name => lookup.get(name) match {
case Some(newname) => col(name).as(newname)
})
someDF.select(cols: _*).show
+---+------+
| id|animal|
+---+------+
| 8| bat|
| 64| mouse|
|-27| horse|
+---+------+

Related

How to efficiently select dataframe columns containing a certain value in Spark?

Suppose you have a dataframe in spark (string type) and you want to drop any column that contains "foo". In the example dataframe below, you would drop column "c2" and "c3" but keep "c1". However I'd like the solution to generalize to large numbers of columns and rows.
+-------------------+
| c1| c2| c3|
+-------------------+
| this| foo| hello|
| that| bar| world|
|other| baz| foobar|
+-------------------+
My solution is to scan every column in the dataframe then aggregate the results using the dataframe API and built in functions.
So, scanning each column could be done like this (I'm new to scala please excuse syntax mistakes):
df = df.select(df.columns.map(c => col(c).like("foo"))
Logically, I would have an intermediate dataframe like this:
+--------------------+
| c1| c2| c3|
+--------------------+
| false| true| false|
| false| false| false|
| false| false| true|
+--------------------+
Which would then be aggregated into a single row to read off which columns need to be dropped.
exprs = df.columns.map( c => max(c).alias(c))
drop = df.agg(exprs.head, exprs.tail: _*)
+--------------------+
| c1| c2| c3|
+--------------------+
| false| true| true|
+--------------------+
Now any column containing true can be dropped.
My question is: Is there better way to do this, performance wise? In this case, does spark stop scanning a column once it finds "foo"? Does it matter how data is stored (would parquet help?).
Thanks, I'm new here so please tell my how the question can be improved.
Depending on your data, for example, if you have a lot of foo values, the code below may perform more efficiently:
val colsToDrop = df.columns.filter{ c =>
!df.where(col(c).like("foo")).limit(1).isEmpty
}
df.drop(colsToDrop: _*)
UPDATE: Removed redundant .limit(1):
val colsToDrop = df.columns.filter{ c =>
!df.where(col(c).like("foo")).isEmpty
}
df.drop(colsToDrop: _*)
An answer following your logic (worked out correctly), but I think the other answer is better, more so for posterity and your improved ability with Scala. I am not sure the other answer is in fact performant, but neither is this. Not sure if parquet would help, difficult to gauge.
The other option is to write a loop on the driver and access every
column and then parquet would be of use due to columnar, stats and
push down.
import org.apache.spark.sql.functions._
def myUDF = udf((cols: Seq[String], cmp: String) => cols.map(code => if (code == cmp) true else false ))
val df = sc.parallelize(Seq(
("foo", "abc", "sss"),
("bar", "fff", "sss"),
("foo", "foo", "ddd"),
("bar", "ddd", "ddd")
)).toDF("a", "b", "c")
val res = df.select($"*", array(df.columns.map(col): _*).as("colN"))
.withColumn( "colres", myUDF( col("colN") , lit("foo") ) )
res.show()
res.printSchema()
val n = 3
val res2 = res.select( (0 until n).map(i => col("colres")(i).alias(s"c${i+1}")): _*)
res2.show(false)
val exprs = res2.columns.map( c => max(c).alias(c))
val drop = res2.agg(exprs.head, exprs.tail: _*)
drop.show(false)

Spark GroupBy and Aggregate Strings to Produce a Map of Counts of the Strings Based on a Condition

I have a dataframe with two multiple columns, two of which are id and label as shown below.
+---+---+---+
| id| label|
+---+---+---+
| 1| "abc"|
| 1| "abc"|
| 1| "def"|
| 2| "def"|
| 2| "def"|
+---+---+---+
I want to groupBy "id" and aggregate the label column by counts (ignore null) of label in a map data structure and the expected result is as shown below:
+---+---+--+--+--+--+--+--
| id| label |
+---+-----+----+----+----+
| 1| {"abc":2, "def":1}|
| 2| {"def":2} |
+---+-----+----+----+----+
Is it possible to do this without using user-defined aggregate functions? I saw a similar answer here, but it doesn't aggregate based on the count of each item.
I apologize if this question is silly, I am new to both Scala and Spark.
Thanks
Without Custom UDFs
import org.apache.spark.sql.functions.{map, collect_list}
df.groupBy("id", "label")
.count
.select($"id", map($"label", $"count").as("map"))
.groupBy("id")
.agg(collect_list("map"))
.show(false)
+---+------------------------+
|id |collect_list(map) |
+---+------------------------+
|1 |[[def -> 1], [abc -> 2]]|
|2 |[[def -> 2]] |
+---+------------------------+
Using Custom UDF,
import org.apache.spark.sql.functions.udf
val customUdf = udf((seq: Seq[String]) => {
seq.groupBy(x => x).map(x => x._1 -> x._2.size)
})
df.groupBy("id")
.agg(collect_list("label").as("list"))
.select($"id", customUdf($"list").as("map"))
.show(false)
+---+--------------------+
|id |map |
+---+--------------------+
|1 |[abc -> 2, def -> 1]|
|2 |[def -> 2] |
+---+--------------------+

Map a multimap to columns of dataframe

Simply, I want to convert a multimap like this:
val input = Map("rownum"-> List("1", "2", "3") , "plant"-> List( "Melfi", "Pomigliano", "Torino" ), "tipo"-> List("gomme", "telaio")).toArray
in the following Spark dataframe:
+-------+--------------+-------+
|rownum | plant | tipo |
+------ +--------------+-------+
| 1 | Melfi | gomme |
| 2 | Pomigliano | telaio|
| 3 | Torino | null |
+-------+--------------+-------+
replacing missing values with "null" values. My issue is apply a map function to the RDD:
val inputRdd = sc.parallelize(input)
inputRdd.map(..).toDF()
Any suggestions? Thanks in advance
Although, see my comments, I'm really not sure the multimap format is well suited to your problem (did you have a look at Spark XML parsing modules ?)
The pivot table solution
The idea is to flatten you input table into a (elementPosition, columnName, columnValue) format :
// The max size of the multimap lists
val numberOfRows = input.map(_._2.size).max
// For each index in the list, emit a tuple of (index, multimap key, multimap value at index)
val flatRows = (0 until numberOfRows).flatMap(rowIdx => input.map({ case (colName, allColValues) => (rowIdx, colName, if(allColValues.size > rowIdx) allColValues(rowIdx) else null)}))
// Probably faster at runtime to write it this way (less iterations) :
// val flatRows = input.flatMap({ case (colName, existingValues) => (0 until numberOfRows).zipAll(existingValues, null, null).map(t => (t._1.asInstanceOf[Int], colName, t._2)) })
// To dataframe
val flatDF = sc.parallelize(flatRows).toDF("elementIndex", "colName", "colValue")
flatDF.show
Will output :
+------------+-------+----------+
|elementIndex|colName| colValue|
+------------+-------+----------+
| 0| rownum| 1|
| 0| plant| Melfi|
| 0| tipo| gomme|
| 1| rownum| 2|
| 1| plant|Pomigliano|
| 1| tipo| telaio|
| 2| rownum| 3|
| 2| plant| Torino|
| 2| tipo| null|
+------------+-------+----------+
Now this is a pivot table problem :
flatDF.groupBy("elementIndex").pivot("colName").agg(expr("first(colValue)")).drop("elementIndex").show
+----------+------+------+
| plant|rownum| tipo|
+----------+------+------+
|Pomigliano| 2|telaio|
| Torino| 3| null|
| Melfi| 1| gomme|
+----------+------+------+
This might not be the best looking solution, but it is fully scalable to any number of columns.

Pass Array[seq[String]] to UDF in spark scala

I am new to UDF in spark. I have also read the answer here
Problem statement: I'm trying to find pattern matching from a dataframe col.
Ex: Dataframe
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),
(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
df.show()
+---+--------------------+
| id| text|
+---+--------------------+
| 1| z|
| 2| abs,abc,dfg|
| 3|a,b,c,d,e,f,abs,a...|
+---+--------------------+
df.filter($"text".contains("abs,abc,dfg")).count()
//returns 2 as abs exits in 2nd row and 3rd row
Now I want to do this pattern matching for every row in column $text and add new column called count.
Result:
+---+--------------------+-----+
| id| text|count|
+---+--------------------+-----+
| 1| z| 1|
| 2| abs,abc,dfg| 2|
| 3|a,b,c,d,e,f,abs,a...| 1|
+---+--------------------+-----+
I tried to define a udf passing $text column as Array[Seq[String]. But I am not able to get what I intended.
What I tried so far:
val txt = df.select("text").collect.map(_.toSeq.map(_.toString)) //convert column to Array[Seq[String]
val valsum = udf((txt:Array[Seq[String],pattern:String)=> {txt.count(_ == pattern) } )
df.withColumn("newCol", valsum( lit(txt) ,df(text)) )).show()
Any help would be appreciated
You will have to know all the elements of text column which can be done using collect_list by grouping all the rows of your dataframe as one. Then just check if element in text column in the collected array and count them as in the following code.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
val valsum = udf((txt: String, array : mutable.WrappedArray[String])=> array.filter(element => element.contains(txt)).size)
df.withColumn("grouping", lit("g"))
.withColumn("array", collect_list("text").over(Window.partitionBy("grouping")))
.withColumn("count", valsum($"text", $"array"))
.drop("grouping", "array")
.show(false)
You should have following output
+---+-----------------------+-----+
|id |text |count|
+---+-----------------------+-----+
|1 |z |1 |
|2 |abs,abc,dfg |2 |
|3 |a,b,c,d,e,f,abs,abc,dfg|1 |
+---+-----------------------+-----+
I hope this is helpful.

How to pivot dataset?

I use Spark 2.1.
I have some data in a Spark Dataframe, which looks like below:
**ID** **type** **val**
1 t1 v1
1 t11 v11
2 t2 v2
I want to pivot up this data using either spark Scala (preferably) or Spark SQL so that final output should look like below:
**ID** **t1** **t11** **t2**
1 v1 v11
2 v2
You can use groupBy.pivot:
import org.apache.spark.sql.functions.first
df.groupBy("ID").pivot("type").agg(first($"val")).na.fill("").show
+---+---+---+---+
| ID| t1|t11| t2|
+---+---+---+---+
| 1| v1|v11| |
| 2| | | v2|
+---+---+---+---+
Note: depending on the actual data, i.e. how many values there are for each combination of ID and type, you might choose a different aggregation function.
Here's one way to do it:
val df = Seq(
(1, "T1", "v1"),
(1, "T11", "v11"),
(2, "T2", "v2")
).toDF(
"id", "type", "val"
).as[(Int, String, String)]
val df2 = df.groupBy("id").pivot("type").agg(concat_ws(",", collect_list("val")))
df2.show
+---+---+---+---+
| id| T1|T11| T2|
+---+---+---+---+
| 1| v1|v11| |
| 2| | | v2|
+---+---+---+---+
Note that if there are different vals associated with a given type, they will be grouped (comma-delimited) under the type in df2.
This one should work
val seq = Seq((123,"2016-01-01","1"),(123,"2016-01-02","2"),(123,"2016-01-03","3"))
val seq = Seq((1,"t1","v1"),(1,"t11","v11"),(2,"t2","v2"))
val df = seq.toDF("id","type","val")
val pivotedDF = df.groupBy("id").pivot("type").agg(first("val"))
pivotedDF.show
Output:
+---+----+----+----+
| id| t1| t11| t2|
+---+----+----+----+
| 1| v1| v11|null|
| 2|null|null| v2|
+---+----+----+----+