I have a Spark Dataframe with the following contents:
Name
E1
E2
E3
abc
4
5
6
I need the various E columns to become rows in a new column as shown below:
Name
value
EType
abc
4
E1
abc
5
E2
abc
6
E3
This answer
gave me the idea of using explode and I now have the following code:
df.select($"Name", explode(array("E1", "E2", "E3")).as("value"))
The above code gives me the Name and value columns I need, but I still need a way to add in the EType column based on which value in the array passed to explode is being used to populate that particular row.
Output of the above code:
Name
value
abc
4
abc
5
abc
6
How can I add the Etype column?
(I am using Spark 2.2 with Scala)
Thanks!
You can use stack function for this particular case.
df.selectExpr('Name', "stack(3, E1, 'E1', E2, 'E2', E3, 'E3')").toDF('Name', 'value', 'EType').show()
df.selectExpr('Name', "stack(3, E1, 'E1', E2, 'E2', E3, 'E3')").toDF('Name', 'value', 'EType').show()
df.selectExpr('Name', "stack(3, E1, 'E1', E2, 'E2', E3, 'E3')").toDF('Name', 'value', 'EType').show()
+----+-----+-----+
|Name|value|EType|
+----+-----+-----+
| abc| 4| E1|
| abc| 5| E2|
| abc| 6| E3|
+----+-----+-----+
Instead of exploding just value, you can explode a struct that contains the name of the column and its content, as follows:
import org.apache.spark.sql.functions.{array, col, explode, lit, struct}
val result = df
.select(
col("name"),
explode(array(
df.columns.filterNot(_ == "name").map(c => struct(lit(c).as("EType"), col(c).alias("value"))): _*
))
)
.select("name", "col.*")
With your input you will get as result dataframe:
+----+-----+-----+
|name|EType|value|
+----+-----+-----+
|abc |E1 |4 |
|abc |E2 |5 |
|abc |E3 |6 |
+----+-----+-----+
You need to use melt operation here.
Note: Melt functionality is not present in pyspark, you need write that util function.
You can go thought this answer on how to implement melt function How to melt Spark DataFrame?
Related
I have here a toy data set for which I need to compute list of cities in each state and population of that state(sum of population of all the cities in that state)Data
I want to do it using RDDs without using groupByKey and joins. My approach so far:
In this approach I used 2 separate key-value pairs and joined them.
val rdd1=inputRdd.map(x=>(x._1,x._3.toInt))
val rdd2=inputRdd.map(x=>(x._1,x._2))
val popn_sum=rdd1.reduceByKey(_+_)
val list_cities=rdd2.reduceByKey(_++_)
popn_sum.join(list_cities).collect()
Is it possible to get the same output with just 1 key-value pair and without any joins.
I have created a new key-value pair, but I do not know how to proceed to get the same output using aggregateByKey or reduceByKey with this RDD:
val rdd3=inputRdd.map(x=>(x._1,(List(x._2),x._3)))
I am new to spark and want to learn the best way get this output.
Array((B,(12,List(B1, B2))), (A,(6,List(A1, A2, A3))), (C,(8,List(C1, C2))))
Thanks in advance
If your inputRdd is of type
inputRdd: org.apache.spark.rdd.RDD[(String, String, Int)]
Then you can achieve your desired result by simply using one reduceByKey as
inputRdd.map(x => (x._1, (List(x._2), x._3.toInt))).reduceByKey((x, y) => (x._1 ++ y._1, x._2+y._2))
and you can it with aggregateByKey as
inputRdd.map(x => (x._1, (List(x._2), x._3.toInt))).aggregateByKey((List.empty[String], 0))((x, y) => (x._1 ++ y._1, x._2+y._2), (x, y) => (x._1 ++ y._1, x._2+y._2))
DataFrame way
Even better approach would be to use dataframe way. You can convert your rdd to dataframe simply by applying .toDF("state", "city", "population") which should give you
+-----+----+----------+
|state|city|population|
+-----+----+----------+
|A |A1 |1 |
|B |B1 |2 |
|C |C1 |3 |
|A |A2 |2 |
|A |A3 |3 |
|B |B2 |10 |
|C |C2 |5 |
+-----+----+----------+
After that you can just use groupBy, and apply collect_list and sum inbuilt aggregation functions as
import org.apache.spark.sql.functions._
inputDf.groupBy("state").agg(collect_list(col("city")).as("cityList"), sum("population").as("sumPopulation"))
which should give you
+-----+------------+-------------+
|state|cityList |sumPopulation|
+-----+------------+-------------+
|B |[B1, B2] |12 |
|C |[C1, C2] |8 |
|A |[A1, A2, A3]|6 |
+-----+------------+-------------+
Dataset is almost the same but comes with additional type-safety
Say I have a Spark SQL DataFrame like so:
name gender grade
-----------------
Joe M 3
Sue F 2
Pam F 3
Gil M 2
Lon F 3
Kim F 3
Zoe F 2
I want to create a report of single values like so:
numMales numFemales numGrade2 numGrade3
---------------------------------------
2 5 3 4
What is the best way to do this? I know how to get one of these individually like so:
val numMales = dataDF.where($"gender" == "M").count
But I don't really know how to put this into a DataFrame, or how to combine all the results.
Use of when, sum and struct inbuilt functions should give you your desired result
import org.apache.spark.sql.functions._
dataDF.select(struct(sum(when(col("gender")==="M", 1)).as("numMales"), sum(when(col("gender")==="F", 1)).as("numFemales")).as("genderCounts"),
struct(sum(when(col("grade")===2, 1)).as("numGrade2"), sum(when(col("grade")===3, 1)).as("numGrade3")).as("gradeCounts"))
.select(col("genderCounts.*"), col("gradeCounts.*"))
.show(false)
which should give you
+--------+----------+---------+---------+
|numMales|numFemales|numGrade2|numGrade3|
+--------+----------+---------+---------+
|2 |5 |3 |4 |
+--------+----------+---------+---------+
You can explode and pivot:
import org.apache.spark.sql.functions._
val cols = Seq("gender", "grade")
df
.select(explode(array(cols map (c => concat(lit(c), col(c))): _*)))
.groupBy().pivot("col").count.show
// +-------+-------+------+------+
// |genderF|genderM|grade2|grade3|
// +-------+-------+------+------+
// | 5| 2| 3| 4|
// +-------+-------+------+------+
I'd say that you need to .groupBy().count() your dataframe separately by each column, them combine the answers into a new dataframe.
I am new to Scala programming , i have worked on R very extensively but while working for scala it has become tough to work in a loop to extract specific columns to perform computation on the column values
let me explain with help of an example :
i have Final dataframe arrived after joining the 2 dataframes,
now i need to perform calculation like
Above is the computation with reference to the columns , so after computation we'll get the below spark dataframe
How to refer to the column index in for-loop to compute the new column values in spark dataframe in scala
Here is one solution:
Input Data:
+---+---+---+---+---+---+---+---+---+
|a1 |b1 |c1 |d1 |e1 |a2 |b2 |c2 |d2 |
+---+---+---+---+---+---+---+---+---+
|24 |74 |74 |21 |66 |65 |100|27 |19 |
+---+---+---+---+---+---+---+---+---+
Zipped the columns to remove the non-matching columns:
val oneCols = data.schema.filter(_.name.contains("1")).map(x => x.name).sorted
val twoCols = data.schema.filter(_.name.contains("2")).map(x => x.name).sorted
val cols = oneCols.zip(twoCols)
//cols: Seq[(String, String)] = List((a1,a2), (b1,b2), (c1,c2), (d1,d2))
Use foldLeft function to dynamically add columns:
import org.apache.spark.sql.functions._
val result = cols.foldLeft(data)((data,c) => data.withColumn(s"Diff_${c._1}",
(col(s"${lit(c._2)}") - col(s"${lit(c._1)}"))/col(s"${lit(c._2)}")))
Here is the result:
result.show(false)
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
|a1 |b1 |c1 |d1 |e1 |a2 |b2 |c2 |d2 |Diff_a1 |Diff_b1|Diff_c1 |Diff_d1 |
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
|24 |74 |74 |21 |66 |65 |100|27 |19 |0.6307692307692307|0.26 |-1.7407407407407407|-0.10526315789473684|
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
I created a PySpark dataframe using the following code
testlist = [
{"category":"A","name":"A1"},
{"category":"A","name":"A2"},
{"category":"B","name":"B1"},
{"category":"B","name":"B2"}
]
spark_df = spark.createDataFrame(testlist)
Result:
category name
A A1
A A2
B B1
B B2
I want to make it appear as follows:
category name
A A1, A2
B B1, B2
I tried the following code which does not work
spark_df.groupby('category').agg('name', lambda x:x + ', ')
Can anyone help identify what I am doing wrong and the best way to make this happen ?
One option is to use pyspark.sql.functions.collect_list() as the aggregate function.
from pyspark.sql.functions import collect_list
grouped_df = spark_df.groupby('category').agg(collect_list('name').alias("name"))
This will collect the values for name into a list and the resultant output will look like:
grouped_df.show()
#+---------+---------+
#|category |name |
#+---------+---------+
#|A |[A1, A2] |
#|B |[B1, B2] |
#+---------+---------+
Update 2019-06-10:
If you wanted your output as a concatenated string, you can use pyspark.sql.functions.concat_ws to concatenate the values of the collected list, which will be better than using a udf:
from pyspark.sql.functions import concat_ws
grouped_df.withColumn("name", concat_ws(", ", "name")).show()
#+---------+-------+
#|category |name |
#+---------+-------+
#|A |A1, A2 |
#|B |B1, B2 |
#+---------+-------+
Original Answer: If you wanted your output as a concatenated string, you'd have to can use a udf. For example, you can first do the groupBy() as above and the apply a udf to join the collected list:
from pyspark.sql.functions import udf
concat_list = udf(lambda lst: ", ".join(lst), StringType())
grouped_df.withColumn("name", concat_list("name")).show()
#+---------+-------+
#|category |name |
#+---------+-------+
#|A |A1, A2 |
#|B |B1, B2 |
#+---------+-------+
UNIQUE values
If you want unique values then use collect_set instead of collect_list
from pyspark.sql.functions import collect_set
grouped_df = sdf.groupby('category').agg(collect_set('name').alias("unique_name"))
sdf.show(5)
Another option is this
>>> df.rdd.reduceByKey(lambda x,y: x+','+y).toDF().show()
+---+-----+
| _1| _2|
+---+-----+
| A|A1,A2|
| B|B1,B2|
+---+-----+
I am new to UDF in spark. I have also read the answer here
Problem statement: I'm trying to find pattern matching from a dataframe col.
Ex: Dataframe
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),
(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
df.show()
+---+--------------------+
| id| text|
+---+--------------------+
| 1| z|
| 2| abs,abc,dfg|
| 3|a,b,c,d,e,f,abs,a...|
+---+--------------------+
df.filter($"text".contains("abs,abc,dfg")).count()
//returns 2 as abs exits in 2nd row and 3rd row
Now I want to do this pattern matching for every row in column $text and add new column called count.
Result:
+---+--------------------+-----+
| id| text|count|
+---+--------------------+-----+
| 1| z| 1|
| 2| abs,abc,dfg| 2|
| 3|a,b,c,d,e,f,abs,a...| 1|
+---+--------------------+-----+
I tried to define a udf passing $text column as Array[Seq[String]. But I am not able to get what I intended.
What I tried so far:
val txt = df.select("text").collect.map(_.toSeq.map(_.toString)) //convert column to Array[Seq[String]
val valsum = udf((txt:Array[Seq[String],pattern:String)=> {txt.count(_ == pattern) } )
df.withColumn("newCol", valsum( lit(txt) ,df(text)) )).show()
Any help would be appreciated
You will have to know all the elements of text column which can be done using collect_list by grouping all the rows of your dataframe as one. Then just check if element in text column in the collected array and count them as in the following code.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
val valsum = udf((txt: String, array : mutable.WrappedArray[String])=> array.filter(element => element.contains(txt)).size)
df.withColumn("grouping", lit("g"))
.withColumn("array", collect_list("text").over(Window.partitionBy("grouping")))
.withColumn("count", valsum($"text", $"array"))
.drop("grouping", "array")
.show(false)
You should have following output
+---+-----------------------+-----+
|id |text |count|
+---+-----------------------+-----+
|1 |z |1 |
|2 |abs,abc,dfg |2 |
|3 |a,b,c,d,e,f,abs,abc,dfg|1 |
+---+-----------------------+-----+
I hope this is helpful.