New to the RDD api of spark - thanks to Spark migrate sql window function to RDD for better performance - I managed to generate the following table:
+-----------------+---+
| _1| _2|
+-----------------+---+
| [col3TooMany,C]| 0|
| [col1,A]| 0|
| [col2,B]| 0|
| [col3TooMany,C]| 1|
| [col1,A]| 1|
| [col2,B]| 1|
|[col3TooMany,jkl]| 0|
| [col1,d]| 0|
| [col2,a]| 0|
| [col3TooMany,C]| 0|
| [col1,d]| 0|
| [col2,g]| 0|
| [col3TooMany,t]| 1|
| [col1,A]| 1|
| [col2,d]| 1|
| [col3TooMany,C]| 1|
| [col1,d]| 1|
| [col2,c]| 1|
| [col3TooMany,C]| 1|
| [col1,c]| 1|
+-----------------+---+
with an initial input of
val df = Seq(
(0, "A", "B", "C", "D"),
(1, "A", "B", "C", "D"),
(0, "d", "a", "jkl", "d"),
(0, "d", "g", "C", "D"),
(1, "A", "d", "t", "k"),
(1, "d", "c", "C", "D"),
(1, "c", "B", "C", "D")
).toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
val columnsToDrop = Seq("col3TooMany")
val columnsToCode = Seq("col1", "col2")
val target = "TARGET"
import org.apache.spark.sql.functions._
val exploded = explode(array(
(columnsToDrop ++ columnsToCode).map(c =>
struct(lit(c).alias("k"), col(c).alias("v"))): _*
)).alias("level")
val long = df.select(exploded, $"TARGET")
import org.apache.spark.util.StatCounter
then
long.as[((String, String), Int)].rdd.aggregateByKey(StatCounter())(_ merge _, _ merge _).collect.head
res71: ((String, String), org.apache.spark.util.StatCounter) = ((col2,B),(count: 3, mean: 0,666667, stdev: 0,471405, max: 1,000000, min: 0,000000))
is aggregating statistics of all the unique values for each column.
How can I add to the count (which is 3 for B in col2) a second count (maybe as a tuple) which represents the number of B in col2 where TARGET == 1. In this case, it should be 2.
You shouldn't need additional aggregate here. With binary target column, mean is just an empirical probability of target being equal 1:
number of 1 - count * mean
number of 0 - count * (1 - mean)
Related
I have the following dataframe of two columns of string type A and B:
val df = (
spark
.createDataFrame(
Seq(
("a1", "b1"),
("a1", "b2"),
("a1", "b2"),
("a2", "b3")
)
)
).toDF("A", "B")
I create maps between distinct elements of each columns and a set of integers
val mapColA = (
df
.select("A")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
val mapColB = (
df
.select("B")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
Now I want to create a new columns in the dataframe applying those maps to their correspondent columns. For one map only this would be
df.select("A").map(x=>mapColA.get(x)).show()
However I don't understand how to apply each map to their correspondent columns and create two new columns (e.g. with withColumn). The expected result would be
val result = (
spark
.createDataFrame(
Seq(
("a1", "b1", 1, 1),
("a1", "b2", 1, 2),
("a1", "b2", 1, 2),
("a2", "b3", 2, 3)
)
)
).toDF("A", "B", "idA", "idB")
Could you help me?
If I understood correctly, this can be achieved using dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("idA", dense_rank().over(Window.orderBy("A")))
.withColumn("idB", dense_rank().over(Window.orderBy("B")))
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 1|
| a1| b2| 1| 2|
| a1| b2| 1| 2|
| a2| b3| 2| 3|
+---+---+---+---+
If you want to stick with your original code, you can make some modifications:
val mapColA = df.select("A").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val mapColB = df.select("B").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val df2 = df.map(r => (r.getAs[String](0), r.getAs[String](1), mapColA.get(r.getAs[String](0)), mapColB.get(r.getAs[String](1)))).toDF("A","B", "idA", "idB")
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 2|
| a1| b2| 1| 0|
| a1| b2| 1| 0|
| a2| b3| 0| 1|
+---+---+---+---+
I have a dataframe as below,
val x = Seq(("A", "B", "C", "D")).toDF("DOC", "A1", "A2", "A3")
+---+---+---+---+
|DOC| A1| A2| A3|
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
Here the A's can be till 100, so I want to loop and get all the A's and nest them under a common structure as below,
+---+---+---+----+
|DOC|A LIST |
+---+---+---+---+
| A| [B, C, D] |
+---+---+---+---+
I want to create a dataframe by creating dynamic column names like A1, A2.. by looping from 1 to 100 and do a select.
How can I do this?
Cheers!
Simply assemble a list of columns to be combined into an array, transform the column names into Columns via col and apply method array to resulting list:
val df = Seq(
(1, "a", "b", "c", 10.0),
(2, "d", "e", "f", 20.0)
).toDF("id", "a1", "a2", "a3", "b")
val selectedCols = df.columns.filter(_.startsWith("a")).map(col)
val otherCols = df.columns.map(col) diff selectedCols
df.select((otherCols :+ array(selectedCols: _*).as("a_list")): _*).show
// +---+----+---------+
// | id| b| a_list|
// +---+----+---------+
// | 1|10.0|[a, b, c]|
// | 2|20.0|[d, e, f]|
// +---+----+---------+
I am trying to filter a dataframe in scala by comparing two of its columns (subject and stream in this case) to a list of tuples. If the column values and the tuple values are equal the row is filtered.
val df = Seq(
(0, "Mark", "Maths", "Science"),
(1, "Tyson", "History", "Commerce"),
(2, "Gerald", "Maths", "Science"),
(3, "Katie", "Maths", "Commerce"),
(4, "Linda", "History", "Science")).toDF("id", "name", "subject", "stream")
Sample input:
+---+------+-------+--------+
| id| name|subject| stream|
+---+------+-------+--------+
| 0| Mark| Maths| Science|
| 1| Tyson|History|Commerce|
| 2|Gerald| Maths| Science|
| 3| Katie| Maths|Commerce|
| 4| Linda|History| Science|
+---+------+-------+--------+
List of tuple based on which the above df needs to be filtered
val listOfTuples = List[(String, String)] (
("Maths" , "Science"),
("History" , "Commerce")
)
Expected result :
+---+------+-------+--------+
| id| name|subject| stream|
+---+------+-------+--------+
| 0| Mark| Maths| Science|
| 1| Tyson|History|Commerce|
| 2|Gerald| Maths| Science|
+---+------+-------+--------+
You can either do it with isin with structs (needs spark 2.2+):
val df_filtered = df
.where(struct($"subject",$"stream").isin(listOfTuples.map(typedLit(_)):_*))
or with leftsemi join:
val df_filtered = df
.join(listOfTuples.toDF("subject","stream"),Seq("subject","stream"),"leftsemi")
You can simply filter as
val resultDF = df.filter(row => {
List(
("Maths", "Science"),
("History", "Commerce")
).contains(
(row.getAs[String]("subject"), row.getAs[String]("stream")))
})
Hope this helps!
It seems this issue asked couple of times, but the solutions that suggested in previous questions not working for me.
I have two dataframes with different dimensions as shown in picture below. The table two second was part of table one first but after some processing on it I added one more column column4. Now I want to join these two tables such that I have table three Required after joining.
Things that tried.
So I did couple of different solution but no one works for me.
I tried
val required =first.join(second, first("PDE_HDR_CMS_RCD_NUM") === second("PDE_HDR_CMS_RCD_NUM") , "left_outer")
Also I tried
val required = first.withColumn("SEQ", when(second.col("PDE_HDR_FILE_ID") === (first.col("PDE_HDR_FILE_ID").alias("PDE_HDR_FILE_ID1")), second.col("uniqueID")).otherwise(lit(0)))
In the second attempt I used .alias after I get an error that says
Error occured during extract process. Error:
org.apache.spark.sql.AnalysisException: Resolved attribute(s) uniqueID#775L missing from.
Thanks for taking time to read my question
To generate the wanted result, you should join the two tables on column(s) that are row-identifying in your first table. Assuming c1 + c2 + c3 uniquely identifies each row in the first table, here's an example using a partial set of your sample data:
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq(
(1, "e", "o"),
(4, "d", "t"),
(3, "f", "e"),
(2, "r", "r"),
(6, "y", "f"),
(5, "t", "g"),
(1, "g", "h"),
(4, "f", "j"),
(6, "d", "k"),
(7, "s", "o")
).toDF("c1", "c2", "c3")
val df2 = Seq(
(3, "f", "e", 444),
(5, "t", "g", 555),
(7, "s", "o", 666)
).toDF("c1", "c2", "c3", "c4")
df1.join(df2, Seq("c1", "c2", "c3"), "left_outer").show
// +---+---+---+----+
// | c1| c2| c3| c4|
// +---+---+---+----+
// | 1| e| o|null|
// | 4| d| t|null|
// | 3| f| e| 444|
// | 2| r| r|null|
// | 6| y| f|null|
// | 5| t| g| 555|
// | 1| g| h|null|
// | 4| f| j|null|
// | 6| d| k|null|
// | 7| s| o| 666|
// +---+---+---+----+
Here are two RDDs.
Table1-pair(key,value)
val table1 = sc.parallelize(Seq(("1", "a"), ("2", "b"), ("3", "c")))
//RDD[(String, String)]
Table2-Arrays
val table2 = sc.parallelize(Array(Array("1", "2", "d"), Array("1", "3", "e")))
//RDD[Array[String]]
I am trying to change elements of table2 such as "1" to "a" using the keys and values in table1. My expect result is as follows:
RDD[Array[String]] = (Array(Array("a", "b", "d"), Array("a", "c", "e")))
Is there a way to make this possible?
If so, would it be efficient using a huge dataset?
I think we can do it better with dataframes while avoiding joins as it might involve shuffling of data.
val table1 = spark.sparkContext.parallelize(Seq(("1", "a"), ("2", "b"), ("3", "c"))).collectAsMap()
//Brodcasting so that mapping is available to all nodes
val brodcastedMapping = spark.sparkContext.broadcast(table1)
val table2 = spark.sparkContext.parallelize(Array(Array("1", "2", "d"), Array("1", "3", "e")))
def changeMapping(value: String): String = {
brodcastedMapping.value.getOrElse(value, value)
}
val changeMappingUDF = udf(changeMapping(_:String))
table2.toDF.withColumn("exploded", explode($"value"))
.withColumn("new", changeMappingUDF($"exploded"))
.groupBy("value")
.agg(collect_list("new").as("mappedCol"))
.select("mappedCol").rdd.map(r => r.toSeq.toArray.map(_.toString))
Let me know if it suits your requirement otherwise I can modify it as needed.
You can do that in Dataset
package dataframe
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
/**
* #author vaquar.khan#gmail.com
*/
object Test {
case class table1Class(key: String, value: String)
case class table2Class(key: String, value: String, value1: String)
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-Basic")
.master("local[4]")
.getOrCreate()
import spark.implicits._
//
val table1 = Seq(
table1Class("1", "a"), table1Class("2", "b"), table1Class("3", "c"))
val df1 = spark.sparkContext.parallelize(table1, 4).toDF()
df1.show()
val table2 = Seq(
table2Class("1", "2", "d"), table2Class("1", "3", "e"))
val df2 = spark.sparkContext.parallelize(table2, 4).toDF()
df2.show()
//
df1.createOrReplaceTempView("A")
df2.createOrReplaceTempView("B")
spark.sql("select d1.key,d1.value,d2.value1 from A d1 inner join B d2 on d1.key = d2.key").show()
//TODO
/* need to fix query
spark.sql( "select * from ( "+ //B1.value,B1.value1,A.value
" select A.value,B.value,B.value1 "+
" from B "+
" left join A "+
" on B.key = A.key ) B2 "+
" left join A " +
" on B2.value = A.key" ).show()
*/
}
}
Results :
+---+-----+
|key|value|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
+---+-----+------+
|key|value|value1|
+---+-----+------+
| 1| 2| d|
| 1| 3| e|
+---+-----+------+
[Stage 15:=====================================> (68 + 6) / 100]
[Stage 15:============================================> (80 + 4) / 100]
+-----+-----+------+
|value|value|value1|
+-----+-----+------+
| 1| a| d|
| 1| a| e|
+-----+-----+------+
Is there a way to make this possible?
Yes. Use Datasets (not RDDs as less effective and expressive), join them together and select fields of your liking.
val table1 = Seq(("1", "a"), ("2", "b"), ("3", "c")).toDF("key", "value")
scala> table1.show
+---+-----+
|key|value|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
val table2 = sc.parallelize(
Array(Array("1", "2", "d"), Array("1", "3", "e"))).
toDF("a").
select($"a"(0) as "a0", $"a"(1) as "a1", $"a"(2) as "a2")
scala> table2.show
+---+---+---+
| a0| a1| a2|
+---+---+---+
| 1| 2| d|
| 1| 3| e|
+---+---+---+
scala> table2.join(table1, $"key" === $"a0").select($"value" as "a0", $"a1", $"a2").show
+---+---+---+
| a0| a1| a2|
+---+---+---+
| a| 2| d|
| a| 3| e|
+---+---+---+
Repeat for the other a columns and union together. While repeating the code, you'll notice the pattern that will make the code generic.
If so, would it be efficient using a huge dataset?
Yes (again). We're talking Spark here and a huge dataset is exactly why you chose Spark, isn't it?