My base dataframe looks like this:
HeroNamesDF
id gen name surname supername
1 1 Clarc Kent BATMAN
2 1 Bruce Smith BATMAN
3 2 Clark Kent SUPERMAN
And then I have another one with the corrections: CorrectionsDF
id gen attribute value
1 1 supername SUPERMAN
1 1 name Clark
2 1 surname Wayne
My aproach to the problem was to do this
CorrectionsDF.select(id, gen).distinct().collect().map(r => {
val id = r(0)
val gen = r(1)
val corrections = CorrectionsDF.filter(col("id") === lit(id) and col("gen") === lit(gen))
val candidates = HeroNamesDF.filter(col("id") === lit(id) and col("gen") === lit(gen))
candidates.columns.map(column => {
val change = corrections.where(col("attribute") === lit(column)).select("id", "gen", "value")
candidates.select("id", "gen", column)
.join(change, Seq("id", "gen"), "full")
.withColumn(column, when(col("value").isNotNull, col("value")).otherwise(col(column)))
.drop("value")
}).reduce((df1, df2) => df1.join(df2, Seq("id", "gen")) )
}
Expected output:
id gen name surname supername
1 1 Clark Kent SUPERMAN
2 1 Bruce Wayne BATMAN
3 2 Clark Kent SUPERMAN
And I would like to get rid of the .collect() but I can't make it work.
If I understood correctly the example, one inner join combined with a group by should be sufficient in your case. With the group by we will generate a map, using
collect_list and map_from_arrays, which will contain the aggregated data for every id/gen pair i.e {"name" : "Clarc", "surname" : "Kent", "superaname" : "BATMAN"}:
import org.apache.spark.sql.functions.{collect_list, map_from_arrays, coalesce}
val hdf = (load hero df)
val cdf = (load corrections df)
hdf.join(cdf, Seq("id", "gen"), "left")
.groupBy(hdf("id"), hdf("gen"))
.agg(
map_from_arrays(
collect_list("attribute"), // the keys
collect_list("value") // the values
).as("m"),
first("firstname").as("firstname"),
first("lastname").as("surname"),
first("supername").as("supername")
)
.select(
$"id",
$"gen",
coalesce($"m".getItem("name"), $"firstname").as("firstname"),
coalesce($"m".getItem("surname"), $"surname").as("surname"),
coalesce($"m".getItem("supername"), $"supername").as("supername")
)
Related
I am a newbie in Apache-spark and recently started coding in Scala.
I have a RDD with 4 columns that looks like this:
(Columns 1 - name, 2- title, 3- views, 4 - size)
aa File:Sleeping_lion.jpg 1 8030
aa Main_Page 1 78261
aa Special:Statistics 1 20493
aa.b User:5.34.97.97 1 4749
aa.b User:80.63.79.2 1 4751
af Blowback 2 16896
af Bluff 2 21442
en Huntingtown,_Maryland 1 0
I want to group based on Column Name and get the sum of Column views.
It should be like this:
aa 3
aa.b 2
af 2
en 1
I have tried to use groupByKey and reduceByKey but I am stuck and unable to proceed further.
This should work, you read the text file, split each line by the separator, map to key value with the appropiate fileds and use countByKey:
sc.textFile("path to the text file")
.map(x => x.split(" ",-1))
.map(x => (x(0),x(3)))
.countByKey
To complete my answer you can approach the problem using dataframe api ( if this is possible for you depending on spark version), example:
val result = df.groupBy("column to Group on").agg(count("column to count on"))
another possibility is to use the sql approach:
val df = spark.read.csv("csv path")
df.createOrReplaceTempView("temp_table")
val result = sqlContext.sql("select <col to Group on> , count(col to count on) from temp_table Group by <col to Group on>")
I assume that you have already have your RDD populated.
//For simplicity, I build RDD this way
val data = Seq(("aa", "File:Sleeping_lion.jpg", 1, 8030),
("aa", "Main_Page", 1, 78261),
("aa", "Special:Statistics", 1, 20493),
("aa.b", "User:5.34.97.97", 1, 4749),
("aa.b", "User:80.63.79.2", 1, 4751),
("af", "Blowback", 2, 16896),
("af", "Bluff", 2, 21442),
("en", "Huntingtown,_Maryland", 1, 0))
Dataframe approach
val sql = new SQLContext(sc)
import sql.implicits._
import org.apache.spark.sql.functions._
val df = data.toDF("name", "title", "views", "size")
df.groupBy($"name").agg(count($"name") as "") show
**Result**
+----+-----+
|name|count|
+----+-----+
| aa| 3|
| af| 2|
|aa.b| 2|
| en| 1|
+----+-----+
RDD Approach (CountByKey(...))
rdd.keyBy(f => f._1).countByKey().foreach(println(_))
RDD Approach (reduceByKey(...))
rdd.map(f => (f._1, 1)).reduceByKey((accum, curr) => accum + curr).foreach(println(_))
If any of this does not solve your problem, pls share where exactely you have strucked.
I have a spark dataframe looks like:
id DataArray
a array(3,2,1)
b array(4,2,1)
c array(8,6,1)
d array(8,2,4)
I want to transform this dataframe into:
id col1 col2 col3
a 3 2 1
b 4 2 1
c 8 6 1
d 8 2 4
What function should I use?
Use apply:
import org.apache.spark.sql.functions.col
df.select(
col("id") +: (0 until 3).map(i => col("DataArray")(i).alias(s"col$i")): _*
)
You can use foldLeft to add each columnn fron DataArray
make a list of column names that you want to add
val columns = List("col1", "col2", "col3")
columns.zipWithIndex.foldLeft(df) {
(memodDF, column) => {
memodDF.withColumn(column._1, col("dataArray")(column._2))
}
}
.drop("DataArray")
Hope this helps!
I have below data and would like to match the data from ID column of df1 to df2.
df1:
ID key
1
2
3
4
5
Df2:
first second third key
--------------------------
1 9 9 777
9 8 8 878
8 1 10 765
10 12 19 909
11 2 20 708
Code:
val finalDF = Df1.join(DF2.withColumnRenamed("key", "key2"), $"ID" === $"first" || $"ID" === $"second" || $"ID" === $"third","left").select($"ID", $"key2".as("key")).show(false)
val notMatchingDF = finalDF.filter($"key" === "")
val matchingDF = finalDF.except(notMatchingDF)
val columnsToCheck = DF2.columns.toSet - "key" toList
import org.apache.spark.sql.functions._
val tempSelectedDetailsDF = DF2.select(array(columnsToCheck.map(col): _*).as("array"), col("key").as("key2"))
val arrayContains = udf((array: collection.mutable.WrappedArray[String], value: String) => array.contains(value))
val finalDF = df1.join(tempSelectedDetailsDF, arrayContains($"array", $"ID"), "left")
.select($"ID", $"key2".as("key"))
.na.fill("")
I am getting the output as below,
ID key
1 777
1 765
2 708
3
4
5
However i am expecting as below,here the logic is from df1 we have id column value 1 and in df2 the value 1 is matching more than once hence i am getting above output. but i should not match second occurrence when it matches in the first occurrence.
Expected output:
ID key
1 777
2 708
3
4
5
i should not match second occurrence when it matches in the first occurrence.
I would suggest you to create a increasing id for df2 for identifying the order of matches when joined with df1 so that it would easy later on to filter in the first matches only. For that you can benefit from monotonically_increasing_id()
import org.apache.spark.sql.functions._
val finalDF = Df1.join(DF2.withColumnRenamed("key", "key2").withColumn("order", monotonically_increasing_id()), $"ID" === $"first" || $"ID" === $"second" || $"ID" === $"third","left").select($"ID", $"key2".as("key").cast(StringType), $"order")
Then you separate the dataframe into matching and non-matching dataframes
val notMatchingDF = finalDF.filter($"key".isNull || $"key" === "")
val matchingDF = finalDF.except(notMatchingDF)
After that on the matchingDF, generate row numbers for each row on each window grouped by ID and sorted by the increasing id gereated above. Then filter in the first matching rows. Then merge in the non matching dataframe and drop the newly created column and fill all nulls with empty character
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("ID").orderBy("order")
matchingDF.withColumn("order", row_number().over(windowSpec))
.filter($"order" === 1)
.union(notMatchingDF)
.drop("order")
.na.fill("")
You should have your requirement fulfilled
+---+---+
|ID |key|
+---+---+
|1 |777|
|2 |708|
|3 | |
|4 | |
|5 | |
+---+---+
I have a dataframe contains 7 days, 24 hours data, so it has 144 columns.
id d1h1 d1h2 d1h3 ..... d7h24
aaa 21 24 8 ..... 14
bbb 16 12 2 ..... 4
ccc 21 2 7 ..... 6
what I want to do, is to find the max 3 values for each day:
id d1 d2 d3 .... d7
aaa [22,2,2] [17,2,2] [21,8,3] [32,11,2]
bbb [32,22,12] [47,22,2] [31,14,3] [32,11,2]
ccc [12,7,4] [28,14,7] [11,2,1] [19,14,7]
import org.apache.spark.sql.functions._
var df = ...
val first3 = udf((list : Seq[Double]) => list.slice(0,3))
for (i <- 1 until 7) {
val columns = (1 until 24).map(x=> "d"+i+"h"+x)
df = df
.withColumn("d"+i, first3(sort_array(array(columns.head, columns.tail :_*), false)))
.drop(columns :_*)
}
This should give you what you want. In fact for each day I aggregate the 24 hours into an array column, that I sort in desc order and from which I select the first 3 elements.
Define pattern:
val p = "^(d[1-7])h[0-9]{1,2}$".r
Group columns:
import org.apache.spark.sql.functions._
val cols = df.columns.tail
.groupBy { case p(d) => d }
.map { case (c, cs) => {
val sorted = sort_array(array(cs map col: _*), false)
array(sorted(0), sorted(1), sorted(2)).as(c)
}}
And select:
df.select($"id" +: cols.toSeq: _*)
I have a dataframe df, which contains below data:
**customers** **product** **Val_id**
1 A 1
2 B X
3 C
4 D Z
i have been provided 2 rules, which are as below:
**rule_id** **rule_name** **product value** **priority**
123 ABC A,B 1
456 DEF A,B,D 2
Requirement is to apply these rules on dataframe df in priority order, customers who have passed rule 1, should not be considered for rule 2 and in final dataframe add two more columns rule_id and rule_name, i have written below code to achieve it:
val rule_name = when(col("product").isin("A","B"), "ABC").otherwise(when(col("product").isin("A","B","D"), "DEF").otherwise(""))
val rule_id = when(col("product").isin("A","B"), "123").otherwise(when(col("product").isin("A","B","D"), "456").otherwise(""))
val df1 = df_customers.withColumn("rule_name" , rule_name).withColumn("rule_id" , rule_id)
df1.show()
Final output looks like below:
**customers** **product** **Val_id** **rule_name** **rule_id**
1 A 1 ABC 123
2 B X ABC 123
3 C
4 D Z DEF 456
Is there any better way to achieve it, adding both columns by just going though entire dataset once instead of going through entire dataset twice?
Question : Is there any better way to achieve it, adding both columns
by just going though entire dataset once instead of going through
entire dataset twice?
Answer : you can have a Map return type in scala...
Limitation : This udf if you are using with With Column for example
column name is ruleIDandRuleName then you can use a single fuction
with Map data type or any acceptable data type of spark sql column.
Other wise you cant use the below mentioned approach
shown in the below example snippet
def ruleNameAndruleId = udf((product : String) => {
if(Seq("A", "B").contains(product)) Map("ruleName"->"ABC","ruleId"->"123")
else if(Seq("A", "B", "D").contains(product)) (Map("ruleName"->"DEF","ruleId"->"456")
else (Map("ruleName"->"","ruleId"->"") })
caller will be
df.withColumn("ruleIDandRuleName",ruleNameAndruleId(product here) ) // ruleNameAndruleId will return a map containing rulename and rule id
An alternative to your solution would be to use udf functions. Its almost similar to when function as both required serialization and deserialization. Its upto you to test which is faster and efficient.
def rule_name = udf((product : String) => {
if(Seq("A", "B").contains(product)) "ABC"
else if(Seq("A", "B", "D").contains(product)) "DEF"
else ""
})
def rule_id = udf((product : String) => {
if(Seq("A", "B").contains(product)) "123"
else if(Seq("A", "B", "D").contains(product)) "456"
else ""
})
val df1 = df_customers.withColumn("rule_name" , rule_name(col("product"))).withColumn("rule_id" , rule_id(col("product")))
df1.show()