Env: Spark 1.6, Scala
My dataframe is like bellow
DF=
DT col1 col2
----------|---|----
2017011011| AA| BB
2017011011| CC| DD
2017011015| PP| BB
2017011015| QQ| DD
2017011016| AA| BB
2017011016| CC| DD
2017011017| PP| BB
2017011017| QQ| DD
How can I filter to get result like SQL - select * from DF where dt> (select distinct dt from DF order by dt desc limit 3)
output have last 3 dates
2017011015 |PP |BB
2017011015 |QQ |DD
2017011016 |AA |BB
2017011016 |CC |DD
2017011017 |PP |BB
2017011017 |QQ |DD
Thanks
Hossain
Tested on Spark 1.6.1
import sqlContext.implicit._
val df = sqlContext.createDataFrame(Seq(
(2017011011, "AA", "BB"),
(2017011011, "CC", "DD"),
(2017011015, "PP", "BB"),
(2017011015, "QQ", "DD"),
(2017011016, "AA", "BB"),
(2017011016, "CC", "DD"),
(2017011017, "PP", "BB"),
(2017011017, "QQ", "DD")
)).select(
$"_1".as("DT"),
$"_2".as("col1"),
$"_3".as("col2")
)
val dates = df.select($"DT")
.distinct()
.orderBy(-$"DT")
.map(_.getInt(0))
.take(3)
val result = df.filter(dates.map($"DT" === _).reduce(_ || _))
result.show()
Related
Summary: Combining multiple rows to columns for a user
Input DF:
Id
group
A1
A2
B1
B2
1
Alpha
1
2
null
null
1
AlphaNew
6
8
null
null
2
Alpha
7
4
null
null
2
Beta
null
null
3
9
Note: The group values are dynamic
Expected Output DF:
Id
Alpha_A1
Alpha_A2
AlphaNew_A1
AlphaNew_A2
Beta_B1
Beta_B2
1
1
2
6
8
null
null
2
7
4
null
null
3
9
Attempted Solution:
I thought of making a json of non-null columns for each row, then a group by and concat_list of maps. Then I can explode the json to get the expected output.
But I am stuck at the stage of a nested json. Here is my code
vcols = df.columns[2:]
df\
.withColumn('json', F.to_json(F.struct(*vcols)))\
.groupby('id')\
.agg(
F.to_json(
F.collect_list(
F.create_map('group', 'json')
)
)
).alias('json')
Id
json
1
[{Alpha: {A1:1, A2:2}}, {AlphaNew: {A1:6, A2:8}}]
2
[{Alpha: {A1:7, A2:4}}, {Beta: {B1:3, B2:9}}]
What I am trying to get:
Id
json
1
[{Alpha_A1:1, Alpha_A2:2, AlphaNew_A1:6, AlphaNew_A2:8}]
2
[{Alpha_A1:7, Alpha_A2:4, Beta_B1:3, Beta_B2:9}]
I'd appreciate any help. I'm also trying to avoid UDFs as my true dataframe's shape is quite big
There's definitely a better way to do this but I continued your to json experiment.
Using UDFs:
After you get something like [{Alpha: {A1:1, A2:2}}, {AlphaNew: {A1:6, A2:8}}] you could create a UDF to flatten the dict. But since it's a JSON string you'll have to parse it to dict and then back again to JSON.
After that you would like to explode and pivot the table but that's not possible with JSON strings, so you have to use F.from_json with defined schema. That will give you MapType which you can explode and pivot.
Here's an example:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from collections import MutableMapping
import json
from pyspark.sql.types import (
ArrayType,
IntegerType,
MapType,
StringType,
)
def flatten_dict(d, parent_key="", sep="_"):
items = []
for k, v in d.items():
new_key = parent_key + sep + k if parent_key else k
if isinstance(v, MutableMapping):
items.extend(flatten_dict(v, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
def flatten_groups(data):
result = []
for item in json.loads(data):
result.append(flatten_dict(item))
return json.dumps(result)
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
data = [
(1, "Alpha", 1, 2, None, None),
(1, "AlphaNew", 6, 8, None, None),
(2, "Alpha", 7, 4, None, None),
(2, "Beta", None, None, 3, 9),
]
columns = ["Id", "group", "A1", "A2", "B1", "B2"]
df = spark.createDataFrame(data, columns)
vcols = df.columns[2:]
df = (
df.withColumn("json", F.struct(*vcols))
.groupby("id")
.agg(F.to_json(F.collect_list(F.create_map("group", "json"))).alias("json"))
)
# Flatten groups
flatten_groups_udf = F.udf(lambda x: flatten_groups(x))
schema = ArrayType(MapType(StringType(), IntegerType()))
df = df.withColumn("json", F.from_json(flatten_groups_udf(F.col("json")), schema))
# Explode and pivot
df = df.select(F.col("id"), F.explode(F.col("json")).alias("json"))
df = (
df.select("id", F.explode("json"))
.groupby("id")
.pivot("key")
.agg(F.first("value"))
)
At the end dataframe looks like:
+---+-----------+-----------+--------+--------+-------+-------+
|id |AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
|1 |6 |8 |1 |2 |null |null |
|2 |null |null |7 |4 |3 |9 |
+---+-----------+-----------+--------+--------+-------+-------+
Without UDFs:
vcols = df.columns[2:]
df = (
df.withColumn("json", F.to_json(F.struct(*vcols)))
.groupby("id")
.agg(
F.collect_list(
F.create_map(
"group", F.from_json("json", MapType(StringType(), IntegerType()))
)
).alias("json")
)
)
df = df.withColumn("json", F.explode(F.col("json")).alias("json"))
df = df.select("id", F.explode(F.col("json")).alias("root", "value"))
df = df.select("id", "root", F.explode(F.col("value")).alias("sub", "value"))
df = df.select(
"id", F.concat(F.col("root"), F.lit("_"), F.col("sub")).alias("name"), "value"
)
df = df.groupBy(F.col("id")).pivot("name").agg(F.first("value"))
Result:
+---+-----------+-----------+--------+--------+-------+-------+
|id |AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
|1 |6 |8 |1 |2 |null |null |
|2 |null |null |7 |4 |3 |9 |
+---+-----------+-----------+--------+--------+-------+-------+
I found a slightly better way than the json approach:
Stack the input dataframe value columns A1, A2,B1, B2,.. as rows
So the structure would look like id, group, sub, value where sub has the column name like A1, A2, B1, B2 and the value column has the value associated
Filter out the rows that have value as null
And, now we are able to pivot by the group. Since the null value rows are removed, we wont have the initial issue of the pivot making extra columns
import pyspark.sql.functions as F
data = [
(1, "Alpha", 1, 2, None, None),
(1, "AlphaNew", 6, 8, None, None),
(2, "Alpha", 7, 4, None, None),
(2, "Beta", None, None, 3, 9),
]
columns = ["id", "group", "A1", "A2", "B1", "B2"]
df = spark.createDataFrame(data, columns)
# Value columns that need to be stacked
vcols = df.columns[2:]
expr_str = ', '.join([f"'{i}', {i}" for i in vcols])
expr_str = f"stack({len(vcols)}, {expr_str}) as (sub, value)"
df = df\
.selectExpr("id", "group", expr_str)\
.filter(F.col("value").isNotNull())\
.select("id", F.concat("group", F.lit("_"), "sub").alias("group"), "value")\
.groupBy("id")\
.pivot("group")\
.agg(F.first("value"))
df.show()
Result:
+---+-----------+-----------+--------+--------+-------+-------+
| id|AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
| 1| 6| 8| 1| 2| null| null|
| 2| null| null| 7| 4| 3| 9|
+---+-----------+-----------+--------+--------+-------+-------+
Input data:
val inputDf = Seq(Seq("a", "b", "c"), Seq("X", "Y", "Z")).toDF
println("Input:")
inputDf.show(false)
Here is how look Input:
+---------+
|value |
+---------+
|[a, b, c]|
|[X, Y, Z]|
+---------+
Here is how look Expected:
+---+---+---+
|0 |1 |2 |
+---+---+---+
|a |b |c |
|X |Y |Z |
+---+---+---+
I tried use code like this:
val ncols = 3
val selectCols = (0 until ncols).map(i => $"arr"(i).as(s"col_$i"))
inputDf
.select(selectCols:_*)
.show()
But I have errors, because I need some :Unit
Another way to create a dataframe ---
df1 = spark.createDataFrame([(1,[4,2, 1]),(4,[3,2])], [ "col2","col4"])
OUTPUT---------
+----+---------+
|col2| col4|
+----+---------+
| 1|[4, 2, 1]|
| 4| [3, 2]|
+----+---------+
package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
object ArrayToCol extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
val inptDf = Seq(Seq("a", "b", "c"), Seq("X", "Y", "Z")).toDF("value")
val d = inptDf
.withColumn("0", col("value").getItem(0))
.withColumn("1", col("value").getItem(1))
.withColumn("2", col("value").getItem(2))
.drop("value")
d.show(false)
}
// Variant 2
val res = inptDf.select(
$"value".getItem(0).as("col0"),
$"value".getItem(1).as("col1"),
$"value".getItem(2).as("col2")
)
// Variant 3
val res1 = inptDf.select(
col("*") +: (0 until 3).map(i => col("value").getItem(i).as(s"$i")): _*
)
.drop("value")
How can we exclude all alphabet from string keeping only numeric value in seperate column using spark 2.0 with scala.
Input
"ActivalteTime": "PT5M",
"ReActivalteTime": "xy20$",
Output
"NewActivalteTime": "5",
"NewReActivalteTime": "20",
Please help
Here's a slightly generalized approach to handle an arbitrary list of columns to be extracted for numeric content using regexp_extract:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, "A", "PT5M", "xy20$", "M100.1!"),
(2, "B", "QU6N", "uv%", "N200.2&")
).toDF("C1", "C2", "C3", "C4", "C5")
val colsToExtract = Seq("C3", "C4", "C5")
val colsRemained = df.columns diff colsToExtract
val prefix = "New"
df.select(colsRemained.map(col) ++ colsToExtract.map(c =>
regexp_extract(col(c), "([0-9.]+)", 1).as(s"${prefix}$c")): _*
).show
// +---+---+-----+-----+-----+
// | C1| C2|NewC3|NewC4|NewC5|
// +---+---+-----+-----+-----+
// | 1| A| 5| 20|100.1|
// | 2| B| 6| |200.2|
// +---+---+-----+-----+-----+
Use Regexp_extract function to extract only the digits from the string.
val df=Seq((""""ActivalteTime": "PT5M","""),(""""ReActivalteTime": "xy20$",""")).toDF("text")
df.show(false)
Result:
+---------------------------+
|text |
+---------------------------+
|"ActivalteTime": "PT5M", |
|"ReActivalteTime": "xy20$",|
+---------------------------+
Using Regexp_extract:
df.withColumn("num",regexp_extract($"text","(\\d+)",1)).show(false)
+---------------------------+---+
|text |num|
+---------------------------+---+
|"ActivalteTime": "PT5M", |5 |
|"ReActivalteTime": "xy20$",|20 |
+---------------------------+---+
I have a requirement to filter a List with another column in the same dataframe.
Below is my DataFrame. Here, I want to filter col3 list with col1 and get only active childs for parent.
Df.show(10,false):
=============================
Col1 Col2 col3 flag
P1 Parent [c1,c2,c3,c4] Active
c1 Child [] InActive
c2 Child [] Active
c3 Child [] Active
Expected Output :
===================
Df.show(10,false):
Col1 Col2 col3 flag
P1 Parent [c2,c3] Active
c2 Child [] Active
c3 Child [] Active
Can someone help me to get the above result.
I generated your dataframe like this:
val df = Seq(("p1", "Parent", Seq("c1", "c2", "c3", "c4"), "Active"),
("c1", "Child", Seq(), "Inactive"),
("c2", "Child", Seq(), "Active"),
("c3", "Child", Seq(), "Active"))
.toDF("Col1", "Col2", "col3", "flag")
Then I filter only the active children in one dataframe which is one part of your output:
val active_children = df.where('flag === "Active").where('Col2 === "Child")
I also generate a flatten dataframe of parent/child relationships with explode:
val rels = df.withColumn("child", explode('col3))
.select("Col1", "Col2", "flag", "child")
scala> rels.show
+----+------+------+-----+
|Col1| Col2| flag|child|
+----+------+------+-----+
| p1|Parent|Active| c1|
| p1|Parent|Active| c2|
| p1|Parent|Active| c3|
| p1|Parent|Active| c4|
+----+------+------+-----+
and a dataframe with only one column corresponding to active children like this:
val child_filter = active_children.select('Col1 as "child")
and use this child_filter dataframe to filter (with a join) the parents you are interested in and use a groupBy to aggregate the lines back to your output format:
val parents = rels
.join(child_filter, "child")
.groupBy("Col1")
.agg(first('Col2) as "Col2",
collect_list('child) as "col3",
first('flag) as "flag")
scala> parents.show
+----+------+--------+------+
|Col1| Col2| col3| flag|
+----+------+--------+------+
| p1|Parent|[c2, c3]|Active|
+----+------+--------+------+
Finally, a union yields the expected output:
scala> parents.union(active_children).show
+----+------+--------+------+
|Col1| Col2| col3| flag|
+----+------+--------+------+
| p1|Parent|[c2, c3]|Active|
| c2| Child| []|Active|
| c3| Child| []|Active|
+----+------+--------+------+
I have a spark data frame with columns like so:
df
--------------------------
A B C D E F amt
"A1" "B1" "C1" "D1" "E1" "F1" 1
"A2" "B2" "C2" "D2" "E2" "F2" 2
I would like to perform groupBy with column combinations
(A, B, sum(amt))
(A, C, sum(amt))
(A, D, sum(amt))
(A, E, sum(amt))
(A, F, sum(amt))
such that the resulting data frame looks like:
df_grouped
----------------------
A field value amt
"A1" "B" "B1" 1
"A2" "B" "B2" 2
"A1" "C" "C1" 1
"A2" "C" "C2" 2
"A1" "D" "D1" 1
"A2" "D" "D2" 2
My attempt at this was the following:
val cols = Vector("B","C","D","E","F")
//code for creating empty data frame with structs for the cols A, field, value and act
for (col <- cols){
empty_df = empty_df.union (df.groupBy($"A",col)
.agg(sum(amt).as(amt)
.withColumn("field",lit(col)
.withColumnRenamed(col, "value"))
}
I feel that the usage "for" or "foreach" may be clumsy for a distributed env such as spark. Are there any alternatives with map functionality for what I am doing? In my mind, aggregateByKey and collect_list may work; however, I am unable to imagine a complete solution. Please advise.
foldLeft is very powerful function devised in Scala if you know how to play with it. I am suggesting you to use foldLeft function ( I have commented for clarity in the code and for explanation)
//selecting the columns without A and amt
val columnsForAggregation = df.columns.tail.toSet - "amt"
//creating an empty dataframe (format for final output
val finalDF = Seq(("empty", "empty", "empty", 0.0)).toDF("A", "field", "value", "amt")
//using foldLeft for the aggregation and merging each aggreted results
import org.apache.spark.sql.functions._
val (originaldf, transformeddf) = columnsForAggregation.foldLeft((df, finalDF)){(tempdf, column) => {
//aggregation on the dataframe with A and one of the column and finally selecting as required in the outptu
val aggregatedf = tempdf._1.groupBy("A", column).agg(sum("amt").as("amt"))
.select(col("A"), lit(column).as("field"), col(column).as("value"), col("amt"))
//union the aggregated results and transferring dataframes for next loop
(df, tempdf._2.union(aggregatedf))
}
}
//finally removing the dummy row created
transformeddf.filter(col("A") =!= "empty")
.show(false)
You should have the dataframe you desire
+---+-----+-----+---+
|A |field|value|amt|
+---+-----+-----+---+
|A1 |E |E1 |1.0|
|A2 |E |E2 |2.0|
|A1 |F |F1 |1.0|
|A2 |F |F2 |2.0|
|A2 |B |B2 |2.0|
|A1 |B |B1 |1.0|
|A2 |C |C2 |2.0|
|A1 |C |C1 |1.0|
|A1 |D |D1 |1.0|
|A2 |D |D2 |2.0|
+---+-----+-----+---+
I hope the answer is helpful
Concised form of above foldLeft function is
import org.apache.spark.sql.functions._
val (originaldf, transformeddf) = columnsForAggregation.foldLeft((df, finalDF)){(tempdf, column) =>
(df, tempdf._2.union(tempdf._1.groupBy("A", column).agg(sum("amt").as("amt")).select(col("A"), lit(column).as("field"), col(column).as("value"), col("amt"))))
}