Parse json with same key to different columns - pyspark

Schema of my json is next
and field Reports_Rows_Rows_Cells if it not null looks like next:
[Row(Attributes=[Row(Id='account', Value='bd9e85e0-0478-433d-ae9f-0b3c4f04bfe4')], Value='Business Bank Account'),
Row(Attributes=[Row(Id='account', Value='bd9e85e0-0478-433d-ae9f-0b3c4f04bfe4')], Value='10105.54'),
Row(Attributes=[Row(Id='account', Value='bd9e85e0-0478-433d-ae9f-0b3c4f04bfe4')], Value='4938.48')]
What I want is to create table which has all above columns and column Reports_Rows_Rows_Cells should look like
-------- |-------|Reports_Rows_Rows_Cells_Value | Reports_Rows_Rows_Cells_Value | Reports_Rows_Rows_Cells_Value|
| Business Bank Account |10105.54 | 4938.48
Not after parsing json my table look like next:
-------- |-------|Reports_Rows_Rows_Cells_Value|
| Business Bank Account |
| 10105.54 |
| 4938.48 |
My code which I use to parse json
def flatten_df(nested_df):
# for ncol in nested_df.columns:
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for col in array_cols:
nested_df = nested_df.withColumn(col, explode_outer(nested_df[col]))
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
if len(nested_cols) == 0:
return nested_df
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
flat_df = nested_df.select(flat_cols +
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols
for c in nested_df.select(nc+'.*').columns])
return flatten_df(flat_df)

Related

How to rename column of a DataFrame from a list

I have a dataframe like this:
dog
cat
Cell 1
Cell 2
Cell 3
Cell 4
And a list like this:
dog, bulldog
cat, persian
I would like to create a function that find the name of the column in the list and substitute it with the second element (bulldog, persian).
So the final result should be:
| bulldog | persian |
| -------- | -------- |
| Cell 1 | Cell 2 |
| Cell 3 | Cell 4 |
You need to perform a look-up for your original column in the pre-defined list that you have shown. It's easier to create a Map out of it so lookups can be performed:
val list: List[(String, String)] = List(("dog", "bulldog"), ("cat", "persian"))
val columnMap = list.toMap
// columnMap: scala.collection.immutable.Map[String,String] = Map(dog -> bulldog, cat -> persian)
val originalCols = df.columns
​
val renamedCols = originalCols.map{
c => if (columnMap.keys.toArray.contains(c)) s"${c} as ${columnMap.getOrElse(c, "")}"
else c
}
println(renamedCols)
// renamedCols: Array[String] = Array(dog as bulldog, cat as persian)
df.selectExpr(renamedCols: _*).show(false)
// +-------+-------+
// |bulldog|persian|
// +-------+-------+
// |Cell 1 |Cell 2 |
// |Cell 1 |Cell 2 |
// +-------+-------+

Fill null values in a row with frequency of other column

In a spark structured streaming context, I have this dataframe :
+------+----------+---------+
|brand |Timestamp |frequency|
+------+----------+---------+
|BR1 |1632899456|4 |
|BR1 |1632901256|4 |
|BR300 |1632901796|null |
|BR300 |1632899155|null |
|BR90 |1632901743|1 |
|BR1 |1632899933|4 |
|BR1 |1632899756|4 |
|BR22 |1632900776|null |
|BR22 |1632900176|null |
+------+----------+---------+
I would like to replace the null values by the frequency of the brand in the batch, in order to obtain a dataframe like this one :
+------+----------+---------+
|brand |Timestamp |frequency|
+------+----------+---------+
|BR1 |1632899456|4 |
|BR1 |1632901256|4 |
|BR300 |1632901796|2 |
|BR300 |1632899155|2 |
|BR90 |1632901743|1 |
|BR1 |1632899933|4 |
|BR1 |1632899756|4 |
|BR22 |1632900776|2 |
|BR22 |1632900176|2 |
+------+----------+---------+
I am using Spark version 2.4.3 and SQLContext, with scala language.
With "count" over window function:
val df = Seq(
("BR1", 1632899456, Some(4)),
("BR1", 1632901256, Some(4)),
("BR300", 1632901796, None),
("BR300", 1632899155, None),
("BR90", 1632901743, Some(1)),
("BR1", 1632899933, Some(4)),
("BR1", 1632899756, Some(4)),
("BR22", 1632900776, None),
("BR22", 1632900176, None)
).toDF("brand", "Timestamp", "frequency")
val brandWindow = Window.partitionBy("brand")
val result = df.withColumn("frequency", when($"frequency".isNotNull, $"frequency").otherwise(count($"brand").over(brandWindow)))
Result:
+-----+----------+---------+
|BR1 |1632899456|4 |
|BR1 |1632901256|4 |
|BR1 |1632899933|4 |
|BR1 |1632899756|4 |
|BR22 |1632900776|2 |
|BR22 |1632900176|2 |
|BR300|1632901796|2 |
|BR300|1632899155|2 |
|BR90 |1632901743|1 |
+-----+----------+---------+
Solution with GroupBy:
val countDF = df.select("brand").groupBy("brand").count()
df.alias("df")
.join(countDF.alias("cnt"), Seq("brand"))
.withColumn("frequency", when($"df.frequency".isNotNull, $"df.frequency").otherwise($"cnt.count"))
.select("df.brand", "df.Timestamp", "frequency")
Hi bro I'm a java programmer . It's better to make a loop through the freq column and search for first null and its related brand . so count the number of that till the end of the table and correct the null value of that brand and go for the other null brand and correct it . here is my java solution :(I didn't test this code just wrote it text editor but I hope works well, 70%;)
//this is your table + dimensions
table[9][3];
int repeatCounter = 0;
String brand;
boolean thereIsNull = true;
//define an array to save the address of the specified null brand
int[tablecolumns.length()] brandmemory;
while (thereisnull) {
for (int i = 0; i < tablecolumns.length(); i++) {
if (array[i][3] == null) {
thereIsNull = true;
brand = array[i][1];
for (int n = i; n < tablecolumns.length(); i++) {
if (brand == array[i][1]) {
repeatCounter++;
// making an array to save address of the null brand in table:
brandmemory[repeatCounter] = i;
else{
break ;
}
}
for (int p = 1; p = repeatCounter ; p++) {
//changing null values to number of repeats
array[brandmemory[p]][3] = repeatCounter;
}
}
}
else{
continue;
//check if the table has any null content if no :end of program.
for(int w>i ; w=tablecolumns.length();w++ ){
if(array[w] != null ){
thereIsNull = false;
else{ thereIsNull = true;
break;
}
}
}
}
}

How to apply an empty condition to sql select by using "and" in Spark?

I have an UuidConditionSet, when the if condition is wrong, I want apply an empty string to my select statement(or just ignore this UuidConditionSet), but I got this error. How to solve this problem?
mismatched input 'FROM' expecting <EOF>(line 10, pos 3)
This is the select
(SELECT
item,
amount,
date
from my_table
where record_type = 'myType'
and ( date_format(date, "yyyy-MM-dd") >= '2020-02-27'
and date_format(date, "yyyy-MM-dd") <= '2020-02-28' )
and ()
var UuidConditionSet = ""
var UuidCondition = Seq.empty[String]
if(!UuidList.mkString.isEmpty) {
UuidCondition = for {
Uuid <- UuidList
UuidConditionSet = s"${SQLColumnHelper.EVENT_INFO_STRUCT_NAME}.${SQLColumnHelper.UUID} = '".concat(eventUuid).concat("'")
} yield UuidConditionSet
UuidConditionSet = UuidCondition.reduce(_.concat(" or ").concat(_))
}
s"""SELECT
| ${SQLColumnHelper.STRUCT_NAME_ITEM},
| ${SQLColumnHelper.STRUCT_NAME_AMOUNT},
| ${SQLColumnHelper.DATE}
| from ${sqlTableHelper.TABLE}
| where ${SQLColumnHelper.EVENT_INFO_STRUCT_NAME} = '${RECORD_TYPE}'
| and ( date_format(${SQLColumnHelper.DATE}, "${Constant.STAY_DATE_FORMAT}") >= '${stayDateRangeTuple._1}'
| and date_format(${SQLColumnHelper.DATE}, "${Constant.STAY_DATE_FORMAT}") <= '${stayDateRangeTuple._2}' )
| and ($UuidConditionSet)
You can use pattern matching on the list UuidList to check the size and return an empty string if the list is empty. Also, you can use IN instead of multiple ORs here.
Try this:
val UuidCondition = UuidList match {
case l if (l.size > 0) => {
l.map(u => s"'$u'").mkString(
s"and ${SQLColumnHelper.EVENT_INFO_STRUCT_NAME}.${SQLColumnHelper.UUID} in (",
",",
")"
)
}
case _ => ""
}
s"""SELECT
| ${SQLColumnHelper.STRUCT_NAME_ITEM},
| ${SQLColumnHelper.STRUCT_NAME_AMOUNT},
| ${SQLColumnHelper.DATE}
| from ${sqlTableHelper.TABLE}
| where ${SQLColumnHelper.EVENT_INFO_STRUCT_NAME} = '${RECORD_TYPE}'
| and date_format(${SQLColumnHelper.DATE}, "${Constant.STAY_DATE_FORMAT}") >= '${stayDateRangeTuple._1}'
| and date_format(${SQLColumnHelper.DATE}, "${Constant.STAY_DATE_FORMAT}") <= '${stayDateRangeTuple._2}'
| $UuidCondition
"""

array[array["string"]] with explode option dropping null rows in spark/scala [duplicate]

I have a Dataframe that I am trying to flatten. As part of the process, I want to explode it, so if I have a column of arrays, each value of the array will be used to create a separate row. For instance,
id | name | likes
_______________________________
1 | Luke | [baseball, soccer]
should become
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
This is my code
private DataFrame explodeDataFrame(DataFrame df) {
DataFrame resultDf = df;
for (StructField field : df.schema().fields()) {
if (field.dataType() instanceof ArrayType) {
resultDf = resultDf.withColumn(field.name(), org.apache.spark.sql.functions.explode(resultDf.col(field.name())));
resultDf.show();
}
}
return resultDf;
}
The problem is that in my data, some of the array columns have nulls. In that case, the entire row is deleted. So this dataframe:
id | name | likes
_______________________________
1 | Luke | [baseball, soccer]
2 | Lucy | null
becomes
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
instead of
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
2 | Lucy | null
How can I explode my arrays so that I don't lose the null rows?
I am using Spark 1.5.2 and Java 8
Spark 2.2+
You can use explode_outer function:
import org.apache.spark.sql.functions.explode_outer
df.withColumn("likes", explode_outer($"likes")).show
// +---+----+--------+
// | id|name| likes|
// +---+----+--------+
// | 1|Luke|baseball|
// | 1|Luke| soccer|
// | 2|Lucy| null|
// +---+----+--------+
Spark <= 2.1
In Scala but Java equivalent should be almost identical (to import individual functions use import static).
import org.apache.spark.sql.functions.{array, col, explode, lit, when}
val df = Seq(
(1, "Luke", Some(Array("baseball", "soccer"))),
(2, "Lucy", None)
).toDF("id", "name", "likes")
df.withColumn("likes", explode(
when(col("likes").isNotNull, col("likes"))
// If null explode an array<string> with a single null
.otherwise(array(lit(null).cast("string")))))
The idea here is basically to replace NULL with an array(NULL) of a desired type. For complex type (a.k.a structs) you have to provide full schema:
val dfStruct = Seq((1L, Some(Array((1, "a")))), (2L, None)).toDF("x", "y")
val st = StructType(Seq(
StructField("_1", IntegerType, false), StructField("_2", StringType, true)
))
dfStruct.withColumn("y", explode(
when(col("y").isNotNull, col("y"))
.otherwise(array(lit(null).cast(st)))))
or
dfStruct.withColumn("y", explode(
when(col("y").isNotNull, col("y"))
.otherwise(array(lit(null).cast("struct<_1:int,_2:string>")))))
Note:
If array Column has been created with containsNull set to false you should change this first (tested with Spark 2.1):
df.withColumn("array_column", $"array_column".cast(ArrayType(SomeType, true)))
You can use explode_outer() function.
Following up on the accepted answer, when the array elements are a complex type it can be difficult to define it by hand (e.g with large structs).
To do it automatically I wrote the following helper method:
def explodeOuter(df: Dataset[Row], columnsToExplode: List[String]) = {
val arrayFields = df.schema.fields
.map(field => field.name -> field.dataType)
.collect { case (name: String, type: ArrayType) => (name, type.asInstanceOf[ArrayType])}
.toMap
columnsToExplode.foldLeft(df) { (dataFrame, arrayCol) =>
dataFrame.withColumn(arrayCol, explode(when(size(col(arrayCol)) =!= 0, col(arrayCol))
.otherwise(array(lit(null).cast(arrayFields(arrayCol).elementType)))))
}
Edit: it seems that spark 2.2 and newer have this built in.
To handle empty map type column: for Spark <= 2.1
List((1, Array(2, 3, 4), Map(1 -> "a")),
(2, Array(5, 6, 7), Map(2 -> "b")),
(3, Array[Int](), Map[Int, String]())).toDF("col1", "col2", "col3").show()
df.select('col1, explode(when(size(map_keys('col3)) === 0, map(lit("null"), lit("null"))).
otherwise('col3))).show()
from pyspark.sql.functions import *
def flatten_df(nested_df):
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
flat_df = nested_df.select(flat_cols +
[col(nc + '.' + c).alias(nc + '_' + c)
for nc in nested_cols
for c in nested_df.select(nc + '.*').columns])
print("flatten_df_count :", flat_df.count())
return flat_df
def explode_df(nested_df):
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct' and c[1][:5] != 'array']
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for array_col in array_cols:
schema = new_df.select(array_col).dtypes[0][1]
nested_df = nested_df.withColumn(array_col, when(col(array_col).isNotNull(), col(array_col)).otherwise(array(lit(None)).cast(schema)))
nested_df = nested_df.withColumn("tmp", arrays_zip(*array_cols)).withColumn("tmp", explode("tmp")).select([col("tmp."+c).alias(c) for c in array_cols] + flat_cols)
print("explode_dfs_count :", nested_df.count())
return nested_df
new_df = flatten_df(myDf)
while True:
array_cols = [c[0] for c in new_df.dtypes if c[1][:5] == 'array']
if len(array_cols):
new_df = flatten_df(explode_df(new_df))
else:
break
new_df.printSchema()
Used arrays_zip and explode to do it faster and address the null issue.

Linq: dynamic Where clause inside a nested subquery get latest records of each group

I currently get problem to dynamic linq expression below
My Models
public class Orders
{
public int OrderId ;
public ICollection<OrderStatuses> Statuses;
}
public class Statuses
{
public int StatusId;
public int OrderId;
public string Note;
public DateTime Created;
}
My Sample data :
Orders
| ID | Name |
----------------------
| 1 | Order 01 |
| 2 | Order 02 |
| 3 | Order 03 |
Statuses
|ID | OrderId | Note | Created |
---------------------------------------
| 1 | 1 | Ordered | 2016-03-01|
| 2 | 1 | Pending | 2016-04-02|
| 3 | 1 | Completed | 2016-05-19|
| 4 | 1 | Ordered | 2015-05-19|
| 5 | 2 | Ordered | 2016-05-20|
| 6 | 2 | Completed | 2016-05-19|
| 7 | 3 | Completed | 2016-05-19|
I'd like to get number of orders which have note value equal to 'Ordered' and max created time.
Below is sample number of orders that I expect from query
| Name | Note | Last Created|
-------------------------------------|
| Order 01 | Ordered | 2016-03-01 |
| Order 02 | Ordered | 2016-05-20 |
Here my idea but it's seem to wrong way
var outer = PredicateBuilder.True<Order>();
var orders = _entities.Orders
.GroupBy(x => x.OrderId)
.Select(x => new { x.Key, Created = x.Max(g => g.Created) })
.ToArray();
var predicateStatuses = PredicateBuilder.False<Status>();
foreach (var item in orders)
{
predicateStatuses = predicateStatuses.Or(x => x.OrderId == item.Key && x.Created == item.Created);
}
var predicateOrders = PredicateBuilder.False<JobOrder>();
predicateOrders = predicateOrders.Or(predicateStatuses); (I don't how to passed expression which different object type (Order and Status) here or I have to write an extension method or something
outer = outer.And(predicateOrders);
Please suggest me how to solve this dynamic linq expression in this case.
Thanks in advance.
There's nothing dynamic about your query, at least, it doesn't need to be. You can express it as a regular query.
var query =
from o in db.Orders
join s in db.Statuses on o.Id equals s.OrderId
where s.Note == "Ordered"
orderby s.Created descending
group new { o.Name, s.Note, LastCreated = s.Created } by o.Id into g
select g.First();
p.s., your models doesn't seem to match the data at all so I'm ignoring that. Adjust as necessary.
Thanks so much for #Jeff Mercado answer. Finally, I customized your answer to solve my problem below:
var predicateStatuses = PredicateBuilder.False<Order>();
predicateStatuses = predicateStatuses.Or(p => (
from j in db.Statuses
where j.OrderId == p.ID
group j by j.OrderId into g
select g.OrderByDescending(t=>t.Created)
.FirstOrDefault()
).FirstOrDefault().Note == 'Ordered'
);