Need help to append dataframe in for loop in pyspark - pyspark

We have list of conditions that needs to be applied to query in the where clause
Conditions=[
Condition-1
Condition-2
...
Condition-n
]
and we have query like
for condition in Conditions:
df = spark.sql("SELECT col1, col2 from table where" + condition)
But we want one final dataframe with result of all the conditions how to do that...our requirement is something like that..

If your conditions are multiple ANDs you could just join them.
Example:
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.getOrCreate()
data = [
{"a": 1, "b": 2, "c": 3},
{"a": 3, "b": 3, "c": 7},
{"a": 2, "b": 3, "c": 5},
]
conditions = [
"a > 2",
"b < 4",
"c > 5",
]
df = spark.createDataFrame(data)
df.createOrReplaceTempView("table")
df = spark.sql("SELECT a, b from table where {}".format(" AND ".join(conditions)))
df.show()
Result:
+---+---+
| a| b|
+---+---+
| 3| 3|
+---+---+

Related

pyspark select/filter from a dataframe multiple conditions

I have two dataframes
d1:
item
c1
c2
value
A
1
2
3
B
1
2
4
A
1
3
5
B
1
3
8
d2:
c1
c2
value
1
3
8
1
2
4
I want to use some function to get this
item
c1
c2
value
B
1
2
4
B
1
3
8
check if d1.c1=d2.c1 d1.c2=d2.c2 d1.value=d2.value find those rows and drop others.
Given your comment, one way to go about solving this without a join would be to use window function, partition by c1, c2 and then order by value desc and apply row number and choose the first row to get the row with the maximum value for same c1, c2.
If you expect to have multiple rows containing the maximum value and want to select all such rows then you will apply rank instead of row_number.
Working Example (row_number)
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
from pyspark.sql.functions import desc
df = spark.createDataFrame([{"item": "A", "c1": 1, "c2": 2, "value": 3}, {"item": "B", "c1": 1, "c2": 2, "value": 4}, {"item": "A", "c1": 1, "c2": 3, "value": 5}, {"item": "B", "c1": 1, "c2": 3, "value": 8}])
window_spec = Window.partitionBy("c1", "c2").orderBy(desc("value"))
df.withColumn("row_number", row_number().over(window_spec)).where("row_number = 1").drop("row_number").show()
Output
+---+---+----+-----+
| c1| c2|item|value|
+---+---+----+-----+
| 1| 2| B| 4|
| 1| 3| B| 8|
+---+---+----+-----+
Working Example (rank)
from pyspark.sql.window import Window
from pyspark.sql.functions import rank
from pyspark.sql.functions import desc
df = spark.createDataFrame([{"item": "A", "c1": 1, "c2": 2, "value": 4}, {"item": "B", "c1": 1, "c2": 2, "value": 4}, {"item": "A", "c1": 1, "c2": 3, "value": 5}, {"item": "B", "c1": 1, "c2": 3, "value": 8}])
window_spec = Window.partitionBy("c1", "c2").orderBy(desc("value"))
df.withColumn("rank", rank().over(window_spec)).where("rank = 1").drop("rank").show()
Output
+---+---+----+-----+
| c1| c2|item|value|
+---+---+----+-----+
| 1| 2| A| 4|
| 1| 2| B| 4|
| 1| 3| B| 8|
+---+---+----+-----+

how to change the df column name in struct with colum value

df.withColumn("storeInfo", struct($"store", struct($"inhand", $"storeQuantity")))
.groupBy("sku").agg(collect_list("storeInfo").as("info"))
.show(false)
+---+---------------------------------------------------+
|sku|info |
+---+---------------------------------------------------+
|1 |[{2222, {3, 34}}, {3333, {5, 45}}] |
|2 |[{4444, {5, 56}}, {5555, {6, 67}}, {6666, {7, 67}}]|
+---+---------------------------------------------------+
when I am sending it to couchbase
{
"SKU": "1",
"info": [
{
"col2": {
"inhand": "3",
"storeQuantity": "34"
},
"Store": "2222"
},
{
"col2": {
"inhand": "5",
"storeQuantity": "45"
},
"Store": "3333"
}}
]}
can we rename the col2 with the value to the value of store? I want it to look like something as below. So the key of every struct is the value of store value.
{
"SKU": "1",
"info": [
{
"2222": {
"inhand": "3",
"storeQuantity": "34"
},
"Store": "2222"
},
{
"3333": {
"inhand": "5",
"storeQuantity": "45"
},
"Store": "3333"
}}
]}
Simply, we can't construct a column as you want. two limitation:
The field name of struct type must be fixed, we can change 'col2' to another name (eg. 'fixedFieldName' in demo 1), but it can't be dynamic(similar to Java class field name)
The key of map type could be dynamic, but the value of map must be same type, see the exception in demo 2.
maybe you should change the schema, see the outputs of demo 1, 3
demo 1
df.withColumn(
"storeInfo", struct($"store", struct($"inhand", $"storeQuantity").as("fixedFieldName"))).
groupBy("sku").agg(collect_list("storeInfo").as("info")).
toJSON.show(false)
// output:
//+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|value |
//+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|{"sku":1,"info":[{"store":2222,"fixedFieldName":{"inhand":3,"storeQuantity":34}},{"store":3333,"fixedFieldName":{"inhand":5,"storeQuantity":45}}]} |
//|{"sku":2,"info":[{"store":4444,"fixedFieldName":{"inhand":5,"storeQuantity":56}},{"store":5555,"fixedFieldName":{"inhand":6,"storeQuantity":67}},{"store":6666,"fixedFieldName":{"inhand":7,"storeQuantity":67}}]}|
//+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
demo 2
df.withColumn(
"storeInfo",
map($"store", struct($"inhand", $"storeQuantity"), lit("Store"), $"store")).
groupBy("sku").agg(collect_list("storeInfo").as("info")).
toJSON.show(false)
// output exception:
// The given values of function map should all be the same type, but they are [struct<inhand:int,storeQuantity:int>, int]
demo 3
df.withColumn(
"storeInfo",
map($"store", struct($"inhand", $"storeQuantity"))).
groupBy("sku").agg(collect_list("storeInfo").as("info")).
toJSON.show(false)
//+---------------------------------------------------------------------------------------------------------------------------------------------+
//|value |
//+---------------------------------------------------------------------------------------------------------------------------------------------+
//|{"sku":1,"info":[{"2222":{"inhand":3,"storeQuantity":34}},{"3333":{"inhand":5,"storeQuantity":45}}]} |
//|{"sku":2,"info":[{"4444":{"inhand":5,"storeQuantity":56}},{"5555":{"inhand":6,"storeQuantity":67}},{"6666":{"inhand":7,"storeQuantity":67}}]}|
//+---------------------------------------------------------------------------------------------------------------------------------------------+

Scala/Spark: flatten multiple json in RDD using SCALA spark but getting invalid data

MY CODE preferred in scala for flatten multiple json
**val data = sc.textFile("/user/cloudera/spark/sample.json")
val nospace = data.map(x => x.trim())
val nospaces = nospace.filter(x => x!="")
val local = nospaces.collect
var vline =""
var eline :List[String]= List()
var lcnt =0
var rcnt =0
local.map{x =>
vline+=x
if (x=="[") lcnt+=1
if (x=="[") rcnt+=1
if (lcnt==rcnt){
eline++=List(vline)
lcnt=0
rcnt=0
vline =""
}
}**
MY Input Sheet multiple json file :
[
{
“Year”: “2013”,
“First Name”: “JANE”,
“County”: “A”,
“Sex”: “F”,
“Count”: “27”
},{
“Year”: “2013”,
“First Name”: “JADE”,
“County”: “B”,
“Sex”: “M”,
“Count”: “26”
},{
“Year”: “2013”,
“First Name”: “JAMES”,
“County”: “C”,
“Sex”: “M”,
“Count”: “21”
}
]
input json taken
root#ubuntu:/home/sathya/Desktop/stackoverflo/data# cat /home/sathya/Desktop/stackoverflo/data/sample.json
[
{
"Year": "2013",
"First Name": "JANE",
"County": "A",
"Sex": "F",
"Count": "27"
},{
"Year": "2013",
"First Name": "JADE",
"County": "B",
"Sex": "M",
"Count": "26"
},{
"Year": "2013",
"First Name": "JAMES",
"County": "C",
"Sex": "M",
"Count": "21"
}
]
code to read the json and flatten as dataframe columns
spark.read.option("multiline","true").json("file:////home/sathya/Desktop/stackoverflo/data/sample.json").show()
'''
+-----+------+----------+---+----+
|Count|County|First Name|Sex|Year|
+-----+------+----------+---+----+
| 27| A| JANE| F|2013|
| 26| B| JADE| M|2013|
| 21| C| JAMES| M|2013|
+-----+------+----------+---+----+
'''

Spark - nested columns merge

I'm working with Scala and Spark 2.1.1. This is a part of schema of my DataFrame:
root
|-- id: string (nullable = true)
|-- op: string (nullable = true)
|-- before: struct (containsNull = true)
| |-- a: string (nullable = true)
| |-- b: string (nullable = true)
| |-- c: string (nullable = true)
| |-- d: string (nullable = true)
|-- after: struct (containsNull = true)
| |-- a: string (nullable = true)
| |-- c: string (nullable = true)
It describes operations on an external database. In case of update operation the dataframe contains both 'before' and 'after' columns representing row state before and after update operation. 'Before' always contains full row, while 'after' only the updated fields. I need to merge them into a column (and DataFrame eventually) containing the full row after update (simply by taking 'before' and changing values of its fields to the ones taken from 'after', if present). I tried different ways to achieve that (mostly by creating a new column by performing UDF on 'before' and 'after'), but I couldn't accomplish it.
An example with 3 rows (I'll use JSON notation for convenience):
{... "before": {"a": "1", "b": "2", "c": "test", "d": true}, "after": {"b": "3"} ...}
{... "before": {"a": "2", "b": null, "c": "test2", "d": false}, "after": {"c": "test4", "d": true} ...}
{... "before": {"other": "4", "other2": "5"}, "after": {"other": "5"} ...}
What I need:
{... "fullAfter": {"a": "1", "b": "3", "c": "test", "d": true} ...}
{... "fullAfter": {"a": "2", "b": null, "c": "test4", "d": true} ...}
{... "fullAfter": {"other": "5", "other2": "5"} ...}
The problem is the DataFrame contains operations from different tables, so 'before' and 'after' may have different schemas in each row.
I tried doing some operations in UDF by converting 'before' and 'after' to JSON (to_json) and creating a new JSON based on them. Unfortunately to_json method causes fields with null values to disappear and therefore I'm not able to create full row without full, original schema:
{... "before": {"a": "2", "b": null, "c": "test2", "d": false}, "after": {"c": "test4", "d": true} ...}
{... "fullAfter": {"a": "2", "c": "test4", "d": true} ...} - "b" is missing
Is there any working, simple/efficient way to do it?
I would suggest using spark sql to solve the problem.
Lets call your table table_with_updates(id,op,before[a,b,c,d],after[a,c])
<!-- language: sql -->
select id, op, struct(
coalesce(after.a,before.a) as a,
after.b as b,
coalesce(after.c,before.c) as c,
before.d as d) as fullAfter
from table_with_updates
obviously you can use coalesce for b and d too but I assumed it is missing from your after schema. Coalesce simply takes the first non null value.
With an UDF you have the issue that it must be typed. You would need case class Before(a:String,b:String,c:String,d:String) and After. Which would not work with null values and require lots of logic and coding. Also even scala udfs are much much slower than spark sql functions.
If you need it more dynamic, what I usually do is just write some code to generate the sql from the column names. Scala is great at this.
val colsAfter = spark.table("table_with_updates").select($"after").columns()
val colsBefore = spark.table("table_with_updates").select($"before").columns()
.map(c => if colsAfter.contains(c) s"coalesce(after.$c,before.$c" else s"before.$c as $c")
val structFields = colsBefore.mkString(",")
sql(" select id, op, struct($structFields) as fullAfter from table_with_updates")

QueryingJSON PostgreSQL

CREATE TABLE tableTestJSON (
id serial primary key,
data jsonb
);
INSERT INTO tableTestJSON (data) VALUES
('{}'),
('{"a": 1}'),
('{"a": 2, "b": ["c", "d"]}'),
('{"a": 1, "b": {"c": "d", "e": true}}'),
('{"b": 2}');
I can select the values. There is no problem this.
SELECT * FROM tableTestJSON;
I can test that two JSON objects are identical this query.
SELECT * FROM tableTestJSON WHERE data = '{"a":1}';
This query's output is :
id | data
----+------
1 | {"a": 1}
(1 row)
But i have a problem :
Lets say I have a column:
{a: 30}
{a: 40}
{a: 50}
In this case, how can i query for all the elements containing a = 30 or a = 40. I was not able to find any solution for 'or', e.g.
select * from table where a in (10,20); // ??
or
How can I query on such condition?
Extract a value of a json object using the operator ->>:
select *
from tabletestjson
where (data->>'a')::int in (1, 2)
id | data
----+--------------------------------------
2 | {"a": 1}
3 | {"a": 2, "b": ["c", "d"]}
4 | {"a": 1, "b": {"c": "d", "e": true}}
(3 rows)