Pyspark convert df to array of objects - pyspark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
simpleData = [("James", "Sales", "NY", 90000, 34, 10000),
("Michael", "Sales", "NY", 86000, 56, 20000),
("Robert", "Sales", "CA", 81000, 30, 23000),
("Maria", "Finance", "CA", 90000, 24, 23000),
("Raman", "Finance", "CA", 99000, 40, 24000),
("Scott", "Finance", "NY", 83000, 36, 19000),
("Jen", "Finance", "NY", 79000, 53, 15000),
("Jeff", "Marketing", "CA", 80000, 25, 18000),
("Kumar", "Marketing", "NY", 91000, 50, 21000)
]
schema = ["employee_name", "department",
"state", "salary", "age", "bonus"]
df = spark.createDataFrame(data=simpleData, schema=schema)
data = df.groupBy("department").count() \
.select(col("department").alias("name"), col("count").alias("value")) \
.toJSON().collect()
print(data)
spark.stop()
When I ran the code it gives an array string:
[
'{"name":"Sales","value":3}',
'{"name":"Finance","value":4}',
'{"name":"Marketing","value":2}'
]
but I don't want an array string, I want an array object to send to frontend
[
{"name":"Sales","value":3},
{"name":"Finance","value":4},
{"name":"Marketing","value":2}
]
Can anyone help me?

Remove the toJSON() and use the following list comprehension
[d.asDict() for d in data]

You need to deserialize the json on the driver side:
import json
deserialized_data = [json.loads(s) for s in data]
print(deserialized_data)
print(deserialized_data[0]["name"])
If you add the above lines after the collect, then deserialized_data will be a list of dictionary objects.
Output of the two prints with your provided data:
[{'name': 'Finance', 'value': 4}, {'name': 'Sales', 'value': 3}, {'name': 'Marketing', 'value': 2}]
Finance

Related

Spark (Scala) filter array of structs without explode

I have a dataframe with a key and a column with an array of structs in a dataframe column. Each row contains a column a looks something like this:
[
{"id" : 1, "someProperty" : "xxx", "someOtherProperty" : "1", "propertyToFilterOn" : 1},
{"id" : 2, "someProperty" : "yyy", "someOtherProperty" : "223", "propertyToFilterOn" : 0},
{"id" : 3, "someProperty" : "zzz", "someOtherProperty" : "345", "propertyToFilterOn" : 1}
]
Now I would like to do two things:
Filter on "propertyToFilterOn" = 1
Apply some logic on other
properties - for example concatenate
So that the result is:
[
{"id" : 1, "newProperty" : "xxx_1"},
{"id" : 3, "newProperty" : "zzz_345"}
]
I know how to do it with explode but explode also requires groupBy on the key when putting it back together. But as this is a streaming Dataframe I would also have to put a watermark on it which I am trying to avoid.
Is there any other way to achieve this without using explode? I am sure there is some Scala magic that can achieve this!
Thanks!
With spark 2.4+ came many higher order functions for arrays. (see https://docs.databricks.com/spark/2.x/spark-sql/language-manual/functions.html)
val dataframe = Seq(
("a", 1, "xxx", "1", 1),
("a", 2, "yyy", "223", 0),
("a", 3, "zzz", "345", 1)
).toDF( "grouping_key", "id" , "someProperty" , "someOtherProperty", "propertyToFilterOn" )
.groupBy("grouping_key")
.agg(collect_list(struct("id" , "someProperty" , "someOtherProperty", "propertyToFilterOn")).as("your_array"))
dataframe.select("your_array").show(false)
+----------------------------------------------------+
|your_array |
+----------------------------------------------------+
|[[1, xxx, 1, 1], [2, yyy, 223, 0], [3, zzz, 345, 1]]|
+----------------------------------------------------+
You can filter elements within an array using the array filter higher order function like this:
val filteredDataframe = dataframe.select(expr("filter(your_array, your_struct -> your_struct.propertyToFilterOn == 1)").as("filtered_arrays"))
filteredDataframe.show(false)
+----------------------------------+
|filtered_arrays |
+----------------------------------+
|[[1, xxx, 1, 1], [3, zzz, 345, 1]]|
+----------------------------------+
for the "other logic" your talking about you should be able to use the transform higher order array function like so:
val tranformedDataframe = filteredDataframe
.select(expr("transform(filtered_arrays, your_struct -> struct(concat(your_struct.someProperty, '_', your_struct.someOtherProperty))"))
but there are issues with returning structs from the transform function as described in this post:
http://mail-archives.apache.org/mod_mbox/spark-user/201811.mbox/%3CCALZs8eBgWqntiPGU8N=ENW2Qvu8XJMhnViKy-225ktW+_c0czA#mail.gmail.com%3E
so you are best using the dataset api for the transform like so:
case class YourStruct(id:String, someProperty: String, someOtherProperty: String)
case class YourArray(filtered_arrays: Seq[YourStruct])
case class YourNewStruct(id:String, newProperty: String)
val transformedDataset = filteredDataframe.as[YourArray].map(_.filtered_arrays.map(ys => YourNewStruct(ys.id, ys.someProperty + "_" + ys.someOtherProperty)))
val transformedDataset.show(false)
+--------------------------+
|value |
+--------------------------+
|[[1, xxx_1], [3, zzz_345]]|
+--------------------------+

spark: how to merge rows to array of jsons

Input:
id1 id2 name value epid
"xxx" "yyy" "EAN" "5057723043" "1299"
"xxx" "yyy" "MPN" "EVBD" "1299"
I want:
{ "id1": "xxx",
"id2": "yyy",
"item_specifics": [
{
"name": "EAN",
"value": "5057723043"
},
{
"name": "MPN",
"value": "EVBD"
},
{
"name": "EPID",
"value": "1299"
}
]
}
I tried the following two solutions from How to aggregate columns into json array? and how to merge rows into column of spark dataframe as vaild json to write it in mysql:
pi_df.groupBy(col("id1"), col("id2"))
//.agg(collect_list(to_json(struct(col("name"), col("value"))).alias("item_specifics"))) // => not working
.agg(collect_list(struct(col("name"),col("value"))).alias("item_specifics"))
But I got:
{ "name":"EAN","value":"5057723043", "EPID": "1299", "id1": "xxx", "id2": "yyy" }
How to fix this? Thanks
For Spark < 2.4
You can create 2 dataframes, one with name and value and other with epic as name and epic value as value and union them together. Then aggregate them as collect_set and create a json. The code should look like this.
//Creating Test Data
val df = Seq(("xxx","yyy" ,"EAN" ,"5057723043","1299"), ("xxx","yyy" ,"MPN" ,"EVBD", "1299") )
.toDF("id1", "id2", "name", "value", "epid")
df.show(false)
+---+---+----+----------+----+
|id1|id2|name|value |epid|
+---+---+----+----------+----+
|xxx|yyy|EAN |5057723043|1299|
|xxx|yyy|MPN |EVBD |1299|
+---+---+----+----------+----+
val df1 = df.withColumn("map", struct(col("name"), col("value")))
.select("id1", "id2", "map")
val df2 = df.withColumn("map", struct(lit("EPID").as("name"), col("epid").as("value")))
.select("id1", "id2", "map")
val jsonDF = df1.union(df2).groupBy("id1", "id2")
.agg(collect_set("map").as("item_specifics"))
.withColumn("json", to_json(struct("id1", "id2", "item_specifics")))
jsonDF.select("json").show(false)
+---------------------------------------------------------------------------------------------------------------------------------------------+
|json |
+---------------------------------------------------------------------------------------------------------------------------------------------+
|{"id1":"xxx","id2":"yyy","item_specifics":[{"name":"MPN","value":"EVBD"},{"name":"EAN","value":"5057723043"},{"name":"EPID","value":"1299"}]}|
+---------------------------------------------------------------------------------------------------------------------------------------------+
For Spark = 2.4
It provides a array_union method. It might be helpful in doing it without union. I haven't tried it though.
val jsonDF = df.withColumn("map1", struct(col("name"), col("value")))
.withColumn("map2", struct(lit("epid").as("name"), col("epid").as("value")))
.groupBy("id1", "id2")
.agg(collect_set("map1").as("item_specifics1"),
collect_set("map2").as("item_specifics2"))
.withColumn("item_specifics", array_union(col("item_specifics1"), col("item_specifics2")))
.withColumn("json", to_json(struct("id1", "id2", "item_specifics2")))
You're pretty close. I believe you're looking for something like this:
val pi_df2 = pi_df.withColumn("name", lit("EPID")).
withColumnRenamed("epid", "value").
select("id1", "id2", "name","value")
pi_df.select("id1", "id2", "name","value").
union(pi_df2).withColumn("item_specific", struct(col("name"), col("value"))).
groupBy(col("id1"), col("id2")).
agg(collect_list(col("item_specific")).alias("item_specifics")).
write.json(...)
The union should bring back epid into item_specifics
Here is what you need to do
import scala.util.parsing.json.JSONObject
import scala.collection.mutable.WrappedArray
//Define udf
val jsonFun = udf((id1 : String, id2 : String, item_specifics: WrappedArray[Map[String, String]], epid: String)=> {
//Add epid to item_specifics json
val item_withEPID = item_specifics :+ Map("epid" -> epid)
val item_specificsArray = item_withEPID.map(m => ( Array(Map("name" -> m.keys.toSeq(0), "value" -> m.values.toSeq(0))))).map(m => m.map( mi => JSONObject(mi).toString().replace("\\",""))).flatten.mkString("[",",","]")
//Add id1 and id2 to output json
val m = Map("id1"-> id1, "id2"-> id2, "item_specifics" -> item_specificsArray.toSeq )
JSONObject(m).toString().replace("\\","")
})
val pi_df = Seq( ("xxx","yyy","EAN","5057723043","1299"), ("xxx","yyy","MPN","EVBD","1299")).toDF("id1","id2","name","value","epid")
//Add epid as part of group by column else the column will not be available after group by and aggregation
val df = pi_df.groupBy(col("id1"), col("id2"), col("epid")).agg(collect_list(map(col("name"), col("value")) as "map").as("item_specifics")).withColumn("item_specifics",jsonFun($"id1",$"id2",$"item_specifics",$"epid"))
df.show(false)
scala> df.show(false)
+---+---+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id1|id2|epid|item_specifics |
+---+---+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|xxx|yyy|1299|{"id1" : "xxx", "id2" : "yyy", "item_specifics" : [{"name" : "MPN", "value" : "EVBD"},{"name" : "EAN", "value" : "5057723043"},{"name" : "epid", "value" : "1299"}]}|
+---+---+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Content of item_specifics column/ output
{
"id1": "xxx",
"id2": "yyy",
"item_specifics": [{
"name": "MPN",
"value": "EVBD"
}, {
"name": "EAN",
"value": "5057723043"
}, {
"name": "epid",
"value": "1299"
}]
}

reactivex repeated skip between

Suppose I have the following stream of data:
1, 2, 3, a, 5, 6, b, 7, 8, a, 10, 11, b, 12, 13, ...
I want to filter everything between 'a' and 'b' (inclusive) no matter how many times they appear. So the result of the above would be:
1, 2, 3, 7, 8, 12, 13, ...
How can I do this with ReactiveX?
Use scan with initial value b to turn
1, 2, 3, a, 5, 6, b, 7, 8, a, 10, 11, b, 12, 13, ...
into
b, 1, 2, 3, a, a, a, b, 7, 8, a, a, a, b, 12, 13, ...
and then filter out a and b to get
1, 2, 3, 7, 8, 12, 13, ...
In pseudo code
values.scan('b', (s, v) -> if (v == 'a' || v == 'b' || s != 'a') v else s).
filter(v -> v != 'a' && v != 'b');
OK. I'm posting this in case anyone else needs an answer to it. A slightly different setup than I described above just to make it easier to understand.
List<String> values = new List<string>()
{
"1", "2", "3",
"a", "5", "6", "b",
"8", "9", "10", "11",
"a", "13", "14", "b",
"16", "17", "18", "19",
"a", "21", "22", "b",
"24"
};
var aa =
// Create an array of CSV strings split on the terminal sigil value
String.Join(",", values.ToArray())
.Split(new String[] { "b," }, StringSplitOptions.None)
// Create the observable from this array of CSV strings
.ToObservable()
// Now create an Observable from each element, splitting it up again
// It is no longer a CSV string but the original elements up to each terminal value
.Select(s => s.Split(',').ToObservable()
// From each value in each observable take those elements
// up to the initial sigil
.TakeWhile(s1 => !s1.Equals("a")))
// Concat the output of each individual Observable - in order
// SelectMany won't work here since it could interleave the
// output of the different Observables created above.
.Concat();
aa.Subscribe(s => Console.WriteLine(s));
This prints out:
1
2
3
8
9
10
11
16
17
18
19
24
It is a bit more convoluted than I wanted but it works.
Edit 6/3/17:
I actually found a cleaner solution for my case.
List<String> values = new List<string>()
{
"1", "2", "3",
"a", "5", "6", "b",
"8", "9", "10", "11",
"a", "13", "14", "b",
"16", "17", "18", "19",
"a", "21", "22", "b",
"24"
};
string lazyABPattern = #"a.*?b";
Regex abRegex = new Regex(lazyABPattern);
var bb = values.ToObservable()
.Aggregate((s1, s2) => s1 + "," + s2)
.Select(s => abRegex.Replace(s, ""))
.Select(s => s.Split(',').ToObservable())
.Concat();
bb.Subscribe(s => Console.WriteLine(s));
The code is simpler which makes it easier to follow (even though it uses regexes).
The problem here is that it still isn't really a general solution to the problem of removing 'repeated regions' of a data stream. It relies on converting the stream to a single string, operating on the string, then converting it back to some other form. If anyone has any ideas on how to approach this in a general way I would love to hear about it.

Get values from an RDD

I created an RDD wtih the following format using Scala :
Array[(String, (Array[String], Array[String]))]
How can I get the list of the Array[1] from this RDD?
The data for the first data line is:
// Array[(String, (Array[String], Array[String]))]
Array(
(
966515171418,
(
Array(4579848447, 4579848453, 2015-07-29 03:27:28, 44, 1, 1, 966515171418, 966515183263, 420500052424347, 0, 52643, 9, 5067, 5084, 2, 1, 0, 0),
Array(4579866236, 4579866226, 2015-07-29 04:16:22, 37, 1, 1, 966515171418, 966515183264, 420500052424347, 0, 3083, 9, 5072, 5084, 2, 1, 0, 0)
)
)
)
Assuming you have something like this (just paste into a spark-shell):
val a = Array(
("966515171418",
(Array("4579848447", "4579848453", "2015-07-29 03:27:28", "44", "1", "1", "966515171418", "966515183263", "420500052424347", "0", "52643", "9", "5067", "5084", "2", "1", "0", "0"),
Array("4579866236", "4579866226", "2015-07-29 04:16:22", "37", "1", "1", "966515171418", "966515183264", "420500052424347", "0", "3083", "9", "5072", "5084", "2", "1", "0", "0")))
)
val rdd = sc.makeRDD(a)
then you get the first array using
scala> rdd.first._2._1
res9: Array[String] = Array(4579848447, 4579848453, 2015-07-29 03:27:28, 44, 1, 1, 966515171418, 966515183263, 420500052424347, 0, 52643, 9, 5067, 5084, 2, 1, 0, 0)
which means the first row (which is a Tuple2), then the 2nd element of the tuple (which is again a Tuple2), then the 1st element.
Using pattern matching
scala> rdd.first match { case (_, (array1, _)) => array1 }
res30: Array[String] = Array(4579848447, 4579848453, 2015-07-29 03:27:28, 44, 1, 1, 966515171418, 966515183263, 420500052424347, 0, 52643, 9, 5067, 5084, 2, 1, 0, 0)
If you want to get it of all rows, just use map():
scala> rdd.map(_._2._1).collect()
which puts the results of all rows into an array.
Another option is to use pattern matching in map():
scala> rdd.map { case (_, (array1, _)) => array1 }.collect()

Remove an array element from an array of sub documents

{
"_id" : ObjectId("5488303649f2012be0901e97"),
"user_id":3,
"my_shopping_list" : {
"books" : [ ]
},
"my_library" : {
"books" : [
{
"date_added" : ISODate("2014-12-10T12:03:04.062Z"),
"tag_text" : [
"english"
],
"bdata_product_identifier" : "a1",
"tag_id" : [
"fa7ec571-4903-4aed-892a-011a8a411471"
]
},
{
"date_added" : ISODate("2014-12-10T12:03:08.708Z"),
"tag_text" : [
"english",
"hindi"
],
"bdata_product_identifier" : "a2",
"tag_id" : [
"fa7ec571-4903-4aed-892a-011a8a411471",
"60733993-6b54-420c-8bc6-e876c0e196d6"
]
}
]
},
"my_wishlist" : {
"books" : [ ]
},
}
Here I would like to remove only english from every tag_text array of my_library using only user_id and tag_text This document belongs to user_id:3. I have tried some queries which delete an entire book sub-document . Thank you.
Well since you are using pymongo and mongodb doesn't provide a nice way for doing this because using the $ operator will only pull english from the first subdocument, why not write a script that will remove english from every tag_text and then update your document.
Demo:
>>> doc = yourcollection.find_one(
{
'user_id': 3, "my_library.books" : {"$exists": True}},
{"_id" : 0, 'user_id': 0
})
>>> books = doc['my_library']['books'] #books field in your doc
>>> new_books = []
>>> for k in books:
... for x, y in k.items():
... if x == 'tag_text' and 'english' in y:
... y.remove('english')
... new_book.append({x:y})
...
>>> new_book
[{'tag_text': []}, {'tag_id': ['fa7ec571-4903-4aed-892a-011a8a411471']}, {'bdata_product_identifier': 'a1'}, {'date_added': datetime.datetime(2014, 12, 10, 12, 3, 4, 62000)}, {'tag_text': ['hindi']}, {'tag_id': ['fa7ec571-4903-4aed-892a-011a8a411471', '60733993-6b54-420c-8bc6-e876c0e196d6']}, {'bdata_product_identifier': 'a2'}, {'date_added': datetime.datetime(2014, 12, 10, 12, 3, 8, 708000)}]
>>> yourcollection.update({'user_id' : 3}, {"$set" : {'my_library.books' : bk}})
Check if everything work fine.
>>> yourcollection.find_one({'user_id' : 3})
{'user_id': 3.0, '_id': ObjectId('5488303649f2012be0901e97'), 'my_library': {'books': [{'tag_text': []}, {'tag_id': ['fa7ec571-4903-4aed-892a-011a8a411471']}, {'bdata_product_identifier': 'a1'}, {'date_added': datetime.datetime(2014, 12, 10, 12, 3, 4, 62000)}, {'tag_text': ['hindi']}, {'tag_id': ['fa7ec571-4903-4aed-892a-011a8a411471', '60733993-6b54-420c-8bc6-e876c0e196d6']}, {'bdata_product_identifier': 'a2'}, {'date_added': datetime.datetime(2014, 12, 10, 12, 3, 8, 708000)}]}, 'my_shopping_list': {'books': []}, 'my_wishlist': {'books': []}}
One possible solution could be to repeat
db.collection.update({user_id: 3, "my_library.books.tag_text": "english"}, {$pull: {"my_library.books.$.tag_text": "english"}}
until MongoDB can no longer match a document to update.