Convert list to a dataframe column in pyspark - pyspark

I have a dataframe in which one of the string type column contains a list of items that I want to explode and make it part of the parent dataframe. How can I do it?
Here is the code to create a sample dataframe:
from pyspark.sql import Row
from collections import OrderedDict
def convert_to_row(d: dict) -> Row:
return Row(**OrderedDict(sorted(d.items())))
df=sc.parallelize([{"arg1": "first", "arg2": "John", "arg3" : '[{"name" : "click", "datetime" : "1570103345039", "event" : "entry" }, {"name" : "drag", "datetime" : "1580133345039", "event" : "exit" }]'},{"arg1": "second", "arg2": "Joe", "arg3": '[{"name" : "click", "datetime" : "1670105345039", "event" : "entry" }, {"name" : "drop", "datetime" : "1750134345039", "event" : "exit" }]'},{"arg1": "third", "arg2": "Jane", "arg3" : '[{"name" : "click", "datetime" : "1580105245039", "event" : "entry" }, {"name" : "drop", "datetime" : "1650134345039", "event" : "exit" }]'}]) \
.map(convert_to_row).toDF()
Running this code will create a dataframe as shown below:
+------+----+--------------------+
| arg1|arg2| arg3|
+------+----+--------------------+
| first|John|[{"name" : "click...|
|second| Joe|[{"name" : "click...|
| third|Jane|[{"name" : "click...|
+------+----+--------------------+
The arg3 column contains a list which I want to explode it into the detailed columns. I want the dataframe as follows:
arg1 | arg2 | arg3 | name | datetime | event
How can I achieve that?

You need to specify array to the schema in from_json function:
from pyspark.sql.functions import explode, from_json
schema = 'array<struct<name:string,datetime:string,event:string>>'
df.withColumn('data', explode(from_json('arg3', schema))) \
.select(*df.columns, 'data.*') \
.show()
+------+----+--------------------+-----+-------------+-----+
| arg1|arg2| arg3| name| datetime|event|
+------+----+--------------------+-----+-------------+-----+
| first|John|[{"name" : "click...|click|1570103345039|entry|
| first|John|[{"name" : "click...| drag|1580133345039| exit|
|second| Joe|[{"name" : "click...|click|1670105345039|entry|
|second| Joe|[{"name" : "click...| drop|1750134345039| exit|
| third|Jane|[{"name" : "click...|click|1580105245039|entry|
| third|Jane|[{"name" : "click...| drop|1650134345039| exit|
+------+----+--------------------+-----+-------------+-----+
Note: if your Spark version does not support simpleString format for schema, try the following:
from pyspark.sql.types import ArrayType, StringType, StructType, StructField
schema = ArrayType(
StructType([
StructField('name',StringType())
, StructField('datetime',StringType())
, StructField('event',StringType())
])
)

Related

Join data-frame based on value in list of WrappedArray

I have to join two spark data-frames in Scala based on a custom function. Both data-frames have the same schema.
Sample Row of data in DF1:
{
"F1" : "A",
"F2" : "B",
"F3" : "C",
"F4" : [
{
"name" : "N1",
"unit" : "none",
"count" : 50.0,
"sf1" : "val_1",
"sf2" : "val_2"
},
{
"name" : "N2",
"unit" : "none",
"count" : 100.0,
"sf1" : "val_3",
"sf2" : "val_4"
}
]
}
Sample Row of data in DF2:
{
"F1" : "A",
"F2" : "B",
"F3" : "C",
"F4" : [
{
"name" : "N1",
"unit" : "none",
"count" : 80.0,
"sf1" : "val_5",
"sf2" : "val_6"
},
{
"name" : "N2",
"unit" : "none",
"count" : 90.0,
"sf1" : "val_7",
"sf2" : "val_8"
},
{
"name" : "N3",
"unit" : "none",
"count" : 99.0,
"sf1" : "val_9",
"sf2" : "val_10"
}
]
}
RESULT of Joining these sample rows:
{
"F1" : "A",
"F2" : "B",
"F3" : "C",
"F4" : [
{
"name" : "N1",
"unit" : "none",
"count" : 80.0,
"sf1" : "val_5",
"sf2" : "val_6"
},
{
"name" : "N2",
"unit" : "none",
"count" : 100.0,
"sf1" : "val_3",
"sf2" : "val_4"
},
{
"name" : "N3",
"unit" : "none",
"count" : 99.0,
"sf1" : "val_9",
"sf2" : "val_10"
}
]
}
The result is:
full-outer-join based on value of "F1", "F2" and "F3" +
join of "F4" keeping unique nodes(use name as id) with max value of "count"
I am not very familiar with Scala and have been struggling with this for more than a day now. Here is what I have gotten to so far:
val df1 = sqlContext.read.parquet("stack_a.parquet")
val df2 = sqlContext.read.parquet("stack_b.parquet")
val df4 = df1.toDF(df1.columns.map(_ + "_A"):_*)
val df5 = df2.toDF(df1.columns.map(_ + "_B"):_*)
val df6 = df4.join(df5, df4("F1_A") === df5("F1_B") && df4("F2_A") === df5("F2_B") && df4("F3_A") === df5("F3_B"), "outer")
def joinFunction(r:Row) = {
//Need the real-deal here!
//print(r(3)) //-->Any = WrappedArray([..])
//also considering parsing as json to do the processing but not sure about the performance impact
//val parsed = JSON.parseFull(r.json) //then play with parsed
r.toSeq //
}
val finalResult = df6.rdd.map(joinFunction)
finalResult.collect
I was planning to add the custom merge logic in joinFunction but I am struggling to convert the WrappedArray/Any class to something I can work with.
Any inputs on how to do the conversion or the join in a better way will be very helpful.
Thanks!
Edit (7 Mar, 2021)
The full-outer join actually has to be performed only on "F1".
Hence, using #werner's answer, I am doing:
val df1_a = df1.toDF(df1.columns.map(_ + "_A"):_*)
val df2_b = df2.toDF(df2.columns.map(_ + "_B"):_*)
val finalResult = df1_a.join(df2_b, df1_a("F1_A") === df2_b("F1_B"), "full_outer")
.drop("F1_B")
.withColumn("F4", joinFunction(col("F4_A"), col("F4_B")))
.drop("F4_A", "F4_B")
.withColumn("F2", when(col("F2_A").isNull, col("F2_B")).otherwise(col("F2_A")))
.drop("F2_A", "F2_B")
.withColumn("F3", when(col("F3_A").isNull, col("F3_B")).otherwise(col("F3_A")))
.drop("F3_A", "F3_B")
But I am getting this error. What am I missing..?
You can implement the merge logic with the help of an udf:
//case class to define the schema of the udf's return value
case class F4(name: String, unit: String, count: Double, sf1: String, sf2: String)
val joinFunction = udf((a: Seq[Row], b: Seq[Row]) =>
(a ++ b).map(r => F4(r.getAs[String]("name"),
r.getAs[String]("unit"),
r.getAs[Double]("count"),
r.getAs[String]("sf1"),
r.getAs[String]("sf2")))
//group the elements from both arrays by name
.groupBy(_.name)
//take the element with the max count from each group
.map { case (_, d) => d.maxBy(_.count) }
.toSeq)
//join the two dataframes
val finalResult = df1.withColumnRenamed("F4", "F4_A").join(
df2.withColumnRenamed("F4", "F4_B"), Seq("F1", "F2", "F3"), "full_outer")
//call the merge function
.withColumn("F4", joinFunction('F4_A, 'F4_B))
//drop the the intermediate columns
.drop("F4_A", "F4_B")

Assign SQL schema to Spark DataFrame

I'm converting my team's legacy Redshift SQL code to Spark SQL code. All the Spark examples I've seen define the schema in a non-SQL way using StructType and StructField and I'd prefer to define the schema in SQL, since most of my users know SQL but not Spark.
This is the ugly workaround I'm doing now. Is there a more elegant way that doesn't require defining an empty table just so that I can pull the SQL schema?
create_table_sql = '''
CREATE TABLE public.example (
id LONG,
example VARCHAR(80)
)'''
spark.sql(create_table_sql)
schema = spark.sql("DESCRIBE public.example").collect()
s3_data = spark.read.\
option("delimiter", "|")\
.csv(
path="s3a://"+s3_bucket_path,
schema=schema
)\
.saveAsTable('public.example')
Yes there is a way to create schema from string although I am not sure if it really looks like SQL! So you can use:
from pyspark.sql.types import _parse_datatype_string
_parse_datatype_string("id: long, example: string")
This will create the next schema:
StructType(List(StructField(id,LongType,true),StructField(example,StringType,true)))
Or you may have a complex schema as well:
schema = _parse_datatype_string("customers array<struct<id: long, name: string, address: string>>")
StructType(
List(StructField(
customers,ArrayType(
StructType(
List(
StructField(id,LongType,true),
StructField(name,StringType,true),
StructField(address,StringType,true)
)
),true),true)
)
)
You can check for more examples here
adding up to what has already been said, making a schema (e.g. StructType-based or JSON) is more straightforward in scala spark than in pySpark:
> import org.apache.spark.sql.types.StructType
> val s = StructType.fromDDL("customers array<struct<id: long, name: string, address: string>>")
> s
res3: org.apache.spark.sql.types.StructType = StructType(StructField(customers,ArrayType(StructType(StructField(id,LongType,true),StructField(name,StringType,true),StructField(address,StringType,true)),true),true))
> s.prettyJson
res9: String =
{
"type" : "struct",
"fields" : [ {
"name" : "customers",
"type" : {
"type" : "array",
"elementType" : {
"type" : "struct",
"fields" : [ {
"name" : "id",
"type" : "long",
"nullable" : true,
"metadata" : { }
}, {
"name" : "name",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "address",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
},
"containsNull" : true
},
"nullable" : true,
"metadata" : { }
} ]
}

how to transform a nested mongodb table into spark dataframe

i have a nested mongodb talbe and its document structure like this:
{
"_id" : "35228334dbd1090f6117c5a0011b56b0",
"brasidas" : [
{
"key" : "buy",
"value" : 859193
}
],
"crawl_time" : NumberLong(1526296211997),
"date" : "2018-05-11",
"id" : "44874f4c8c677087bcd5f829b2843e66",
"initNumber" : 0,
"repurchase" : 0,
"source_url" : "http://query.sse.com.cn/commonQuery.do?jsonCallBack=jQuery11120015170331124618408_1526262411932&isPagination=true&sqlId=COMMON_SSE_SCSJ_CJGK_ZQZYSHG_JYSLMX_L&beginDate&endDate&securityCode&pageHelp.pageNo=1&pageHelp.beginPage=1&pageHelp.cacheSize=1&pageHelp.endPage=1&pageHelp.pageSize=25",
"stockCode" : "600020",
"stockName" : "ZYGS",
"type" : "SSE"
}
i want to transform it into spark dataframe,and extract the title "key"and "value " of "brasidas" as single column respectively.just like follows:
initNumber repurchase key value stockName type date
50000 50000 buy 286698 shgf SSE 2015/3/30
but there is a problem with the form of title "brasidas",it have three forms:
[{ "key" : "buy", "value" : 286698 }]
[{ "value" : 15311500, "key" : "buy_free" }, { "value" : 0, "key" : "buy_limited" }]
[{ "key" : ""buy_free" " }, { "key" : "buy_limited" }]
so when i use scala to define a StructType, it's not suitable for every document,i can only take "brasidas" as a single column and failed to divide it by the "key" .this is what i get:
initNumber repurchase brasidas stockName type date
50000 50000 [{ "key" : "buy", "value" : 286698 }] shgf SSE 2015/3/30
This is the code for getting mongodb document:
val readpledge =ReadConfig(Map("uri"-> (mongouri_beehive+".pledge")))
val pledge = getMongoDB.readCollection(sc, readpledge,"initNumber","repurchase","brasidas","stockName","type","date")
.selectExpr("cast(initNumber as int) initNumber", "cast(repurchase as int) repurchase","brasidas","stockName","type","date")
If you try to df.printSchema() you'll probably be able to observe that brasidas got ArrayType. Most likely (array of map).
So, I'd suggest to implement some sort of UDF function that get Array as parameter and transform it in a way you need.
def arrayProcess(arr: Seq[AnyRef]): Seq[AnyRef] = ???

Spark - after a withColumn("newCol", collect_list(...)) select rows with more than one element

I am working with a DataFrame created from this json:
{"id" : "1201", "name" : "satish", "age" : "25"},
{"id" : "1202", "name" : "krishna", "age" : "28"},
{"id" : "1203", "name" : "amith", "age" : "39"},
{"id" : "1204", "name" : "javed", "age" : "23"},
{"id" : "1205", "name" : "mendy", "age" : "25"},
{"id" : "1206", "name" : "rob", "age" : "24"},
{"id" : "1207", "name" : "prudvi", "age" : "23"}
Initially the Dataframe looks like this:
+---+----+-------+
|age| id| name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203| amith|
| 23|1204| javed|
| 25|1205| mendy|
| 24|1206| rob|
| 23|1207| prudvi|
+---+----+-------+
What I need is to group all students with the same age, ordering them depending on their id. This is how I'm approaching this so far:
*Note: I'm pretty sure the are more efficient way than adding a new column using withColumn("newCol", ..) to then use a select("newCol"), but I don't know how to solve it better
val conf = new SparkConf().setAppName("SimpleApp").set("spark.driver.allowMultipleContexts", "true").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val df = sqlContext.read.json("students.json")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val mergedDF = df.withColumn("newCol", collect_list(struct("age","id","name")).over(Window.partitionBy("age").orderBy("id"))).select("List")
The output I am getting is this:
[WrappedArray([25,1201,satish], [25,1205,mendy])]
[WrappedArray([24,1206,rob])]
[WrappedArray([23,1204,javed])]
[WrappedArray([23,1204,javed], [23,1207,prudvi])]
[WrappedArray([28,1202,krishna])]
[WrappedArray([39,1203,amith])]
Now, How can I filter the rows which have got more than one element? That is, I want that my final dataframe to be:
[WrappedArray([25,1201,satish], [25,1205,mendy])]
[WrappedArray([23,1204,javed], [23,1207,prudvi])]
My best approach so far is:
val mergedDF = df.withColumn("newCol", collect_list(struct("age","id","name")).over(Window.partitionBy("age").orderBy("id")))
val filterd = mergedDF.withColumn("count", count("age").over(Window.partitionBy("age"))).filter($"count" > 1).select("newCol")
But I must be missing something, because the result is not the expected one:
[WrappedArray([23,1204,javed], [23,1207,prudvi])]
[WrappedArray([25,1201,satish])]
[WrappedArray([25,1201,satish], [25,1205,mendy])]
You can use size() to filter your data:
import org.apache.spark.sql.functions.{col,size}
mergedDF.filter(size(col("newCol"))>1).show(false)
+---+----+------+-----------------------------------+
|age|id |name |newCol |
+---+----+------+-----------------------------------+
|23 |1207|prudvi|[[23,1204,javed], [23,1207,prudvi]]|
|25 |1205|mendy |[[25,1201,satish], [25,1205,mendy]]|
+---+----+------+-----------------------------------+

SCALA: Reading JSON file with the path provided

I have scenario where I will be getting different JSON result from multiple API's, I need to read specific value from the response.
For instance my JSON response is as below, now I need a format from user to provider by which I can read the value of Lat, Don't want hard-coded approach for this, user can provided a node to read in some other json file or txt file:
{
"name" : "Watership Down",
"location" : {
"lat" : 51.235685,
"long" : -1.309197
},
"residents" : [ {
"name" : "Fiver",
"age" : 4,
"role" : null
}, {
"name" : "Bigwig",
"age" : 6,
"role" : "Owsla"
} ]
}
You can get the key of json using scala JSON parser as below. Im defining a function to get the lat, which you can make generic as per your need, so that you just need to change the function.
import scala.util.parsing.json.JSON
val json =
"""
|{
| "name" : "Watership Down",
| "location" : {
| "lat" : 51.235685,
| "long" : -1.309197
| },
| "residents" : [ {
| "name" : "Fiver",
| "age" : 4,
| "role" : null
| }, {
| "name" : "Bigwig",
| "age" : 6,
| "role" : "Owsla"
| } ]
|}
""".stripMargin
val jsonObject = JSON.parseFull(json).get.asInstanceOf[Map[String, Any]]
val latLambda : (Map[String, Any] => Option[Double] ) = _.get("location")
.map(_.asInstanceOf[Map[String, Any]]("lat").toString.toDouble)
assert(latLambda(jsonObject) == Some(51.235685))
The expanded version of function,
val latitudeLambda = new Function[Map[String, Any], Double]{
override def apply(input: Map[String, Any]): Double = {
input("location").asInstanceOf[Map[String, Any]]("lat").toString.toDouble
}
}
Make the function generic so that once you know what key you want from the JSON, just change the function and apply the JSON.
Hope it helps. But there are nicer APIs out there like Play JSON lib. You simply can use,
import play.api.libs.json._
val jsonVal = Json.parse(json)
val lat = (jsonVal \ "location" \ "lat").get