Convert Spark data frame into array of JSON objects - scala

I have the following Dataframe as an example:
+--------------------------------------+------------+------------+------------------+
| user_id | city | user_name | facebook_id |
+--------------------------------------+------------+------------+------------------+
| 55c3c59d-0163-46a2-b495-bc352a8de883 | Toronto | username_x | 0123482174440907 |
| e2ddv22d-4132-c211-4425-9933aa8de454 | Washington | username_y | 0432982476780234 |
+--------------------------------------+------------+------------+------------------+
How can I convert it to an array of JSON Objecta like:
[{
"user_id": "55c3c59d-0163-46a2-b495-bc352a8de883",
"facebook_id": "0123482174440907"
},
{
"user_id": "e2ddv22d-4132-c211-4425-9933aa8de454",
"facebook_id": "0432982476780234"
}]

Assuming you have already loaded the given data in the form of a dataframe, you can use function toJSON on the dataframe.
*scala> sc.parallelize(Seq(("55c3c59d-0163-46a2-b495-bc352a8de883","Toronto","username_x","0123482174440907"))).toDF("user_id","city","user_name","facebook_id")
res2: org.apache.spark.sql.DataFrame = [user_id: string, city: string, user_name: string, facebook_id: string]*
*res2.toJSON.take(1)
res3: Array[String] = Array({"user_id":"55c3c59d-0163-46a2-b495-bc352a8de883","city":"Toronto","user_name":"username_x","facebook_id":"0123482174440907"})*

Related

Spark dataframe map root key with elements of array of another column of string type

Actually I am stuck in a problem where I have a dataframe with 2 columns having schema
scala> df1.printSchema
root
|-- actions: string (nullable = true)
|-- event_id: string (nullable = true)
actions column actually contains as array of objects but it's type is string and hence I can't use explode here
Sample data :
------------------------------------------------------------------------------------------------------------------
| event_id | actions |
------------------------------------------------------------------------------------------------------------------
| 1 | [{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}] |
------------------------------------------------------------------------------------------------------------------
There are some other keys present in each object of actions, but for simplicity I have taken 2 here.
I want to convert this to below format
OUTPUT :-
---------------------------------------
| event_id | name | score |
---------------------------------------
| 1 | Vijay | 843 |
---------------------------------------
| 2 | Manish | 840 |
---------------------------------------
| 3 | Mayur | 930 |
---------------------------------------
how can I do this with spark dataframe?
I have tried to read actions column using
val df2= spark.read.option("multiline",true).json(df1.rdd.map(row => row.getAs[String]("actions")))
but here I am not able to map event_id with each line.
You can do this by using the from_json function.
This function has 2 inputs:
A column from which we want to read json string from
A schema with which to parse the json string
That would look something like this:
import spark.implicits._
import org.apache.spark.sql.types._
// Reading in your data
val df = spark.read.option("sep", ";").option("header", "true").csv("./csvWithJson.csv")
df.show(false)
+--------+---------------------------------------------------------------------------------------------------+
|event_id|actions |
+--------+---------------------------------------------------------------------------------------------------+
|1 |[{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}]|
+--------+---------------------------------------------------------------------------------------------------+
// Creating the necessary schema for the from_json function
val actionsSchema = ArrayType(
new StructType()
.add("name", StringType)
.add("score", IntegerType)
)
// Parsing the json string into our schema, exploding the column to make one row
// per json object in the array and then selecting the wanted columns,
// unwrapping the parsedActions column into separate columns
val parsedDf = df
.withColumn("parsedActions",explode(from_json(col("actions"), actionsSchema)))
.drop("actions")
.select("event_id", "parsedActions.*")
parsedDf.show(false)
+--------+------+-----+
|event_id| name|score|
+--------+------+-----+
| 1| Vijay| 843|
| 1|Manish| 840|
| 1| Mayur| 930|
+--------+------+-----+
Hope this helps!

How to ignore empty NullFields when collecting to list with Pyspark

I have theses two tables
+---------+--------+ author_df
|AUTHOR_ID| NAME |
+---------+--------+
| 102 |Camus |
| 103 |Hugo |
+---------+-------- +------------ book_df
|AUTHOR_ID| BOOK_ID + BOOK_NAME |
+---------+-------- + -----------|
| 1 |Camus | Etranger
| 1 |Hugo | Mesirable |
I did some join and aggregations as the following
result_df = author_df.join(book_df, 'AUTHOR_ID', 'left')\
.groupby('AUTHOR_ID', 'NAME')\
.agg(f.collect_list(f.struct('BOOK_ID', 'BOOK_NAME')).alias('BOOK_LIST'))
And I get a dataframe with this structure
root
|-- AUTHOR_ID: integer
|-- NAME: string
|-- BOOK_LIST: array
| |-- BOOK_ID: integer
| |-- BOOK_NAME: string
the problem is when I want to write the result_dataframe to json I get empty objects {} in the list, like
{
...
"BOOK_LIST": [
{}
]
}
Even when I tried the option ignoreNullFields
result_df\
.coalesce(1)\
.write\
.mode("overwrite")\
.option("ignoreNullFields", "true")\ # this option
.format("json")\
.save(path)
But I get some empty objects that I want to remove
```JSON
{
...
"BOOK_LIST": [
{
"BOOK_ID": null,
"BOOK_NAME": null
}
]
}
In the resulted dataframe `result_df.show(truncate=False)`
```S
+---------+-----------------+
| | BOOKS_LIST |
+---------+-----------------+
| |[{null, null}] |
So how to remove the object with null values and get emplty array instead ? Thanks
Simply we need just to add a condition f.when().otherwise() to our collect list, for our case
.agg(
f.collect_list(
f.when(f.col("BOOK_ID").isNotNull(), f.struct('BOOK_ID', 'BOOK_NAME')).otherwise(f.lit(None))).alias("BOOK_LIST")
),

How to use dataframe inside an udf and parse the data in spark scala

I am new to scala and spark. I have a requirement to create the new dataframe by using the udf.
I have a 2 dataframes, one contains 3 columns namely company, id, and type.
df2 contains 2 columns namely company and message.
df2 JSON will be like this
{"company": "Honda", "message": ["19:[\"cost 500k\"],[\"colour blue\"]","20:[\"cost 600k\"],[\"colour white\"]"]}
{"company": "BMW", "message": ["19:[\"cost 1500k\"],[\"colour blue\"]"]}
df2 will be like this:
+-------+--------------------+
|company| message|
+-------+--------------------+
| Honda|[19:["cost 500k"]...|
| BMW|[19:["cost 1500k"...|
+-------+--------------------+
|-- company: string (nullable = true)
|-- message: array (nullable = true)
| |-- element: string (containsNull = true)
df1 will be like this:
+----------+---+-------+
|company | id| name|
+----------+---+-------+
| Honda | 19| city |
| Honda | 20| amaze |
| BMW | 19| x1 |
+----------+---+-------+
I want to create a new data frame by replacing the id in df2 with the name in df1.
["city:[\"cost 500k\"],[\"colour blue\"]","amaze:[\"cost 600k\"],[\"colour white\"]"]
I had tried with udf by passing message as Seq[String] and company but I was not able to select the data in df1.
I want the output like this:
+-------+----------------------+
|company| message |
+-------+----------------------+
| Honda|[city:["cost 500k"]...|
| BMW|[x1:["cost 1500k"... |
+-------+----------------------+
I tried by using the fallowing udf but I was facing errors while selecting the name
def asdf(categories: Seq[String]):String={
| var data=""
| for(w<-categories){
| if (w != null){
| var id=w.toString().indexOf(":")
| var namea=df1.select("name").where($"id" === 20).map(_.getString(0)).collect()
| var name=namea(0)
| println(name)
| var ids=w.toString().substring(0,id)
| var li=w.toString().replace(ids,name)
| println(li)
| data=data+li
| }
| }
| data
| }
Please check below code.
scala> df1.show(false)
+-------+---------------------------------------------------------------------+
|company|message |
+-------+---------------------------------------------------------------------+
|Honda |[19:["cost 500k"],["colour blue"], 20:["cost 600k"],["colour white"]]|
|BMW |[19:["cost 1500k"],["colour blue"]] |
+-------+---------------------------------------------------------------------+
scala> df2.show(false)
+-------+---+-----+
|company|id |name |
+-------+---+-----+
|Honda | 19|city |
|Honda | 20|amaze|
|BMW | 19|x1 |
+-------+---+-----+
val replaceFirst = udf((message: String,id:String,name:String) =>
if(message.contains(s"""${id}:""")) message.replaceFirst(s"""${id}:""",s"${name}:") else ""
)
val jdf =
df1
.withColumn("message",explode($"message"))
.join(df2,df1("company") === df2("company"),"inner")
.withColumn(
"message_data",
replaceFirst($"message",trim($"id"),$"name")
)
.filter($"message_data" =!= "")
scala> jdf.show(false)
+-------+---------------------------------+-------+---+-----+------------------------------------+
|company|message |company|id |name |message_data |
+-------+---------------------------------+-------+---+-----+------------------------------------+
|Honda |19:["cost 500k"],["colour blue"] |Honda | 19|city |city:["cost 500k"],["colour blue"] |
|Honda |20:["cost 600k"],["colour white"]|Honda | 20|amaze|amaze:["cost 600k"],["colour white"]|
|BMW |19:["cost 1500k"],["colour blue"]|BMW | 19|x1 |x1:["cost 1500k"],["colour blue"] |
+-------+---------------------------------+-------+---+-----+------------------------------------+
scala> df1.join(df2,df1("company") === df2("company"),"inner").select(df1("company"),df1("message"),df2("id"),df2("name")).withColumn("message",explode($"message")).withColumn("message",replaceFirst($"message",trim($"id"),$"name")).filter($"message" =!= "").groupBy($"company").agg(collect_list($"message").cast("string").as("message")).show(false)
+-------+--------------------------------------------------------------------------+
|company|message |
+-------+--------------------------------------------------------------------------+
|Honda |[amaze:["cost 600k"],["colour white"], city:["cost 500k"],["colour blue"]]|
|BMW |[x1:["cost 1500k"],["colour blue"]] |
+-------+--------------------------------------------------------------------------+

How to add a Seq[T] column to a Dataset that contains elements of two Datasets?

I have two Datasets AccountData and CustomerData, with the corresponding case classes:
case class AccountData(customerId: String, forename: String, surname: String)
customerId|accountId|balance|
+----------+---------+-------+
| IND0002| ACC0002| 200|
| IND0002| ACC0022| 300|
| IND0003| ACC0003| 400|
+----------+---------+-------+
case class CustomerData(customerId: String, accountId: String, balance: Long)
+----------+-----------+--------+
|customerId| forename| surname|
+----------+-----------+--------+
| IND0001|Christopher| Black|
| IND0002| Madeleine| Kerr|
| IND0003| Sarah| Skinner|
+----------+-----------+--------+
How do I derive the following Dataset, which add column accounts that contains Seq[AccountData] of each customerId?
+----------+-----------+----------------------------------------------+
|customerId|forename |surname |accounts |
+----------+-----------+----------+---------------------------------- +
|IND0001 |Christopher|Black |[]
|IND0002 |Madeleine |Kerr |[[IND0002,ACC002,200],[IND0002,ACC0022,300]]
|IND0003 |Sarah |Skinner |[[IND0003,ACC003,400]
I've tried:
val joinCustomerAndAccount = accountDS.joinWith(customerDS, customerDS("customerId") === accountDS("customerId")).drop(col("_2"))
which gives me the following Dataframe:
+---------------------+
|_1 |
+---------------------+
|[IND0002,ACC0002,200]|
|[IND0002,ACC0022,300]|
|[IND0003,ACC0003,400]|
+---------------------+
If I then do:
val result = customerDS.withColumn("accounts", joinCustomerAndAccount("_1")(0))
I get the following Exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Field name should be String Literal, but it's 0;
Accounts can be grouped by "customerId" and joined with Customers:
// data
val accountDS = Seq(
AccountData("IND0002", "ACC0002", 200),
AccountData("IND0002", "ACC0022", 300),
AccountData("IND0003", "ACC0003", 400)
).toDS()
val customerDS = Seq(
CustomerData("IND0001", "Christopher", "Black"),
CustomerData("IND0002", "Madeleine", "Kerr"),
CustomerData("IND0003", "Sarah", "Skinner")
).toDS()
// action
val accountsGroupedDF = accountDS.toDF
.groupBy("customerId")
.agg(collect_set(struct("accountId", "balance")).as("accounts"))
val result = customerDS.toDF.alias("c")
.join(accountsGroupedDF.alias("a"), $"c.customerId" === $"a.customerId", "left")
.select("c.*","accounts")
result.show(false)
Output:
+----------+-----------+-------+--------------------------------+
|customerId|forename |surname|accounts |
+----------+-----------+-------+--------------------------------+
|IND0001 |Christopher|Black |null |
|IND0002 |Madeleine |Kerr |[[ACC0002, 200], [ACC0022, 300]]|
|IND0003 |Sarah |Skinner|[[ACC0003, 400]] |
+----------+-----------+-------+--------------------------------+

I've a table with Map as column data type, how can I explode it to generate 2 columns, one for map and one for key?

Hive Table: (Name_Age: Map[String, Int] and ID: Int)
+---------------------------------------------------------++------+
| Name_Age || ID |
+---------------------------------------------------------++------+
|"SUBHAJIT SEN":28,"BINOY MONDAL":26,"SHANTANU DUTTA":35 || 15 |
|"GOBINATHAN SP":35,"HARSH GUPTA":27,"RAHUL ANAND":26 || 16 |
+---------------------------------------------------------++------+
I've exploded the Name_Age column into multiple Rows:
def toUpper(name: Seq[String]) = (name.map(a => a.toUpperCase)).toSeq
sqlContext.udf.register("toUpper",toUpper _)
var df = sqlContext.sql("SELECT toUpper(name) FROM namelist").toDF("Name_Age")
df.explode(df("Name_Age")){case org.apache.spark.sql.Row(arr: Seq[String]) => arr.toSeq.map(v => Tuple1(v))}.drop(df("Name_Age")).withColumnRenamed("_1","Name_Age")
+-------------------+
| Name_Age |
+-------------------+
| [SUBHAJIT SEN,28]|
| [BINOY MONDAL,26]|
|[SHANTANU DUTTA,35]|
| [GOBINATHAN SP,35]|
| [HARSH GUPTA,27]|
| [RAHUL ANAND,26]|
+-------------------+
But I want to explode and create 2 rows: Name and Age
+-------------------+-------+
| Name | Age |
+-------------------+-------+
| SUBHAJIT SEN | 28 |
| BINOY MONDAL | 26 |
|SHANTANU DUTTA | 35 |
| GOBINATHAN SP | 35 |
| HARSH GUPTA | 27 |
| RAHUL ANAND | 26 |
+-------------------+-------+
Could any one please help with the explode code modification?
All you need here is to drop toUpper call explode function:
import org.apache.spark.sql.functions.explode
val df = Seq((Map("foo" -> 1, "bar" -> 2), 1)).toDF("name_age", "id")
val exploded = df.select($"id", explode($"name_age")).toDF("id", "name", "age")
exploded.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = false)
// |-- age: integer (nullable = false)
You can convert to upper case using built-in functions afterwards:
import org.apache.spark.sql.functions.upper
exploded.withColumn("name", upper($"name"))