Convert Spark data frame into array of JSON objects

Convert Spark data frame into array of JSON objects - scala

I have the following Dataframe as an example:
+--------------------------------------+------------+------------+------------------+
| user_id | city | user_name | facebook_id |
+--------------------------------------+------------+------------+------------------+
| 55c3c59d-0163-46a2-b495-bc352a8de883 | Toronto | username_x | 0123482174440907 |
| e2ddv22d-4132-c211-4425-9933aa8de454 | Washington | username_y | 0432982476780234 |
+--------------------------------------+------------+------------+------------------+
How can I convert it to an array of JSON Objecta like:
[{
"user_id": "55c3c59d-0163-46a2-b495-bc352a8de883",
"facebook_id": "0123482174440907"
},
{
"user_id": "e2ddv22d-4132-c211-4425-9933aa8de454",
"facebook_id": "0432982476780234"
}]

Assuming you have already loaded the given data in the form of a dataframe, you can use function toJSON on the dataframe.
*scala> sc.parallelize(Seq(("55c3c59d-0163-46a2-b495-bc352a8de883","Toronto","username_x","0123482174440907"))).toDF("user_id","city","user_name","facebook_id")
res2: org.apache.spark.sql.DataFrame = [user_id: string, city: string, user_name: string, facebook_id: string]*
*res2.toJSON.take(1)
res3: Array[String] = Array({"user_id":"55c3c59d-0163-46a2-b495-bc352a8de883","city":"Toronto","user_name":"username_x","facebook_id":"0123482174440907"})*

Related

Spark dataframe map root key with elements of array of another column of string type

Actually I am stuck in a problem where I have a dataframe with 2 columns having schema
scala> df1.printSchema
root
|-- actions: string (nullable = true)
|-- event_id: string (nullable = true)
actions column actually contains as array of objects but it's type is string and hence I can't use explode here
Sample data :
------------------------------------------------------------------------------------------------------------------
| event_id | actions |
------------------------------------------------------------------------------------------------------------------
| 1 | [{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}] |
------------------------------------------------------------------------------------------------------------------
There are some other keys present in each object of actions, but for simplicity I have taken 2 here.
I want to convert this to below format
OUTPUT :-
---------------------------------------
| event_id | name | score |
---------------------------------------
| 1 | Vijay | 843 |
---------------------------------------
| 2 | Manish | 840 |
---------------------------------------
| 3 | Mayur | 930 |
---------------------------------------
how can I do this with spark dataframe?
I have tried to read actions column using
val df2= spark.read.option("multiline",true).json(df1.rdd.map(row => row.getAs[String]("actions")))
but here I am not able to map event_id with each line.

You can do this by using the from_json function.
This function has 2 inputs:
A column from which we want to read json string from
A schema with which to parse the json string
That would look something like this:
import spark.implicits._
import org.apache.spark.sql.types._
// Reading in your data
val df = spark.read.option("sep", ";").option("header", "true").csv("./csvWithJson.csv")
df.show(false)
+--------+---------------------------------------------------------------------------------------------------+
|event_id|actions |
+--------+---------------------------------------------------------------------------------------------------+
|1 |[{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}]|
+--------+---------------------------------------------------------------------------------------------------+
// Creating the necessary schema for the from_json function
val actionsSchema = ArrayType(
new StructType()
.add("name", StringType)
.add("score", IntegerType)
)
// Parsing the json string into our schema, exploding the column to make one row
// per json object in the array and then selecting the wanted columns,
// unwrapping the parsedActions column into separate columns
val parsedDf = df
.withColumn("parsedActions",explode(from_json(col("actions"), actionsSchema)))
.drop("actions")
.select("event_id", "parsedActions.*")
parsedDf.show(false)
+--------+------+-----+
|event_id| name|score|
+--------+------+-----+
| 1| Vijay| 843|
| 1|Manish| 840|
| 1| Mayur| 930|
+--------+------+-----+
Hope this helps!

How to ignore empty NullFields when collecting to list with Pyspark

I have theses two tables
+---------+--------+ author_df
|AUTHOR_ID| NAME |
+---------+--------+
| 102 |Camus |
| 103 |Hugo |
+---------+-------- +------------ book_df
|AUTHOR_ID| BOOK_ID + BOOK_NAME |
+---------+-------- + -----------|
| 1 |Camus | Etranger
| 1 |Hugo | Mesirable |
I did some join and aggregations as the following
result_df = author_df.join(book_df, 'AUTHOR_ID', 'left')\
.groupby('AUTHOR_ID', 'NAME')\
.agg(f.collect_list(f.struct('BOOK_ID', 'BOOK_NAME')).alias('BOOK_LIST'))
And I get a dataframe with this structure
root
|-- AUTHOR_ID: integer
|-- NAME: string
|-- BOOK_LIST: array
| |-- BOOK_ID: integer
| |-- BOOK_NAME: string
the problem is when I want to write the result_dataframe to json I get empty objects {} in the list, like
{
...
"BOOK_LIST": [
{}
]
}
Even when I tried the option ignoreNullFields
result_df\
.coalesce(1)\
.write\
.mode("overwrite")\
.option("ignoreNullFields", "true")\ # this option
.format("json")\
.save(path)
But I get some empty objects that I want to remove
```JSON
{
...
"BOOK_LIST": [
{
"BOOK_ID": null,
"BOOK_NAME": null
}
]
}
In the resulted dataframe `result_df.show(truncate=False)`
```S
+---------+-----------------+
| | BOOKS_LIST |
+---------+-----------------+
| |[{null, null}] |
So how to remove the object with null values and get emplty array instead ? Thanks

Simply we need just to add a condition f.when().otherwise() to our collect list, for our case
.agg(
f.collect_list(
f.when(f.col("BOOK_ID").isNotNull(), f.struct('BOOK_ID', 'BOOK_NAME')).otherwise(f.lit(None))).alias("BOOK_LIST")
),

How to use dataframe inside an udf and parse the data in spark scala

I am new to scala and spark. I have a requirement to create the new dataframe by using the udf.
I have a 2 dataframes, one contains 3 columns namely company, id, and type.
df2 contains 2 columns namely company and message.
df2 JSON will be like this
{"company": "Honda", "message": ["19:[\"cost 500k\"],[\"colour blue\"]","20:[\"cost 600k\"],[\"colour white\"]"]}
{"company": "BMW", "message": ["19:[\"cost 1500k\"],[\"colour blue\"]"]}
df2 will be like this:
+-------+--------------------+
|company| message|
+-------+--------------------+
| Honda|[19:["cost 500k"]...|
| BMW|[19:["cost 1500k"...|
+-------+--------------------+
|-- company: string (nullable = true)
|-- message: array (nullable = true)
| |-- element: string (containsNull = true)
df1 will be like this:
+----------+---+-------+
|company | id| name|
+----------+---+-------+
| Honda | 19| city |
| Honda | 20| amaze |
| BMW | 19| x1 |
+----------+---+-------+
I want to create a new data frame by replacing the id in df2 with the name in df1.
["city:[\"cost 500k\"],[\"colour blue\"]","amaze:[\"cost 600k\"],[\"colour white\"]"]
I had tried with udf by passing message as Seq[String] and company but I was not able to select the data in df1.
I want the output like this:
+-------+----------------------+
|company| message |
+-------+----------------------+
| Honda|[city:["cost 500k"]...|
| BMW|[x1:["cost 1500k"... |
+-------+----------------------+
I tried by using the fallowing udf but I was facing errors while selecting the name
def asdf(categories: Seq[String]):String={
| var data=""
| for(w<-categories){
| if (w != null){
| var id=w.toString().indexOf(":")
| var namea=df1.select("name").where($"id" === 20).map(_.getString(0)).collect()
| var name=namea(0)
| println(name)
| var ids=w.toString().substring(0,id)
| var li=w.toString().replace(ids,name)
| println(li)
| data=data+li
| }
| }
| data
| }

Please check below code.
scala> df1.show(false)
+-------+---------------------------------------------------------------------+
|company|message |
+-------+---------------------------------------------------------------------+
|Honda |[19:["cost 500k"],["colour blue"], 20:["cost 600k"],["colour white"]]|
|BMW |[19:["cost 1500k"],["colour blue"]] |
+-------+---------------------------------------------------------------------+
scala> df2.show(false)
+-------+---+-----+
|company|id |name |
+-------+---+-----+
|Honda | 19|city |
|Honda | 20|amaze|
|BMW | 19|x1 |
+-------+---+-----+
val replaceFirst = udf((message: String,id:String,name:String) =>
if(message.contains(s"""${id}:""")) message.replaceFirst(s"""${id}:""",s"${name}:") else ""
)
val jdf =
df1
.withColumn("message",explode($"message"))
.join(df2,df1("company") === df2("company"),"inner")
.withColumn(
"message_data",
replaceFirst($"message",trim($"id"),$"name")
)
.filter($"message_data" =!= "")
scala> jdf.show(false)
+-------+---------------------------------+-------+---+-----+------------------------------------+
|company|message |company|id |name |message_data |
+-------+---------------------------------+-------+---+-----+------------------------------------+
|Honda |19:["cost 500k"],["colour blue"] |Honda | 19|city |city:["cost 500k"],["colour blue"] |
|Honda |20:["cost 600k"],["colour white"]|Honda | 20|amaze|amaze:["cost 600k"],["colour white"]|
|BMW |19:["cost 1500k"],["colour blue"]|BMW | 19|x1 |x1:["cost 1500k"],["colour blue"] |
+-------+---------------------------------+-------+---+-----+------------------------------------+
scala> df1.join(df2,df1("company") === df2("company"),"inner").select(df1("company"),df1("message"),df2("id"),df2("name")).withColumn("message",explode($"message")).withColumn("message",replaceFirst($"message",trim($"id"),$"name")).filter($"message" =!= "").groupBy($"company").agg(collect_list($"message").cast("string").as("message")).show(false)
+-------+--------------------------------------------------------------------------+
|company|message |
+-------+--------------------------------------------------------------------------+
|Honda |[amaze:["cost 600k"],["colour white"], city:["cost 500k"],["colour blue"]]|
|BMW |[x1:["cost 1500k"],["colour blue"]] |
+-------+--------------------------------------------------------------------------+

How to add a Seq[T] column to a Dataset that contains elements of two Datasets?

I have two Datasets AccountData and CustomerData, with the corresponding case classes:
case class AccountData(customerId: String, forename: String, surname: String)
customerId|accountId|balance|
+----------+---------+-------+
| IND0002| ACC0002| 200|
| IND0002| ACC0022| 300|
| IND0003| ACC0003| 400|
+----------+---------+-------+
case class CustomerData(customerId: String, accountId: String, balance: Long)
+----------+-----------+--------+
|customerId| forename| surname|
+----------+-----------+--------+
| IND0001|Christopher| Black|
| IND0002| Madeleine| Kerr|
| IND0003| Sarah| Skinner|
+----------+-----------+--------+
How do I derive the following Dataset, which add column accounts that contains Seq[AccountData] of each customerId?
+----------+-----------+----------------------------------------------+
|customerId|forename |surname |accounts |
+----------+-----------+----------+---------------------------------- +
|IND0001 |Christopher|Black |[]
|IND0002 |Madeleine |Kerr |[[IND0002,ACC002,200],[IND0002,ACC0022,300]]
|IND0003 |Sarah |Skinner |[[IND0003,ACC003,400]
I've tried:
val joinCustomerAndAccount = accountDS.joinWith(customerDS, customerDS("customerId") === accountDS("customerId")).drop(col("_2"))
which gives me the following Dataframe:
+---------------------+
|_1 |
+---------------------+
|[IND0002,ACC0002,200]|
|[IND0002,ACC0022,300]|
|[IND0003,ACC0003,400]|
+---------------------+
If I then do:
val result = customerDS.withColumn("accounts", joinCustomerAndAccount("_1")(0))
I get the following Exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Field name should be String Literal, but it's 0;

Accounts can be grouped by "customerId" and joined with Customers:
// data
val accountDS = Seq(
AccountData("IND0002", "ACC0002", 200),
AccountData("IND0002", "ACC0022", 300),
AccountData("IND0003", "ACC0003", 400)
).toDS()
val customerDS = Seq(
CustomerData("IND0001", "Christopher", "Black"),
CustomerData("IND0002", "Madeleine", "Kerr"),
CustomerData("IND0003", "Sarah", "Skinner")
).toDS()
// action
val accountsGroupedDF = accountDS.toDF
.groupBy("customerId")
.agg(collect_set(struct("accountId", "balance")).as("accounts"))
val result = customerDS.toDF.alias("c")
.join(accountsGroupedDF.alias("a"), $"c.customerId" === $"a.customerId", "left")
.select("c.*","accounts")
result.show(false)
Output:
+----------+-----------+-------+--------------------------------+
|customerId|forename |surname|accounts |
+----------+-----------+-------+--------------------------------+
|IND0001 |Christopher|Black |null |
|IND0002 |Madeleine |Kerr |[[ACC0002, 200], [ACC0022, 300]]|
|IND0003 |Sarah |Skinner|[[ACC0003, 400]] |
+----------+-----------+-------+--------------------------------+

I've a table with Map as column data type, how can I explode it to generate 2 columns, one for map and one for key?

Hive Table: (Name_Age: Map[String, Int] and ID: Int)
+---------------------------------------------------------++------+
| Name_Age || ID |
+---------------------------------------------------------++------+
|"SUBHAJIT SEN":28,"BINOY MONDAL":26,"SHANTANU DUTTA":35 || 15 |
|"GOBINATHAN SP":35,"HARSH GUPTA":27,"RAHUL ANAND":26 || 16 |
+---------------------------------------------------------++------+
I've exploded the Name_Age column into multiple Rows:
def toUpper(name: Seq[String]) = (name.map(a => a.toUpperCase)).toSeq
sqlContext.udf.register("toUpper",toUpper _)
var df = sqlContext.sql("SELECT toUpper(name) FROM namelist").toDF("Name_Age")
df.explode(df("Name_Age")){case org.apache.spark.sql.Row(arr: Seq[String]) => arr.toSeq.map(v => Tuple1(v))}.drop(df("Name_Age")).withColumnRenamed("_1","Name_Age")
+-------------------+
| Name_Age |
+-------------------+
| [SUBHAJIT SEN,28]|
| [BINOY MONDAL,26]|
|[SHANTANU DUTTA,35]|
| [GOBINATHAN SP,35]|
| [HARSH GUPTA,27]|
| [RAHUL ANAND,26]|
+-------------------+
But I want to explode and create 2 rows: Name and Age
+-------------------+-------+
| Name | Age |
+-------------------+-------+
| SUBHAJIT SEN | 28 |
| BINOY MONDAL | 26 |
|SHANTANU DUTTA | 35 |
| GOBINATHAN SP | 35 |
| HARSH GUPTA | 27 |
| RAHUL ANAND | 26 |
+-------------------+-------+
Could any one please help with the explode code modification?

All you need here is to drop toUpper call explode function:
import org.apache.spark.sql.functions.explode
val df = Seq((Map("foo" -> 1, "bar" -> 2), 1)).toDF("name_age", "id")
val exploded = df.select($"id", explode($"name_age")).toDF("id", "name", "age")
exploded.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = false)
// |-- age: integer (nullable = false)
You can convert to upper case using built-in functions afterwards:
import org.apache.spark.sql.functions.upper
exploded.withColumn("name", upper($"name"))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Convert Spark data frame into array of JSON objects - scala

Related

Spark dataframe map root key with elements of array of another column of string type

How to ignore empty NullFields when collecting to list with Pyspark

How to use dataframe inside an udf and parse the data in spark scala

How to add a Seq[T] column to a Dataset that contains elements of two Datasets?

I've a table with Map as column data type, how can I explode it to generate 2 columns, one for map and one for key?

Categories

Resources