Spark Create DF from json string and string scala - scala

I have a json string and a different string I'd like to create a dataframe of.
val body = """{
| "time": "2020-07-01T17:17:15.0495314Z",
| "ver": "4.0",
| "name": "samplename",
| "iKey": "o:something",
| "random": {
| "stuff": {
| "eventFlags": 258,
| "num5": "DHM",
| "num2": "something",
| "flags": 415236612,
| "num1": "4004825",
| "seq": 44
| },
| "banana": {
| "id": "someid",
| "ver": "someversion",
| "asId": 123
| },
| "something": {
| "example": "somethinghere"
| },
| "apple": {
| "time": "2020-07-01T17:17:37.874Z",
| "flag": "something",
| "userAgent": "someUserAgent",
| "auth": 12,
| "quality": 0
| },
| "loc": {
| "country": "US"
| }
| },
| "EventEnqueuedUtcTime": "2020-07-01T17:17:59.804Z"
|}
|""".stripMargin
val offset = "10"
I tried
val data = Seq(body, offset)
val columns = Seq("body","offset")
import sparkSession.sqlContext.implicits._
val df = data.toDF(columns:_*)
As well as
val data = Seq(body, offset)
val rdd = sparkSession.sparkContext.parallelize((data))
val dfFromRdd = rdd.toDF("body", "offset")
dfFromRdd.show(20, false)
but for both I get this an error: "value toDF is not a member of org.apache.spark.RDD[String]"
Is there a different way I can create a dataframe that will have one column with my json body data, and another column with my offset string value?
Edit: I've also tried the following:
val offset = "1000"
val data = Seq(body, offset)
val rdd = sparkSession.sparkContext.parallelize((data))
val dfFromRdd = rdd.toDF("body", "offset")
dfFromRdd.show(20, false)
and get an error of column mismatch : "The number of columns doesn't match.
Old column names (1): value
New column names (2): body, offset"
I dont understand why my data has the column name of "value"

I guess the issue is with your Seq syntax, elements should be tuples. Below code has worked for me,
val data = Seq((body, offset)) // <--- Check this line
val columns = Seq("body","offset")
import sparkSession.sqlContext.implicits._
data.toDF(columns:_*).printSchema()
/*
/
/ root
/ |-- body: string (nullable = true)
/ |-- offset: string (nullable = true)
/
*/
data.toDF(columns:_*).show()
/*
/
/ +--------------------+------+
/ | body|offset|
/ +--------------------+------+
/ |{
/ "time": "2020...| 10|
/ +--------------------+------+
/
/*

Related

Spark:join condition with Array (nullable ones)

I have 2 Dataframes and would like to join them and would like to filter the data, i want to filter the
data where OrgTypeToExclude is matching with respect to each transactionid.
in single word my transactionId is join contiions and OrgTypeToExclude is exclude condition,sharing a simple example here
import org.apache.spark.sql.functions.expr
import spark.implicits._
val jsonstr ="""{
"id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
"Transactions": [
{
"TransactionId": "USAL",
"OrgTypeToExclude": ["A","B"]
},
{
"TransactionId": "USMD",
"OrgTypeToExclude": ["E"]
},
{
"TransactionId": "USGA",
"OrgTypeToExclude": []
}
]
}"""
val df = Seq((1, "USAL","A"),(4, "USAL","C"), (2, "USMD","B"),(5, "USMD","E"), (3, "USGA","C")).toDF("id", "code","Alp")
val json = spark.read.json(Seq(jsonstr).toDS).select("Transactions.TransactionId","Transactions.OrgTypeToExclude")
df.printSchema()
json.printSchema()
df.join(json,$"code"<=> $"TransactionId".cast("string") && !exp("array_contains(OrgTypeToExclude, Alp)") ,"inner" ).show()
--Expecting output
id Code Alp
4 "USAL" "C"
2 "USMD" "B"
3 "USGA" "C"
Thanks,
Manoj.
Transactions is an array type & you are accessing TransactionId & OrgTypeToExclude on that so you will be getting multiple arrays.
Instead of that You just explode root level Transactions array & extract the struct values that is OrgTypeToExclude & TransactionId next steps will be easy.
Please check below code.
scala> val jsonstr ="""{
|
| "id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
| "Transactions": [
| {
| "TransactionId": "USAL",
| "OrgTypeToExclude": ["A","B"]
| },
| {
| "TransactionId": "USMD",
| "OrgTypeToExclude": ["E"]
| },
| {
| "TransactionId": "USGA",
| "OrgTypeToExclude": []
| }
| ]
| }"""
jsonstr: String =
{
"id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
"Transactions": [
{
"TransactionId": "USAL",
"OrgTypeToExclude": ["A","B"]
},
{
"TransactionId": "USMD",
"OrgTypeToExclude": ["E"]
},
{
"TransactionId": "USGA",
"OrgTypeToExclude": []
}
]
}
scala> val df = Seq((1, "USAL","A"),(4, "USAL","C"), (2, "USMD","B"),(5, "USMD","E"), (3, "USGA","C")).toDF("id", "code","Alp")
df: org.apache.spark.sql.DataFrame = [id: int, code: string ... 1 more field]
scala> val json = spark.read.json(Seq(jsonstr).toDS).select(explode($"Transactions").as("Transactions")).select($"Transactions.*")
json: org.apache.spark.sql.DataFrame = [OrgTypeToExclude: array<string>, TransactionId: string]
scala> df.show(false)
+---+----+---+
|id |code|Alp|
+---+----+---+
|1 |USAL|A |
|4 |USAL|C |
|2 |USMD|B |
|5 |USMD|E |
|3 |USGA|C |
+---+----+---+
scala> json.show(false)
+----------------+-------------+
|OrgTypeToExclude|TransactionId|
+----------------+-------------+
|[A, B] |USAL |
|[E] |USMD |
|[] |USGA |
+----------------+-------------+
scala> df.join(jsondf,(df("code") === jsondf("TransactionId") && !array_contains(jsondf("OrgTypeToExclude"),df("Alp"))),"inner").select("id","code","alp").show(false)
+---+----+---+
|id |code|alp|
+---+----+---+
|4 |USAL|C |
|2 |USMD|B |
|3 |USGA|C |
+---+----+---+
scala>
First, it looks like you overlooked the fact that Transactions is also an array, which we can use explode to deal with:
val json = spark.read.json(Seq(jsonstr).toDS)
.select(explode($"Transactions").as("t")) // deal with Transactions array first
.select($"t.TransactionId", $"t.OrgTypeToExclude")
Also, array_contains wants a value rather than a column as its second argument. I'm not aware of a version that supports referencing a column, so we'll make a udf:
val arr_con = udf { (a: Seq[String], v: String) => a.contains(v) }
We can then modify the join condition like so:
df.join(json0, $"code" <=> $"TransactionId" && ! arr_con($"OrgTypeToExclude", $"Alp"), "inner").show()
And the expected result:
scala> df.join(json, $"code" <=> $"TransactionId" && ! arr_con($"OrgTypeToExclude", $"Alp"), "inner").show()
+---+----+---+-------------+----------------+
| id|code|Alp|TransactionId|OrgTypeToExclude|
+---+----+---+-------------+----------------+
| 4|USAL| C| USAL| [A, B]|
| 2|USMD| B| USMD| [E]|
| 3|USGA| C| USGA| []|
+---+----+---+-------------+----------------+

Tabular data from DB to MAP Data structure

I am fetching few values from DB and want to create a nested map data structure out of this. The tabular data looks like this
+---------+--------------+----------------+------------------+----------------+-----------------------+
| Cube_ID | Dimension_ID | Dimension_Name | Partition_Column | Display_name | Dimension_Description |
+---------+--------------+----------------+------------------+----------------+-----------------------+
| 1 | 1 | Reporting_Date | Y | Reporting_Date | Reporting_Date |
| 1 | 2 | Platform | N | Platform | Platform |
| 1 | 3 | Country | N | Country | Country |
| 1 | 4 | OS_Version | N | OS_Version | OS_Version |
| 1 | 5 | Device_Version | N | Device_Version | Device_Version |
+---------+--------------+----------------+------------------+----------------+-----------------------+
I want to create a nested structure something like this
{
CubeID = "1": {
Dimension ID = "1": [
{
"Name": "Reporting_Date",
"Partition_Column": "Y"
"Display": "Reporting_Date"
}
]
Dimension ID = "2": [
{
"Name": "Platform",
"Column": "N"
"Display": "Platform"
}
]
},
CubeID = "2": {
Dimension ID = "1": [
{
"Name": "Reporting_Date",
"Partition_Column": "Y"
"Display": "Reporting_Date"
}
]
Dimension ID = "2": [
{
"Name": "Platform",
"Column": "N"
"Display": "Platform"
}
]
}
}
I have the result set from DB using the following. I am able to populate individual columns, but not sure how to create a map for later computation
while (rs.next()) {
val Dimension_ID = rs.getInt("Dimension_ID")
val Dimension_Name = rs.getString("Dimension_Name")
val Partition_Column = rs.getString("Partition_Column")
val Display_name = rs.getString("Display_name")
val Dimension_Description = rs.getString("Dimension_Description")
}
I believe I should write a case class for this, but I am not sure how to create a case class and load values to the case class.
Thanks for the help. I can provide any other info needed. Let me know
Background
you can define data class something as below,
case class Dimension(
dimensionId: Long,
name: String,
partitionColumn: String,
display: String
)
case class Record(
cubeId: Int,
dimension: Dimension
)
case class Data(records: List[Record])
And this is how you can construct data,
val data =
Data(
List(
Record(
cubeId = 1,
dimension = Dimension(
dimensionId = 1,
name = "Reporting_Date",
partitionColumn = "Y",
display = "Reporting_Date"
)
),
Record(
cubeId = 2,
dimension = Dimension(
dimensionId = 1,
name = "Platform",
partitionColumn = "N",
display = "Platform"
)
)
)
)
Now to your question, since you are using JDBC you have to construct list of records in a mutable way or use scala Iterator. I will write below mutable way to construct above data class but you can explore more.
import scala.collection.mutable.ListBuffer
var mutableData = new ListBuffer[Record]()
while (rs.next()) {
mutableData += Record(
cubeId = rs.getIn("Cube_ID"),
dimension = Dimension(
dimensionId = rs.getInt("Dimension_ID"),
name = rs.getString("Dimension_Name"),
partitionColumn = rs.getString("Partition_Column"),
display = rs.getString("Dimension_Description")
)
)
}
val data = Data(records = mutableData.toList)
Also read - Any better way to convert SQL ResultSet to Scala List

mininum value of struct type column in a dataframe

My question is I have below json file where it contains struct type data for column3. I can able to extract the rows but not able to find the mininum value of column3. Where column3 contain dinamic nested columns(dynamic names) with values.
inputdata is:
"result": { "data" :
[ {"col1": "value1", "col2": "value2", "col3" : { "dyno" : 3, "aeio": 5 }, "col4": "value4"},
{"col1": "value11", "col2": "value22", "col3" : { "abc" : 6, "def": 9 , "aero": 2}, "col4": "value44"},
{"col1": "value12", "col2": "value23", "col3" : { "ddc" : 6}, "col4": "value43"}]
outputDate expecting as:
col1 col2 col3 col4 col5(min value of col3)
value1 value2 [3,5] value4 3
value11 value22 [6,9,2] value44 2
value12 value23 [6] value43 6
I can able to read the file and explode the data as records but not able to find the min value of col3.
val bestseller_df1 = bestseller_json.withColumn("extractedresult", explode(col("result.data")))
can you please help me to code to find the min value of col3 in spark/scala.
my json file is:
{"success":true, "result": { "data": [ {"col1": "value1", "col2": "value2", "col3" : { "dyno" : 3, "aeio": 5 }, "col4": "value4"},{"col1": "value11", "col2": "value22", "col3" : { "abc" : 6, "def": 9 , "aero": 2}, "col4": "value44"},{"col1": "value12", "col2": "value23", "col3" : { "ddc" : 6}, "col4": "value43"}],"total":3}}
Here is how you would do it
scala> val df = spark.read.json("/tmp/stack/pathi.json")
df: org.apache.spark.sql.DataFrame = [result: struct<data: array<struct<col1:string,col2:string,col3:struct<abc:bigint,aeio:bigint,aero:bigint,ddc:bigint,def:bigint,dyno:bigint>,col4:string>>, total: bigint>, success: boolean]
scala> df.printSchema
root
|-- result: struct (nullable = true)
| |-- data: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- col1: string (nullable = true)
| | | |-- col2: string (nullable = true)
| | | |-- col3: struct (nullable = true)
| | | | |-- abc: long (nullable = true)
| | | | |-- aeio: long (nullable = true)
| | | | |-- aero: long (nullable = true)
| | | | |-- ddc: long (nullable = true)
| | | | |-- def: long (nullable = true)
| | | | |-- dyno: long (nullable = true)
| | | |-- col4: string (nullable = true)
| |-- total: long (nullable = true)
|-- success: boolean (nullable = true)
scala> df.show(false)
+-------------------------------------------------------------------------------------------------------------------------------+-------+
|result |success|
+-------------------------------------------------------------------------------------------------------------------------------+-------+
|[[[value1, value2, [, 5,,,, 3], value4], [value11, value22, [6,, 2,, 9,], value44], [value12, value23, [,,, 6,,], value43]], 3]|true |
+-------------------------------------------------------------------------------------------------------------------------------+-------+
scala> df.select(explode($"result.data")).show(false)
+-----------------------------------------+
|col |
+-----------------------------------------+
|[value1, value2, [, 5,,,, 3], value4] |
|[value11, value22, [6,, 2,, 9,], value44]|
|[value12, value23, [,,, 6,,], value43] |
+-----------------------------------------+
By looking at the schema, now we know the list of possible columns inside "col3", so we can compute the minimum of all those values by hard-coding as below
scala> df.select(explode($"result.data")).select(least($"col.col3.abc",$"col.col3.aeio",$"col.col3.aero",$"col.col3.ddc",$"col.col3.def",$"col.col3.dyno")).show(false)
+--------------------------------------------------------------------------------------------+
|least(col.col3.abc, col.col3.aeio, col.col3.aero, col.col3.ddc, col.col3.def, col.col3.dyno)|
+--------------------------------------------------------------------------------------------+
|3 |
|2 |
|6 |
+--------------------------------------------------------------------------------------------+
Dynamic handling:
I'll assume that upto col.col3, the structure remains the same, so we proceed by creating another dataframe as
scala> val df2 = df.withColumn("res_data",explode($"result.data")).select(col("success"),col("res_data"),$"res_data.col3.*")
df2: org.apache.spark.sql.DataFrame = [success: boolean, res_data: struct<col1: string, col2: string ... 2 more fields> ... 6 more fields]
scala> df2.show(false)
+-------+-----------------------------------------+----+----+----+----+----+----+
|success|res_data |abc |aeio|aero|ddc |def |dyno|
+-------+-----------------------------------------+----+----+----+----+----+----+
|true |[value1, value2, [, 5,,,, 3], value4] |null|5 |null|null|null|3 |
|true |[value11, value22, [6,, 2,, 9,], value44]|6 |null|2 |null|9 |null|
|true |[value12, value23, [,,, 6,,], value43] |null|null|null|6 |null|null|
+-------+-----------------------------------------+----+----+----+----+----+----+
Other than "success" and "res_data", the rest of the columns are the dynamic ones from "col3"
scala> val p = df2.columns
p: Array[String] = Array(success, res_data, abc, aeio, aero, ddc, def, dyno)
Filter those two and map the rest of them to spark Columns
scala> val m = p.filter(_!="success").filter(_!="res_data").map(col(_))
m: Array[org.apache.spark.sql.Column] = Array(abc, aeio, aero, ddc, def, dyno)
Now pass m:_* as argument to the least function and you get your results as below
scala> df2.withColumn("minv",least(m:_*)).show(false)
+-------+-----------------------------------------+----+----+----+----+----+----+----+
|success|res_data |abc |aeio|aero|ddc |def |dyno|minv|
+-------+-----------------------------------------+----+----+----+----+----+----+----+
|true |[value1, value2, [, 5,,,, 3], value4] |null|5 |null|null|null|3 |3 |
|true |[value11, value22, [6,, 2,, 9,], value44]|6 |null|2 |null|9 |null|2 |
|true |[value12, value23, [,,, 6,,], value43] |null|null|null|6 |null|null|6 |
+-------+-----------------------------------------+----+----+----+----+----+----+----+
scala>
Hope this helps.
dbutils.fs.put("/tmp/test.json", """
{"col1": "value1", "col2": "value2", "col3" : { "dyno" : 3, "aeio": 5 }, "col4": "value4"},
{"col1": "value11", "col2": "value22", "col3" : { "abc" : 6, "def": 9 , "aero": 2}, "col4": "value44"},
{"col1": "value12", "col2": "value23", "col3" : { "ddc" : 6}, "col4": "value43"}]}
""", true)
val df_json = spark.read.json("/tmp/test.json")
val tf = df_json.withColumn("col3", explode(array($"col3.*"))).toDF
val tmp_group = tf.groupBy("col1").agg( min(tf.col("col3")).alias("col3"))
val top_rows = tf.join(tmp_group, Seq("col3","col1"), "inner")
top_rows.select("col1", "col2", "col3","col4").show()
Wrote 282 bytes.
+-------+-------+----+-------+
| col1| col2|col3| col4|
+-------+-------+----+-------+
| value1| value2| 3| value4|
|value11|value22| 2|value44|
|value12|value23| 6|value43|
+-------+-------+----+-------+

Slick-pg: How to use arrayElementsText and overlap operator "?|"?

I'm trying to write the following query in scala using slick/slick-pg, but I don't have much experience with slick and can't figure out how:
SELECT *
FROM attributes a
WHERE a.other_id = 10
and ARRAY(SELECT jsonb_array_elements_text(a.value->'value'))
&& array['1','30','205'];
This is a simplified version of the attributes table, where the value field is a jsonb:
class Attributes(tag: Tag) extends Table[Attribute](tag, "ship_attributes") {
def id = column[Int]("id")
def other_id = column[Int]("other_id")
def value = column[Json]("value")
def * = (id, other_id, value) <> (Attribute.tupled, Attribute.unapply)
}
Sample data:
| id | other_id | value |
|:-----|:-----------|:------------------------------------------|
| 1 | 10 | {"type": "IdList", "value": [1, 21]} |
| 2 | 10 | {"type": "IdList", "value": [5, 30]} |
| 3 | 10 | {"type": "IdList", "value": [7, 36]} |
This is my current query:
attributes
.filter(_.other_id = 10)
.filter { a =>
val innerQuery = attributes.map { _ =>
a.+>"value".arrayElementsText
}.to[List]
innerQuery #& List("1", "30", "205").bind
}
But it's complaining about the .to[List] conversion.
I've tried to create a SimpleFunction.unary[X, List[String]]("ARRAY"), but I don't know how to pass innerQuery to it (innerQuery is Query[Rep[String], String, Seq]).
Any ideas are very much appreciated.
UPDATE 1
while I can't figure this out, I changed the app to save in the database the json field as a list of strings instead of integer to be able to do this simple query:
attributes
.filter(_.other_id = 10)
.filter(_.+>"value" ?| List("1", "30", "205").bind)
| id | other_id | value |
|:-----|:-----------|:------------------------------------------|
| 1 | 10 | {"type": "IdList", "value": ["1", "21"]} |
| 2 | 10 | {"type": "IdList", "value": ["5", "30"]} |
| 3 | 10 | {"type": "IdList", "value": ["7", "36"]} |

Convert Spark data frame into array of JSON objects

I have the following Dataframe as an example:
+--------------------------------------+------------+------------+------------------+
| user_id | city | user_name | facebook_id |
+--------------------------------------+------------+------------+------------------+
| 55c3c59d-0163-46a2-b495-bc352a8de883 | Toronto | username_x | 0123482174440907 |
| e2ddv22d-4132-c211-4425-9933aa8de454 | Washington | username_y | 0432982476780234 |
+--------------------------------------+------------+------------+------------------+
How can I convert it to an array of JSON Objecta like:
[{
"user_id": "55c3c59d-0163-46a2-b495-bc352a8de883",
"facebook_id": "0123482174440907"
},
{
"user_id": "e2ddv22d-4132-c211-4425-9933aa8de454",
"facebook_id": "0432982476780234"
}]
Assuming you have already loaded the given data in the form of a dataframe, you can use function toJSON on the dataframe.
*scala> sc.parallelize(Seq(("55c3c59d-0163-46a2-b495-bc352a8de883","Toronto","username_x","0123482174440907"))).toDF("user_id","city","user_name","facebook_id")
res2: org.apache.spark.sql.DataFrame = [user_id: string, city: string, user_name: string, facebook_id: string]*
*res2.toJSON.take(1)
res3: Array[String] = Array({"user_id":"55c3c59d-0163-46a2-b495-bc352a8de883","city":"Toronto","user_name":"username_x","facebook_id":"0123482174440907"})*