Merge Spark dataframe rows based on key column in Scala - scala

I have a streaming Dataframe with 2 columns. A key column represented as String and an objects column which is an array containing one object element. I want to be able to merge records or rows in the Dataframe with the same key such that the merged records form an array of objects.
Dataframe
----------------------------------------------------------------
|key | objects |
----------------------------------------------------------------
|abc | [{"name": "file", "type": "sample", "code": "123"}] |
|abc | [{"name": "image", "type": "sample", "code": "456"}] |
|xyz | [{"name": "doc", "type": "sample", "code": "707"}] |
----------------------------------------------------------------
Merged Dataframe
-------------------------------------------------------------------------
|key | objects |
-------------------------------------------------------------------------
|abc | [{"name": "file", "type": "sample", "code": "123"}, {"name":
"image", "type": "sample", "code": "456"}] |
|xyz | [{"name": "doc", "type": "sample", "code": "707"}] |
--------------------------------------------------------------------------
One option to do this to convert this into a PairedRDD and apply the reduceByKey function, but I'd prefer to do this with Dataframes if possible since it'd more optimal. Is there any way to do this with Dataframes without compromising on performance?

Assuming column objects is an array of a single JSON string, here's how you can merge objects by key:
import org.apache.spark.sql.functions._
case class Obj(name: String, `type`: String, code: String)
val df = Seq(
("abc", Obj("file", "sample", "123")),
("abc", Obj("image", "sample", "456")),
("xyz", Obj("doc", "sample", "707"))
).
toDF("key", "object").
select($"key", array(to_json($"object")).as("objects"))
df.show(false)
// +---+-----------------------------------------------+
// |key|objects |
// +---+-----------------------------------------------+
// |abc|[{"name":"file","type":"sample","code":"123"}] |
// |abc|[{"name":"image","type":"sample","code":"456"}]|
// |xyz|[{"name":"doc","type":"sample","code":"707"}] |
// +---+-----------------------------------------------+
df.groupBy($"key").agg(collect_list($"objects"(0)).as("objects")).
show(false)
// +---+---------------------------------------------------------------------------------------------+
// |key|objects |
// +---+---------------------------------------------------------------------------------------------+
// |xyz|[{"name":"doc","type":"sample","code":"707"}] |
// |abc|[{"name":"file","type":"sample","code":"123"}, {"name":"image","type":"sample","code":"456"}]|
// +---+---------------------------------------------------------------------------------------------+

Related

Confluent schema registry not found error

I'm using Confluent's Schema Registry to play with it a bit, but I'm having a bit of trouble when writing to Kafka.
I have the following DataFrame I want to write to a Kafka topic using schema registry in databricks
+---------------+-----+------+------+-----------------------------+
|name |id |gender|salary|headers |
+---------------+-----+------+------+-----------------------------+
|James Smith |36636|M |3100 |[[id, 36636], [salary, 3100]]|
|Michael Rose |40288|M |4300 |[[id, 40288], [salary, 4300]]|
|Robert Williams|42114|M |1400 |[[id, 42114], [salary, 1400]]|
|Maria Jones |39192|F |5500 |[[id, 39192], [salary, 5500]]|
|Jen Mary Brown | |F |-1 |[[id, ], [salary, -1]] |
+---------------+-----+------+------+-----------------------------+
after formatting it to be written it looks like this:
+-----+------------------------------------------------------------------+-----------------------------+
|key |value |headers |
+-----+------------------------------------------------------------------+-----------------------------+
|36636|{"name":"James Smith","id":"36636","gender":"M","salary":3100} |[[id, 36636], [salary, 3100]]|
|40288|{"name":"Michael Rose","id":"40288","gender":"M","salary":4300} |[[id, 40288], [salary, 4300]]|
|42114|{"name":"Robert Williams","id":"42114","gender":"M","salary":1400}|[[id, 42114], [salary, 1400]]|
|39192|{"name":"Maria Jones","id":"39192","gender":"F","salary":5500} |[[id, 39192], [salary, 5500]]|
| |{"name":"Jen Mary Brown","id":"","gender":"F","salary":-1} |[[id, ], [salary, -1]] |
+-----+------------------------------------------------------------------+-----------------------------+
so whenever I try to write it to kafka using to_avro() function it says the schema is not found, I made sure the schema is compatible with data using this validator
so the writing statement looks like this:
preparedDf
.select(
col("key"),
to_avro($"value", lit(s"$KAFKA_TOPIC-value"), SCHEMA_REGISTRY_SERVER).as("value"),
col("headers"))
.write
.format("kafka")
.option("kafka.bootstrap.servers", KAFKA_BROKERS)
.option("topic", KAFKA_TOPIC)
.option("includeHeaders", true)
.save()
but it fails with the following message:
org.spark_project.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema not found; error code: 40403
I've also tried just
preparedDf
.select(
col("key"),
to_avro($"value", lit(s"$KAFKA_TOPIC"), SCHEMA_REGISTRY_SERVER).as("value"),
col("headers"))
but fails with Subject not found; error code: 40401. At first I thought it was that the data was incompatible with the provided schema in a data type or something but after validating it I ran out of ideas. Have you guys faced this trouble before?
the schema I'm using for this data is:
{
"type": "record",
"namespace": "com.some.namespace",
"name": "value",
"doc": "test Data message value",
"fields": [
{
"name": "id",
"type": "string",
"doc": "Id for the employee"
},
{
"name": "gender",
"type": {
"name": "genderEnum",
"type": "enum",
"symbols": [
"M",
"F"
]
},
"doc": "Employees gender"
},
{
"name": "salary",
"type": "int",
"doc": "Employees salary"
},
{
"name": "name",
"type": "string",
"doc": "Employees names"
}
]
}

How to validate my data with jsonSchema scala

I have a dataframe which looks like that
+--------------------+----------------+------+------+
| id | migration|number|string|
+--------------------+----------------+------+------+
|[5e5db036e0403b1a. |mig | 1| str |
+--------------------+----------------+------+------+
and I have a jsonSchema:
{
"title": "Section",
"type": "object",
"additionalProperties": false,
"required": ["migration", "id"],
"properties": {
"migration": {
"type": "string",
"additionalProperties": false
},
"string": {
"type": "string"
},
"number": {
"type": "number",
"min": 0
}
}
}
I would like to validate the schema of my dataframe with my jsonSchema.
Thank you
Please find inline code comments for the explanation
val newSchema : StructType = DataType.fromJson("""{
| "type": "struct",
| "fields": [
| {
| "name": "id",
| "type": "string",
| "nullable": true,
| "metadata": {}
| },
| {
| "name": "migration",
| "type": "string",
| "nullable": true,
| "metadata": {}
| },
| {
| "name": "number",
| "type": "integer",
| "nullable": false,
| "metadata": {}
| },
| {
| "name": "string",
| "type": "string",
| "nullable": true,
| "metadata": {}
| }
| ]
|}""".stripMargin).asInstanceOf[StructType] // Load you schema from JSON string
// println(newSchema)
val spark = Constant.getSparkSess // Create SparkSession object
//Correct data
val correctData: RDD[Row] = spark.sparkContext.parallelize(Seq(Row("5e5db036e0403b1a.","mig",1,"str")))
val dfNew = spark.createDataFrame(correctData, newSchema) // validating the data
dfNew.show()
//InCorrect data
val inCorrectData: RDD[Row] = spark.sparkContext.parallelize(Seq(Row("5e5db036e0403b1a.",1,1,"str")))
val dfInvalid = spark.createDataFrame(inCorrectData, newSchema) // validating the data which will throw RuntimeException: java.lang.Integer is not a valid external type for schema of string
dfInvalid.show()
val res = spark.sql("") // Load the SQL dataframe
val diffColumn : Seq[StructField] = res.schema.diff(newSchema) // compare SQL dataframe with JSON schema
diffColumn.foreach(_.name) // Print the Diff columns

Convert nested json to dataframe in scala spark

I want to create the dataframe out of json for only given key. It values is a list and that is nested json type. I tried for flattening but I think there could be some workaround as I only need one key of json to convert into dataframe.
I have json like:
("""
{
"Id_columns": 2,
"metadata": [{
"id": "1234",
"type": "file",
"length": 395
}, {
"id": "1235",
"type": "file2",
"length": 396
}]
}""")
Now I want to create a DataFrame using spark for only key 'metadata', I have written code:
val json = Json.parse("""
{
"Id_columns": 2,
"metadata": [{
"id": "1234",
"type": "file",
"length": 395
}, {
"id": "1235",
"type": "file2",
"length": 396
}]
}""")
var jsonlist = Json.stringify(json("metadata"))
val rddData = spark.sparkContext.parallelize(jsonlist)
resultDF = spark.read.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").json(rddData)
resultDF.show()
But it's giving me error:
overloaded method value json with alternatives:
cannot be applied to (org.apache.spark.rdd.RDD[Char])
[error] val resultDF = spark.read.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").json(rddData)
^
I am expecting result:
+----+-----+--------+
| id | type| length |
+----+-----+--------+
|1234|file1| 395 |
|1235|file2| 396 |
+----+-----+--------+
You need to explode your array like this :
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.read.json(
spark.sparkContext.parallelize(Seq("""{"Id_columns":2,"metadata":[{"id":"1234","type":"file","length":395},{"id":"1235","type":"file2","length":396}]}"""))
)
df.select(explode($"metadata").as("metadata"))
.select("metadata.*")
.show(false)
Output :
+----+------+-----+
|id |length|type |
+----+------+-----+
|1234|395 |file |
|1235|396 |file2|
+----+------+-----+

Dataframe columns does not keep order and columns with null values are excluded while writing to CosmosDB Collection

I tried to copy data into cosmosDB collection from a dataframe in spark.
The data is writing into cosmosDB , but with two issues.
The order of column in dataframe is not maintaining in cosmosDB.
Columns with null values are not written in cosmosDB, they are totally excluded.
Below is the data available in dataframe:
+-------+------+--------+---------+---------+-------+
| NUM_ID| TIME| SIG1| SIG2| SIG3| SIG4|
+-------+------+--------+---------+---------+-------+
|X00030 | 13000|35.79893| 139.9061| 48.32786| null|
|X00095 | 75000| null| null| null|5860505|
|X00074 | 43000| null| 8.75037| 98.9562|8014505|
Below is the code written in spark to copy the dataframe into cosmosDB.
val finalSignals = spark.sql("""SELECT * FROM db.tableName""")
val toCosmosDF = finalSignals.withColumn("NUM_ID", trim(col("NUM_ID"))).withColumn("SIG1", round(col("SIG1"),5)).select("NUM_ID","TIME","SIG1","SIG2","SIG3","SIG4")
//write DF into COSMOSDB
import com.microsoft.azure.cosmosdb.spark.config.Config
import org.apache.spark.sql.SaveMode
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark._
val writeConfig = Config(Map(
"Endpoint" -> "xxxxxxxx",
"Masterkey" -> "xxxxxxxxxxx",
"Database" -> "xxxxxxxxx",
"Collection" -> "xxxxxxxxx",
"preferredRegions" -> "xxxxxxxxx",
"Upsert" -> "true"
))
toCosmosDF.write.mode(SaveMode.Append).cosmosDB(writeConfig)
Below is the data written into cosmosDB.
"SIG3": 48.32786,
"SIG2": 139.9061,
"TIME": 13000,
"NUM_ID": "X00030",
"id": "xxxxxxxxxxxx2a",
"SIG1": 35.79893,
"_rid": "xxxxxxxxxxxx",
"_self": "xxxxxxxxxxxxxxxxxx",
"_etag": "\"xxxxxxxxxxxxxxxx\"",
"_attachments": "attachments/",
"_ts": 1571390120
}
{
"TIME": 75000,
"NUM_ID": "X00095",
"id": "xxxxxxxxxxxx2a",
"_rid": "xxxxxxxxxxxx",
"SIG4": 5860505,
"_self": "xxxxxxxxxxxxxxxxxx",
"_etag": "\"xxxxxxxxxxxxxxxx\"",
"_attachments": "attachments/",
"_ts": 1571390120
}
{
"SIG3": 98.9562,
"SIG2": 8.75037,
"TIME": 43000,
"NUM_ID": "X00074",
"id": "xxxxxxxxxxxx2a",
"SIG4": 8014505,
"_rid": "xxxxxxxxxxxx",
"_self": "xxxxxxxxxxxxxxxxxx",
"_etag": "\"xxxxxxxxxxxxxxxx\"",
"_attachments": "attachments/",
"_ts": 1571390120
}
The entry for columns with null in dataframe is missing in cosmosDB document.
The data written into cosmosDB is not having the column order which is there in dataframe.
How to resolve these two issues?

OrientDB ETL create edge using multiple fields in match criteria

I have some data that I'm tracking that looks something like this:
node.csv
Label1,Label2
Alpha,A
Alpha,B
Alpha,C
Bravo,A
Bravo,B
The pair Label1 and Label2 define a unique entry in this data set.
I have another table that has some values in it that I want to link to the vertices created in Table1:
data.csv
Label1,Label2,Data
Alpha,A,10
Alpha,A,20
Alpha,B,30
Bravo,A,99
I'd like to generate edges from entries in Data to Node when both Label1 and Label2 fields match in each.
In this case, I'd have:
Data(Alpha,A,10) ---> Node(Alpha,A)
Data(Alpha,A,20) ---> Node(Alpha,A)
Data(Alpha,B,30) ---> Node(Alpha,B)
Data(Bravo,A,99) ---> Node(Bravo,A)
In another question it appears that this issue gets solved by simply adding an extra "joinFieldName" entry into the json file, but I'm not getting the same result with my data.
My node.json file looks like:
{
"config": { "log": "info" },
"source": { "file": { "path": "./node.csv" } },
"extractor": { "csv": {} },
"transformers": [ { "vertex": { "class": "Node" } } ],
"loader": {
"orientdb": {
"dbURL": "plocal:test.orientdb",
"dbType": "graph",
"batchCommit": 1000,
"classes": [ {"name": "Node", "extends": "V"} ],
"indexes": []
}
}
}
and my data.json file looks like this:
{
"config": { "log": "info" },
"source": { "file": { "path": "./data.csv" } },
"extractor": { "csv": { } },
"transformers": [
{ "vertex": { "class": "Data" } },
{ "edge": { "class": "Source",
"joinFieldName": "Label1",
"lookup": "Node.Label1",
"joinFieldName": "Label2",
"lookup": "Node.Label2",
"direction": "in"
}
}
],
"loader": {
"orientdb": {
"dbURL": "plocal:test.orientdb",
"dbType": "graph",
"batchCommit": 1000,
"classes": [ {"name": "Data", "extends": "V"},
{"name": "Source", "extends": "E"}
],
"indexes": []
}
}
}
After I run these, I get this output when I query the result:
orientdb {db=test.orientdb}> SELECT FROM V
+----+-----+------+------+------+-------------------+----+-------------+
|# |#RID |#CLASS|Label1|Label2|out_Source |Data|in_Source |
+----+-----+------+------+------+-------------------+----+-------------+
|0 |#25:0|Node |Alpha |A |[#41:0,#43:0,#47:0]| | |
|1 |#26:0|Node |Alpha |B |[#45:0] | | |
|2 |#27:0|Node |Alpha |C | | | |
|3 |#28:0|Node |Bravo |A |[#42:0,#44:0,#48:0]| | |
|4 |#29:0|Node |Bravo |B |[#46:0] | | |
|5 |#33:0|Data |Alpha |A | |10 |[#41:0,#42:0]|
|6 |#34:0|Data |Alpha |A | |20 |[#43:0,#44:0]|
|7 |#35:0|Data |Alpha |B | |30 |[#45:0,#46:0]|
|8 |#36:0|Data |Bravo |A | |99 |[#47:0,#48:0]|
+----+-----+------+------+------+-------------------+----+-------------+
9 item(s) found. Query executed in 0.012 sec(s).
This is incorrect. I don't want Edges #42:0, #44:0, #46:0 and #47:0:
#42:0 connects Node(Bravo,A) and Data(Alpha,A)
#44:0 connects Node(Bravo,A) and Data(Alpha,A)
#46:0 connects Node(Bravo,B) and Data(Alpha,B)
#47:0 connects Node(Alpha,A) and Data(Bravo,A)
It looks like adding multiple joinFieldName entries in the transformer is resulting in an OR operation, but I'd like an 'AND' here.
Does anyone know how to fix this? I'm not sure what I'm doing differently than the other StackOverflow question...
After debugging the ETL code, I figured out a workaround. As you said, there is no way to make the multiple joinFieldNames forms one edge. Each joinFieldName will create an edge.
What you can do is, generate an extra column in the CSV file by concatenating "Label1" and "Label2" and use lookup query in edge transformation, something like, assume your data.csv has one extra field like label1_label2 and the values of that field are something like "label1====label2`.
Your edge transformation should have the following
{ "edge": { "class": "Source",
"joinFieldName": "label1_label2",
"lookup": "select expand(n) from (match {class: Node, as: n} return n) where n.Label1+'===='+n.Label2 = ?",
"direction": "in"
}
}
Don't forget to expand the vertex otherwise, ETL thinks that it is a Document. The trick here is to write one query by concatenating multiple fields and passing the equivalent joinFieldName.