Join data-frame based on value in list of WrappedArray - scala

I have to join two spark data-frames in Scala based on a custom function. Both data-frames have the same schema.
Sample Row of data in DF1:
{
"F1" : "A",
"F2" : "B",
"F3" : "C",
"F4" : [
{
"name" : "N1",
"unit" : "none",
"count" : 50.0,
"sf1" : "val_1",
"sf2" : "val_2"
},
{
"name" : "N2",
"unit" : "none",
"count" : 100.0,
"sf1" : "val_3",
"sf2" : "val_4"
}
]
}
Sample Row of data in DF2:
{
"F1" : "A",
"F2" : "B",
"F3" : "C",
"F4" : [
{
"name" : "N1",
"unit" : "none",
"count" : 80.0,
"sf1" : "val_5",
"sf2" : "val_6"
},
{
"name" : "N2",
"unit" : "none",
"count" : 90.0,
"sf1" : "val_7",
"sf2" : "val_8"
},
{
"name" : "N3",
"unit" : "none",
"count" : 99.0,
"sf1" : "val_9",
"sf2" : "val_10"
}
]
}
RESULT of Joining these sample rows:
{
"F1" : "A",
"F2" : "B",
"F3" : "C",
"F4" : [
{
"name" : "N1",
"unit" : "none",
"count" : 80.0,
"sf1" : "val_5",
"sf2" : "val_6"
},
{
"name" : "N2",
"unit" : "none",
"count" : 100.0,
"sf1" : "val_3",
"sf2" : "val_4"
},
{
"name" : "N3",
"unit" : "none",
"count" : 99.0,
"sf1" : "val_9",
"sf2" : "val_10"
}
]
}
The result is:
full-outer-join based on value of "F1", "F2" and "F3" +
join of "F4" keeping unique nodes(use name as id) with max value of "count"
I am not very familiar with Scala and have been struggling with this for more than a day now. Here is what I have gotten to so far:
val df1 = sqlContext.read.parquet("stack_a.parquet")
val df2 = sqlContext.read.parquet("stack_b.parquet")
val df4 = df1.toDF(df1.columns.map(_ + "_A"):_*)
val df5 = df2.toDF(df1.columns.map(_ + "_B"):_*)
val df6 = df4.join(df5, df4("F1_A") === df5("F1_B") && df4("F2_A") === df5("F2_B") && df4("F3_A") === df5("F3_B"), "outer")
def joinFunction(r:Row) = {
//Need the real-deal here!
//print(r(3)) //-->Any = WrappedArray([..])
//also considering parsing as json to do the processing but not sure about the performance impact
//val parsed = JSON.parseFull(r.json) //then play with parsed
r.toSeq //
}
val finalResult = df6.rdd.map(joinFunction)
finalResult.collect
I was planning to add the custom merge logic in joinFunction but I am struggling to convert the WrappedArray/Any class to something I can work with.
Any inputs on how to do the conversion or the join in a better way will be very helpful.
Thanks!
Edit (7 Mar, 2021)
The full-outer join actually has to be performed only on "F1".
Hence, using #werner's answer, I am doing:
val df1_a = df1.toDF(df1.columns.map(_ + "_A"):_*)
val df2_b = df2.toDF(df2.columns.map(_ + "_B"):_*)
val finalResult = df1_a.join(df2_b, df1_a("F1_A") === df2_b("F1_B"), "full_outer")
.drop("F1_B")
.withColumn("F4", joinFunction(col("F4_A"), col("F4_B")))
.drop("F4_A", "F4_B")
.withColumn("F2", when(col("F2_A").isNull, col("F2_B")).otherwise(col("F2_A")))
.drop("F2_A", "F2_B")
.withColumn("F3", when(col("F3_A").isNull, col("F3_B")).otherwise(col("F3_A")))
.drop("F3_A", "F3_B")
But I am getting this error. What am I missing..?

You can implement the merge logic with the help of an udf:
//case class to define the schema of the udf's return value
case class F4(name: String, unit: String, count: Double, sf1: String, sf2: String)
val joinFunction = udf((a: Seq[Row], b: Seq[Row]) =>
(a ++ b).map(r => F4(r.getAs[String]("name"),
r.getAs[String]("unit"),
r.getAs[Double]("count"),
r.getAs[String]("sf1"),
r.getAs[String]("sf2")))
//group the elements from both arrays by name
.groupBy(_.name)
//take the element with the max count from each group
.map { case (_, d) => d.maxBy(_.count) }
.toSeq)
//join the two dataframes
val finalResult = df1.withColumnRenamed("F4", "F4_A").join(
df2.withColumnRenamed("F4", "F4_B"), Seq("F1", "F2", "F3"), "full_outer")
//call the merge function
.withColumn("F4", joinFunction('F4_A, 'F4_B))
//drop the the intermediate columns
.drop("F4_A", "F4_B")

Related

Decoding a nested json using circe

Hi I am trying to write a decoder for a nested json using circe in scala3 but can't quite figure out how. The json I have looks something like this:
[{
"id" : "something",
"clientId" : "something",
"name" : "something",
"rootUrl" : "something",
"baseUrl" : "something",
"surrogateAuthRequired" : something boolean,
"enabled" : something boolean,
"alwaysDisplayInConsole" : someBoolean,
"clientAuthenticatorType" : "client-secret",
"redirectUris" : [
"/realms/WISEMD_V2_TEST/account/*"
],
"webOrigins" : [
],
.
.
.
.
"protocolMappers" : [
{
"id" : "some Id",
"name" : "something",
"protocol" : "something",
"protocolMapper" : "something",
"consentRequired" : someBoolean,
"config" : {
"claim.value" : "something",
"userinfo.token.claim" : "someBoolean",
"id.token.claim" : "someBoolean",
"access.token.claim" : "someBoolean",
"claim.name" : "something",
"jsonType.label" : "something",
"access.tokenResponse.claim" : "something"
},
{
"id" : "some Id",
"name" : "something",
"protocol" : "something",
"protocolMapper" : "something",
"consentRequired" : someBoolean,
"config" : {
"claim.value" : "something",
"userinfo.token.claim" : "someBoolean",
"id.token.claim" : "someBoolean",
"access.token.claim" : "someBoolean",
"claim.name" : "something",
"jsonType.label" : "something",
"access.tokenResponse.claim" : "something"
},
.
.
}
],
}]
What I want is my decoder to have list of protocolMappers with name and claim.value. something like List(ProtocolMappers("something", Configs("something")),ProtocolMappers("something", Configs("something")))
The case class I have consists of just the needed keys and looks something like this
case class ClientsResponse (
id: String,
clientId: String,
name: String,
enabled: Boolean,
alwaysDisplayInConsole: Boolean,
redirectUris: Seq[String],
directAccessGrantsEnabled: Boolean,
publicClient: Boolean,
access: Access,
protocolMappers : List[ProtocolMappers]
)
case class ProtocolMappers (
name: String,
config: Configs
)
case class Configs (
claimValue: String
)
And my decoder looks something like this:
given clientsDecoder: Decoder[ClientsResponse] = new Decoder[ClientsResponse] {
override def apply(x: HCursor) =
for {
id <- x.downField("id").as[Option[String]]
clientId <- x.downField("clientId").as[Option[String]]
name <- x.downField("name").as[Option[String]]
enabled <- x.downField("enabled").as[Option[Boolean]]
alwaysDisplayInConsole <- x
.downField("alwaysDisplayInConsole")
.as[Option[Boolean]]
redirectUris <- x.downField("redirectUris").as[Option[Seq[String]]]
directAccessGrantsEnabled <- x
.downField("directAccessGrantsEnabled")
.as[Option[Boolean]]
publicClient <- x.downField("publicClient").as[Option[Boolean]]
access <- x.downField("access").as[Option[Access]]
protocolMapper <- x.downField("protocolMappers").as[Option[List[ProtocolMappers]]]
} yield ClientsResponse(
id.getOrElse(""),
clientId.getOrElse(""),
name.getOrElse(""),
enabled.getOrElse(false),
alwaysDisplayInConsole.getOrElse(false),
redirectUris.getOrElse(Seq()),
directAccessGrantsEnabled.getOrElse(false),
publicClient.getOrElse(false),
access.getOrElse(Access(false, false, false)),
protocolMapper.getOrElse(List(ProtocolMappers("", Configs(""))))
)
}
given protocolMapperDecoder: Decoder[ProtocolMappers] = new Decoder[ProtocolMappers] {
override def apply(x: HCursor) =
for {
protocolName <- x.downField("protocolMappers").downField("name").as[Option[String]]
configs <- x.downField("protocolMappers").downField("config").as[Option[Configs]]
claimValue <- x.downField("protocolMappers").downField("config").downField("claim.value").as[Option[String]]
}yield ProtocolMappers(protocolName.getOrElse(""), configs.getOrElse(Configs("")))
}
given configsDecoder: Decoder[Configs] = new Decoder[Configs] {
override def apply(x: HCursor) =
for {
claimValue <- x.downField("protocolMappers").downField("config").downField("claim.value").as[Option[String]]
}yield Configs(claimValue.getOrElse(""))
}
but it just returns empty strings. Can you please help me on how to do this?

how to transform a nested mongodb table into spark dataframe

i have a nested mongodb talbe and its document structure like this:
{
"_id" : "35228334dbd1090f6117c5a0011b56b0",
"brasidas" : [
{
"key" : "buy",
"value" : 859193
}
],
"crawl_time" : NumberLong(1526296211997),
"date" : "2018-05-11",
"id" : "44874f4c8c677087bcd5f829b2843e66",
"initNumber" : 0,
"repurchase" : 0,
"source_url" : "http://query.sse.com.cn/commonQuery.do?jsonCallBack=jQuery11120015170331124618408_1526262411932&isPagination=true&sqlId=COMMON_SSE_SCSJ_CJGK_ZQZYSHG_JYSLMX_L&beginDate&endDate&securityCode&pageHelp.pageNo=1&pageHelp.beginPage=1&pageHelp.cacheSize=1&pageHelp.endPage=1&pageHelp.pageSize=25",
"stockCode" : "600020",
"stockName" : "ZYGS",
"type" : "SSE"
}
i want to transform it into spark dataframe,and extract the title "key"and "value " of "brasidas" as single column respectively.just like follows:
initNumber repurchase key value stockName type date
50000 50000 buy 286698 shgf SSE 2015/3/30
but there is a problem with the form of title "brasidas",it have three forms:
[{ "key" : "buy", "value" : 286698 }]
[{ "value" : 15311500, "key" : "buy_free" }, { "value" : 0, "key" : "buy_limited" }]
[{ "key" : ""buy_free" " }, { "key" : "buy_limited" }]
so when i use scala to define a StructType, it's not suitable for every document,i can only take "brasidas" as a single column and failed to divide it by the "key" .this is what i get:
initNumber repurchase brasidas stockName type date
50000 50000 [{ "key" : "buy", "value" : 286698 }] shgf SSE 2015/3/30
This is the code for getting mongodb document:
val readpledge =ReadConfig(Map("uri"-> (mongouri_beehive+".pledge")))
val pledge = getMongoDB.readCollection(sc, readpledge,"initNumber","repurchase","brasidas","stockName","type","date")
.selectExpr("cast(initNumber as int) initNumber", "cast(repurchase as int) repurchase","brasidas","stockName","type","date")
If you try to df.printSchema() you'll probably be able to observe that brasidas got ArrayType. Most likely (array of map).
So, I'd suggest to implement some sort of UDF function that get Array as parameter and transform it in a way you need.
def arrayProcess(arr: Seq[AnyRef]): Seq[AnyRef] = ???

Updating Mongo documents

I would like to update a collection by transforming all documents from this form:
{
"_id" : "somestring i made",
"value" : {
"a" : 0.42361499999999996,
"b" : 3,
"c" : "foo",
"d" : "bar"
}
}
To this form (with new id's):
{
"_id" : ObjectId("77d987f6dsf6f76sa7676df"),
"a" : 0.42361499999999996,
"b" : 3,
"c" : "foo",
"d" : "bar"
}
So essentially take the fields out of the object "value" and reset the id to a real document id.
First get the document , convert to required format , remove the old doc and again insert the modified one .
Something like
db.collection.find({}).forEach(function(doc){
var obj = { a : doc.value.a,
b : doc.value.b,
c : doc.value.c,
d : doc.value.d};
db.collection.remove(doc);
db.collection.insert(obj);
});

Casbah cas from BasicDBObject to my type

I have a collection in the database that looks like below:
Question
{
"_id" : ObjectId("52b3248a43fa7cd2bc4a2d6f"),
"id" : 1001,
"text" : "Which is a valid java access modifier?",
"questype" : "RADIO_BUTTON",
"issourcecode" : true,
"sourcecodename" : "sampleques",
"examId" : 1000,
"answers" : [
{
"id" : 1,
"text" : "private",
"isCorrectAnswer" : true
},
{
"id" : 2,
"text" : "personal",
"isCorrectAnswer" : false
},
{
"id" : 3,
"text" : "protect",
"isCorrectAnswer" : false
},
{
"id" : 4,
"text" : "publicize",
"isCorrectAnswer" : false
}
]
}
I have a case class that represents both the Question and Answer. The Question case class has a List of Answer objects. I tried converting the result of the find operation to convert the DBObject to my Answer type:
def toList[T](dbObj: DBObject, key: String): List[T] =
(List[T]() ++ dbObject(key).asInstanceOf[BasicDBList]) map { _.asInstanceOf[T]}
The result of the above operation when I call it like
toList[Answer](dbObj, "answers") map {y => Answer(y.id,y.text, y.isCorrectAnswer)}
fails with the following exception:
com.mongodb.BasicDBObject cannot be cast to domain.content.Answer
Why should it fail? Is there a way to convert the DBObject to Answer type that I want?
You have to retrieve values from BasicDBObject, cast them and then populate the Answer class:
Answer class:
case class Answer(id:Int,text:String,isCorrectAnswer:Boolean)
toList, I changed it to return List[Answer]
def toList(dbObj: DBObject, key: String): List[Answer] = dbObj.get(key).asInstanceOf[BasicDBList].map { o=>
Answer(
o.asInstanceOf[BasicDBObject].as[Int]("id"),
o.asInstanceOf[BasicDBObject].as[String]("text"),
o.asInstanceOf[BasicDBObject].as[Boolean]("isCorrectAnswer")
)
}.toList

Mongodb map reduce trivial query

I have a below map:
var mapFunction = function() {
if(this.url.match(/http:\/\/test.com\/category\/.*?\/checkout/)) {
var key=this.em;
var value = {
url : 'checkout',
count : 1,
account_id:this.accId
}emit(key,value); };
if(this.url.match(/http:\/\/test.com\/landing/)) {
var key=this.em;
var value = {
url : 'landing',
count : 1,
account_id:this.accId
}emit(key,value); };
}
Then I have defined reduce something like below:
var reduceFunction = function (keys, values) {
var reducedValue = {count_checkout:0, count_landing:0};
for (var idx = 0; idx < values.length; idx++) {
if(values[idx].url=='checkout'){
reducedValue.count_checkout++;
}
else {
reducedValue.count_landing++;
}
}
return reducedValue;
}
Now, lets say I have only 1 record:
{
"_id" : ObjectId("516a7cff6dad5949ddf3f7b6"),
"ip" : "1.2.3.4",
"accId" : 123,
"em" : "testing#test.com",
"pgLdTs" : ISODate("2013-04-11T18:30:00Z"),
"url" : "http://test.com/category/prr/checkout",
"domain" : "www.test.com",
"pgUdTs" : ISODate("2013-04-14T09:55:11.682Z"),
"title" : "Test",
"ua" : "Mozilla",
"res" : "1024*768",
"rfr" : "www.google.com"
}
Now if I fire my map reduce like below:
db.test_views.mapReduce(mapFunction,reduceFunction,{out:{inline:1}})
The I get below result returned:
{
"_id" : "testing#test.com",
"value" : {
"url" : "checkout",
"count" : 1,
"account_id" : 123
}
}
So, its basically returning me the map. Now, if I go a add another document for this email id. Finally it becomes something like below.
{
"_id" : ObjectId("516a7cff6dad5949ddf3f7b6"),
"ip" : "1.2.3.4",
"accId" : 123,
"em" : "testing#test.com",
"pgLdTs" : ISODate("2013-04-11T18:30:00Z"),
"url" : "http://test.com/category/prr/checkout",
"domain" : "www.test.com",
"pgUdTs" : ISODate("2013-04-14T09:55:11.682Z"),
"title" : "Test",
"ua" : "Mozilla",
"res" : "1024*768",
"rfr" : "www.google.com"
}
{
"_id" : ObjectId("516a7e1b6dad5949ddf3f7b7"),
"ip" : "1.2.3.4",
"accId" : 123,
"em" : "testing#test.com",
"pgLdTs" : ISODate("2013-04-11T18:30:00Z"),
"url" : "http://test.com/category/prr/checkout",
"domain" : "www.test.com",
"pgUdTs" : ISODate("2013-04-14T09:59:55.326Z"),
"title" : "Test",
"ua" : "Mozilla",
"res" : "1024*768",
"rfr" : "www.google.com"
}
Then, I go again and fire the map reduce, it gives me proper results
{
"_id" : "testing#test.com",
"value" : {
"count_checkout" : 2,
"count_landing" : 0
}
}
Can anyone please help me out in understanding why it returns me a map for single document and doesn't do the counting in reduce.
Thanks for help.
-Lalit
Can anyone please help me out in understanding why it returns me a map for single document and doesn't do the counting in reduce.
The Reduce step combines documents with the same key into a single result document. If you only have one key in the data emitted by your Map function, the data is already "reduced" and the reduce() will not be called.
This is the expected behaviour of the MapReduce algorithm.
The reduce function should return the same type of value objects as the map function emits.
Like you've experienced, when there's a single value associated with a key - the reduce function will not be called at all .
From the MongoDB MapReduce Documentation:
Requirements for the reduce Function:
...
the type of the return object must be identical to the type of the value emitted by the map function to ensure that the following operations is true:
reduce(key, [ C, reduce(key, [ A, B ]) ] ) == reduce( key, [ C, A, B ] )