Spark fails to parse field from elasticsearch - scala

I`m trying to read data from ElasticSearch to Spark using Scala.
The problematic field in elasticsearch is defined as object.
When I tried without schema got an empty object, with schema getting an exception.
Object in ES looks like
{
"field1" : "value1",
"field2" : "value2",
"field3" : "value",
"field4" : true,
"field5" : true,
"field6" : true,
"field7" : "value7",
"field8" : "value8",
"field9" : "value9",
"field10" : "value10"
},
...
,
{
"field1" : "value1",
"field2" : "value2",
"field3" : "value",
"field4" : true,
"field5" : true,
"field6" : true,
"field7" : "value7",
"field8" : "value8",
"field9" : "value9",
"field10" : "value10"
}
I tried a lot of different options. For example:
case class Element(
field1: String,
field2: String,
field3: String,
field4: Boolean,
field5: Boolean,
field6: Boolean,
field7: String,
field8: String,
field9: String,
field10: String
)
case class Elements(innerElement: Array[Element])
val elementsSchema = ScalaReflection.schemaFor[Elements].dataType.asInstanceOf[StructType]
val customSchema = StructType(Array(
StructField("docField1", StringType, true),
StructField("docField2", StringType, true),
...
StructField("docField20", elementsSchema, true),
StructField("docField21", StringType, true)
))
Exception:
Caused by: java.lang.RuntimeException: scala.collection.convert.Wrappers$JListWrapper is not a valid external type for schema of struct<Element:array<struct<field1:string,field2:string,field3:string,field4:boolean,field5:boolean,field6:boolean,field7:string,field8:string,field9:string,field10:string>>>
Tried with .option("es.read.field.as.array.include", "elements")
Tried to use different API spark.read.format("org.elasticsearch.spark.sql") and sc.esRDD
I would like some advice, thanks.

Related

Decoding a nested json using circe

Hi I am trying to write a decoder for a nested json using circe in scala3 but can't quite figure out how. The json I have looks something like this:
[{
"id" : "something",
"clientId" : "something",
"name" : "something",
"rootUrl" : "something",
"baseUrl" : "something",
"surrogateAuthRequired" : something boolean,
"enabled" : something boolean,
"alwaysDisplayInConsole" : someBoolean,
"clientAuthenticatorType" : "client-secret",
"redirectUris" : [
"/realms/WISEMD_V2_TEST/account/*"
],
"webOrigins" : [
],
.
.
.
.
"protocolMappers" : [
{
"id" : "some Id",
"name" : "something",
"protocol" : "something",
"protocolMapper" : "something",
"consentRequired" : someBoolean,
"config" : {
"claim.value" : "something",
"userinfo.token.claim" : "someBoolean",
"id.token.claim" : "someBoolean",
"access.token.claim" : "someBoolean",
"claim.name" : "something",
"jsonType.label" : "something",
"access.tokenResponse.claim" : "something"
},
{
"id" : "some Id",
"name" : "something",
"protocol" : "something",
"protocolMapper" : "something",
"consentRequired" : someBoolean,
"config" : {
"claim.value" : "something",
"userinfo.token.claim" : "someBoolean",
"id.token.claim" : "someBoolean",
"access.token.claim" : "someBoolean",
"claim.name" : "something",
"jsonType.label" : "something",
"access.tokenResponse.claim" : "something"
},
.
.
}
],
}]
What I want is my decoder to have list of protocolMappers with name and claim.value. something like List(ProtocolMappers("something", Configs("something")),ProtocolMappers("something", Configs("something")))
The case class I have consists of just the needed keys and looks something like this
case class ClientsResponse (
id: String,
clientId: String,
name: String,
enabled: Boolean,
alwaysDisplayInConsole: Boolean,
redirectUris: Seq[String],
directAccessGrantsEnabled: Boolean,
publicClient: Boolean,
access: Access,
protocolMappers : List[ProtocolMappers]
)
case class ProtocolMappers (
name: String,
config: Configs
)
case class Configs (
claimValue: String
)
And my decoder looks something like this:
given clientsDecoder: Decoder[ClientsResponse] = new Decoder[ClientsResponse] {
override def apply(x: HCursor) =
for {
id <- x.downField("id").as[Option[String]]
clientId <- x.downField("clientId").as[Option[String]]
name <- x.downField("name").as[Option[String]]
enabled <- x.downField("enabled").as[Option[Boolean]]
alwaysDisplayInConsole <- x
.downField("alwaysDisplayInConsole")
.as[Option[Boolean]]
redirectUris <- x.downField("redirectUris").as[Option[Seq[String]]]
directAccessGrantsEnabled <- x
.downField("directAccessGrantsEnabled")
.as[Option[Boolean]]
publicClient <- x.downField("publicClient").as[Option[Boolean]]
access <- x.downField("access").as[Option[Access]]
protocolMapper <- x.downField("protocolMappers").as[Option[List[ProtocolMappers]]]
} yield ClientsResponse(
id.getOrElse(""),
clientId.getOrElse(""),
name.getOrElse(""),
enabled.getOrElse(false),
alwaysDisplayInConsole.getOrElse(false),
redirectUris.getOrElse(Seq()),
directAccessGrantsEnabled.getOrElse(false),
publicClient.getOrElse(false),
access.getOrElse(Access(false, false, false)),
protocolMapper.getOrElse(List(ProtocolMappers("", Configs(""))))
)
}
given protocolMapperDecoder: Decoder[ProtocolMappers] = new Decoder[ProtocolMappers] {
override def apply(x: HCursor) =
for {
protocolName <- x.downField("protocolMappers").downField("name").as[Option[String]]
configs <- x.downField("protocolMappers").downField("config").as[Option[Configs]]
claimValue <- x.downField("protocolMappers").downField("config").downField("claim.value").as[Option[String]]
}yield ProtocolMappers(protocolName.getOrElse(""), configs.getOrElse(Configs("")))
}
given configsDecoder: Decoder[Configs] = new Decoder[Configs] {
override def apply(x: HCursor) =
for {
claimValue <- x.downField("protocolMappers").downField("config").downField("claim.value").as[Option[String]]
}yield Configs(claimValue.getOrElse(""))
}
but it just returns empty strings. Can you please help me on how to do this?

Join data-frame based on value in list of WrappedArray

I have to join two spark data-frames in Scala based on a custom function. Both data-frames have the same schema.
Sample Row of data in DF1:
{
"F1" : "A",
"F2" : "B",
"F3" : "C",
"F4" : [
{
"name" : "N1",
"unit" : "none",
"count" : 50.0,
"sf1" : "val_1",
"sf2" : "val_2"
},
{
"name" : "N2",
"unit" : "none",
"count" : 100.0,
"sf1" : "val_3",
"sf2" : "val_4"
}
]
}
Sample Row of data in DF2:
{
"F1" : "A",
"F2" : "B",
"F3" : "C",
"F4" : [
{
"name" : "N1",
"unit" : "none",
"count" : 80.0,
"sf1" : "val_5",
"sf2" : "val_6"
},
{
"name" : "N2",
"unit" : "none",
"count" : 90.0,
"sf1" : "val_7",
"sf2" : "val_8"
},
{
"name" : "N3",
"unit" : "none",
"count" : 99.0,
"sf1" : "val_9",
"sf2" : "val_10"
}
]
}
RESULT of Joining these sample rows:
{
"F1" : "A",
"F2" : "B",
"F3" : "C",
"F4" : [
{
"name" : "N1",
"unit" : "none",
"count" : 80.0,
"sf1" : "val_5",
"sf2" : "val_6"
},
{
"name" : "N2",
"unit" : "none",
"count" : 100.0,
"sf1" : "val_3",
"sf2" : "val_4"
},
{
"name" : "N3",
"unit" : "none",
"count" : 99.0,
"sf1" : "val_9",
"sf2" : "val_10"
}
]
}
The result is:
full-outer-join based on value of "F1", "F2" and "F3" +
join of "F4" keeping unique nodes(use name as id) with max value of "count"
I am not very familiar with Scala and have been struggling with this for more than a day now. Here is what I have gotten to so far:
val df1 = sqlContext.read.parquet("stack_a.parquet")
val df2 = sqlContext.read.parquet("stack_b.parquet")
val df4 = df1.toDF(df1.columns.map(_ + "_A"):_*)
val df5 = df2.toDF(df1.columns.map(_ + "_B"):_*)
val df6 = df4.join(df5, df4("F1_A") === df5("F1_B") && df4("F2_A") === df5("F2_B") && df4("F3_A") === df5("F3_B"), "outer")
def joinFunction(r:Row) = {
//Need the real-deal here!
//print(r(3)) //-->Any = WrappedArray([..])
//also considering parsing as json to do the processing but not sure about the performance impact
//val parsed = JSON.parseFull(r.json) //then play with parsed
r.toSeq //
}
val finalResult = df6.rdd.map(joinFunction)
finalResult.collect
I was planning to add the custom merge logic in joinFunction but I am struggling to convert the WrappedArray/Any class to something I can work with.
Any inputs on how to do the conversion or the join in a better way will be very helpful.
Thanks!
Edit (7 Mar, 2021)
The full-outer join actually has to be performed only on "F1".
Hence, using #werner's answer, I am doing:
val df1_a = df1.toDF(df1.columns.map(_ + "_A"):_*)
val df2_b = df2.toDF(df2.columns.map(_ + "_B"):_*)
val finalResult = df1_a.join(df2_b, df1_a("F1_A") === df2_b("F1_B"), "full_outer")
.drop("F1_B")
.withColumn("F4", joinFunction(col("F4_A"), col("F4_B")))
.drop("F4_A", "F4_B")
.withColumn("F2", when(col("F2_A").isNull, col("F2_B")).otherwise(col("F2_A")))
.drop("F2_A", "F2_B")
.withColumn("F3", when(col("F3_A").isNull, col("F3_B")).otherwise(col("F3_A")))
.drop("F3_A", "F3_B")
But I am getting this error. What am I missing..?
You can implement the merge logic with the help of an udf:
//case class to define the schema of the udf's return value
case class F4(name: String, unit: String, count: Double, sf1: String, sf2: String)
val joinFunction = udf((a: Seq[Row], b: Seq[Row]) =>
(a ++ b).map(r => F4(r.getAs[String]("name"),
r.getAs[String]("unit"),
r.getAs[Double]("count"),
r.getAs[String]("sf1"),
r.getAs[String]("sf2")))
//group the elements from both arrays by name
.groupBy(_.name)
//take the element with the max count from each group
.map { case (_, d) => d.maxBy(_.count) }
.toSeq)
//join the two dataframes
val finalResult = df1.withColumnRenamed("F4", "F4_A").join(
df2.withColumnRenamed("F4", "F4_B"), Seq("F1", "F2", "F3"), "full_outer")
//call the merge function
.withColumn("F4", joinFunction('F4_A, 'F4_B))
//drop the the intermediate columns
.drop("F4_A", "F4_B")

Assign SQL schema to Spark DataFrame

I'm converting my team's legacy Redshift SQL code to Spark SQL code. All the Spark examples I've seen define the schema in a non-SQL way using StructType and StructField and I'd prefer to define the schema in SQL, since most of my users know SQL but not Spark.
This is the ugly workaround I'm doing now. Is there a more elegant way that doesn't require defining an empty table just so that I can pull the SQL schema?
create_table_sql = '''
CREATE TABLE public.example (
id LONG,
example VARCHAR(80)
)'''
spark.sql(create_table_sql)
schema = spark.sql("DESCRIBE public.example").collect()
s3_data = spark.read.\
option("delimiter", "|")\
.csv(
path="s3a://"+s3_bucket_path,
schema=schema
)\
.saveAsTable('public.example')
Yes there is a way to create schema from string although I am not sure if it really looks like SQL! So you can use:
from pyspark.sql.types import _parse_datatype_string
_parse_datatype_string("id: long, example: string")
This will create the next schema:
StructType(List(StructField(id,LongType,true),StructField(example,StringType,true)))
Or you may have a complex schema as well:
schema = _parse_datatype_string("customers array<struct<id: long, name: string, address: string>>")
StructType(
List(StructField(
customers,ArrayType(
StructType(
List(
StructField(id,LongType,true),
StructField(name,StringType,true),
StructField(address,StringType,true)
)
),true),true)
)
)
You can check for more examples here
adding up to what has already been said, making a schema (e.g. StructType-based or JSON) is more straightforward in scala spark than in pySpark:
> import org.apache.spark.sql.types.StructType
> val s = StructType.fromDDL("customers array<struct<id: long, name: string, address: string>>")
> s
res3: org.apache.spark.sql.types.StructType = StructType(StructField(customers,ArrayType(StructType(StructField(id,LongType,true),StructField(name,StringType,true),StructField(address,StringType,true)),true),true))
> s.prettyJson
res9: String =
{
"type" : "struct",
"fields" : [ {
"name" : "customers",
"type" : {
"type" : "array",
"elementType" : {
"type" : "struct",
"fields" : [ {
"name" : "id",
"type" : "long",
"nullable" : true,
"metadata" : { }
}, {
"name" : "name",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "address",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
},
"containsNull" : true
},
"nullable" : true,
"metadata" : { }
} ]
}

How to update a subdocument in mongodb

I know the question have been asked many times, but I can't figure out how to update a subdocument in mongo.
Here's my Schema:
// Schemas
var ContactSchema = new mongoose.Schema({
first: String,
last: String,
mobile: String,
home: String,
office: String,
email: String,
company: String,
description: String,
keywords: []
});
var UserSchema = new mongoose.Schema({
email: {
type: String,
unique: true,
required: true
},
password: {
type: String,
required: true
},
contacts: [ContactSchema]
});
My collection looks like this:
db.users.find({}).pretty()
{
"_id" : ObjectId("5500b5b8908520754a8c2420"),
"email" : "test#random.org",
"password" : "$2a$08$iqSTgtW27TLeBSUkqIV1SeyMyXlnbj/qavRWhIKn3O2qfHOybN9uu",
"__v" : 8,
"contacts" : [
{
"first" : "Jessica",
"last" : "Vento",
"_id" : ObjectId("550199b1fe544adf50bc291d"),
"keywords" : [ ]
},
{
"first" : "Tintin",
"last" : "Milou",
"_id" : ObjectId("550199c6fe544adf50bc291e"),
"keywords" : [ ]
}
]
}
Say I want to update subdocument of id 550199c6fe544adf50bc291e by doing:
db.users.update({_id: ObjectId("5500b5b8908520754a8c2420"), "contacts._id": ObjectId("550199c6fe544adf50bc291e")}, myNewDocument)
with myNewDocument like:
{ "_id" : ObjectId("550199b1fe544adf50bc291d"), "first" : "test" }
It returns an error:
db.users.update({_id: ObjectId("5500b5b8908520754a8c2420"), "contacts._id": ObjectId("550199c6fe544adf50bc291e")}, myNewdocument)
WriteResult({
"nMatched" : 0,
"nUpserted" : 0,
"nModified" : 0,
"writeError" : {
"code" : 16837,
"errmsg" : "The _id field cannot be changed from {_id: ObjectId('5500b5b8908520754a8c2420')} to {_id: ObjectId('550199b1fe544adf50bc291d')}."
}
})
I understand that mongo tries to replace the parent document and not the subdocument, but in the end, I don't know how to update my subdocument.
You need to use the $ operator to update a subdocument in an array
Using contacts.$ will point mongoDB to update the relevant subdocument.
db.users.update({_id: ObjectId("5500b5b8908520754a8c2420"),
"contacts._id": ObjectId("550199c6fe544adf50bc291e")},
{"$set":{"contacts.$":myNewDocument}})
I am not sure why you are changing the _id of the subdocument. That is not advisable.
If you want to change a particular field of the subdocument use the contacts.$.<field_name> to update the particular field of the subdocument.

Casbah cas from BasicDBObject to my type

I have a collection in the database that looks like below:
Question
{
"_id" : ObjectId("52b3248a43fa7cd2bc4a2d6f"),
"id" : 1001,
"text" : "Which is a valid java access modifier?",
"questype" : "RADIO_BUTTON",
"issourcecode" : true,
"sourcecodename" : "sampleques",
"examId" : 1000,
"answers" : [
{
"id" : 1,
"text" : "private",
"isCorrectAnswer" : true
},
{
"id" : 2,
"text" : "personal",
"isCorrectAnswer" : false
},
{
"id" : 3,
"text" : "protect",
"isCorrectAnswer" : false
},
{
"id" : 4,
"text" : "publicize",
"isCorrectAnswer" : false
}
]
}
I have a case class that represents both the Question and Answer. The Question case class has a List of Answer objects. I tried converting the result of the find operation to convert the DBObject to my Answer type:
def toList[T](dbObj: DBObject, key: String): List[T] =
(List[T]() ++ dbObject(key).asInstanceOf[BasicDBList]) map { _.asInstanceOf[T]}
The result of the above operation when I call it like
toList[Answer](dbObj, "answers") map {y => Answer(y.id,y.text, y.isCorrectAnswer)}
fails with the following exception:
com.mongodb.BasicDBObject cannot be cast to domain.content.Answer
Why should it fail? Is there a way to convert the DBObject to Answer type that I want?
You have to retrieve values from BasicDBObject, cast them and then populate the Answer class:
Answer class:
case class Answer(id:Int,text:String,isCorrectAnswer:Boolean)
toList, I changed it to return List[Answer]
def toList(dbObj: DBObject, key: String): List[Answer] = dbObj.get(key).asInstanceOf[BasicDBList].map { o=>
Answer(
o.asInstanceOf[BasicDBObject].as[Int]("id"),
o.asInstanceOf[BasicDBObject].as[String]("text"),
o.asInstanceOf[BasicDBObject].as[Boolean]("isCorrectAnswer")
)
}.toList