I have almost ready what I want to do, however the method that converts to a JSON object does not help me to solve what is missing. I want to get the same thing, but there will be more content inside "add" and inside "firsts" and so I need them to be arrays of objects.
My code:
case class FirstIdentity(docType: String, docNumber: String, pId: String)
case class SecondIdentity(firm: String, code: String, orgType: String,
orgNumber: String, typee: String, perms: Seq[String])
case class General(id: Int, pName: String, description: String, add: Seq[SecondIdentity],
delete: Seq[String], act: String, firsts: Seq[FirstIdentity])
val someDF = Seq(
("0010XR_TYPE_6","0010XR", "222222", "6", "TYPE", "77444478", "6", 123, 1, "PF 1", "name", "description",
Seq("PERM1", "PERM2"))
).toDF("firm", "code", "org_number", "org_type", "type", "doc_number",
"doc_type", "id", "p_id", "p_name", "name", "description", "perms")
someDF.createOrReplaceTempView("vw_test")
val filter = spark.sql("""
select
firm, code, org_number, org_type, type, doc_number,
doc_type, id, p_id, p_name, name, description, perms
from vw_test
""")
val group =
filter.rdd.map(x => {
(
x.getInt(x.fieldIndex("id")),
x.getString(x.fieldIndex("p_name")),
x.getString(x.fieldIndex("description")),
SecondIdentity(
x.getString(x.fieldIndex("firm")),
x.getString(x.fieldIndex("code")),
x.getString(x.fieldIndex("org_type")),
x.getString(x.fieldIndex("org_number")),
x.getString(x.fieldIndex("type")),
x.getSeq(x.fieldIndex("perms"))
),
"act",
FirstIdentity(
x.getString(x.fieldIndex("doc_number")),
x.getString(x.fieldIndex("doc_type")),
x.getInt(x.fieldIndex("p_id")).toString
)
)
})
.toDF("id", "name", "desc", "add", "actKey", "firsts")
.groupBy("id", "name", "desc", "add", "actKey", "firsts")
.agg(collect_list("add").as("null"))
.drop("null")
group.toJSON.show(false)
result:
{
"id": 123,
"name": "PF 1",
"desc": "description",
"add": {
"firm": "0010XR_TYPE_6",
"code": "0010XR",
"orgType": "6",
"orgNumber": "222222",
"typee": "TYPE",
"perms": [
"PERM1",
"PERM2"
]
},
"actKey": "act",
"firsts": {
"docType": "77444478",
"docNumber": "6",
"pId": "1"
}
}
I want to have an array of "add" and also of "firsts"
this:
EDIT
{
"id": 123,
"name": "PF 1",
"desc": "description",
"add": [ <----
{
"firm": "0010XR_TYPE_6",
"code": "0010XR",
"orgType": "6",
"orgNumber": "222222",
"typee": "TYPE",
"perms": [
"PERM1",
"PERM2"
]
},
{
"firm": "0010XR_TYPE_6",
"code": "0010XR",
"orgType": "5",
"orgNumber": "11111",
"typee": "TYPE2",
"perms": [
"PERM1",
"PERM2"
]
}
],
"actKey": "act",
"firsts": [ <----
{
"docType": "77444478",
"docNumber": "6",
"pId": "1"
},
{
"docType": "411133",
"docNumber": "6",
"pId": "2"
}
]
}
As per your comment, you want to aggregate add depending on some grouping. Please check what all columns you want to group by. The columns which you want to Agrregate cannot be part of grouping. That will never work, and will give you always separate records.
This will work as per your expectations (I suppose):
val group =
filter.rdd.map(x => {
(
x.getInt(x.fieldIndex("id")),
x.getString(x.fieldIndex("p_name")),
x.getString(x.fieldIndex("description")),
SecondIdentity(
x.getString(x.fieldIndex("firm")),
x.getString(x.fieldIndex("code")),
x.getString(x.fieldIndex("org_type")),
x.getString(x.fieldIndex("org_number")),
x.getString(x.fieldIndex("type")),
x.getSeq(x.fieldIndex("perms"))
),
"act",
FirstIdentity(
x.getString(x.fieldIndex("doc_number")),
x.getString(x.fieldIndex("doc_type")),
x.getInt(x.fieldIndex("p_id")).toString
)
)
})
.toDF("id", "name", "desc", "add", "actKey", "firsts")
.groupBy("id", "name", "desc", "actKey")
.agg(collect_list("add").as("null"))
.drop("null")
Result:
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"id":123,"name":"PF 1","desc":"description","actKey":"act","collect_list(add)":[{"firm":"0010XR_TYPE_6","code":"0010XR","orgType":"6","orgNumber":"222222","typee":"TYPE","perms":["PERM1","PERM2"]},{"firm":"0010XR_TYPE_5","code":"0010XR","orgType":"5","orgNumber":"222223","typee":"TYPE","perms":["PERM1","PERM2"]}]}|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Inside your map function, you are not mapping the FirstEntity and SecondEntity as Seq hence the add is not getting converted to array.
Change your map function to this:
filter.rdd.map(x => {
(
x.getInt(x.fieldIndex("id")),
x.getString(x.fieldIndex("p_name")),
x.getString(x.fieldIndex("description")),
Seq(SecondIdentity(
x.getString(x.fieldIndex("firm")),
x.getString(x.fieldIndex("code")),
x.getString(x.fieldIndex("org_type")),
x.getString(x.fieldIndex("org_number")),
x.getString(x.fieldIndex("type")),
x.getSeq(x.fieldIndex("perms"))
)),
"act",
Seq(FirstIdentity(
x.getString(x.fieldIndex("doc_number")),
x.getString(x.fieldIndex("doc_type")),
x.getInt(x.fieldIndex("p_id")).toString
))
)
})
Will result into this:
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"id":123,"name":"PF 1","desc":"description","add":[{"firm":"0010XR_TYPE_6","code":"0010XR","orgType":"6","orgNumber":"222222","typee":"TYPE","perms":["PERM1","PERM2"]}],"actKey":"act","firsts":[{"docType":"77444478","docNumber":"6","pId":"1"}]}|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Related
I have simple json:
{
"name": "John",
"placesVisited": [
{
"name": "Paris",
"data": {
"weather": "warm",
"date": "31/01/22"
}
},
{
"name": "New York",
"data": [
{
"weather": "warm",
"date": "31/01/21"
},
{
"weather": "cold",
"date": "28/01/21"
}
]
}
]
}
as you can see in this json there is placesVisited field, and if name is "New York" the "data" field is a List, and if the name is "Paris" its an object.
what I want to do is to pull the placesVisited object where "name": "New York" and then I will parse it to a case class I have, I can't use this case class for both objects in placesVisited cause they have diff types for the same name.
so what I thought is to do something like:
(myJson \ "placesVisited") and here I need to add something that will give me element where name is "New York", how can I do that?
my result should be this:
{
"name": "New York",
"data": [
{
"weather": "warm",
"date": "31/01/21"
},
{
"weather": "cold",
"date": "28/01/21"
}
]
}
something like this maybe can happen but its horrible haha:
(Json.parse(myjson) \ "placesVisited").as[List[JsObject]].find(item => {
item.value.get("name").toString.contains("New York")
}).getOrElse(throw Exception("could not find New York element")).as[NewYorkModel]
item.value.get("name").toString can slightly be simplified to (item \ "name").as[String] but otherwise there's not much to improve.
Another option is to use a case class Place(name: String, data: JsValue) and do it like this:
(Json.parse(myjson) \ "placesVisited")
.as[List[Place]]
.find(_.name == "New York")
I am using Postgres DB 13.5. From pgdocs -
The technical difference between a jsonb_ops and a jsonb_path_ops GIN
index is that the former creates independent index items for each key
and value in the data, while the latter creates index items only for
each value in the data. Basically, each jsonb_path_ops index item
is a hash of the value and the key(s) leading to it; for example to
index {"foo": {"bar": "baz"}}
Understanding the above in detail is important for me coz my jdata (document) is big with many keys and nested objects. Consider my json data that is stored as jsonb in a column named jdata looks like below -
{
"supplier": {
"id": "3c67b6eb-3b0d-492d-8736-66df107b83b3",
"customer": {
"type": "pro",
"name": "John George",
"address": [
{
"add-id": "098ad4df-2a90-4fda-8f92-dbe8d7196732",
"addressActive": true,
"street": "abc street",
"zip": 94044,
"staying-since": "long long",
"accessibility": {
"traffic": "heavy/congested",
"bestwaytoreach": {
"weekdays": {
"bart/metro/calltrain": true,
"price": {
"off-peak-hours": "affordable",
"peak-hours": "high"
},
"journey-time": "super-fast"
}
},
"weekends": {
"byroad": {
"ok": true,
"distance": "long",
"has-tolls": {
"true": true,
"toll-price": "relatively-high"
},
"journey-speed": "fast"
}
}
}
},
{
"add-id": "ddd1d2a0-9050-4bcf-a3ad-2e608d65e468",
"addressActive": true,
"street": "xyz street",
"zip": 10001,
"staying-since": "moved recently",
"accessibility": {
"traffic": "heavy/congested",
"bestwaytoreach": {
"weekdays": {
"subway": true,
"price": {
"off-peak-hours": "affordable",
"peak-hours": "high"
},
"journey-speed": "super-fast"
}
},
"weekends": {
"byroad": {
"ok": true,
"distance": "moderate",
"tolls": {
"has-tolls": true,
"toll-price": "relatively-high"
},
"journey-time": "super-fast"
}
}
}
}
],
"firstName": "John",
"lastName": "CRAWFORD",
"emailAddresses": {
"personal": [
"johnreplies#jg.com",
"ursjohn#jg.com",
"1234#jg.com"
],
"official": [
{
"repies-in": "1 day",
"email": "jg#jg.com"
},
{
"check's regularly": true,
"repies-in": "1 Hour",
"email": "jg-watching#jg.com"
}
]
},
"cities": [
"NYC",
"LA",
"SF",
"DC"
],
"splCustFlag": null,
"isPerson": true,
"isEntity": false,
"allowEmailSolicit": "Y",
"allowPhoneSolicit": "Y",
"taxPayer": true,
"suffix": null,
"title": null,
"birthDate": "05/10/1993",
"loyaltyPrograms": null,
"phoneNumbers-summary": [
1234567890,
1234567899,
1234567898,
1234567897
],
"phoneNumbers": [
{
"description": null,
"extension": null,
"number": 1234567890,
"countryCode": null,
"type": "Business"
},
{
"description": null,
"extension": null,
"number": 1234567899,
"countryCode": null,
"type": "Home"
}
],
"data-privacy": {
"required": true,
"laws": [
"CCPA",
"GDPR"
]
}
}
}
}
Now if I create GIN jsonb_ops index for jdata column - I want to clarify what all keys and values will be part of index.
For example - "staying-since" is a key nested at below path and it's part of "address" array too. But it's still a key, thought nested deep in the document. So will it be part of the index.
{
"supplier": {
"customer": {
"address": [
{
"staying-since": "long long" ...
And similarly "long long" is a value of a deeply nested key. Will it also be indexed.
And if GIN jsonb_path_ops index is created for jdata column --
Will a hash of "long long" value along with the path that leads to it will also be indexed.
hash(
"supplier": {
"customer": {
"address":[{"staying-since": "long long"}]
}
}
)
will the above also gets index.
I am aware about the operators that are supported by the GIN index types and am aware about the usage of these operators -
jsonb_ops ? ?& ?| #> #? ##
jsonb_path_ops #> #? ##
Considering 2 sets of data as follows:
JSON1=> {
"data": [
{"id": "1-abc",
"model": "Agile",
"status":"open"
"configuration": {
"state": "running",
"rootVolumeSize": "0.00000",
"count": "2",
"type": "large",
"platform": "Linux"
}
"stateId":"123-567"
}
]}
JSON2=>{
"data": [
{"id": "1-abc",
"model": "Agile",
"configuration": {
"state": "running",
"diskSize": "0",
"type": "small",
"platform":"Windows"
}
}
]}
I need to compare JSON1 and JSON2 based on the 1st field id and if they match , I need to merge JSON1 with JSON 2 retaining the existing values in JSON2( only append fields not present).
I have coded the same as below:
private def merger(JSON1: Seq[JSON], JSON2: Seq[JSON]):Seq[JSON] = {
val abcKey = JSON1.groupBy(_.id) map { case (k, v) => (k, v.head)
val mergedRecords = for {
xyzJSON<- JSON2
} yield (
abcKey.get(xyzJSON.id) match {
case Some(JSON1) => xyzJSON.copy(status = JSON1.status,
stateId = JSON1.stateId)
case None => xyzJSON.copy(origin = "N/A")
}
)
I am not able to derive at a solution for reconciling the fields within the configurationMap.
Expected result set should be like:
{
"data": [
{"id": "1-abc",
"model": "Agile",
"status":"open"
"configuration": {
"state": "running",
"diskSize": "0",
"rootVolumeSize": "0.00000",
"count": "2",
"type": "small",
"platform": "Windows",
}
"stateId":"123-567"
}
]}
I'm trying to implement a service in my play2 app that uses elastic4s to get a document by Id.
My document in elasticsearch:
curl -XGET 'http://localhost:9200/test/venues/3659653'
{
"_index": "test",
"_type": "venues",
"_id": "3659653",
"_version": 1,
"found": true,
"_source": {
"id": 3659653,
"name": "Salong Anna och Jag",
"description": "",
"telephoneNumber": "0811111",
"postalCode": "16440",
"streetAddress": "Kistagången 12",
"city": "Kista",
"lastReview": null,
"location": {
"lat": 59.4045675,
"lon": 17.9502138
},
"pictures": [],
"employees": [],
"reviews": [],
"strongTags": [
"skönhet ",
"skönhet ",
"skönhetssalong"
],
"weakTags": [
"Frisörsalong",
"Frisörer"
],
"reviewCount": 0,
"averageGrade": 0,
"roundedGrade": 0,
"recoScore": 0
}
}
My Service:
#Singleton
class VenueSearchService extends ElasticSearchService[IndexableVenue] {
/**
* Elastic search conf
*/
override def path = "test/venues"
def getVenue(companyId: String) = {
val resp = client.execute(
get id companyId from path
).map { response =>
// transform response to IndexableVenue
response
}
resp
}
If I use getFields() on the response object I get an empty object. But if I call response.getSourceAsString I get the document as json:
{
"id": 3659653,
"name": "Salong Anna och Jag ",
"description": "",
"telephoneNumber": "0811111",
"postalCode": "16440",
"streetAddress": "Kistagången 12",
"city": "Kista",
"lastReview": null,
"location": {
"lat": 59.4045675,
"lon": 17.9502138
},
"pictures": [],
"employees": [],
"reviews": [],
"strongTags": [
"skönhet ",
"skönhet ",
"skönhetssalong"
],
"weakTags": [
"Frisörsalong",
"Frisörer"
],
"reviewCount": 0,
"averageGrade": 0,
"roundedGrade": 0,
"recoScore": 0
}
As you can se the get request omits info:
"_index": "test",
"_type": "venues",
"_id": "3659653",
"_version": 1,
"found": true,
"_source": {}
If I try to do a regular search:
def getVenue(companyId: String) = {
val resp = client.execute(
search in "test"->"venues" query s"id:${companyId}"
//get id companyId from path
).map { response =>
Logger.info("response: "+response.toString)
}
resp
}
I get:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "venues",
"_id": "3659653",
"_score": 1,
"_source": {
"id": 3659653,
"name": "Salong Anna och Jag ",
"description": "",
"telephoneNumber": "0811111",
"postalCode": "16440",
"streetAddress": "Kistagången 12",
"city": "Kista",
"lastReview": null,
"location": {
"lat": 59.4045675,
"lon": 17.9502138
},
"pictures": [],
"employees": [],
"reviews": [],
"strongTags": [
"skönhet ",
"skönhet ",
"skönhetssalong"
],
"weakTags": [
"Frisörsalong",
"Frisörer"
],
"reviewCount": 0,
"averageGrade": 0,
"roundedGrade": 0,
"recoScore": 0
}
}
]
}
}
My Index Service:
trait ElasticIndexService [T <: ElasticDocument] {
val clientProvider: ElasticClientProvider
def path: String
def indexInto[T](document: T, id: String)(implicit writes: Writes[T]) : Future[IndexResponse] = {
Logger.debug(s"indexing into $path document: $document")
clientProvider.getClient.execute {
index into path doc JsonSource(document) id id
}
}
}
case class JsonSource[T](document: T)(implicit writes: Writes[T]) extends DocumentSource {
def json: String = {
val js = Json.toJson(document)
Json.stringify(js)
}
}
and indexing:
#Singleton
class VenueIndexService #Inject()(
stuff...) extends ElasticIndexService[IndexableVenue] {
def indexVenue(indexableVenue: IndexableVenue) = {
indexInto(indexableVenue, s"${indexableVenue.id.get}")
}
Why is getFields empty when doing get?
Why is query info left out when doing getSourceAsString in a get request?
Thank you!
What you're hitting in question 1 is that you're not specifying which fields to return. By default ES will return the source and not fields (other than type and _id). See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-fields.html
I've added a test to elastic4s to show how to retrieve fields, see:
https://github.com/sksamuel/elastic4s/blob/master/src%2Ftest%2Fscala%2Fcom%2Fsksamuel%2Felastic4s%2FSearchTest.scala
I am not sure on question 2.
The fields are empty because elasticsearch don't return it.
If you need fields, you must indicate in query what field you need:
this is you search query without field:
search in "test"->"venues" query s"id:${companyId}"
and in this query we indicate which field we want to, in this case 'name' and 'description':
search in "test"->"venues" fields ("name","description") query s"id:${companyId}"
now you can retrieve the fields:
for(x <- response.getHits.hits())
{
println(x.getFields.get("name").getValue)
You found a getSourceAsString in a get request because the parameter _source is to default 'on' and fields is to default 'off'.
I hope this will help you
I want to extract JSON values usgin for-comprehensions
my code is this:
import net.liftweb.json._
val json = parse("""
{
"took": 212,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0.625,
"hits": [
{
"_index": "siteindex",
"_type": "posts",
"_id": "1",
"_score": 0.625,
"_source": {
"title": "title 1",
"content": "content 1"
},
"highlight": {
"title": [
"<b>title</b> 1"
]
}
},
{
"_index": "siteindex",
"_type": "posts",
"_id": "4",
"_score": 0.19178301,
"_source": {
"title": "title 4",
"content": "content 4"
},
"highlight": {
"title": [
"<b>title</b> 4"
]
}
},
{
"_index": "siteindex",
"_type": "posts",
"_id": "2",
"_score": 0.19178301,
"_source": {
"title": "title 2",
"content": "content 2"
},
"highlight": {
"title": [
"<b>title</b> 2"
]
}
},
{
"_index": "siteindex",
"_type": "posts",
"_id": "3",
"_score": 0.19178301,
"_source": {
"title": "title 3",
"content": "content 3"
},
"highlight": {
"title": [
"<b>title</b> 3"
]
}
}
]
}
}
""")
my "case class" is this:
case class Document(title:String, content:String)
my "for" is this:
val ret: List[Document] = for {
JObject(child) <- json
JField("title", JString(title)) <- child
JField("content", JString(content)) <- child
} yield (Document( title, content ))
and my "list" is this:
ret: List[Document] = List(Document(title 1,content 1), Document(title 4,content 4), Document(title 2,content 2), Document(title 3,content 3))
until here everything is fine!
but now i need something like this:
List(Document2(1,<b>title</b> 1,content 1), Document2(4,<b>title</b> 4,content 4), Document2(2,<b>title</b> 2,content 2), Document2(3,<b>title</b> 3,content 3))
i need the value of:
"highlight": {
"title": [
"<b>*</b> *"
]
}
and this:
"_id": "*",
in my list.
my "case class" is this:
case class Document2(_id:String, title:String, content:String)
i try this, but it does not work
val ret: List[Document2] = for {
JObject(child) <- json
JField("_id", JString(_id)) <- child
JField("title", JString(title)) <- child
JField("content", JString(content)) <- child
} yield (Document2( _id, title, content ))
i don't know, if there is a better way of data extraction for this json
but the result is this:
<console>:23: warning: `withFilter' method does not yet exist on net.liftweb.json.JValue, using `filter' method instead
JObject(child) <- json
ret: List[Document2] = List()
any suggestion please
thanks for your help
Here is an answer similar to the approach from #flav, but I'll give you the structure to map to the json and how to get your end result too. First, the case classes:
case class Document2(_id:String, title:String, content:String)
case class Results(hits:HitsList)
case class HitsList(hits:List[Hit])
case class Hit(_id:String, _source:Source, highlight:Highlight)
case class Source(title:String, content:String)
case class Highlight(title:List[String])
Then, the code for parsing it and converting it:
implicit val formats = DefaultFormats
val results = json.extract[Results]
val docs2 = results.hits.hits.map{ hit =>
Document2(hit._id, hit.highlight.title.head, hit._source.content)
}