JOLT subtree validation based on one of the property values - jolt

I am a jolt newbie.
I have the following question, I have a JSON document structure of which may vary based on the type property. See my example below.
{
"recipientId": "xxx",
"messages": [
{
"type": "text",
"text": "hi there!"
},
{
"type": "image",
"url": "http://example.com/image.jpg",
"preview": "http://example.com/thumbnail.jpg"
}
]
}
After transformation I would like to receive the following output:
{
"messages" : [ {
"text" : "hi there!",
"type" : "text"
}, {
"type" : "image",
"url" : "http://example.com/image.jpg",
"preview": "http://example.com/thumbnail.jpg"
} ],
"to" : "xxx"
}
Here is the spec that I came up with:
[
{
"operation": "shift",
"spec": {
"recipientId": "to",
"messages": {
"*": {
"type": "messages[&1].type",
"text": "messages[&1].text",
"url": "messages[&1].url",
"preview": "messages[&1].previewImageUrl"
}
}
}
}
]
The problem with this approach is that if I have "type": "text" and if I also throw "preview" property with the value, it will not make sense as the type text should not have "preview" property set.
So, I would like jolt to either ignore some properties based on the value of "type" property or avoid transforming such payloads.
Is there a way to do such "validations" in JOLT? The other option that I see would be validating it with Jackson type hierarchy.

What you can do is match down to the value of "type", and then jump back up the tree some, and process the message as a "text" type or an "image" type.
Input
{
"recipientId": "xxx",
"messages": [
{
"type": "text",
"text": "hi there!",
"preview": "SHOULD NOT PASS THRU"
},
{
"type": "image",
"url": "http://example.com/image.jpg",
"preview": "http://example.com/thumbnail.jpg"
}
]
}
Spec
[
{
"operation": "shift",
"spec": {
"recipientId": "to",
"messages": {
"*": { // the array index of the messages array, referenced below
// as [&2] or [&4] depending how far down they have gone
"type": {
// always pass the value of type thru
"#": "messages[&2].type",
"text": {
// if the value of type was "text", then
// go back up the tree 3 levels (0,1,2)
// and process the whole message as a "text" type
"#2": {
"text": "messages[&4].text"
}
},
"image": {
"#2": {
// if the value of type was "image", then
// go back up the tree 3 levels (0,1,2)
// and process the whole message as a "image" type
"url": "messages[&4].url",
"preview": "messages[&4].previewImageUrl"
}
}
}
}
}
}
}
]

Related

What is the reason fo this difference in Facebook messages webhook JSON scheme?

I'm reading the Facebook Send API and events docs, and I'm surprised to see some examples define the message at the top level of the JSON object hierarchy (here for "text message")
{
"sender":{
"id":"<PSID>"
},
"recipient":{
"id":"<PAGE_ID>"
},
"timestamp":1458692752478,
"message":{
"mid":"mid.1457764197618:41d102a3e1ae206a38",
"text":"hello, world!",
"quick_reply": {
"payload": "<DEVELOPER_DEFINED_PAYLOAD>"
}
}
}
While others examples have a different object structure:
{
"id": "682498302938465",
"time": 1518479195594,
"messaging": [
{
"sender": {
"id": "<PSID>"
},
"recipient": {
"id": "<PAGE_ID>"
},
"timestamp": 1518479195308,
"message": {
"mid": "mid.$cAAJdkrCd2ORnva8ErFhjGm0X_Q_c",
"attachments": [
{
"type": "<image|video|audio|file>",
"payload": {
"url": "<ATTACHMENT_URL>"
}
}
]
}
}
]
}
Here is seems like the object is the first example is instead contained in a messaging array. I can't find either what id corresponds to in this second example : there is no table detailing the fields on why the hierarchy is different here.
Finally, there is another example with a different structure:
{
"object": "page",
"entry": [
{
"id": "<PAGE_ID>",
"time": 1583173667623,
"messaging": [
{
"sender": {
"id": "<PSID>"
},
"recipient": {
"id": "<PAGE_ID>"
},
"timestamp": 1583173666767,
"message": {
"mid": "m_toDnmD...",
"text": "This is where I want to go: https:\/\/youtu.be\/bbo_fZAjIhg",
"attachments": [
{
"type": "fallback",
"payload": {
"url": "<ATTACHMENT_URL >",
"title": "TAHITI - Heaven on Earth"
}
}
]
}
}
]
}
]
}
Are these differences documented elsewhere? Why do they exist in the first place? Why is messaging an array? Can it contain multiple messages at once?

Elastic4s ngram mapping

I have to create an ElasticSearch mapping like this using elastic4s:
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"name": {
"type": "text",
"analyzer": "ngram_analyzer",
"fielddata": true
},
"lang": {
"type": "keyword"
},
"order": {
"type": "long"
},
"active": {
"type": "boolean"
}
"description": {
"type": "text"
}
}
}
I can do
def mapping: Option[MappingDefinition] =
Some(
properties(
KeywordField("id"),
KeywordField("lang"),
BasicField("order", "long"),
BasicField("active", "boolean"),
TextField("description")
)
)
for id, lang, order, active and description.
But, how can I do such mapping for name. the problem is analyzer and fielddata inside it.
You should use this:
TextField("name").fielddata(true).analyzer("ngram_analyzer")
You also need to make sure to properly create the ngram_analyzer in your index settings.

how to validate datatype in jolt transformation

I'm new to jolt transformation. I was wondering if there is a way to do a validation on data type then proceed.
I'm processing a json to insert record into hbase. From source I'm getting timestamp repeated for the same resource id which I want to use for row key.
So I just retrieve the first timestamp and concate with resource id to create row key. But I have an issue when there is only one timestamp in the record i.e when its not a list. Appreciate if someone can help me how to handle this situation.
input data
{ "resource": {
"id": "200629068",
"name": "resource_name_1)",
"parent": {
"id": 200053744,
"name": "parent_name"
},
"properties": {
"AP_ifSpeed": "0",
"DisplaySpeed": "0 (NotApplicable)",
"description": "description"
}
},
"data": [
{
"metric": {
"id": "2215",
"name": "metric_name 1"
},
"timestamp": 1535064595000,
"value": 0
},
{
"metric": {
"id": "2216",
"name": "metric_name_2"
},
"timestamp": 1535064595000,
"value": 1
}
]
}
Jolt transformation
[{
"operation": "shift",
"spec": {
"resource": {
// "id": "resource_&",
"name": "resource_&",
"id": "resource_&",
"parent": {
"id": "parent_&",
"name": "parent_&"
},
"properties": {
"*": "&"
}
},
"data": {
"*": {
"metric": {
"id": {
"*": {
"#(3,value)": "&1"
}
},
"name": {
"*": {
"#(3,value)": "&1"
}
}
},
"timestamp": "timestamp"
}
}
}
}, {
"operation": "shift",
"spec": {
"timestamp": {
// get first element from list
"0": "&1"
},
"*": "&"
}
},
{
"operation": "modify-default-beta",
"spec": {
"rowkey": "=concat(#(1,resource_id),'_',#(1,timestamp))"
}
}
]
Output I'm getting
{ "resource_name" : "resource_name_1)",
"resource_id" : "200629068",
"parent_id" : 200053744,
"parent_name" : "parent_name",
"AP_ifSpeed" : "0",
"DisplaySpeed" : "0 (NotApplicable)",
"description" : "description",
"2215" : 0,
"metric_name 1" : 0,
"timestamp" : 1535064595000,
"2216" : 1,
"metric_name_2" : 1,
"rowkey" : "200629068_1535064595000"
}
when there is only one timestamp then I get
"rowkey" : "200629068_"
In your shift make the output "timestamp" always be an array, even if the incoming data array only has one element in it.
"timestamp": "timestamp[]"

How can I use CloudKit web services to query based on a reference field?

I've got two CloudKit data objects that look somewhat like this:
Parent Object:
{
"records": [
{
"recordName": "14102C0A-60F2-4457-AC1C-601BC628BF47-184-000000012D225C57",
"recordType": "ParentObject",
"fields": {
"fsYear": {
"value": "2015",
"type": "STRING"
},
"displayOrder": {
"value": 2015221153856287200,
"type": "INT64"
},
"fjpFSGuidForReference": {
"value": "14102C0A-60F2-4457-AC1C-601BC628BF47-184-000000012D225C57",
"type": "STRING"
},
"fsDateSearch": {
"value": "2015221153856287158",
"type": "STRING"
},
},
"recordChangeTag": "id4w7ivn",
"created": {
"timestamp": 1439149087571,
"userRecordName": "_0d26968032e31bbc72c213037b6cb35d",
"deviceID": "A19CD995FDA3093781096AF5D818033A241D65C1BFC3D32EC6C5D6B3B4A9AA6B"
},
"modified": {
"timestamp": 1439149087571,
"userRecordName": "_0d26968032e31bbc72c213037b6cb35d",
"deviceID": "A19CD995FDA3093781096AF5D818033A241D65C1BFC3D32EC6C5D6B3B4A9AA6B"
}
}
],
"total":
}
Child Object:
{
"records": [
{
"recordName": "2015221153856287168",
"recordType": "ChildObject",
"fields": {
"District": {
"value": "002",
"type": "STRING"
},
"ZipCode": {
"value": "12345",
"type": "STRING"
},
"InspecReference": {
"value": {
"recordName": "14102C0A-60F2-4457-AC1C-601BC628BF47-184-000000012D225C57",
"action": "NONE",
"zoneID": {
"zoneName": "_defaultZone"
}
},
"type": "REFERENCE"
},
},
"recordChangeTag": "id4w7lew",
"created": {
"timestamp": 1439149090856,
"userRecordName": "_0d26968032e31bbc72c213037b6cb35d",
"deviceID": "A19CD995FDA3093781096AF5D818033A241D65C1BFC3D32EC6C5D6B3B4A9AA6B"
},
"modified": {
"timestamp": 1439149090856,
"userRecordName": "_0d26968032e31bbc72c213037b6cb35d",
"deviceID": "A19CD995FDA3093781096AF5D818033A241D65C1BFC3D32EC6C5D6B3B4A9AA6B"
}
}
],
"total": 1
}
I'm trying to write a query to directly access the CloudKit web service and return the Child Object based on the reference of the parent object.
My test JSON looks something like this:
{"query":{"recordType":"ChildObject","filterBy":{"fieldName":"InspecReference","fieldValue":{ "value" : "14102C0A-60F2-4457-AC1C-601BC628BF47-184-000000012D225C57", "type" : "string" },"comparator":"EQUALS"}},"zoneID":{"zoneName":"_defaultZone"}}
However, I'm getting the following error from CloudKit:
{"uuid":"33db91f3-b768-4a68-9056-216ecc033e9e","serverErrorCode":"BAD_REQUEST","reason":"BadRequestException:
Unexpected input"}
I'm guessing I have the Record Field Dictionary in the query wrong. However, the documentation isn't clear on what this should look like on a reference object.
You have to re-create the actual object of the reference. In this particular case, the JSON looks like this:
{
"query": {
"recordType": "ChildObject",
"filterBy": {
"fieldName": "InspecReference",
"fieldValue": {
"value": {
"recordName": "14102C0A-60F2-4457-AC1C-601BC628BF47-184-000000012D225C57",
"action": "NONE"
},
"type": "REFERENCE"
},
"comparator": "EQUALS"
}
},
"zoneID": {
"zoneName": "_defaultZone"
}
}

ElasticSearch autocomplete returning 0 hits

I am trying to build an autocomplete feature for our database running on MongoDB. We need to provide autocomplete which lets users complete their queries by offering suggestions while they are typing in the search box.
I have a collection of articles from various sources, which is having the following fields :
{
"title" : "Its the title of a random article",
"cont" : { "paragraphs" : [ .... ] },
and so on..
}
I went through a video by Clinton Gormley. From 37:00 through 42:00 minute, Gormley describes an autocomplete using edgeNGram. Also, I referred to this question to recognize that both are almost the same things, just the mappings differ.
So based on these experiences, I built almost identical settings and mapping and then restored articles collection to ensure that it is indexed by ElasticSearch
The indexing scheme is as follows:
POST /title_autocomplete/title
{
"settings": {
"analysis": {
"filter": {
"autocomplete": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 50
}
},
"analyzer": {
"title" : {
"type" : "standard",
"stopwords":[]
},
"autocomplete": {
"type" : "autocomplete",
"tokenizer": "standard",
"filter": ["lowercase", "autocomplete"]
}
}
}
},
"mappings": {
"title": {
"type": "multi_field",
"fields" : {
"title" : {
"type": "string",
"analyzer": "title"
},
"autocomplete" : {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer" : "title"
}
}
}
}
}
But when I run the search query, I am unable to get any hits!
GET /title_autocomplete/title/_search
{
"query": {
"bool" : {
"must" : {
"match" : {
"title.autocomplete" : "Its the titl"
}
},
"should" : {
"match" : {
"title" : "Its the titl"
}
}
}
}
}
Can anybody please explain what's wrong with the mapping query or settings? I have been reading ElasticSearch docs for over 7 days now but seem to get nowhere more than full text searches!
ElastiSearch version : 0.90.10
MongoDB version : v2.4.9
using _river
Ubuntu 12.04 64bit
UPDATE
I realised that mapping is screwed after applying previous settings:
GET /title_autocomplete/_mapping
{
"title_autocomplete": {
"title": {
"properties": {
"analysis": {
"properties": {
"analyzer": {
"properties": {
"autocomplete": {
"properties": {
"filter": {
"type": "string"
},
"tokenizer": {
"type": "string"
},
"type": {
"type": "string"
}
}
},
"title": {
"properties": {
"type": {
"type": "string"
}
}
}
}
},
"filter": {
"properties": {
"autocomplete": {
"properties": {
"max_gram": {
"type": "long"
},
"min_gram": {
"type": "long"
},
"type": {
"type": "string"
}
}
}
}
}
}
},
"content": {
... paras and all ...
}
"title": {
"type": "string"
},
"url": {
"type": "string"
}
}
}
}
}
Analyzers and filters are actually mapped into the document after the settings are applied whereas original title field is not affected at all! Is this normal??
I guess this explains why the query is not matching. There is no title.autocomplete field or title.title field at all.
So how should I proceed now?
For those facing this problem, its better to delete the index and start again instead of wasting time with the _river just as DrTech pointed out in the comment.
This saves time but is not a solution. (Therefore not marking it as answer.)
The key is to set up the mappings and index before you initiate the river.
We had an existing setup with a mongodb river and an index called coresearch that we wanted to add autocomplete capacity to, this is the set of commands we used to delete the existing index and river and start again.
Stack is:
ElasticSearch 1.1.1
MongoDB 2.4.9
ElasticSearchMapperAttachments v2.0.0
ElasticSearchRiverMongoDb/2.0.0
Ubuntu 12.04.2 LTS
curl -XDELETE "localhost:9200/_river/node"
curl -XDELETE "localhost:9200/coresearch"
curl -XPUT "localhost:9200/coresearch" -d '
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}'
curl -XPUT "localhost:9200/coresearch/_mapping/users" -d '{
"users": {
"properties": {
"firstname": {
"type": "string",
"search_analyzer": "standard",
"index_analyzer": "autocomplete"
},
"lastname": {
"type": "string",
"search_analyzer": "standard",
"index_analyzer": "autocomplete"
},
"username": {
"type": "string",
"search_analyzer": "standard",
"index_analyzer": "autocomplete"
},
"email": {
"type": "string",
"search_analyzer": "standard",
"index_analyzer": "autocomplete"
}
}
}
}'
curl -XPUT "localhost:9200/_river/node/_meta" -d '
{
"type": "mongodb",
"mongodb": {
"servers": [
{ "host": "127.0.0.1", "port": 27017 }
],
"options":{
"exclude_fields": ["time"]
},
"db": "users",
"gridfs": false,
"options": {
"import_all_collections": true
}
},
"index": {
"name": "coresearch",
"type": "documents"
}
}'