Can druid perform a nested query such that each contain one dimension and a list of associated dimensions? - druid

For instance, given this data:
{"timestamp": "2011-01-12T00:00:00.000Z", "ip": "1" "user": "abc" }
{"timestamp": "2011-01-12T00:00:00.000Z", "ip": "1" "user": "def" }
{"timestamp": "2011-01-12T00:00:00.000Z", "ip": "1" "user": "hgi" }
{"timestamp": "2011-01-12T00:00:00.000Z", "ip": "2" "user": "mno" }
{"timestamp": "2011-01-12T00:00:00.000Z", "ip": "2" "user": "qrs" }
{"timestamp": "2011-01-12T00:00:00.000Z", "ip": "3" "user": "xyz" }
Is it possible to do an efficient query that returns
{
"timestamp": "...",
"event": {
"ip": 1,
"user": ["abc", "def", "hgi"]
},
{
"timestamp": "...",
"event": {
"ip": 2,
"user": ["mno", "qrs"]
},
{
"timestamp": "...",
"event": {
"ip": 3,
"user": ["xyz"]
}
And if so, is it possible to limit the results count of only the user list result?

With a druid groupBy Query you cannot apply a "sub-group" or "group_concat" function. These simply are not available. Druid will group your query by the fields which you select.
Of course you can group per ip and then count the number of rows or even the number of distinct users.

Related

How to fetch records from mongoDB on the basis of duplicate data in multiple fields

The requirement is to fetch the document with some field having the same values in the particular collection..
For Example:
we have two documents below:
1. {
"_id": "finance100",
"status": "ACTIVE",
"customerId": "100",
"contactId": "contact_100",
"itemId": "profile_100",
"audit": {
"dateCreated": "2022-02-16T16:34:52.718539Z",
"dateModified": "2022-03-18T09:36:42.774271Z",
"createdBy": "41d38c187155427fa37c855a4d1868d1",
"modifiedBy": "41d38c187155427fa37c855a4d1868d1"
},
"location": "US"
}
2. {
"_id": "finance101",
"status": "ACTIVE",
"customerId": "100",
"contactId": "contact_100",
"itemId": "profile_100",
"audit": {
"dateCreated": "2022-02-16T16:34:52.718539Z",
"dateModified": "2022-03-18T09:36:42.774271Z",
"createdBy": "41d38c187155427fa37c855a4d1868d1",
"modifiedBy": "41d38c187155427fa37c855a4d1868d1"
},
"location": "US"
}
3. {
"_id": "finance101",
"status": "ACTIVE",
"customerId": "100",
"contactId": "contact_100",
"itemId": "profile_100",
"audit": {
"dateCreated": "2022-02-16T16:34:52.718539Z",
"dateModified": "2022-03-18T09:36:42.774271Z",
"createdBy": "41d38c187155427fa37c855a4d1868d1",
"modifiedBy": "41d38c187155427fa37c855a4d1868d1"
},
"location": "UK"
}
The following parameter should have the same values:
customerId
contactId
itemId
location
so, need the fetch those records, matching with these above parameters having the same values in all the documents.
So, it should fetch the first two records (1 and 2) because the values for customerId, contactId, itemId, and location is same in first two document and in 3rd document only the location value is different(so it will not be fetched).
Could you please share the appropriate mongo query to work for this. I tried aggeration but it did not work. Thanks in advance.
Need the fetch those records, matching with these above parameters having the same values in all the documents.

How to avoid huge json documents in mongoDB

I am new to mongoDB modelling, I have been working in a small app that just used to have one collection with all my data like this:
{
"name": "Thanos",
"age": 999,
"lastName": "whatever",
"subjects": [
{
"name": "algebra",
"mark": 999
},
{
"name": "quemistry",
"mark": 999
},
{
"name": "whatever",
"mark": 999
}
]
}
I know this is standard in mongoDB since we don't have to map relotionships to other collections like in a relational database. My problem is that my app is growing and my json, even tho it works perfectly fine, it is starting to be huge since it has a few more (and quite big) nested fields:
{
"name": "Thanos",
"age": 999,
"lastName": "whatever",
"subjects": [
{
"name": "algebra",
"mark": 999
},
{
"name": "quemistry",
"mark": 999
},
{
"name": "whatever",
"mark": 999
}
],
"tutors": [
{
"name": "John",
"phone": 2000,
"status": "father"
},
{
"name": "Anne",
"phone": 200000,
"status": "mother"
}
],
"exams": [
{
"id": "exam1",
"file": "file"
},
{
"id": "exam2",
"file": "file"
},
{
"id": "exam3",
"file": "file"
}
]
}
notice that I have simplified the json a lot, the nested fields have way more fields. I have two questions:
Is this a proper way to model Mongodb one to many relationships and how do I avoid such long json documents without splitting into more documents?
Isn't it a performance issue that I have to go through all the students just to get subjects for example?

How to select filtered postgresql jsonb field with performance prioritization?

A table:
CREATE TABLE events_holder(
id serial primary key,
version int not null,
data jsonb not null
);
Data field can be very very large (up to 100 Mb) and looks like this:
{
"id": 5,
"name": "name5",
"events": [
{
"id": 255,
"name": "festival",
"start_date": "2022-04-15",
"end_date": "2023-04-15",
"values": [
{
"id": 654,
"type": "text",
"name": "importance",
"value": "high"
},
{
"id": 655,
"type": "boolean",
"name": "epic",
"value": "true"
}
]
},
{
"id": 256,
"name": "discovery",
"start_date": "2022-02-20",
"end_date": "2022-02-22",
"values": [
{
"id": 711,
"type": "text",
"name": "importance",
"value": "low"
},
{
"id": 712,
"type": "boolean",
"name": "specificAttribute",
"value": "false"
}
]
}
]
}
I want to select data field by version, but filtered with extra condition: where events end_date > '2022-03-15'. And the output must look like this:
{
"id": 5,
"name": "name5",
"events": [
{
"id": 255,
"name": "festival",
"start_date": "2022-04-15",
"end_date": "2023-04-15",
"values": [
{
"id": 654,
"type": "text",
"name": "importance",
"value": "high"
},
{
"id": 655,
"type": "boolean",
"name": "epic",
"value": "true"
}
]
}
]
}
How can I do this with maximum performance? How should I index the data field?
My primary solution:
with cte as (
select eh.id, eh.version, jsonb_agg(events) as filteredEvents from events_holder eh
cross join jsonb_array_elements(eh.data #> '{events}') as events
where version = 1 and (events ->> 'end_date')::timestamp >= '2022-03-15'::timestamp
group by id, version
)
select jsonb_set(data, '{events}', cte.filteredEvents) from events_holder, cte
where events_holder.id = cte.id;
But i don't think it's a good variant.
You can do this using a JSON path expression:
select eh.id, eh.version,
jsonb_path_query_array(data,
'$.events[*] ? (#.end_date.datetime() >= "2022-03-15".datetime())')
from events_holder eh
where eh.version = 1
and eh.data #? '$.events[*] ? (#.end_date.datetime() >= "2022-03-15".datetime())'
Given your example JSON, this returns:
[
{
"id": 255,
"name": "festival",
"values": [
{
"id": 654,
"name": "importance",
"type": "text",
"value": "high"
},
{
"id": 655,
"name": "epic",
"type": "boolean",
"value": "true"
}
],
"end_date": "2023-04-15",
"start_date": "2022-04-15"
}
]
Depending on your data distribution a GIN index on data or an index on version could help.
If you need to re-construct the whole JSON content but with just a filtered events array, you can do something like this:
select (data - 'events')||
jsonb_build_object('events', jsonb_path_query_array(data, '$.events[*] ? (#.end_date.datetime() >= "2022-03-15".datetime())'))
from events_holder eh
...
(data - 'events') removes the events key from the json. Then the the result of the JSON path query is appended back to that (partial) object.

Cloudant JSON data into dashdb table joins

I have successfully imported some JSON data into cloudant, the JSON data has three levels. Then created the dashdb warehouse from cloudant to put the data into relational tables. It appears that dashdb has created three tables for each of the levels in the JSON data but has not provided me with a Key to join back to the top level. Is there a customisation that is done somewhere that tells dashdb how to join the tables.
A sample JSON doc is below:
{
"_id": "579b56388aa56fd03a4fd0a9",
"_rev": "1-698183d4326352785f213b823749b9f8",
"v": 0,
"startTime": "2016-07-29T12:48:04.204Z",
"endTime": "2016-07-29T13:11:48.962Z",
"userId": "Ranger1",
"uuid": "497568578283117a",
"modes": [
{
"startTime": "2016-07-29T12:54:22.565Z",
"endTime": "2016-07-29T12:54:49.894Z",
"name": "bicycle",
"_id": "579b56388aa56fd03a4fd0b1",
"locations": []
},
{
"startTime": "2016-07-29T12:48:02.477Z",
"endTime": "2016-07-29T12:53:28.503Z",
"name": "walk",
"_id": "579b56388aa56fd03a4fd0ad",
"locations": [
{
"at": "2016-07-29T12:49:05.716Z",
"_id": "579b56388aa56fd03a4fd0b0",
"location": {
"coords": {
"latitude": -34.0418308,
"longitude": 18.3503616,
"accuracy": 37.5,
"speed": 0,
"heading": 0,
"altitude": 0
},
"battery": {
"is_charging": true,
"level": 0.7799999713897705
}
}
},
{
"at": "2016-07-29T12:49:48.488Z",
"_id": "579b56388aa56fd03a4fd0af",
"location": {
"coords": {
"latitude": -34.0418718,
"longitude": 18.3503895,
"accuracy": 33,
"speed": 0,
"heading": 0,
"altitude": 0
},
"battery": {
"is_charging": true,
"level": 0.7799999713897705
}
}
},
{
"at": "2016-07-29T12:50:20.760Z",
"_id": "579b56388aa56fd03a4fd0ae",
"location": {
"coords": {
"latitude": -34.0418788,
"longitude": 18.3503887,
"accuracy": 33,
"speed": 0,
"heading": 0,
"altitude": 0
},
"battery": {
"is_charging": true,
"level": 0.7799999713897705
}
}
}
]
},
{
"startTime": "2016-07-29T12:53:37.137Z",
"endTime": "2016-07-29T12:54:18.505Z",
"name": "carshare",
"_id": "579b56388aa56fd03a4fd0ac",
"locations": []
},
{
"startTime": "2016-07-29T12:54:54.112Z",
"endTime": "2016-07-29T13:11:47.818Z",
"name": "bus",
"_id": "579b56388aa56fd03a4fd0aa",
"locations": [
{
"at": "2016-07-29T13:00:08.039Z",
"_id": "579b56388aa56fd03a4fd0ab",
"location": {
"coords": {
"latitude": -34.0418319,
"longitude": 18.3503623,
"accuracy": 36,
"speed": 0,
"heading": 0,
"altitude": 0
},
"battery": {
"is_charging": false,
"level": 0.800000011920929
}
}
}
]
}
]
}
SQL for the three tables created in dashdb showing all the fields in each table is here. Note there is no FK that I can see, the "_ID" fields are unique to each table.
SELECT ENDTIME,STARTTIME,USERID,UUID,V,"_ID","_REV"
FROM <schemaname>.RANGER_DATA
where "_ID" = '579b56388aa56fd03a4fd0a9'
SELECT ARRAY_INDEX,ENDTIME,NAME,STARTTIME,TOTALPAUSEDMS,"_ID"
FROM <schemaname>.RANGER_DATA_MODES
where "_ID" = '579b56388aa56fd03a4fd0b1'
SELECT ARRAY_INDEX,AT,LOCATION_BATTERY_IS_CHARGING,LOCATION_BATTERY_LEVEL,LOCATION_COORDS_ACCURACY,LOCATION_COORDS_ALTITUDE,LOCATION_COORDS_HEADING,LOCATION_COORDS_LATITUDE,LOCATION_COORDS_LONGITUDE,LOCATION_COORDS_SPEED,RANGER_DATA_MODES,"_ID"
FROM <schemaname>.RANGER_DATA_MODES_LOCATIONS
where "_ID" = '579b56388aa56fd03a4fd0b0'
Cloudant uses _id for its UID for each document. It seems that the warehousing task iterates over these documents and assumes that there is a new document every time it sees a new _id.
Because you're using _id in your modes and locations this will produce an undesired result in the SQL DB.
Renaming your _id in modes and locations to something else should fix the problem.

Custom API in Elastic search

Building my first RESTful api, and thought I'd try elasticsearch for a base. Is there a way customize the API in Elasticsearch to only return certain fields from results of a query. For instance if I have data with fname, lname, city, state, zip, email and I only want to return a list of fnames and cities for every query matching the city field. So something like this:
curl -XPOST "http://localhost:9200/custom_call/_search" -d'
{
"query": {
"query_string": {
"query": "Toronto",
"fields": ["city"]
}
}
}'
Would ideally return something like:
{"took": 52, "timed_out": false, "_shards": {
"total": 35,
"successful": 35,
"failed": 0
}, "hits": {
"total": 1,
"max_score": 0.375,
"hits": [
{
"_index": "persons",
"_type": "person",
"_id": "6",
"_score": 0.375,
"_source": {
"fname": "Bob",
"city": "Toronto",
}
},
{
"_index": "persons",
"_type": "person",
"_id": "13",
"_score": 0.375,
"_source": {
"fname": "Sue",
"city": "Toronto",
}
},
{
"_index": "persons",
"_type": "person",
"_id": "21",
"_score": 0.375,
"_source": {
"fname": "Jose",
"city": "Toronto",
}
}
]
}}
Not sure if Elasticsearch is set up to do this or even if you would want it to. My first foray into building a RESTful API. I figure if NPR StackOverflow like it, its worth a shot! Thanks for the help.
Yes you can, I think you haven't tried to find out on your own.
Here is how to do that,
POST localhost:9200/index/type/_search
{
"query": {
"query_string": {
"query": "Toronto",
"fields": ["city"]
}
},
"_source" :["fields_you_want_to_get"]
}
The term you are looking is source filtering.