Apache Druid - preserving order of elements in a multi-value dimension - druid

I am using Apache Druid to store multi-value dimensions for customers.
While loading data from a CSV, I noticed that the order of the elements in the multi-value dimension is getting changed. E.g. Mumbai|Delhi|Chennai gets ingested as ["Chennai","Mumbai","Delhi"].
It is important for us to preserve the order of elements in order to apply filters in the query using MV_OFFSET function. One work around is to create explicit order element and concatenate it to the element (like ["3~Chennai","1~Mumbai","2~Delhi"])- but this hampers plain group by aggregations.
Is there any way to preserve the order of the elements in a multi-value dimension during load time?

Thanks to the response from Navis Ryu on Druid slack channel, following dimension spec will keep the order of the elements unchanged:
"dimensions": [
"page",
"language",
{
"type": "string",
"name": "userId",
"multiValueHandling": "ARRAY"
}
]
More details around the functionality here.

Related

list contains for structure data type in DMN decision table

I am planning to use Drools for executing the DMN models. However I am having trouble to write a condition in DMN Decision table where the input is an array of objects with structure data type and condition is to check if the array contains an object with specific fields. For ex:
Input to decision table is as below:
[
{
"name": "abc",
"lastname": "pqr"
},
{
"name": "xyz",
"lastname": "lmn"
},
{
"name": "pqr",
"lastname": "jkl"
}
]
Expected output: True if the above list contains an element that match {"name": "abc", "lastname": "pqr"} both on the same element in the list.
I see that FEEL has support for list contains, but I could not find a syntax where objects in array are not of primitive types like number,string etc but structures. So, I need help on writing this condition in Decision table.
Thanks!
Edited description:
I am trying to achieve the following using the decision table wherein details is list of info structure. Unfortunately as you see I am not getting the desired output wherein my input list contains the specific element I am looking for.
Input: details = [{"name": "hello", "lastname": "world"}]
Expected Output = "Hello world" based on condition match in row 1 of the decision table.
Actual Output = null
NOTE: Also in row no 2 of the decision table, I only check for condition wherein I am only interested in the checking for the name field.
Content for the DMN file can be found over here
In this question is not clear the overall need and requirements for the Decision Table.
For what pertaining the part of the question about:
True if the above list contains an element that match {"name": "abc", "lastname": "pqr"}
...
I see that FEEL has support for list contains, but I could not find a syntax where objects in array are not of primitive types like number,string etc but structures.
This can be indeed achieved with the list contains() function, described here.
Example expression
list contains(my list, {"name": "abc", "lastname": "pqr"})
where my list is the verbatim FEEL list from the original question statement.
Example run:
giving the expected output, true.
Naturally 2 context (complex structures) are the same if all their properties and fields are equivalent.
In DMN, there are multiple ways to achieve the same result.
If I understand the real goal of your use case, I want to suggest a better approach, much easier to maintain from a design point of view.
First of all, you have a list of users as input so those are the data types:
Then, you have to structure a bit your decision:
The decision node at least one user match will go trough the user list and will check if there is at least one user that matches the conditions inside the matching BKM.
at least one user match can implemented with the following FEEL expression:
some user in users satisfies matching(user)
The great benefit of this approach is that you can reason on specific element of your list inside the matching BKM, which makes the matching decision table extremely straightforward:

Scroll vs (from+size) pagination vs search_after in stateless data sync APIs

I have an ES index which stores the unique key and last updated date for each document.
I need to write an APi which will be used to sync the data related to this key, either delta (based on the date stored, e.g. give me data updated after 3rd Mar 2020)
Rough ES mapping:
{
"mappings": {
"userdata": {
"_all": {
"enabled": false
},
"properties": {
"userId": {
"type": "long"
},
"userUUID": {
"type": "keyword"
},
"uniqueKey":{
"type":"keyword"
},
"updatedTimestamp":{
"type":"date"
}
}
}
}
I will use this ES index to find the list of such unique keys matching the date filter and build the remaining details for each key from cassandra.
The API is stateless.
The no. of documents matching the date filter could be in thousands to few hundred thousand.
Now, when synching such data, the client will need to paginate the results.
To paginate, I plan to use 'lastSynchedUniqueKey'. For each subsequent call, the client will provide this value and the API will internally perform a range query on this field and fetch the data with uniqueKey > lastSynchedUniqueKey
So, ES query will have following components:
search query : (date rage query) + (uniqueKey > lastSynchedUniqueKey) + (query on username)
sort : on uniqueKey in asc order
size : 100 --> this is the max pageSize (suggest if it can be changed based on total no. of documents to be synced. Only concern being, don't want to load the ES cluster with these queries. There will be other indices in the cluster which are used for user-facing searches.)
What is better option to perform pagination in this case:
pagination: using (from + size) and filter and sort param: I know this will not performant.
scroll: with same filter and sort param
ES document suggests using '_doc' for sorting for scrolls. Which is not possible in my case. Is it ok to use a field in the index instead?
Is scroll faster than search_after?
Please provide your inputs about sorting and pagination from client perspective and internally.

unnest string array in druid in ingestion phase for better rollup

I am trying to define a druid ingestion spec for the following case.
I have a field of type string array and I want it to be unnested and rolled up by druid during the ingestion. For example, if I have the following two entries in the raw data:
["a","b","c"]
["a", "c"]
In the rollup table I would like to see three entries:
"a" 2
"b" 1
"c" 2
If I just define this column as a dimension, the array is kept as is and the individual values are not extracted. I've looked on possible solution with transformSpec and Expressions, by no help.
I know how to use GROUP BY in query time to get what I need, but I'd like to have this functionality during the ingestion time. Is there some way to define in in the dataSchema?
Thank you.

What is the best way to store column oriented table in MongoDB for optimal query of data

I have a large table where the columns are user_id, user_feature_1, user_feature_2, ...., user_feature_n
So each row corresponds to a user and his or her features.
I stored this table in MongoDB by storing each column's values as an array, e.g.
{
'name': 'user_feature_1',
'values': [
15,
10,
...
]
}
I am using Meteor to pull data from MongoDB, and this way of storage facilitates fast and easy retrieval of the whole column's values for graph plotting.
However, this way of storing has a major drawback; I can't store arrays larger than 16mb.
There are a couple of possible solutions, but non of them seems good enough:
Store each column's values using gridFS. I am not sure if meteor supports gridFS, and it lacks support for slicing of the data, i.e., I may need to just get the top 1000 values of a column.
Store the table in row oriented format. E.g.
{
'user_id': 1,
'user_feature_1': 10,
'user_feature_2': 0.9,
....
'user_feature_n': 42
}
But I think this way of storing data is inefficient for querying a feature column's values
Or MongoDB is not suitable at all and sql is the way to go? But Meteor does not support sql
Update 1:
I found this interesting article which talks about array in mongodb is inefficient. https://www.mongosoup.de/blog-entry/Storing-Large-Lists-In-MongoDB.html
Following explanation is from http://bsonspec.org/spec.html
Array - The document for an array is a normal BSON document with integer values for the keys, starting with 0 and continuing sequentially. For example, the array ['red', 'blue'] would be encoded as the document {'0': 'red', '1': 'blue'}. The keys must be in ascending numerical order.
This means that we can store at most 1 million values in a document, if the values and keys are of float type (16mb/128bits)
There is also a third option. A separate document for each user and feature:
{ u:"1", f:"user_feature_1", v:10 },
{ u:"1", f:"user_feature_2", v:11 },
{ u:"1", f:"user_feature_3", v:52 },
{ u:"2", f:"user_feature_1", v:4 },
{ u:"2", f:"user_feature_2", v:13 },
{ u:"2", f:"user_feature_3", v:12 },
You will have no document growth problems and you can query both "all values for user x" and "all values for feature x" without also accessing any unrelated data.
16MB / 64bit float = 2,000,000 uncompressed datapoints. What kind of graph requires a minimum of 2 million points per column??? Instead try:
Saving a picture on an s3 server
Using a map-reduce solution like hadoop (probably your best bet)
Reducing numbers to small ints if they're currently floats
Computing the data on the fly, on the client (preferred, if possible)
Using a compression algo so you can save a subset & interpolate the rest
That said, a document-based DB would outperform a SQL DB in this use case because a SQL DB would do exactly as Philipp suggested. Either way, you cannot send multiple 16MB files to a client, if the client doesn't leave you for poor UX then you'll go broke for server costs :-).

How to find ID for existing Fi-Ware sensors

I'm working with Fi-Ware and I would like to include existing information from smartcities on my project. Clicking on the link below I could find information about how is the ID pattern and type of different device (for example OUTSMART.NODE.).
https://forge.fi-ware.org/plugins/mediawiki/wiki/fiware/index.php/Publish/Subscribe_Broker_-_Orion_Context_Broker_-_User_and_Programmers_Guide#Sample_code
However, I don't know the after that pattern
I've tried random numbers (OUTSMART.NODE.1 or OUTSMART.NODE.0001).
Is there some kind of list or somewhere to find that information??
Thank you!
In order to know the particular entity IDs for a given type, you can use a "discovery" query on the type associated to the sensor with the .* global pattern. E.g., in order to get the IDs associated to type "santander:traffic" you could use:
{
"entities": [
{
"type": "santander:traffic",
"isPattern": "true",
"id": ".*"
}
],
"attributes" : [
"TimeInstant"
]
}
Using "TimeInstant" in the "attributes" field is not strictly needed. You can leave "attribute" empty, in order to get all the attributes from each sensor. However, if you are insterested only in the IDs, "TimeInstant" would suffice and you will save length in the JSON response (the respone of the above query is around 17KB, while if you use an empty "attributes" field, the response will be around 48KB).
EDIT: since the update to Orion 0.14.0 in orion.lab.fi-ware.org on July 2nd, 2014 the NGSI API implements pagiation. The default limit is 20 entities so if you want to get all them, you will need to implement pagination in your cliente, using limit and details URI parameters. Have a look to the pagination section in the user manual for details.