Filter out events during ingestion in druid - druid

I am ingesting data into druid database event by event, but I want to delete all the events which are specific to a particular user.
For ex. while ingesting data I want to delete events for all the entries having name="Ram"

You can use filters property of transformSpec to filter out events during ingestion. This is the standard way to specify filters in ingestion spec. According to docs,
Transform specs allow Druid to filter and transform input data during ingestion.
Any Druid filter can be used in transformSpec. For ex. in this case to filter out specific name transformSpec will be something like this:
"transformSpec": {
"filter": {
"type": "not",
"field": {
"type": "selector",
"dimension": "name",
"value": "Ram"
}
},
"transforms": []
}
More details on transform spec can be found here: Transform Spec Documentation Link

Related

How to update Avro schema with a reference to another schema on a Kafka topic?

What is the correct way to update an Avro schema on a Kafka topic, if this schema is used as a reference in another schema?
For example, let's say we have two Kafka topics: one uses Avro schema User {"type" : "record", "namespace" : "test", "name" : "User", "fields" : [{"name": "username", "type": "string"}]} and the second one UserAction {"type" : "record", "namespace" : "test", "name" : "UserAction", "fields" : [{"name": "action", "type": "string"}, {"name": "user", "type": "test.User"}]}.
Then I want to add an additional field to the User - a "surname", so it will look like this: ... "fields" : [{"name": "username", "type": "string"}, {"name": "surname", "type": ["string", "null"], "default": null}], null to make this change a compatible one. To do this I can change the Avro schema file, regenerate POJOs using Maven schema plugin, and then if I'll send a message to the first topic with a KafkaTemplate, the schema will be updated and the new field will be visible on the topic.
The issue is that if I'll send a message with UserAction to the second topic, it would still refer to the old User schema, without the "surname" field, even though POJOs will see it correctly. And because of this any "surname" sent won't be stored in the topic and would be received as a null in Consumer.
Is there any way to force update UserAction schema on the second topic to refer to the new User schema?
While the Confluent Schema Registry allows for references at registration time, I don't think it'll dynamically update as you change only one model.
Instead, you can define a schema "monorepo" where you package and register your schema changes together.
For example, in Avro IDL, you could define one file
record User {
// fields here
}
record UserAction {
User user;
string action;
}
And if you use Avro Maven plugin idl-schemata action, it'll reflect all User changes in both output AVSC schema files.
And when the Java models are created, it'll have all the necessary fields. Still, you need to update all external clients that depend on these models separately.

Apache Druid - preserving order of elements in a multi-value dimension

I am using Apache Druid to store multi-value dimensions for customers.
While loading data from a CSV, I noticed that the order of the elements in the multi-value dimension is getting changed. E.g. Mumbai|Delhi|Chennai gets ingested as ["Chennai","Mumbai","Delhi"].
It is important for us to preserve the order of elements in order to apply filters in the query using MV_OFFSET function. One work around is to create explicit order element and concatenate it to the element (like ["3~Chennai","1~Mumbai","2~Delhi"])- but this hampers plain group by aggregations.
Is there any way to preserve the order of the elements in a multi-value dimension during load time?
Thanks to the response from Navis Ryu on Druid slack channel, following dimension spec will keep the order of the elements unchanged:
"dimensions": [
"page",
"language",
{
"type": "string",
"name": "userId",
"multiValueHandling": "ARRAY"
}
]
More details around the functionality here.

Scroll vs (from+size) pagination vs search_after in stateless data sync APIs

I have an ES index which stores the unique key and last updated date for each document.
I need to write an APi which will be used to sync the data related to this key, either delta (based on the date stored, e.g. give me data updated after 3rd Mar 2020)
Rough ES mapping:
{
"mappings": {
"userdata": {
"_all": {
"enabled": false
},
"properties": {
"userId": {
"type": "long"
},
"userUUID": {
"type": "keyword"
},
"uniqueKey":{
"type":"keyword"
},
"updatedTimestamp":{
"type":"date"
}
}
}
}
I will use this ES index to find the list of such unique keys matching the date filter and build the remaining details for each key from cassandra.
The API is stateless.
The no. of documents matching the date filter could be in thousands to few hundred thousand.
Now, when synching such data, the client will need to paginate the results.
To paginate, I plan to use 'lastSynchedUniqueKey'. For each subsequent call, the client will provide this value and the API will internally perform a range query on this field and fetch the data with uniqueKey > lastSynchedUniqueKey
So, ES query will have following components:
search query : (date rage query) + (uniqueKey > lastSynchedUniqueKey) + (query on username)
sort : on uniqueKey in asc order
size : 100 --> this is the max pageSize (suggest if it can be changed based on total no. of documents to be synced. Only concern being, don't want to load the ES cluster with these queries. There will be other indices in the cluster which are used for user-facing searches.)
What is better option to perform pagination in this case:
pagination: using (from + size) and filter and sort param: I know this will not performant.
scroll: with same filter and sort param
ES document suggests using '_doc' for sorting for scrolls. Which is not possible in my case. Is it ok to use a field in the index instead?
Is scroll faster than search_after?
Please provide your inputs about sorting and pagination from client perspective and internally.

How do I store timestamp or datetime in Orion?

I'm using Orion to store context information, and among all the entity attributes there are two that are time specific:
updated_at
created_at
How can I store this? Is there a timestamp or datetime attribute type in Orion?
You can use attribute type date to store dates, as described in NGSIv2 specification section "Special attribute types". For example, you could create the following entity:
POST /v2/entities
{
"id": "myEntity",
"type": "myType",
"updated_at": {
"value": "2017-06-17T07:21:24.00Z",
"type": "date"
},
"created_at": {
"value": "2017-06-17T07:21:24.00Z",
"type": "date"
}
}
Note that (at least in the newest Orion version, 0.28.0) precision is seconds. In order words, you can create/update with 2017-06-17T07:21:24.238Z but you will get 2017-06-17T07:21:24.00Z.
Note also that Orion manages automatically creation and modification dates for entities, i.e. your client doesn't need to manage them. In order to retrieve entity creation and/or modification use them in the options URI param, as described in NGSIv2 specification section "Virtual attributes". For example:
GET /v2/entities/myEntity?options=dateCreated,dateModified

Select top N from my_table in druid.io - a simplest request - how?

How can I request a few "records" from Druid without using any search/filter criteria using as simple as possible request? I relational db I'd do that by select top 10 from my_table.
The reason I want to do that is to make sure data exists and see its structure.
Druid has a simple SELECT query. Just use something like this:
{
"queryType": "select",
"dataSource": "wikipedia",
"intervals": [
"2015-01-01/2016-01-02"
],
"pagingSpec":{"pagingIdentifiers": {}, "threshold":10}
}