Can we change data type of dimension post ingestion in Druid - druid

We are doing POC on Druid to check whether it fits our use cases. Though we are able to ingest data but not sure on following:
How druid supports schemaless input: Let's say input dimension are on end user discretion. Then there is no defined schema here. Thus onus lies on application to identify new dimension, identify data type and ingest. Any way to achieve this?
How druid support data type changes: Lets say in course (say after ingesting 100GBs of data), there is need to change data type of dimension from string to long or long to string (or other). What are receommended way to do it without hampering ongoing ingestion?
I looked over docs but could not get a substantial overview for both use cases.

For question 1 I'd ingest everything as string and figure it out later. It should be possible to query string columns in druid as numbers
Getting the possible behaviours explained in: https://github.com/apache/incubator-druid/issues/4888
Consider values are zeros, do not try to parse string values. Seems this is the current behaviour.
Try to parse string values, and consider values are zero if they are not parseable, or null, or multiple-value
One current inconsistency is that with expression-based column selectors (anything that goes through Parser/Expr) the behavior is (2). See IdentifierExpr + how it handles strings that are treated as numbers. But with direct column selectors the behavior is (1). In particular this means that e.g. a longSum aggregator behaves differently if it's "fieldName" : "x" vs. "expression" : "x" even though you might think they should behave the same.
You can follow the entire discussion here: https://github.com/apache/incubator-druid/issues/4888
For question 2 it think it is necessary a reindex of the data
- http://druid.io/docs/latest/ingestion/update-existing-data.html
- http://druid.io/docs/latest/ingestion/schema-changes.html
I hope this helps

1) In such cases, you don't need to specify any dimension columns in druid ingestion spec and druid will treat all columns which are not timestamp as a dimension.
More detail about such approach can be found here:
Druid Schema less Ingestion
2) For 2nd question, you can make changes to schema and druid will create new segments with new data type while your old segments will still use old data type.
In cases if you want to keep all your segments with new data type then you can reindex all the segments. Please checkout this link for further description about reindexing all segments. http://druid.io/docs/latest/ingestion/update-existing-data.html
Additional info on schema changes can be found here:
http://druid.io/docs/latest/ingestion/schema-changes.html

Related

How can I retain domain objects after a BigQueryIO write?

My team has a Beam pipeline where we're writing an unbounded PCollection of domain objects to BigQuery using the BigQueryIO.write() function. We're transforming the domain objects into TableRow objects inside of the BigQueryIO.write().withFormatFunction(). WriteResult.getSuccessfulInserts() gives us a PCollection of tableRows that were successfully written to BigQuery but, we would rather have access to the original domain objects again, specifically only the domain objects that were successfully written.
We've come up with one solution where we add a groupingKey field to the domain objects and put the groupingKey field into the TableRows when we do the withFormatFunction call. This allows us to take the original input (PCollection<DomainObj>), transform it into a PCollection<KV<String, DomainObj>> where the String key is the groupingKey, transform the output from writeResult.getSuccessfulTableRows into a PCollection<KV<String,TableRow>> where the String key is the groupingKey, and then do a CoGroupByKey operation on the KV keys to get a PCollection<KV<DomainObj, TableRow>>, then we can just drop the TableRows and end up with the PCollection of successfully written DomainObjects. There are a couple reasons why this solution is undesirable:
We thought of using the BiqQueryIO.write().ignoreUnknownValues() option to ensure the new groupingKey field we added to the TableRows doesn't end up in our BQ tables. This is a problem because our bigQuery schema is altered from time to time by an upstream applications and there are some occasional instances where we want unknown fields to be written to the table (we just don't want this groupingKey in the table).
The CoGroupByKey operation requires equal length windowing on its inputs and its possible that the BigQueryIO.write operation could exceed that window length. This would lead to us having to come up with complex solutions to handle items arriving past their window deadline.
Are there any more elegant solutions to write an unbounded PCollection of domain objects to BigQuery and end up with a PCollection of just the successfully written domain objects? Solutions that don't involve storing extra information in the TableRows are preferred. Thank you.

Kafka Message Keys with Composite Values

I am working on a system that will produce kafka messages. These messages will be organized into topics that more or less represent database tables. Many of these tables have composite keys and this aspect of the design is out of my control. The goal is to prepare these messages in a way that they can be easily consumed by common sink connectors, without a lot of manipulation.
I will be using the schema registry and avro format for all of the obvious advantages. Having the entire "row" expressed as a record in the message value is fine for upsert operations, but I also need to support deletes. From what I can tell, this means my message needs a key so I can have "tombstone" messages. Also keep in mind that I want to avoid any sort of transforms unless absolutely necessary.
In a perfect world, the message key would be a "record" that included strongly-typed key-column values and the message value would have the other column values (both controlled by the schema registry). However, it seems like a lot of the tooling around kafka expects message keys to be a single, primitive value. This makes me wonder if I need to compute a key value where I concatenate my multiple key columns into a single string value and keep the individual columns in my message value. Is this right or am I missing something? What other options do I have?
I'm assuming that you know the relationship between the message key and partition assignment.
As per my understanding, there is nothing that stops you from using a complex type like STRUCT as a key with or without a key schema. Please refer to the API here. If you are using an out of box connector that does not support complex type as key, then you may have to write your own Single Message Transformations (SMT) to move the key attributes into the value.
The approach that you mentioned - contacting columns to create the key and keeping the values of the same column in the value attribute would work in many cases if you don't want to write code. The only downside I could see is that your messages would be larger than required. If you don't need a partition assignment strategy or ordering requirement, then the message can have no key or a random key.
I wanted to follow-up with an answer that solved my issue:
The strategy I mentioned of using a concatenated string, technically worked. However, it certainly wasn't very elegant.
My original issue in using a structured key was that I wasn't using the correct converter for deserializing the key, which led to other errors. Once I used the avro converter, I was able to get my multi-part key and use it effectively.
Both, when implemented appropriately allowed me to produce valid tombstone messages that could represent deletes.

Handle Range based queries

We came across a case, where we want to retrieve data from a time series. Let say we have time based data : [“t1-t2” : {data1}, “t2-t3” : {data2}, “t3-t4”:{dat3}]
With the above kind of data, we would want to look up exact data w.r.t time. For example, for a given time t1.5, the data has to come as data1, and for t2.6 it should come as data2.
To solve the above problem, we are planning to store the data in a sorted map in aerospike as mentioned below {“t1”:{data1}, “t2”:{dat2}, “t3”: {data3}}
When a client asks for t1.5, we must return data1. To achieve this, we implemented a UDF at the server level to do a binary search for the nearest and lowest value for the given input (i.e t1.5), which will return t1 's value ,i.e data1.
Is there a better way of achieving the same, as it incurs cost at server level for every request. Even UDF to do a binary search requires loading all the data in memory, can we avoid it?
We are planning to use a Aerospike for this. Is there a better data store to handle such queries..?
Thinking aloud… Storing t1-t2, t2-t3 is redundant on t2. Just store t1, t2 is inferred from next key:value. { t1:data, t2:data, …} - store key sorted (map policy) You must know max difference between any ‘t1’ and ‘t2’ Build secondary index on MAPKEY and type numeric (this essentially does the bulk of the sort work for you upfront in the RAM) Search for records where t between t-maxdiff and t+maxdiff ==> a set of few records and pass these to your UDF. Invoke UDF on these few records subset to return the data. This will be a very simple UDF. Note: UDFs are limited to 128 concurrent executions at any given time.
I'm not sure I understand the problem. First, you should be inserting into a K-ordered map, where the key is the timestamp (in millisecond or second or another resolution). The value would be a map of the attributes.
To get back any range of time you'd use a get_by_key_interval (for example the Python client's Client.map_get_by_key_range). You can figure out how to build the range but it's simply all between two timestamps.
Don't use a UDF for this, it is not going to perform as well or scale as the native map/list operations would.

Scaler values vs array in MongoDB

Consider following two collections and followed note. Which one do you think is more appropriate ?
// #1
{x:'a'}
{x:'b'}
{x:'c'}
{x:['d','e']}
{x:'f'}
.
//#2
{x:['a']}
{x:['b']}
{x:['c']}
{x:['d','e']}
{x:['f']}
some facts:
field x have usually only one value (95%) and some times more (5%).
Mongodb behaves with {x:['a']} like {x:'a'} while querying.
MongoVUE shows scaler values in #1 directly and shows Array[0] for #2.
Using #1, when you want append a new value you have to cast data types
#1 May be a little faster in some CRUD operation (?)
To amplify #ZaidMasud's point I recommend staying with sclars or arrays and not mix both. If you have unavoidable reasons for having both (legacy data, say) then I recommend that you get very familiar with how Mongo queries work with arrays; it is not intuitive at first glance. See for example this puzzler.
From a schema design perspective, even though MongoDB allows you to store different data types for a key value pair, it's not necessarily a good idea to do so. If there is no compelling reason to use different data types, it's often best to use the same datatype for a given key/value pair.
So given that reasoning, I would prefer #2. Application code will generally be simpler in this case. Additionally, if you ever need to use the Aggregation Framework, you will find it useful to have uniform types.

Whats more efficent Core Data Fetch or manipulate/create arrays?

I have a core data application and I would like to get results from the db, based on certain parameters. For example if I want to grab only the events that occured in the last week, and the events that occured in the last month. Is it better to do a fetch for the whole entity and then work with that result array, to create arrays out of that for each situation, or is it better to use predicates and make multiple fetches?
The answer depends on a lot of factors. I'd recommend perusing the documentation's description of the various store types. If you use the SQLite store type, for example, it's far more efficient to make proper use of date range predicates and fetch only those in the given range.
Conversely, say you use a non-standard attribute like searching for a substring in an encrypted string - you'll have to pull everything in, decrypt the strings, do your search, and note the matches.
On the far end of the spectrum, you have the binary store type, which means the whole thing will always be pulled into memory regardless of what kind of fetches you might do.
You'll need to describe your managed object model and the types of fetches you plan to do in order to get a more specific answer.