unnest string array in druid in ingestion phase for better rollup - druid

I am trying to define a druid ingestion spec for the following case.
I have a field of type string array and I want it to be unnested and rolled up by druid during the ingestion. For example, if I have the following two entries in the raw data:
["a","b","c"]
["a", "c"]
In the rollup table I would like to see three entries:
"a" 2
"b" 1
"c" 2
If I just define this column as a dimension, the array is kept as is and the individual values are not extracted. I've looked on possible solution with transformSpec and Expressions, by no help.
I know how to use GROUP BY in query time to get what I need, but I'd like to have this functionality during the ingestion time. Is there some way to define in in the dataSchema?
Thank you.

Related

Nested fields and partitioning in MongoDB to BigQuery template

I want to use MongoDB to BigQuery dataflow template and I have 2 questions:
Can I somehow configure partitioning for the destination table? For e.g. if I want to dump my database every day?
Can I map nested fields in MongoDB to records in BQ instead of columns with string values?
I see User option with values FLATTEN and NONE, but FLATTEN will flatten documents for 1 level only.
May any of these two approaches help?:
Create a destination table with structure definition before running dataflow
Using UDF
I tried to use MongoDB to BigQuery dataflow with User option set to FLATTEN.

What does the distinct on clause mean in cloud datastore and how does it effect the reads?

This is what the cloud datastore doc says but I'm having a hard time understanding what exactly this means:
A projection query that does not use the distinct on clause is a small operation and counts as only a single entity read for the query itself.
Grouping
Projection queries can use the distinct on clause to ensure that only the first result for each distinct combination of values for the specified properties will be returned. This will return only the first result for entities which have the same values for the properties that are being projected.
Let's say i have a table for questions and i only want to get the question text sorted by the created date would this be counted as a single read and rest as small operations?
If your goal is to just project the date and text fields, you can create a composite index on those two fields. When you query, this is a small operation with all the results as a single read. You are not trying to de-duplicate (so no distinct/on) in this case and so it is a small operation with a single read.

When putting data into elasticsearch, how do you handle fields that sometimes have different structures?

I'm getting several mapping Exceptions when trying to insert data from my mongoDb into Elastic. After some investigative work, it seems that the error comes from the fact that I have a field in my db that is sometimes and array of strings, while other times an array of objects.
Meaning, for some documents in mongo it will have this:
{"my_field" : ["one", "two"]
while others
{"my_field": [{"key":"value", "key2":"value"}, {"key":"value", "key2":"value"}, ...]
I'm having a difficult time in pinning down how exactly this situation is handled in Elastic.
You will need to massage the data before it is indexed so that it does conform to elasticsearch's rules. One approach is for my_field to be a nested document - for one document you might have
{"my_field": {"string_value": ["one", "two"]}}
and for another
{"my_field": {"doc_value": {"key":"value", "key2":"value"}}}
This assumes that the values for key and key2 will always have the same type and that there is a small number of possible keys in this document. If this document contains arbitrary data you might be better off indexing as
{"my_field": [{"key": "key1", "string_value": "value"},
{"key": "key2", "int_value": "123"}]}
As for how you massage, one option is to do this before you send the data to elasticsearch. The downside is that the the _source attribute will obviously contained the transformed data.
Another approach is to send the data to elasticsearch as-is, but to have a transform in the mapping that elasticsearch will run to transform the data before indexing.

Mongodb compare two big data collection

I want to compare two very big collection, the main of the operation is two know what element is change or deleted
My collection 1 and 2 have a same structure and have more 3 million records
example :
record 1 {id:'7865456465465',name:'tototo', info:'tototo'}
So i want to know : what element is change, and what element is not present in collection 2.
What is the best solution to do this?
1) Define what equality of 2 documents means. For me it would be: both documents should contain all fields with exact same values given their ids are unique. Note that mongo does not guarantee field order, and if you update a field it might move to the end of the document which is fine.
2) I would use some framework that can connect to mongo and fetch data at the same time converting it to a map-like data structure or even JSON. For instance I would go with Scala + Lift record (db.coll.findAll()) + Lift JSON. Lift JSON library has Diff function that will give you a diff of 2 JSON docs.
3) Finally I would sort both collections by ids, open db cursors, iterate and compare.
if the schema is flat in your case it is, you can use a free tool to compare the data(dataq.io) in two tables.
Disclaimer : I am the founder of this product.

Iterate on the columns of a row in MongoDB Map Reduce

My Mongo schema is as follows:
KEY: Client ID
Value: { Location1: Bitwise1, Location2: Bitwise2, ...}
So the Column names would be names of locations. This data represents the locations to which a client has been to, and bitwise captures the days for which the client was present at that location.
I'd like to run a map-reduce query on the above schema. In that, I'd like to iterate on all the columns of the Value for a Row. How can that be done? Could some one give a small code snipped which explains it clearly? I'm having a hard time finding it on web.