Ingesting multi-valued dimension from comma sep string - druid

I have event data from Kafka with the following structure that I want to ingest in Druid
{
"event": "some_event",
"id": "1",
"parameters": {
"campaigns": "campaign1, campaign2",
"other_stuff": "important_info"
}
}
Specifically, I want to transform the dimension "campaigns" from a comma-separated string into an array / multi-valued dimension so that it can be nicely filtered and grouped by.
My ingestion so far looks as follows
{
"type": "kafka",
"dataSchema": {
"dataSource": "event-data",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "posix"
},
"flattenSpec": {
"fields": [
{
"type": "root",
"name": "parameters"
},
{
"type": "jq",
"name": "campaigns",
"expr": ".parameters.campaigns"
}
]
}
},
"dimensionSpec": {
"dimensions": [
"event",
"id",
"campaigns"
]
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
}
],
"granularitySpec": {
"type": "uniform",
...
}
},
"tuningConfig": {
"type": "kafka",
...
},
"ioConfig": {
"topic": "production-tracking",
...
}
}
Which however leads to campaigns being ingested as a string.
I could neither find a way to generate an array out of it with a jq expression in flattenSpec nor did I find something like a string split expression that may be used as a transformSpec.
Any suggestions?

Try setting useFieldDiscover: false in your ingestion spec. when this flag is set to true (which is default case) then it interprets all fields with singular values (not a map or list) and flat lists (lists of singular values) at the root level as columns.
Here is a good example and reference link to use flatten spec:
https://druid.apache.org/docs/latest/ingestion/flatten-json.html

Looks like since Druid 0.17.0, Druid expressions support typed constructors for creating arrays, so using expression string_to_array should do the trick!

Related

What column type do I need for this nested data in BigQuery?

I have a JSON schema for a Kafka stream that I am integrating with BigQuery but I can't get the data type correct at the BigQuery end. This is the schema:
"my_meta_data": {
"type": "object",
"properties": {
"property_1": {
"type": "array",
"items": {
"type": "number"
}
},
"property_2": {
"type": "array",
"items": {
"type": "number"
}
},
"property_3": {
"type": "array",
"items": {
"type": "number"
}
}
}
}
I tried this in the JSON file defining the BigQuery table:
{
"name": "my_meta_data",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "property_1",
"type": "INT64",
"mode": "REPEATED"
},
{
"name": "property_2",
"type": "INT64",
"mode": "REPEATED"
},
{
"name": "property_3",
"type": "INT64",
"mode": "REPEATED"
}
]
}
I am using a hosted connector from Confluent, the Kafka provider, and the error message is
The connector is failing because it cannot write a non-array element to an array column. Please check the schemas of the data in Kafka and the BigQuery tables the connector is writing to, and ensure that all data from Kafka that will be written to an array column in BigQuery is contained in an array.
I haven't defined an array column though, I've defined a RECORD column that contains arrays. Any ideas how I can set up the BigQuery table to capture this data? Thanks in advance.

JSON Schema - can array / list validation be combined with anyOf?

I have a json document I'm trying to validate with this form:
...
"products": [{
"prop1": "foo",
"prop2": "bar"
}, {
"prop3": "hello",
"prop4": "world"
},
...
There are multiple different forms an object may take. My schema looks like this:
...
"definitions": {
"products": {
"type": "array",
"items": { "$ref": "#/definitions/Product" },
"Product": {
"type": "object",
"oneOf": [
{ "$ref": "#/definitions/Product_Type1" },
{ "$ref": "#/definitions/Product_Type2" },
...
]
},
"Product_Type1": {
"type": "object",
"properties": {
"prop1": { "type": "string" },
"prop2": { "type": "string" }
},
"Product_Type2": {
"type": "object",
"properties": {
"prop3": { "type": "string" },
"prop4": { "type": "string" }
}
...
On top of this, certain properties of the individual product array objects may be indirected via further usage of anyOf or oneOf.
I'm running into issues in VSCode using the built-in schema validation where it throws errors for every item in the products array that don't match Product_Type1.
So it seems the validator latches onto that first oneOf it found and won't validate against any of the other types.
I didn't find any limitations to the oneOf mechanism on jsonschema.org. And there is no mention of it being used in the page specifically dealing with arrays here: https://json-schema.org/understanding-json-schema/reference/array.html
Is what I'm attempting possible?
Your general approach is fine. Let's take a slightly simpler example to illustrate what's going wrong.
Given this schema
{
"oneOf": [
{ "properties": { "foo": { "type": "integer" } } },
{ "properties": { "bar": { "type": "integer" } } }
]
}
And this instance
{ "foo": 42 }
At first glance, this looks like it matches /oneOf/0 and not oneOf/1. It actually matches both schemas, which violates the one-and-only-one constraint imposed by oneOf and the oneOf fails.
Remember that every keyword in JSON Schema is a constraint. Anything that is not explicitly excluded by the schema is allowed. There is nothing in the /oneOf/1 schema that says a "foo" property is not allowed. Nor does is say that "foo" is required. It only says that if the instance has a keyword "foo", then it must be an integer.
To fix this, you will need required and maybe additionalProperties depending on the situation. I show here how you would use additionalProperties, but I recommend you don't use it unless you need to because is does have some problematic properties.
{
"oneOf": [
{
"properties": { "foo": { "type": "integer" } },
"required": ["foo"],
"additionalProperties": false
},
{
"properties": { "bar": { "type": "integer" } },
"required": ["bar"],
"additionalProperties": false
}
]
}

Druid GroupBy query gives different response when changing the order by fields

I have a question regarding an Apache Druid incubating query.
I have a simple group by to select the number of calls per operator. See here my query:
{
"queryType": "groupBy",
"dataSource": "ivr-calls",
"intervals": [
"2019-12-06T00:00:00.000Z/2019-12-07T00:00:00.000Z"
],
"dimensions": [
{
"type": "lookup",
"dimension": "operator_id",
"outputName": "value",
"name": "ivr_operator",
"replaceMissingValueWith": "Unknown"
},
{
"type": "default",
"dimension": "operator_id",
"outputType": "long",
"outputName": "id"
}
],
"granularity": "all",
"aggregations": [
{
"type": "longSum",
"name": "calls",
"fieldName": "calls"
}
],
"limitSpec": {
"type": "default",
"limit": 999999,
"columns": [
{
"dimension": "value",
"direction": "ascending",
"dimensionOrder": "numeric"
}
]
}
}
In this query I order the result by the "value" dimension, I receive 218 results.
I noticed that some of the records are duplicate. (I see some operators two times in my resultset). This is strange because in my experience all dimensions which you select are also used for grouping by. So, they should be unique.
If I add an order by to the "id" dimension, I receive 183 results (which is expected):
"columns": [
{
"dimension": "value",
"direction": "ascending",
"dimensionOrder": "numeric"
},
{
"dimension": "id",
"direction": "ascending",
"dimensionOrder": "numeric"
}
]
The documentation tells me nothing about this strange behavior (https://druid.apache.org/docs/latest/querying/limitspec.html).
My previous experience with druid is that the order by is just "ordering".
I am running druid version 0.15.0-incubating-iap9.
Can anybody tell me why there is a difference in the result set based on the column sorting?
I resolved this problem for now by specifying all columns in my order by.
Issue seems to be related to a bug in druid. See: https://github.com/apache/incubator-druid/issues/9000

Create index from raw JSON using elastic4s

I need to create an index that has some context completion suggester mappings as in (https://www.elastic.co/guide/en/elasticsearch/reference/current/suggester-context.html). As I see in https://github.com/sksamuel/elastic4s/issues/452 this doesn't seem to be supported by the DSL.
So, it would be nice to create an index from a raw JSON string (similar to raw queries). Is it possible to achieve this?
Considering that you have the JSON mapping in a variable rawMapping like this:
val rawMapping =
"""{
"service": {
"properties": {
"name": {
"type" : "string"
},
"tag": {
"type" : "string"
},
"suggest_field": {
"type": "completion",
"context": {
"color": {
"type": "category",
"path": "color_field",
"default": ["red", "green", "blue"]
},
"location": {
"type": "geo",
"precision": "5m",
"neighbors": true,
"default": "u33"
}
}
}
}
}
}"""
You can create the index using the raw mapping like this:
client.execute {
create index "services" source rawMapping
}

OData REST Filter for deeply nested data

I have a working REST request that returns a large results collection. (trimmed here)
The original URL is:
http://intranet.domain.com//_api/SP.UserProfiles.PeopleManager/GetPropertiesFor(accountName=#v)?#v='domain\kens'&$select=AccountName,DisplayName,Email,Title,UserProfileProperties
The response is:
{
"d": {
"__metadata": {
"id": "stuff",
"uri": "morestuff",
"type": "SP.UserProfiles.PersonProperties"
},
"AccountName": "domain\\KenS",
"DisplayName": "Ken Sanchez",
"Email": "KenS#domain.com",
"Title": "Research Assistant",
"UserProfileProperties": {
"results": [
{
"__metadata": {
"type": "SP.KeyValue"
},
"Key": "UserProfile_GUID",
"Value": "1c419284-604e-41a8-906f-ac34fd4068ab",
"ValueType": "Edm.String"
},
{
"__metadata": {
"type": "SP.KeyValue"
},
"Key": "SID",
"Value": "S-1-5-21-2740942301-4273591597-3258045437-1132",
"ValueType": "Edm.String"
},
{
"__metadata": {
"type": "SP.KeyValue"
},
"Key": "ADGuid",
"Value": "",
"ValueType": "Edm.String"
},
{
"__metadata": {
"type": "SP.KeyValue"
},
"Key": "AccountName",
"Value": "domain\\KenS",
"ValueType": "Edm.String"
}...
Is it possible to change the REST request with a $filter that only returns the Key Values from the results collection where Key=SID OR Key= other values?
I only need about 3 values from the results collection by name.
In OData, you can't filter an inner feed.
Instead you could try to query the entity set that UserProfileProperties comes from and expand the associated SP.UserProfiles.PersonProperties entity.
The syntax will need to be adjusted for your scenario, but I'm thinking something along these lines:
service.svc/UserProfileProperties?$filter=Key eq 'SID' and RelatedPersonProperties/AccountName eq 'domain\kens'&$expand=RelatedPersonProperties
That assumes you have a top-level entity set of UserProfileProperties and each is tied back to a single SP.UserProfiles.PersonProperties entity via a navigation property called (in my example) RelatedPersonProperties.