array_sort function sorting the data based on first numerical element in Array<Struct> - scala

I need to sort the array<struct> based on a particular element from a struct. I am trying to use the array_sort function and could see that by default it is sorting the array but based on the first numerical element. Is this the expected behavior? PFB sample code and output.
val jsonData = """
{
"topping":
[
{ "id": "5001", "id1": "5001", "type": "None" },
{ "id": "5002", "id1": "5008", "type": "Glazed" },
{ "id": "5005", "id1": "5007", "type": "Sugar" },
{ "id": "5007", "id1": "5002", "type": "Powdered Sugar" },
{ "id": "5006", "id1": "5005", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "id1": "5004", "type": "Chocolate" },
{ "id": "5004", "id1": "5003", "type": "Maple" }
]
}
"""
val json_df = spark.read.json(Seq(jsonData).toDS)
val sort_df = json_df.select(array_sort($"topping").as("sort_col"))
display(sort_df)
OUTPUT
As you could see the above output is sorted based on the id element which is the first numerical element in the struct.
Is there any way to specify the element based on which sorting can be done?

Is this the expected behavior?
Short answer, yes!
For arrays with struct type elements, It compares first fields to determine the order and if they are equal it compares the second fields and so on. You can see that easily by modifying your input data example to have the same value in id for 2 rows, you'll then notice the order is determined by the second field.
The array_sort function uses collection operation ArraySort. If you look into the code you'll find how it handles complex DataTypes like StructType.
Is there any way to specify the element based on which sorting can be done?
One way is using a tranform function to change the positions of the struct fields so that you can have the first field containing the values you want the ordering to be based on. Example: if you want to order by the field type:
val transform_expr = "TRANSFORM(topping, x -> struct(x.type as type, x.id as id, x.id1 as id1))"
val transform_df = json_df.select(expr(transform_expr).alias("topping_transform"))
transfrom_df.show(fasle)
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|topping_transform |
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|[[None, 5001, 5001], [Glazed, 5002, 5008], [Sugar, 5005, 5007], [Powdered Sugar, 5007, 5002], [Chocolate with Sprinkles, 5006, 5005], [Chocolate, 5003, 5004], [Maple, 5004, 5003]]|
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
val sort_df = transform_df.select(array_sort($"topping_transform").as("sort_col"))
sort_df.show(false)
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|sort_col |
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|[[Chocolate, 5003, 5004], [Chocolate with Sprinkles, 5006, 5005], [Glazed, 5002, 5008], [Maple, 5004, 5003], [None, 5001, 5001], [Powdered Sugar, 5007, 5002], [Sugar, 5005, 5007]]|
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Related

How to use `in` expression in Mapbox GL map.setFilter

I'm trying to filter a geoJson layer (called "destinations") based on a set of conditions. My "destinations" layer has a property called uniqueID of type string, which most of the time only has one numeric ID, but sometime has more than one. I would like to filter this layer (using map.setFilter) based on an array of IDs. The filter should do something like this: if any of the uniqueID values in the "destination" layer are found in the array of IDs, then filter that feature out.
Here is an snippet of my "destinations" layer:
{ "type": "Feature", "properties": { "destinationLon": -74.20716879, "destinationLat": 40.69097357, "uniqueID": "2029" }, "geometry": { "type": "Point", "coordinates": [ -74.20716879, 40.69097357 ] } },
{ "type": "Feature", "properties": { "destinationLon": -74.20670807, "destinationLat": 40.69137214, "uniqueID": "984,985" }, "geometry": { "type": "Point", "coordinates": [ -74.20670807, 40.69137214 ] } },
{ "type": "Feature", "properties": { "destinationLon": -74.20651489, "destinationLat": 40.71887889, "uniqueID": "1393" }, "geometry": { "type": "Point", "coordinates": [ -74.20651489, 40.71887889 ] } }
And here is a sample of the array of IDs I might use: [2000, 984, 1393]
I have tried using a filter using the in expression (documentation here) like this:
let thisFilter = ["any"].concat(uniqueIDs.map(id => ["in", id, ["get", "uniqueID"]]));
map.setFilter("destinations", thisFilter);
But I keep on getting this error message:
Error: "layers.destinations.filter[0][2]: string, number, or boolean expected, array found"
However, the documentation states the following:
["in",
keyword: InputType (boolean, string, or number),
input: InputType (array or string)
]: boolean
The third argument in the expression is supposed to be an array. Why then am I getting that error?
Any ideas? Thank you!
I think your error has nothing to do with in. You could discover this by simplifying your expression, and separating out the any from the individual in parts.
Your error is that any takes a variable number of parameters:
['any', ['in', ...], ['in', ...]]
Whereas you are passing in an array:
['any', [['in', ...], ['in', ...]]]
I would rewrite your expression like this:
let thisFilter = ["any", ...uniqueIDs.map(id => ["in", id, ["get", "uniqueID"]])];

Ingesting multi-valued dimension from comma sep string

I have event data from Kafka with the following structure that I want to ingest in Druid
{
"event": "some_event",
"id": "1",
"parameters": {
"campaigns": "campaign1, campaign2",
"other_stuff": "important_info"
}
}
Specifically, I want to transform the dimension "campaigns" from a comma-separated string into an array / multi-valued dimension so that it can be nicely filtered and grouped by.
My ingestion so far looks as follows
{
"type": "kafka",
"dataSchema": {
"dataSource": "event-data",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "posix"
},
"flattenSpec": {
"fields": [
{
"type": "root",
"name": "parameters"
},
{
"type": "jq",
"name": "campaigns",
"expr": ".parameters.campaigns"
}
]
}
},
"dimensionSpec": {
"dimensions": [
"event",
"id",
"campaigns"
]
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
}
],
"granularitySpec": {
"type": "uniform",
...
}
},
"tuningConfig": {
"type": "kafka",
...
},
"ioConfig": {
"topic": "production-tracking",
...
}
}
Which however leads to campaigns being ingested as a string.
I could neither find a way to generate an array out of it with a jq expression in flattenSpec nor did I find something like a string split expression that may be used as a transformSpec.
Any suggestions?
Try setting useFieldDiscover: false in your ingestion spec. when this flag is set to true (which is default case) then it interprets all fields with singular values (not a map or list) and flat lists (lists of singular values) at the root level as columns.
Here is a good example and reference link to use flatten spec:
https://druid.apache.org/docs/latest/ingestion/flatten-json.html
Looks like since Druid 0.17.0, Druid expressions support typed constructors for creating arrays, so using expression string_to_array should do the trick!

Parsing Really Messy Nested JSON Strings

I have a series of deeply nested json strings in a pyspark dataframe column. I need to explode and filter based on the contents of these strings and would like to add them as columns. I've tried defining the StructTypes but each time it continues to return an empty DF.
Tried using json_tuples to parse but there are no common keys to rejoin the dataframes and the row numbers dont match up? I think it might have to do with some null fields
The sub field can be nullable
Sample JSON
{
"TIME": "datatime",
"SID": "yjhrtr",
"ID": {
"Source": "Person",
"AuthIFO": {
"Prov": "Abc",
"IOI": "123",
"DETAILS": {
"Id": "12345",
"SId": "ABCDE"
}
}
},
"Content": {
"User1": "AB878A",
"UserInfo": "False",
"D": "ghgf64G",
"T": "yjuyjtyfrZ6",
"Tname": "WE ARE THE WORLD",
"ST": null,
"TID": "BPV 1431: 1",
"src": "test",
"OT": "test2",
"OA": "test3",
"OP": "test34
},
"Test": false
}

How can I count all possible subdocument elements for a given top element in Mongo?

Not sure I am using the right terminology here, but assume following oversimplified JSON structure available in Mongo :
{
"_id": 1234,
"labels": {
"label1": {
"id": "l1",
"value": "abc"
},
"label3": {
"id": "l2",
"value": "def"
},
"label5": {
"id": "l3",
"value": "ghi"
},
"label9": {
"id": "l4",
"value": "xyz"
}
}
}
{
"_id": 5678,
"labels": {
"label1": {
"id": "l1",
"value": "hjk"
},
"label5": {
"id": "l5",
"value": "def"
},
"label10": {
"id": "l10",
"value": "ghi"
},
"label24": {
"id": "l24",
"value": "xyz"
}
}
}
I know my base element name (labels in the example), but I do not know the various sub elements I can have (so in this case the labelx names).
How can I group / count the existing elements (like as if I would be using a wildcard) so I would get some distinct overview like
"label1":2
"label3":1
"label5":2
"label9":1
"label10":1
"label24":1
as a result? So far I only found examples where you actually need to know the element names. But I don't know them and want to find some way to get all possible sub element names for a given top element for easy review.
In reality the label names can be pretty wild, I used labelx for readability in the example.
You can try below aggregation in 3.4.
Use $objectToArray to transform object to array of key value pairs followed by $unwind and $group on key to count occurrences.
db.col.aggregate([
{"$project":{"labels":{"$objectToArray":"$labels"}}},
{"$unwind":"$labels"},
{"$group":{"_id":"$labels.k","count":{"$sum":1}}}
])

Filtering nested results an OData Query

I have a OData query returning a bunch of items. The results come back looking like this:
{
"d": {
"__metadata": {
"id": "http://dev.sp.swampland.local/_api/SP.UserProfiles.PeopleManager/GetPropertiesFor(accountName=#v)",
"uri": "http://dev.sp.swampland.local/_api/SP.UserProfiles.PeopleManager/GetPropertiesFor(accountName=#v)",
"type": "SP.UserProfiles.PersonProperties"
},
"UserProfileProperties": {
"results": [
{
"__metadata": {
"type": "SP.KeyValue"
},
"Key": "UserProfile_GUID",
"Value": "66a0c6c2-cbec-4abb-9e25-cc9e924ad390",
"ValueType": "Edm.String"
},
{
"__metadata": {
"type": "SP.KeyValue"
},
"Key": "ADGuid",
"Value": "System.Byte[]",
"ValueType": "Edm.String"
},
{
"__metadata": {
"type": "SP.KeyValue"
},
"Key": "SID",
"Value": "S-1-5-21-2355771569-1952171574-2825027748-500",
"ValueType": "Edm.String"
}
]
}
}
}
In reality, there's a lot of items (100+) coming back in the UserProfileProperties collection however I'm only looking for a few where the KEY matches a few items but I can't figure out exactly what I need my filter to be. I've tried $filter=UserProfileProperties/Key eq 'SID' but that still gives me everything. Also trying to figure out how to pull back multiple items.
Ideas?
I believe you forgot about how each of the results have a key, not the UserProfileProperties so UserProfileProperties/Key doesn't actually exist. Instead because result is an array you must check either a certain position (eq. result(1)) or use the oData functions any or all.
Try $filter=UserProfileProperties/results/any(r: r/Key eq 'SID') if you want all the profiles where just one of the keys is SID or use
$filter=UserProfileProperties/results/all(r: r/Key eq 'SID') if you want the profiles where every result has a key equaling SID.