ksql query array of structs by value in the struct - apache-kafka

I have an array of structs. The array is an output of delta processing, thats why the count of structs differ (and the struct B sometimes have the position 1, but also can have the position 5).
A struct in the array looks like this:
{
val:{
asString:"12345"
},
position:"1200"
}
Another Example:
{
val:{
asString:"12927"
},
position:"1120"
}
I want to query the Value as String by using the position-key. I know which position keys to query, but i dont know how to get the value, because the position-Value is part of the struct.
The whole object has a key to identify a object
I thought of exploding the array and create a new event with the object identifying key.
The object itself has about 6000 lines, which would return a huge amount of events (which i am trying to avoid).
Maybe an switch to kafka streams is necessary?

I found a good working solution:
In my case I knew which keys (or position-attributes in the structs) I was searching for.
Therefore you can use the FILTER Method of Arrays to filter the struct, which has the specific value in the position-attribute, i was searching for.
You can do this for all the values you are searching for (if the position key is missing, you just get a null value, but in my case that was totally alright)
That's a good way to implement it in ksql without using Kafka Streams.
(FYI I am trying to avoid Kafka Streams, because it is not a confluent cloud nativ product and requires deeper java knowledge)

Related

Creating attributes for elements of a PCollection in Apache Beam

I'm fairly new to Apache Beam and was wondering if I can create my own attributes for elements in a PCollection.
I went through the docs but could not find anything.
Example 2 in this ParDo doc shows how to access the TimestampParam and the WindowParam, which are - from what I understand - attributes of each element in a PCollection:
class AnalyzeElement(beam.DoFn):
def process(
self,
elem,
timestamp=beam.DoFn.TimestampParam,
window=beam.DoFn.WindowParam):
yield [...]
So my question is, if it is possible to create such attributes (e.g. TablenameParam) for the elements in a PCollection and if not, if there is some kind of workaround to achieve that?
What you are describing would simply be part of the element. For your example, the TablenameParam would be a field of the type you add to the PCollection.
The reason that WindowParam and TimestampParam are treated differently is that they are often propagated implicitly, and are a part of the Beam model for every element no matter the actual data. Most code that is only operating on the main data does not need to touch them.

Kafka Message Keys with Composite Values

I am working on a system that will produce kafka messages. These messages will be organized into topics that more or less represent database tables. Many of these tables have composite keys and this aspect of the design is out of my control. The goal is to prepare these messages in a way that they can be easily consumed by common sink connectors, without a lot of manipulation.
I will be using the schema registry and avro format for all of the obvious advantages. Having the entire "row" expressed as a record in the message value is fine for upsert operations, but I also need to support deletes. From what I can tell, this means my message needs a key so I can have "tombstone" messages. Also keep in mind that I want to avoid any sort of transforms unless absolutely necessary.
In a perfect world, the message key would be a "record" that included strongly-typed key-column values and the message value would have the other column values (both controlled by the schema registry). However, it seems like a lot of the tooling around kafka expects message keys to be a single, primitive value. This makes me wonder if I need to compute a key value where I concatenate my multiple key columns into a single string value and keep the individual columns in my message value. Is this right or am I missing something? What other options do I have?
I'm assuming that you know the relationship between the message key and partition assignment.
As per my understanding, there is nothing that stops you from using a complex type like STRUCT as a key with or without a key schema. Please refer to the API here. If you are using an out of box connector that does not support complex type as key, then you may have to write your own Single Message Transformations (SMT) to move the key attributes into the value.
The approach that you mentioned - contacting columns to create the key and keeping the values of the same column in the value attribute would work in many cases if you don't want to write code. The only downside I could see is that your messages would be larger than required. If you don't need a partition assignment strategy or ordering requirement, then the message can have no key or a random key.
I wanted to follow-up with an answer that solved my issue:
The strategy I mentioned of using a concatenated string, technically worked. However, it certainly wasn't very elegant.
My original issue in using a structured key was that I wasn't using the correct converter for deserializing the key, which led to other errors. Once I used the avro converter, I was able to get my multi-part key and use it effectively.
Both, when implemented appropriately allowed me to produce valid tombstone messages that could represent deletes.

Is mgo bson marshalling guaranteed to preserve order of struct components?

I am saving go structs in mongo using mgo. I wish to save them with a hash of that struct (and a secret) to determine whether they have been tampered with (and I do not wish the mongo db itself to have the secret).
Currently I am hashing the structs by serializing them using gob whose ordering of struct components is well-defined. This works great, save when I go to reread the struct from mango, things have changed - to be precise the time values in mongo have truncated accuracy compared to go - therefore the hashes do not match up.
My planned work around for this is simply to marshall and unmarshall the struct from BSON before calculating the hash, i.e.:
Marshal struct to BSON
Unmarshal struct from BSON (thereby losing precision on time)
Marshall struct to gob and hash resultant []byte
Put hash in struct
Save struct to mongo
Now, that's more than a little circuitous.
If I could guaranteed that the BSON itself always preserved order of components in structs, I could:
Marshal struct to BSON
hash resultant byte[]
Put hash in struct
Save struct to mongo
Which would be less nasty (albeit that it would still require converting to BSON twice).
Any ideas?
Answering your actual question, yes, you can trust mgo/bson to always marshal struct fields in the order they are observed in the code. Although not yet documented (issue), this is very much intentional behavior, and even mgo itself depends on it internally.
Now, responding to your intended usage: don't do that. The fact fields are in a guaranteed order does not mean the binary format is stable as a whole. Even now there are known ways the output can change without breaking even the existing clients, but that would break a hash of its output.
Here is some suggested reading to understand a bit better the ins and outs of the problem you are trying to solve (I'm the author):
http://blog.labix.org/2013/06/25/strepr-v1
http://blog.labix.org/2013/07/03/reference-strepr-implementation
This is precisely addressing the issue of marshaling an arbitrary or map with arbitrary fields and keys in a stable way, to obtain a stable hash out of it for signatures. The reference implementation is in Go.
BSON keeps order. I would do all secret in a couple of simple wrap function:
type StructWrap struct{
Value bson.Raw
Hash []byte
}
func save2mgo(in interface{}){
str := StructWrap{}
orig,_err := bson.Marshal(in)
str.Value = bson.Raw{0, orig}
str.Hash = <do hash based on orig bytes>
//to save str to mgo. because Value already in bson.Raw, it will not do marshal again.
}
func unmarshal(inType interface{}){
//read from mgo
//check str.Hash here
str.Value.Unmarshal(inType)
}
hope can help you.

Traversing an object model recursively to search for a value

I need to recursively traverse a very large and complicated object model to search for a particular value of an ID.
The value I'm looking for is in a property called "ID", but objects with a particular ID might have many children, some of which are arrays, each having a different ID, and each of those children in turn can have a different ID and so on and so forth.
So if I give you an object, say $web, and you know that deep down in it's object model there is a value of an object that you are looking for. How do you look for it using recursion and reflection?
Note: This is a generic powershell/recursion/programming question even though the topic is SharePoint.
How about using Format-Custom? For example, getting a lot of nested member data from a directory info is done like so,
(gci)[0] | fc > test.txt
Will give some 8800 lines of data for expanded members.

Storing the order of embedded documents in a separated array

I have a set of objects that the user can sort arbitrarily. I would like to make my client remember the sorting of the set of objects so that when the user visits the page again the ordering he/she chose will be preserved. However, the client-side framework should also be able to quickly lookup the objects from whatever array/hashmap they are stored in based upon the ordering. What is the most efficient way of doing this?
The best way I have found for doing this is using an array that stores the IDs of the array in the particular order I wanted. From there, I can access the array of objects I wanted by converting the array to a hashmap using Underscore.js.