KafkaStreams: Sequential access vs time-bound in SessionStore - apache-kafka

When you are Materializing a Store from a SessionWindowKStream, it forces you to do it as a SessionStore by setting
Materialized<K,VR,SessionStore<org.apache.kafka.common.utils.Bytes,byte[]>> materialized).
So what you get is a SessionStore<org.apache.kafka.common.utils.Bytes,byte[]>.
In this type of store, you can fetch by key, but not by key and time as in a WindowStore, even of the key type is a Windowed<K>. So you would have to iterate through it to find the time-related entries, which should be less efficient than querying by time.
How can you use the aggregated session store of Windowed<K> in order to query the store with (key, time)?
Or, put differently, why is there no findSessions-like methods (i.e. time-bound access) in ReadOnlySessionStore, while there is in SessionStore?

Related

Is it possible to extract the Schema ID when using KStream processing?

I am processing messages from sourceTopic to a targetTopic using KStream (using map method). In the map method, I am generating a new schema (since i need to extract explicit fields) for the targettopic using the incoming messages, but since the KStream operation is per message, i wish to avoid regenerating the schema for every message and would instead want to cache the schemaID of the incoming messages (for both Key and Value) and generate new targetschema only if the source Schema changes.
Is there a way to do this via the KStream object or from the Key/Value objects used in the map method
Update:
I was not able to get the schema ID for my above use case, as a workaround I cached the schema into a local variable and checked on each iteration if it changed and process further as required.
You only will have access to the ID if you use Serde.Bytes; after the records are deserialized, you'll only have access to the Schema.
The AvroSerdes from Confluent already cache the ids, though.

Achieving tombstoning in Kafka

I have a KStream with key value pairs that are grouped by key. Every key should be unique, and the only reason why it might not be is as the same key is streamed with a null as value.
In my streams application I need to filter out all the same keys if the value of on of the records is null (tombstone). How do I get started?
KStream<Key, Value> table = builder.stream(kafkaProperties.getTopicName());
// If key exists multiple times, check for null value and if found
// remove / ignore record
So when it needs to stay, but when the complete key with all values need to be thrown away.
This is quite tricky to achieve. Data is processed linearly, and thus you would need to buffer all key-value pairs in a state store, eg, using a transform(). You would insert each input key-value pair into the key-value store. If you receive a null value, you can delete it from the store.
The difficult part is, to decide/know that there won't be any null-value for a key in the future. How to determine this, depends on your overall setup, and there is no generic answer. If you can make the decision at some point that a key-value pair in the store cannot have a future tombstone, you can send it downstream and also delete it from the store.

Unable to retrieve metadata for a state store key in Kafka Streams

I'm trying to use Kafka Streams with a state store distributed among two instances. Here's how the store and the associated KTable are defined:
KTable<String, Double> userBalancesTable = kStreamBuilder.table(
"balances-table",
Consumed.with(String(), Double()),
Materialized.<String, Double, KeyValueStore<Bytes, byte[]>>as(BALANCES_STORE).withKeySerde(String()).withValueSerde(Double())
);
Next, I have some stream processing logic which is aggregating some data to this balance-table KTable:
transactionsStream
.leftJoin(...)
...
.aggregate(...)
.to("balances-table", Produced.with(String(), Double()));
And at some point I am, from a REST handler, querying the state store.
ReadOnlyKeyValueStore<String, Double> balances = streams.store(BALANCES_STORE, QueryableStoreTypes.<String, Double>keyValueStore());
return Optional.ofNullable(balances.get(userId)).orElse(0.0);
Which works perfectly - as long as I have a single stream processing instance.
Now, I'm adding a second instance (note: my topics all have 2 partititions). As explained in the docs, the state store BALANCES_STORE is distributed among the instances based on the key of each record (in my case, the key is an user ID). Therefore, an instance must:
Make a call to KafkaStreams#metadataForKey to discover which instance is handling the part of the state store containing the key it wants to retrieve
Make a RPC (e.g. REST) call to this instance to retrieve it
My problem is that the call to KafkaStreams#metadataForKey is always returning a null metadata object. However, KafkaStreams#allMetadataForStore() is returning a metadata object containing the two instances. So it behaves like it doesn't know about the key I'm querying, although looking it up in the state stores works.
Additional note: my application.server property is correctly set.
Thank you!

Storing custom temporary data in Sitecore xDB

I am using Sitecore 8.1 with xDB enabled (MongoDB). I would like to store the user-roles of the visiting users in the xDB, so I can aggregate on these data in my reports. These roles can change over time, so one user could have one set of roles at some point in time and another set of roles at a later time.
I could go and store these user-roles as custom facets on the Contact entity, but as they may change for a user from visit to visit, I will loose historical data if I update the data in the facet every time the user log in (fx. I will not be able to tell which roles a given user had, at some given visit).
Instead, I could create a custom IElement for my facet data, and store the roles along with a timestamp saying when the given roles were registered, but this model may be hard to handle during the reporting phase, where I would need to connect the interaction data with the role-data based on timestamps every time I generate a report.
Is it possible to store these custom data in the xDB in something else than the Contact collection? Can I store custom data in the Interactions collection? There is a property called Tracker.Current.Session.Interaction.CustomValues which sounds like what I need, but if I store data here, will I be able to perform proper aggregation/reporting on the data? Any other approaches I haven't thought about?
CustomValues
Yes, the CustomValues dictionary is what I would use in your case. This dictionary will get serialized to MongoDB as a nested document of every interaction (unless the dictionary is empty).
Also note that, since CustomValues is a member of the base class Sitecore.Analytics.Model.Entity, this dictionary is available in many other data classes of xDB. For example, you can store custom values in PageData and PageEventData objects.
Since CustomValues takes an object of any class, your custom data class needs some extra things for it to be successfully saved to and subsequently loaded from MongoDB:
It has to be marked as [Serializable].
It needs to be registered in the MongoDB driver like this:
using Sitecore.Analytics.Data.DataAccess.MongoDb;
// [...]
MongoDbObjectMapper.Instance.RegisterModelExtension<YourCustomClassName>();
This needs to be done only once per application lifetime - for example, in an initialize pipeline processor.
Your own storage
Of course, you don't have to use Sitecore's API to store your custom data. So the alternative would be to manually save data to a custom MongoDB collection or an SQL table. You can then read that data in your aggregation processor, finding it by the ID of currently processed interaction.
The benefit of this approach is that you can decide where and how your data is stored. The downside is extra work of implementing and maintaining this data storage.

Mitigation techniques for Insecure direct object reference

what are the mitigation techniques for preventing horizontal privilege escalation through insecure direct object reference other than securing the session ? In other words, how do we achieve access controls on horizontal level, I mean the functionality, data, etc is accessible to everyone on the same level, if we are breaching privilege I feel the only possible way other than hijacking session is through Insecure direct object reference or is there any other way that I'm not aware of ?
may be use below link to prevent the Insecure Direct Object Reference: http://owasp-esapi-java.googlecode.com/svn/trunk_doc/latest/org/owasp/esapi/AccessReferenceMap.html
Apart from horizontally or vertically, IDOR occurs when the authorization check has forgotten to reach an object in the system. It is critical if the reached object is sensitive like displaying an invoice belongs to users in the system.
So, I advise using randomly generated IDs or UUIDs to avoid IDOR in total. The attacker has to find valid random ID values that belong to another user.
Or if this sounds hard to apply cus it's possible. Even if you use auto-incremented object IDs you can apply a hash function with salt and put in a hash map like key-value pair. Then you’ll store the key-value map in the Session.
Instead of exposing auto-increment IDs to the user, you can use hash values of corresponding IDs. When you get the value back from the user, you can find an actual ID value by looking up the key-value map in the Session. So that means, even if the attacker spoof the generated value it’s not going to exist on the map. Basically that means IDOR is not going to exploitable anymore.
To read all about IDOR and mitigation here is a post I wrote about it considering every possible aspect: https://medium.com/#aysebilgegunduz/everything-you-need-to-know-about-idor-insecure-direct-object-references-375f83e03a87