Can a context consumer retrieve history values for an entity through Orion? - fiware-orion

I am new with Fiware technologies and I have read many documents regarding Orion Context Broker but is not clear for me if a context consumer can request from Orion Context Broker history values about an entity.
I mean that if the context consumer need the measurements from day 1 to day 10 from a single temperature sensor to make an analysis, can it request those values from Orion or the context consumer can only retrieved those values from the database that are stored?
Example image
Does Orion has this capability?

The context base managed by Orion Context Broker corresponds to the current status of the system. In other words, if a given attribute of a given entity has a value and a new update changes that value, then the old one is overriden.
This is not a limitation of Orion, but a design principle, due to the responsability of storing historical context is on charge of other FIWARE components. In particular, Cygnus is used to persist such historical information. It plays the role of a context consumer, subscribing to Orion and storing data in several persistence backends (HDFS, CKAN, MySQL, MongoDB, etc.). It can be used in combination with Short Term Historic, that provides a REST API similar to Orion NGSIv1 API in order to get raw historical data and some basic aggregations (sum, average, etc.).

Related

What is the point of using Kafka in this example and why not use DB straightaway?

Here is an example of how Kafka should run for a Social network site.
But it is hard for me to understand the point of Kafka here. We would not want to store posts and likes in Kafka as they will be destroyed after some time. So kafka should be an intermediate storage between View and DB.
But why would we need it? Wouldn't it be better to use DB straightaway.
I guess that we could use kafka as some kind of cache so the data accumulates in Kafka and then we can insert it to DB in one big batch query. But I am pretty sure that is not the reason kafka here.
What's not shown in the diagram is the processes querying the database (RocksDB, in this case). Without using Kafka Streams, you'd need to write some external service to run GROUP BY / SUM on the database. The "website" box on the left is doing some sort of front-end Javascript, and it is unclear how the Kafka backend consumer sends data to it (perhaps WebSockets?).
With Kafka Streams Interactive Queries, that logic can be moved closer to the actual event source, and is performed in near real time, rather than a polling batch. In a streaming framework, you could also send out individual event hooks (websockets, for example) to dynamically update "likes per post", "shares per post", "trends", etc without needing the user to update the page, or have the page load AJAX calls with large API responses for those details for all page rendered items.
More specifically, each Kafka Stream instance serves a specific query, rather than the API hitting one database for all queries. Therefore, load is more distributed and fault tolerant.
Worth pointing out that Apache Pinot loaded from Kafka is more suited for such real time analytical queries than Kafka Streams.
Also as you pointed out, Kafka or any message queue would act as a buffer ahead of any database (not a cache, although, Redis could be added as a cache, just like the later mentioned search service). And there's nothing preventing you from adding another database that's connected to Kafka Connect sink. For instance, a popular design is to write data to a RDBMS as well as Elasticsearch for text based search-indexing. The producer code only cares about one Kafka topic, not every downstream system where the data is needed.

How to route requests to correct consumer in consumer group

From an event sourcing/CQRS perspective: Say I have a consumer group of 2 instances, that's subscribed to a topic. On startup/subscription, each instance processes its share of the event stream, and builds a local view of the data.
When an external request comes in with a command to update the data, how would that request be routed to the correct instance in the group? If the data were partitioned by entity ID so that odd-numbered IDs went to consumer 1 and evens to consumer 2, how would that be communicated to the consumers? Or, for that matter, whatever reverse-proxy or service-mesh is responsible for sending that incoming request to the correct instance?
And what happens when the consumer group is re-balanced due to the addition or subtraction of consumers? Is that somehow automatically communicated the routing mechanism?
Is there a gap in service while the consumers all rebuild their local model from their new set of events from the given topics?
This seems to apply to both the command and query side of things, if they're both divided between multiple instances with partitioned data...
Am I even thinking about this correctly?
Thank you
Kafka partitioning is great for sharding streams of commands and events by the entity they affect, but not for using this sharding in other means (e.g. for routing requests).
The broad technique for sharding the entity state I'd recommend is to not rely on Kafka partitioning for that (only using the topic partitions to ensure ordering of commands/events for an entity, i.e. by having all commands/events for a given entity be in one partition), but instead using something external to coordinate those shards (candidates would include leases in zookeeper/etcd/consul or cluster sharding from akka (JVM) or akka.net or cloudstate/akka serverless (more polyglot)). From there, there are two broad approaches you can take:
(most really applicable if the number of entity shards for state and processing happens to equal the number of Kafka partitions) move part of the consumer group protocol into your application and have the instance which owns a particular shard consume a particular partition
have the instances ingesting from Kafka resolve the shard for an entity and which instance owns that shard and then route a request to that instance. The same pattern would also allow things like HTTP requests for an entity to be handled by any instance. By doing this you're making a service implemented in a stateful manner present to things like a service mesh/container scheduler/load balancer as a more stateless service would present.

Is Kafka global store visible from multiple instances of app

My question is - I have Kafka stream processing micro service which listens to multiple topics. Aggregates the state from multiple topics and stores in state store. We send that aggregated message ahead to down stream system. Down stream system will response back with other message on different Kafka topic. Need to create global state store which is visible for all my micro service instances. Is it possible to achieve same in karka using global store ? Can I get sample code example on how to create one ?
Quoted from documentation
The state that is locally available on any given instance is only a subset of the application’s entire state. Querying the local stores on an instance will only return data locally available on that particular instance.
An in this instance, to query across instances, you must expose the application.server, as discussed in the link in the comments. Also relevant: Is Kafka Stream StateStore global over all instances or just local?
However, GlobalKTable has a full copy of all data on each instance

Kafka Streams DSL over Kafka Consumer API

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.

Ingesting data from REST api to Kafka

I have many REST API to pull the data from different data sources, now i want to publish these rest response to different kafka topics. Also i want to make sure that duplicate data is not getting produced.
Is there any tools available to do this kind of operations?
So in general a Kafka processing pipeline should be able to handle messages that are sent multiple times. Exactly once delivery of Kafka messages is a feature that's only been around since mid 2017 (giving that I'm writing this Jan 2018), and Kafka 0.11, so in general unless you're super bleedy edge in your Kafka installation your pipeline should be able to handle multiple deliveries of the same message.
That's of course your pipeline. Now you have a problem where you have a data source that may deliver the message to you multiple times, to your HTTP -> Kafka microservice.
Theoretically you should design your pipeline to be idempotent: that multiple applications of the same change message should only affect the data once. This is, of course, easier said than done. But if you manage this then "problem solved": just send duplicate messages through and whatever it doesn't matter. This is probably the best thing to drive for, regardless of whatever once only delivery CAP Theorem bending magic KIP-98 does. (And if you don't get why this super magic well here's a homework topic :) )
Let's say your input data is posts about users. If your posted data includes some kind of updated_at date you could create a transaction log Kafka topic. Set the key to be the user ID and the values to be all the (say) updated_at fields applied to that user. When you're processing a HTTP Post look up the user in a local KTable for that topic, examine if your post has already been recorded. If it's already recorded then don't produce the change into Kafka.
Even without the updated_at field you could save the user document in the KTable. If Kafka is a stream of transaction log data (the database inside out) then KTables are the streams right side out: a database again. If the current value in the KTable (the accumulation of all applied changes) matches the object you were given in your post, then you've already applied the changes.