IoT Streaming Architecture [closed] - mongodb

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I just started learning about IoT and data streaming. Apologies if this question seems too obvious or generic.
I am working on a school project, which involves streaming data from hundreds (maybe thousands) of Iot sensors, storing said data on a database, then retrieving that data for display on a web-based UI.
Things to note are:
fault-tolerance and the ability to accept incomplete data entries
the database has to have the ability to load and query data by stream
I've looked around on Google for some ideas on how to build an architecture that can support these requirements. Here's what I have in mind:
Sensor data is collected by FluentD and converted into a stream
Apache Spark manages a cluster of MongoDB servers
a. the MongoDB servers are connected to the same storage
b. Spark will handle fault-tolerance and load balancing between MongoDB servers
BigQuery will be used for handling queries from UI/web application.
My current idea of a IoT streaming architecture :
The question now is whether this architecture is feasible, or whether it would work at all. I'm open to any ideas and suggestions.
Thanks in advance!

Note that you could stream your device data directly into BigQuery and avoid an intermediate buffering step.
See:
https://cloud.google.com/bigquery/streaming-data-into-bigquery

Related

Optimal ways to rate limit a spark stream [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 days ago.
Improve this question
I have a spark stream whose source is a blob storage, that transforms and enriches data in a batch process written in Scala 2.12. The enrich step calls an external service to get some additional information thats added before the data is dropped in a sink.
The webservice is rate limited, so I'm looking for ways to control the request rate to the service. These are the ways I have thought of so far:
Use partitioning from spark libraries to process data in smaller chunks
Make use of Akka streams, load the spark stream onto an akka stream and throttle the requests. The disadvantage of this approach is that I'll end up loading a lot of data into memory. I can still mitigate this by making multiple smaller blob files that are processed one after another.
Look for an HTTP client library that takes care of throttling and retrying for me.
Implement some kind of a circuit breaker that stops when it encounters 429 and resumes later.
What's the best way to solve this?

Which Messaging System to be used? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I would like to transfer data from one db system to any other db systems. which messaging system(Kafka, ActiveMQ, RabbitMQ.....likewise) would be better to achieve this with much throughput and performance.
I guess the answer for this type of questions is "it depends"
You probably could find many information on the internet about comparison between these message brokers,
as far as I can share from our experience and knowledge, Kafka and its ecosystem tools like kafka connect , introducing your requested behavior with source connectors and sink connectors using kafka in the middle,
Kafka connect is a framework which allows adding plugin called connectors
Sink connectors- reads from kafka and send that data to target system
Source connector- read from source store and write to kafka
Using kafka connect is "no code", calling rest api to set configuration of the connectors.
Kafka is distributed system that supports very high throughout with low latency. It supports near real time streaming of data.
Kafka is highly adopted by the biggest companies around the world.
There are many tools and vendors that support your use case, they vary in price and support, it depends from which sources you need to take data and to which targets you wish to write, should it be cdc/near real time or "batch" copy

Pitfalls of kafka tiered storage [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a question regarding the tiered storage feature in Kafka. I like this feature since it means in my case that I can use Kafka as infinite storage (with gcs backend for example). However, let's suppose for whatever reason the Kafka cluster got deleted and Kafka data is lost.
Is data in gcs/s3 store still useful?
I mean can I plug the old logs to a new Kafka cluster or is it totally useless now (terabytes of logs)?
BTW I know I can analyse the segments in the gcs/S3 store and extract data. but that's a bit hacky that's why I m trying to see if I can find a clean solution.
As of right now, if the cluster or specifically the topic that has tiered storage enabled gets deleted, the data in GCS/S3 will not be "reloaded" if you connect it to another cluster.
If you want to keep the data that's in GCS/S3, you will need to stream the data to a new topic that does not have tiered storage enabled or use kafka connect to independently write the data to a usable format before deleting it.
We do plan on improving this use case in the future.

scaling a database on cloud and on local servers [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am considering using mongo db (it could be postgresql or any other ) as a data warehouse, my concern is that up to twenty or more users could be running queries at a time and this could have serious implications in terms of performance.
My question is what is the best approach to handle this in a cloud based and non cloud based environment? Do cloud based db's automatically handle this? If so would the data be consistent through all instances if a refresh on the data was made? In a non cloud based environment would the best approach be to load balance all instances? Again how would you ensure data integrity for all instances?
thanks in advance
I think auto sharding is what I am looking for
http://docs.mongodb.org/v2.6/MongoDB-sharding-guide.pdf

Is combining MongoDB with Neo4J a good practice? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I already have a .Net Web project running on MongoDB where I store some news/feed data.
After a while I needed a faster way to track "who shared what" and "how to find relationships depending on these information".
Then I came up with an idea to use graphDB to track related feeds and users.
Since the system is already running on MongoDB, I am thinking of leaving the data in Mongo and creating the graph representation in Neo4J for applying a graph search.
I do not want to migrate all my data to Neo4J because many people telling me MongoDB's I/O performance is way better than Neo4J and they also pointed out Sharding feature.
What would you suggest in this situation?
And If I follow my idea, will it be a good practice?
Personnally I think there are no unique answer and best practices. It is common usage to use polyglot persistence systems.
Now everything is based on your context and there are points we can't just reply for you :
How much time do you have (learning a new technology is not a matter of days until you can use it in production and sleep good )
How much money you can invest in the project , sharding is, AFAIK, a neo4j enterprise feature and licenses have a cost if you're not opensource or commercial company. Also hosting costs for Neo4j in cluster mode.
How much data ? As long as your graph can fit in memory, you'll not run I/O issues.
Now, away from these points, yes you can in a first instance trying to map neo4j on top of mongoDB.
Maybe try to do incremental migrations, and at then end of the process, maybe ask you the following questions, WHY do you need MongoDB to handle graph structures ?