Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 12 months ago.
Improve this question
We have certain linux devices which send data like battery percentage, cpu utilization, ram utilization, etc. in certain intervals. We want to run analytics for this data. Should we capture this data in mongo(https://www.mongodb.com/blog/post/time-series-data-and-mongodb-part-1-introduction) or use a specific timeseries database like influxdb or TSDB? The data generated is around 100 GB per day and we want it for last 3 months.
TSDB bencmarks show (TimescaleDB vs MongoDB, InfluxDB vs MongoDB) that dedicated timeseries databases outperform MongoDB. At 100 GB per day x 3 months on-disk data compression is also important. VictoriaMetrics seems to be leading in ingestion rate, query speed and compression for typical use cases although TimescaleDB has recently improved data compression. And have a look at Yandex ClickHouse benchmarks too.
For another alternative, check out QuestDB at Questdb.io. QuestDB outperforms all of the above mentioned TSDBs and is SQL-based.
You can try it out for speed at http://try.questdb.io:9000/ which is a live instance loaded with 1.9B rows of data from the NYC Taxi dataset.
For timeseries data, it's highly recommended to use timeseries database instead of RDBMS or NoSQL DB because the storage and query are optimized for timeseries data in TSDB.
Here I want to recommend a lightweight, high performance, open source time series database, TDengine. TDengine is a distributed TSDB and its distributed solution is also open source, it also supports SQL for easy use.
https://tdengine.com/
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Problem:
Looking for best solution to store and make easily available big amount of weather data for the machine learning specialists team.
Initially I'm fetching data from cds.climate.copernicus.eu in netCDF or grib format. There will be some around 10-20Tb in grib or netCDF.
Requirements:
ML specialists can easily query data for given location (point, polygon) in given time range.
Results are returned in reasonable time.
Ideas:
Postgres. I thought that maybe pg would handle that amount of data. But the problem I encoutered with this is that loading data into postgres will take ages additionally it would take much more space than 10-20Tb (Because I planned to store that in row like format where you have two tables Point and WeatherMeasurement) Is it a good idea? Have anyone experience with this kind of data and pg?
Amazon Redshift. Would it be good approach to use this solution for weather data. How to load netCDf or grib into it? I have zero experience with could solutions like this.
Files. Just store data in the grib or netCDF files. I would write some simplified Python interface to fetch data from those files? But the questions is will the queries be fast enough? Have anyone experience with those?
For data this size that you want to sub-select quickly along multiple dimensions I'd lean toward Redshift. You will want to pay attention to how you want to query the data and establish the data model to provide the fastest access for the needed subsets. You may want to get some help setting this up initially as trial-and-error approach will take a while with this data size. Also Redshift isn't cheap at this scale so ask the budget questions too. This can be reduced if the database only needs to be up part of the time.
Files isn't a terrible idea as long as you can partition the data such that only a subset of files need to be accessed for any query. A partitioning strategy based on YEAR, MONTH, LAT-Decade, and LON-Decade might work - you'll need to understand what queries need to be performed and how fast (what's reasonable time?). This approach will be the least cost.
There is also a combo option - Redshift Spectrum. Redshift can utilize on database information AND in S3 stored data in the same queries. Again setting up the Redshift data model and S3 partitioning will be critical but this combo could give you attributes that will be valuable.
For any of these options you will want to convert to a more database friendly format like Parquet (or even CSV). This conversion process along with how to merge new data will need to be understood. There are lots of cloud tools to help with this processing.
Given the size of data you are working with I'll stress again that learning as you go will be time consuming. You will likely want to find experts in the tools you are working with (and at the data sizes you have) to get up quickly.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I am developing a system which constains a lot of olap work. According to my research, column based data warehouse is the best choice. But I am puzzled to choose a good data warehouse product.
All the data warehouse comparison article I see is befor 2012,and there seems little article about it. Is data warehouse out-of-date? Hadoop HBase is better?
As far as I know, InfiniDB is a high performance open source data warehouse product, but it has not been maintained for 2 years https://github.com/infinidb/infinidb. And there is little document about InfiniDB . Has InfiniDB been abundanted by developers ?
Which is the best data warehouse product by now?
How do I incrementally move my Business data stored in the Mysql database to data warehouse ?
Thank you for your answer!
Data warehousing is still a hot topic, and HBase is not the fastes, but a very well known and compatible one (many applications build on it)
I have taken the Journey for a good Column store some years ago and finally went with InfiniDB because of the easy migration from plain mysql. its a nice piece of software, but it has still bugs, so i cannot fully recommend it to be used in production. (not without a 2nd failover instance).
However, MariaDB has picket up the InfiniDB technology and is porting it over to their MariaDB Database Server. This new product ist called MariaDB Columnstore[1], of with a testing build is available. They have already put a lot effort in it, so i think ColumnStore will get a Major product of MariaDB within the next two years.
I cant answer that. Im still with InfiniDB and also helping others with their projects.
This totally depends on your data structure and usage.
InfiniDB is great at querying, it had (in my tests) ~8% better performance than impala, however, while infinidb supports INSERT, UPDATE, DELETE and transactions it is not great on transactional workload. i.e. just moving a community driven website to infinidb where visitors always manipulating data will NOT work well. one insert with 10000 rows will work well, 10000 inserts with 1 row will kill it.
We deployed Infinidb for our customers to 'aid' the query performance of a regular mariadb installation - we created a tool that imports and updates MariaDB database tables into InfiniDB faster querying. manipulations on that table are still done in MairaDB and the changes get batch-imported into InfiniDB with 30 sec delay. as original and infinidb tables have the same structure and are accessable with api mysql, we just can switch the database connection and have super-fast SELECT queries. this works well for our use case.
We also built new statistics/analytics applications from ground up to work with infinidb and replace a older MySQL-Based System, which also works great and above any performance-expectations. (we now have 15x of the data we had in mariadb, and its still easier to maintain and much faster to query).
[1] https://mariadb.com/products/mariadb-columnstore
I would give Splice Machine a shot (Open Source). It stores data on HBase and will provide the core data management functions that a warehouse provides (Primary Keys, Constraints, Foreign Keys, etc.)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I want to implement mongodb as a distributed database but i cannot find good tutorials for it. Whenever i searched for distributed database in mongodb, it gives me links of sharding, so i am confused if both of them are the same things?
Generally speaking, if you got a read-heavy system, you may want to use replication. Which is 1 primary with at most 50 secondaries. The secondaries share the read stress while the primary takes care of writes. It is a auto-failover system so that when the primary is down, one of the secondaries would take the job there and becomes a new primary.
Sharding, however, is more flexible. All the Shards share write stress and read stress. That is to say, data are distributed into different Shards. And each shard can be consists of a Replication system and auto-failover works as described above.
I would choose replication first because it's simple and is basically enough for most scenarios. And once it's not enough, you can choose to convert from replication to sharding.
There is also another discussion of differences between replication and sharding for your reference.
Just some perspective on distributed databases:
In early nineties a lot of applications were desktop based and had a local database which contained MB/GBs of data.
Now with the advent of web based applications there can be millions of users who use and store their data, this data can run into GB/TB/PB. Storing all this data on a single server is economically expensive so there is a cluster of servers(or commodity hardware) across which data is horizontally partitioned. Sharding is another term for horizontal partitioning of data.
For example you have a Customer table which contains 100 rows, you want to shard it across 4 servers, you can pick 'key' based sharding in which customers will be distributed as follows: SHARD-1(1-25),SHARD-2(26-50),SHARD-3(51-75),SHARD-4(76-100)
Sharding can be done in 2 ways:
Hash based
Key based
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I want to understand more about the system and DB architecture of MongoDB.
I am trying to understand how MongoDB stores and retrieves the documents. If it's all in memory etc.
A comparative analysis between MongoDB and Oracle will be a bonus but, I am mostly focusing on understanding the MongoDB architecture per se.
Any pointers will be helpful.
MongoDB memory maps the database files. It allows the OS to control this and allocate the maximum amount of RAM to the memory mapping. As MongoDB updates and reads from the DB it is reading and writing to RAM. All indexes on the documents in the database are held in RAM also. The files in RAM are flushed to disk every 60 seconds. To prevent data loss in the event of power failure, the default is to run with journaling switched on. The journal file is flushed to disk every 100ms and if there is power loss is used to bring the database back to a consistent state.
An important design decision with mongo is on the amount of RAM. You need to figure out your working set size - i.e if you are going to be reading and writing to only the most recent 10% of your data in the database then this 10% is your working set and should be held in memory for maximum performance. So if your working set is 10GB you are going to neen 10GB for max performance - otherwise your queries/updates will run slower as pages of memory are paged from disk into memory.
Other important aspects of mongoDB are replication for backups and sharding for scaling.
There are a lot of great online resources for learning. MongoDB is free and opensource.
EDIT:
It's a good idea to check out the tutorial
http://www.mongodb.org/display/DOCS/Tutorial
and manual
http://www.mongodb.org/display/DOCS/Manual
and the Admin Zone is useful too
http://www.mongodb.org/display/DOCS/Admin+Zone
and if you get bored of reading then the presentations are worth checking out.
http://www.10gen.com/presentations
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I was thinking of using a database like mongodb or ravendb to store a lot of stock tick data and wanted to know if this would be viable compared to a standard relational such as Sql Server.
The data would not really be relational and would be a couple of huge tables. I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
Example data:
500 symbols * 60 min * 60sec * 300 days... (per record we store: date, open, high,low,close, volume, openint - all decimal/float)
So what do you guys think?
Since when this question was asked in 2010, several database engines were released or have developed features that specifically handle time series such as stock tick data:
InfluxDB - see my other answer
Cassandra
With MongoDB or other document-oriented databases, if you target performance, the advices is to contort your schema to organize ticks in an object keyed by seconds (or an object of minutes, each minute being another object with 60 seconds). With a specialized time series database, you can query data simply with
SELECT open, close FROM market_data
WHERE symbol = 'AAPL' AND time > '2016-09-14' AND time < '2016-09-21'
I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
With InfluxDB, this is very straightforward. Here's how to get the daily minimums and maximums:
SELECT MIN("close"), MAX("close") FROM "market_data" WHERE WHERE symbol = 'AAPL'
GROUP BY time(1d)
You can group by time intervals which can be in microseconds (u), seconds (s), minutes (m), hours (h), days (d) or weeks (w).
TL;DR
Time-series databases are better choices than document-oriented databases for storing and querying large amounts of stock tick data.
The answer here will depend on scope.
MongoDB is great way to get the data "in" and it's really fast at querying individual pieces. It's also nice as it is built to scale horizontally.
However, what you'll have to remember is that all of your significant "queries" are actually going to result from "batch job output".
As an example, Gilt Groupe has created a system called Hummingbird that they use for real-time analytics on their web site. Presentation here. They're basically dynamically rendering pages based on collected performance data in tight intervals (15 minutes).
In their case, they have a simple cycle: post data to mongo -> run map-reduce -> push data to webs for real-time optimization -> rinse / repeat.
This is honestly pretty close to what you probably want to do. However, there are some limitations here:
Map-reduce is new to many people. If you're familiar with SQL, you'll have to accept the learning curve of Map-reduce.
If you're pumping in lots of data, your map-reduces are going to be slower on those boxes. You'll probably want to look at slaving / replica pairs if response times are a big deal.
On the other hand, you'll run into different variants of these problems with SQL.
Of course there are some benefits here:
Horizontal scalability. If you have lots of boxes then you can shard them and get somewhat linear performance increases on Map/Reduce jobs (that's how they work). Building such a "cluster" with SQL databases is lot more costly and expensive.
Really fast speed and as with point #1, you get the ability to add RAM horizontally to keep up the speed.
As mentioned by others though, you're going to lose access to ETL and other common analysis tools. You'll definitely be on the hook to write a lot of your own analysis tools.
Here's my reservation with the idea - and I'm going to openly acknowledge that my working knowledge of document databases is weak. I’m assuming you want all of this data stored so that you can perform some aggregation or trend-based analysis on it.
If you use a document based db to act as your source, the loading and manipulation of each row of data (CRUD operations) is very simple. Very efficient, very straight forward, basically lovely.
What sucks is that there are very few, if any, options to extract this data and cram it into a structure more suitable for statistical analysis e.g. columnar database or cube. If you load it into a basic relational database, there are a host of tools, both commercial and open source such as pentaho that will accommodate the ETL and analysis very nicely.
Ultimately though, what you want to keep in mind is that every financial firm in the world has a stock analysis/ auto-trader application; they just caused a major U.S. stock market tumble and they are not toys. :)
A simple datastore such as a key-value or document database is also beneficial in cases where performing analytics reasonably exceeds a single system's capacity. (Or it will require an exceptionally large machine to handle the load.) In these cases, it makes sense to use a simple store since the analytics require batch processing anyway. I would personally look at finding a horizontally scaling processing method to coming up with the unit/time analytics required.
I would investigate using something built on Hadoop for parallel processing. Either use the framework natively in Java/C++ or some higher level abstraction: Pig, Wukong, binary executables through the streaming interface, etc. Amazon offers reasonably cheap processing time and storage if that route is of interest. (I have no personal experience but many do and depend on it for their businesses.)