I need a timeseries datastore and visualization platform where I can dump experiment data into hierarchical namespaces and then go back later for analysis. Saving graph templates, linking to graphs and other features to go from analysis to presentation would be very useful. Initially I was really excited to read about Graphite and Graphiti, because they appear to fit the bill. However, the events I'm tracking are milliseconds apart and I need to keep millisecond precision without aggregation or averaging. It looks like the only way to make Graphite play nice is to aggregate up from statsd to metrics per second, which will obscure the events I'm interesting in. Optional aggregation would be fine in some cases, but not always.
Cube takes events with millisecond timestamps, but Cubism appears to be a rich library and not a full-fledged platform like Graphite. It also appears to be heavily real-time oriented. If I can't find a good stack to meet my needs I'll probably use Cube to store my data, but visualizing it with batch scripts that generate piles and piles of matplotlib graphs is not fun.
Am I misinformed, or is there another framework out there which will give me decent analysis/interactivity with an arbitrary time granularity?
Cubism.js is just a front-end for Graphite (and other back-ends, like Cube), so I think it would fit your needs.
You would need to setup a Graphite system to store your metrics (rather than Cube) with the appropriate level of detail (eg. millisecond), and then use Cubism's Graphite context to display it with the same step value.
Related
I'm building a data visualization system for Forex trading and I'm exploring ways of storing the historical Forex trading tick data that I have.
The data are in the form of currency pair (e.g. USD/CAD) chronological ticks of Ask and Bid prices. At the end of the day I need my data to be indexed in Elasticsearch and what I searching for is the best way to get them there.
I found a couple of approaches online; they start out simple but then get complicated. I'm wondering if adding that extra complexity is worth it. Some of my options are:
Storing tick data on PostgreSQL and then via a plugin sync them to Elasticsearch (here)
Storing tick data on PostgreSQL, push them to Logstash and then to Elasticsearch
Finally, storing tick data on PostgreSQL, push them to Redis, then to Logstash, and then to Elasticsearch
My intuition says that solution No 2 would be the ideal one, but what is considered best practice?
It's a good idea to store your data in a long-term storage DB, such as PostgreSQL or similar. That way you can decide at any time whether you need to change your mappings, add fields, remove fields, change their types, or what have you, and then you can easily rebuild your ES index/indices without too much trouble from your primary source of truth (i.e. PostgreSQL) and you always have clean data in ES.
I don't know ZomboDB (solution 1) so I can't really speak for it, all I know is that I'm generally not too fond of tying two different technologies together, it makes it hard to upgrade any of them in case you need/must/want to apply patches or benefit from new features in either of them.
Unless you have big and costly transformations to do on your source data, I feel that solution 3 doesn't bring much, i.e. the additional step of storing data in an intermediary Redis, doesn't bring much in my opinion (your mileage may vary here). It's a good idea to use a temporary store, such as Redis or Kafka, when you may lose data along the pipeline, but in this case, since you have your data in PostgreSQL, you don't really run the risk of losing anything. If at all, you can relaunch your pipeline and rebuild a few days of data.
That leaves solution 2, which would be fine given the information at hand. Using the Logstash JDBC input, you can easily retrieve the latest changes and forward them to ES every x minutes.
Eric from ZomboDB here. I wanted to try and answer your question as it relates to ZDB.
ZomboDB is really designed for full-text searching within Postgres. It's important to note that it's not a tool to synchronize your PG data to Elasticsearch. It's a fully-functional Postgres index type (akin to the built-in types like btree, gin, and gist) that happens to be backed by Elasticsearch. The fact that ZomboDB uses Elasticsearch is really an implementation detail.
While ZDB does provide a number of UDFs that expose access to ES' aggregate facilities, again, it's really designed for text searching.
So if your data is really just pairs of numbers, you're probably better off using ES directly -- especially if you're loading in one batch per day. There's no doubt that ZDB could provide superior aggregate performance compared to standard Postgres "GROUP BY" queries (because it passes it through to Elasticsearch), but you're paying a heavy operational penalty for a limited use-case.
If, on the other hand, your ask/bid data comes with a lot of related metadata, and:
You need PG to be your source of truth,
You need to text-search that metadata (with or without aggregation support), and
You don't want to learn ES and introduce another database system to your application, then...
... ZomboDB could be right for you.
I suspect Stack Overflow isn't the place to get into this, so feel free to contact me via the ways ZDB's github page recommends.
I have somewhat of a unique problem that looks similar to the problem here :
https://news.ycombinator.com/item?id=8368509
I have a high-speed traffic analysis box that is capturing at about 5 Gbps, and picking out specific packets from this to save into some format in a C++ program. Each day there will probably be 1-3 TB written to disk. Since it's network data, it's all time series down to the nanosecond level, but it would be fine to save it at second or millisecond level and have another application sort the embedded higher-resolution timestamps afterwards. My problem is deciding which format to use. My two requirements are:
Be able to write to disk at about 50 MB/s continuously with several different timestamped parameters.
Be able to export chunks of this data into MATLAB (HDF5).
Query this data once or twice a day for analytics purposes
Another nice thing that's not a hard requirement is :
There will be 4 of these boxes running independently, and it would be nice to query across all of them and combine data if possible. I should mention all 4 of these boxes are in physically different locations, so there is some overhead in sharing data.
The second one is something I cannot change because of legacy applications, but I think the first is more important. The types of queries I may want to export into matlab are something like "Pull metric X between time Y and Z", so this would eventually have to go into an HDF5 format. There is an external library called MatIO that I can use to write matlab files if needed, but it would be even better if there wasn't a translation step. I have read the entire thread mentioned above, and there are many options that appear to stand out: kdb+, Cassandra, PyTables, and OpenTSDB. All of these seem to do what I want, but I can't really figure out how easy it would be to get it into the MATLAB HDF5 format, and if any of these would make it harder than others.
If anyone has experience doing something similar, it would be a big help. Thanks!
A KDB+ tickerplant is certainly capable of capturing data at that rate, however there's lots of things you need to make sure (whatever solution you pick)
Do the machine(s) that are capturing the data have enough cores? Best to taskset a tickerplant, for example, to a core that nothing else will contend with
Similarly with disk - SSD, be sure there is no contention on the bus
Separate the workload - can write different types of data (maybe packets can be partioned by source or stream?) to different cpus/disks/tickerplant processes.
Basically there's lots of ways you can cut this. I can say though that with the appropriate hardware KDB+ could do the job. However, given you want HDF5 it's probably even better to have a simple process capturing the data and writing/converting to disk on the fly.
Are there any technologies that can take raw semi-structured, schema-less big data input (say from HDFS or S3), perform near-real-time computation on it, and generate output that can be queried or plugged in to BI tools?
If not, is anyone at least working on it for release in the next year or two?
There are some solutions with big semistructured input and queried output, but they are usually
unique
expensive
secret enough
If you are able to avoid direct computations using neural networks or expert systems, you will be close enough to low latency system. All you need is a team of brilliant mathematicians to make a model of your problem, a team of programmers to realize it in code and some cash to buy servers and get needed input/output channels for them.
Have you taken a look at Splunk? We use it to analyze Windows Event Logs and Splunk does an excellent job indexing this information to allow for fast querying of any string that appears in the data.
My Data
It's primarily monitoring data, passed in the form of Timestamp: Value, for each monitored value, on each monitored appliance. It's regularly collected over many appliances and many monitored values.
Additionally, it has the quirky feature of many of these data values being derived at the source, with the calculation changing from time to time. This means that my data is effectively versioned, and I need to be able to simply call up only data from the most recent version of the calculation. Note: This is not versioning where the old values are overwritten. I simply have timestamp cutoffs, beyond which the data changes its meaning.
My Usage
Downstream, I'm going to have various undefined data mining/machine learning uses for the data. It's not really clear yet what those uses are, but it is clear that I will be writing all of the downstream code in Python. Also, we are a very small shop, so I can really only deal with so much complexity in setup, maintenance, and interfacing to downstream applications. We just don't have that many people.
The Choice
I am not allowed to use a SQL RDBMS to store this data, so I have to find the right NoSQL solution. Here's what I've found so far:
Cassandra
Looks totally fine to me, but it seems like some of the major users have moved on. It makes me wonder if it's just not going to be that much of a vibrant ecosystem. This SE post seems to have good things to say: Cassandra time series data
Accumulo
Again, this seems fine, but I'm concerned that this is not a major, actively developed platform. It seems like this would leave me a bit starved for tools and documentation.
MongoDB
I have a, perhaps irrational, intense dislike for the Mongo crowd, and I'm looking for any reason to discard this as a solution. It seems to me like the data model of Mongo is all wrong for things with such a static, regular structure. My data even comes in (and has to stay in) order. That said, everybody and their mother seems to love this thing, so I'm really trying to evaluate its applicability. See this and many other SE posts: What NoSQL DB to use for sparse Time Series like data?
HBase
This is where I'm currently leaning. It seems like the successor to Cassandra with a totally usable approach for my problem. That said, it is a big piece of technology, and I'm concerned about really knowing what it is I'm signing up for, if I choose it.
OpenTSDB
This is basically a time-series specific database, built on top of HBase. Perfect, right? I don't know. I'm trying to figure out what another layer of abstraction buys me.
My Criteria
Open source
Works well with Python
Appropriate for a small team
Very well documented
Has specific features to take advantage of ordered time series data
Helps me solve some of my versioned data problems
So, which NoSQL database actually can help me address my needs? It can be anything, from my list or not. I'm just trying to understand what platform actually has code, not just usage patterns, that support my super specific, well understood needs. I'm not asking which one is best or which one is cooler. I'm trying to understand which technology can most natively store and manipulate this type of data.
Any thoughts?
It sounds like you are describing one of the most common use cases for Cassandra. Time series data in general is often a very good fit for the cassandra data model. More specifically many people store metric/sensor data like you are describing. See:
http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
http://engineering.rockmelt.com/post/17229017779/modeling-time-series-data-on-top-of-cassandra
As far as your concerns with the community I'm not sure what is giving you that impression, but there is quite a large community (see irc, mailing lists) as well as a growing number of cassandra users.
http://www.datastax.com/cassandrausers
Regarding your criteria:
Open source
Yes
Works well with Python
http://pycassa.github.com/pycassa/
Appropriate for a small team
Yes
Very well documented
http://www.datastax.com/docs/1.1/index
Has specific features to take advantage of ordered time series data
See above links
Helps me solve some of my versioned data problems
If I understand your description correctly you could solve this multiple ways. You could start writing a new row when the version changes. Alternatively you could use composite columns to store the version along with the timestamp/value pair.
I'll also note that Accumulo, HBase, and Cassandra all have essentially the same data model. You will still find small differences around the data model in regards to specific features that each database offers, but the basics will be the same.
The bigger difference between the three will be the architecture of the system. Cassandra takes its architecture from Amazon's Dynamo. Every server in the cluster is the same and it is quite simple to setup. HBase and Accumulo or more direct clones of BigTable. These have more moving parts and will require more setup/types of servers. For example, setting up HDFS, Zookeeper, and HBase/Accumulo specific server types.
Disclaimer: I work for DataStax (we work with Cassandra)
I only have experience in Cassandra and MongoDB but my experience might add something.
So your basically doing time based metrics?
Ok if I understand right you use the timestamp as a versioning mechanism so that you query per a certain timestamp, say to get the latest calculation used you go based on the metric ID or whatever and get ts DESC and take off the first row?
It sounds like a versioned key value store at times.
With this in mind I probably would not recommend either of the two I have used.
Cassandra is too rigid and it's too heirachal, too based around how you query to the point where you can only make one pivot of graph data from (I presume you would wanna graph these metrics) the columfamily which is crazy, hence why I dropped it. As for searching (which Facebook use it for, and only that) it's not that impressive either.
MongoDB, well I love MongoDB and I am an elite of the user group and it could work here if you didn't use a key value storage policy but at the end of the day if your mind is not set and you don't like the tech then let me be the very first to say: don't use it! You will be no good at a tech that you don't like so stay away from it.
Though I would picture this happening in Mongo much like:
{
_id: ObjectID(),
metricId: 'AvailableMessagesInQueue',
formula: '4+5/10.01',
result: NaN
ts: ISODate()
}
And you query for the latest version of your calculation by:
var results = db.metrics.find({ 'metricId': 'AvailableMessagesInQueue' }).sort({ ts: -1 });
var latest = results.getNext();
Which would output the doc structure you see above. Without knowing more of exactly how you wish to query and the general servera and app scenario etc thats the best I can come up with.
I fond this thread on HBase though: http://mail-archives.apache.org/mod_mbox/hbase-user/201011.mbox/%3C5A76F6CE309AD049AAF9A039A39242820F0C20E5#sc-mbx04.TheFacebook.com%3E
Which might be of interest, it seems to support the argument that HBase is a good time based key value store.
I have not personally used HBase so do not take anything I say about it seriously....
I hope I have added something, if not you could try narrowing your criteria so we can answer more dedicated questions.
Hope it helps a little,
Not a plug for any particular technology but this article on Time Series storage using MongoDB might provide another way of thinking about the storage of large amounts of "sensor" data.
http://www.10gen.com/presentations/mongodc-2011/time-series-data-storage-mongodb
Axibase Time-Series Database
Open source
There is a free Community Edition
Works well with Python
https://github.com/axibase/atsd-api-python. There are also other language wrappers, for example ATSD R client.
Appropriate for a small team
Built-in graphics and rule engine make it productive for building an in-house reporting, dashboarding, or monitoring solution with less coding.
Very well documented
It's hard to beat IBM redbooks, but we're trying. API, configuration, and administration is documented in detail and with examples.
Has specific features to take advantage of ordered time series data
It's a time-series database from the ground-up so aggregation, filtering and non-parametric ARIMA and HW forecasts are available.
Helps me solve some of my versioned data problems
ATSD supports versioned time-series data natively in SE and EE editions. Versions keep track of status, change-time and source changes for the same timestamp for audit trails and reconciliations. It's a useful feature to have if you need clean, verified data with tracing. Think energy metering, PHMR records. ATSD schema also supports series tags, which you could use to store versioning columns manually if you're on CE edition or you need to extend default versioning columns: status, source, change-time.
Disclosure - I work for the company that develops ATSD.
I need to design a near real-time system where documents ( with fields:id,keywords,timestamp ) are getting added to the system. The requirement is to get top-k keywords from the documents added to the system in last x minutes. The typical document addition rate is around 100 documents/sec, which may increase in the future ( hence technology should be horizontally scalable ).
I am thinking of using solr-facets ( with sharding ) to generate the top-k keywords, where I am a bit concerned about the high writes/sec for solr. Another option is to use Cassandra, but not sure how it will scale for range queries ( to compute aggregates ), as OrderPreservingPartitioner could make it difficult to distribute the load.
This sounds like a Complex Event Processing problem, not a database or search problem. You don't care so much about keeping the data as you do about getting keyword trends or something similar in realtime. Some vendors/libraries that you'd want to look at are (with many omissions):
Microsoft/.NET:
StreamInsight - Microsoft's CEP offering
Reactive Extensions for .NET - extensions to .NET for stream processing
Java:
Esper - very popular OSS stream processing library
Storm - stream processing on Hadoop, likely to scale to extremely large data volumes
3rd Party / Proprietary (Won't comment on these, don't know enough about them):
StreamBase
Aleri
use sql and nosql at the same time.
use sql for queries and use nosql for complex datas.
write x2 code. and solve every problem with their needs in mind.