How to get all time-series Influxdb entries with one python query? - raspberry-pi

I've have a question about using Python together with InfluxDB. I've got multiple Rasperry PI's collecting time series data of sensors (like temperature,humidity,..) and saving them to my InfluxDB.
Now I want to use another PI to access that Influxdata and do some calculations, like the similarity of those time series. Because the number of queries can differ from time to time i want to dynamically ask for a list of all entries and then query that data.
I did that really helfull tutorial over here: https://www.influxdata.com/blog/getting-started-python-influxdb/
There its stated to use
client.get_list_database()
to get a list containing all databases, which returns in my case:
[{'name': 'db1'}, {'name': 'db2'}, {'name': 'sensordata'}]
My target now is to "go deeper" into the sensordata database and get a list of all time series whichare contained in these database, which are for example RP1-Temperature1,RP2-Brightness1,.., and so on. So to makes things clear, my magic query would contain the length of my query and the database and would return me a python dictionary containing the names and values of the time series.
Thanks in Advance!!

The Python Client allows you to query database with line protocol.
The command
SHOW series
will yield all series contained within a database.
What to do with the result is up to you and I think you should be good on your own from here.
Actually reading the Influx Python client documentation would have answered most of your question.

Related

Pagination Options in KDB

I am looking to support a use case that returns kdb datasets back to users. The users connects to kdb using the Java API, runs the query synchronously and retrieves results.
However, issues are coming up when returning larger datasets and therefore I would like to return the data from kdb to the java process in pages/slices. Unfortunately users need to be able to run queries that return millions of rows and it would be easier to handle if they were passed back in slices of say 100,000 rows (Cassandra and other DBs do this sort of thing).
The potential approaches I have come up with are as follows:
Run the "where" part of the query on the database and return only the indices/date partitions (if applicable) of the data required. The java process would then use these indices to select the data required slice by slice . This approach would control memory usage on the kdb side as it would not have to load all HDB data required at once. However, overall this would increase the run time of the query as data would have to be searched/queried multiple times. This could work well for simple selects but complicated queries may need to go through an "onboarding" process which I want to avoid.
Store results of the query in a global variable in kdb which the java process can then query slice by slice. This simpler method could support any query but could potentially hit limits on the kdb side (memory/timeout) if too large a dataset is queried.
Other points to consider:
It should support users running queries on any type of process - gateway, hdb, rdb etc
It should support more than just simple selects e.g.
((1!select sym, price from trade where sym=`AAA) uj
1!select sym,price from order where sym=`AAA)
lj select avgBid:avg bid by sym from quote where sym=`AAA
The paging functionality should be removed from the end user
Does anyone have any views on if there are there any options available other than the ones listed above? Essentially I am looking for a select[m n] type approach that supports any query.

Which data model to choose for big data project with > 100 mio. items

I am working on a big data project where large amounts of product information is gathered from different online sellers, such as prices, titles, sellers and so on (30+ data points per item).
In general, there are 2 use cases for the project:
Display the latest data points for a specific product in a web app or widget
Analyze historical data, e.g. price history, product clustering, semantic analysis and so on
I first decided to use MongoDB to be able to scale horizontally as the data stored for the project is assumed to be in the range of hundreds of GBs and the data could be sharded dynamically with MongoDB across many MongoDB instances.
The 30+ data points per product won't be collected at once, but at different times, e.g. one crawler collects the prices, a couple of days later another one collects the product description. However, some data points might overlap because both crawler collect e.g. the product title. For example the result could be something like:
Document 1:
{
'_id': 1,
'time': ISODate('01.05.2016'),
'price': 15.00,
'title': 'PlayStation4',
'description': 'Some description'
}
Document 2:
{
'_id': 1,
'time': ISODate('02.05.2016'),
'price': 16.99,
'title': 'PlayStation4',
'color': 'black'
}
Therefore I initially came up with the following idea (Idea 1):
All the data points found at one specific crawl process end up in one document as described above. To get the latest product info, I would then query each data point individually and get the newest entry that is not older than some threshold, e.g. a week, to make sure that the product info is not outdated for "Use Case 1" and that we have all the data points (because a single document may not include all data points but only a subset).
However, as some data points (e.g. product titles) do not change regularly, just saving all the data all the time (to be able to do time series analysis and advanced analytics) would lead to massive redundancy in the database, e.g. the same product description would be saved every day even though it doesn't change. Therefore I thought I might check the latest value in the DB and only save the value if it has changed. However, this leads to a lot of additional DB queries (one for each data point) and, due to the time threshold mentioned above, we would lose the information whether the data point did not change or was removed from the website by the owner of the shop.
Thus, I was thinking about a different solution (Idea 2):
I wanted to split up all the data points in different documents, e.g. the price and the title are stored in separate documents with own timestamps. If a data point does not change, the timestamp can be updated to indicate that the data point did not change and is still available on the website. However, this would lead to a tremendous overhead for small data points, e.g just boolean values, because every document needs its own key, timestamp and so on to be able to find / filter / sort them quickly using indexes.
For example:
{
'_id': 1,
'timestamp': ISODate('04.05.2016'),
'type': 'price',
'value': 15.00
}
Therefore, I am struggling to find the right model and / or database to use for this project. To sum it up, these are the requirements:
Collect hundreds of millions of products (hundreds of GBs even TBs)
Overlapping subsets of product information are retrieved by distributed crawlers at different points of time
Information should be stored in a distributed, horizontally scalable database
Data redundancy should be reduced to a minimum
Time series information about the data points should be retained
I would be very grateful for any ideas (data model / architecture, different database, ...) that might help me advance the project. Thanks a lot in advance!
Are the fields / data points already known and specified? I.e., do you have a fixed schema? If so, then you can consider relational databases as well.
DB2 has a what they call temporal databases. In the 'system' form, the DB handles versioning transparently. Any inserts are automatically timestamped, and whenever you update a row, the previous row is automatically migrated to a history table (keeping its old timestamp). Thereafter, you can run SQL queries at any given point in time, and DB2 will return the data as it was at the time (or time range) specified. They also have an 'application' form, in which you specify the time periods that the row is valid for when you insert the row (e.g. if prices are valid for a specific period of time), but the ultimate SQL queries still work the same way. What's nice is that either way, all the time complexity is managed by the database and you can write relatively clean SQL queries.
You can check out more at their DeveloperWorks site.
I know that other relational DBs like Oracle also have special capabilities for time series data that manage the versioning / timestamping stuff for you.
As far as space efficiency and scale, I'm not sure as I don't run any databases that big :-)
(OTOH, if you don't have a fixed schema, or you know you'll have multiple schemas for the different data inputs and you can't model it with sparse tables, then a document DB like mongo might be your best bet)

What data structure to use for timeseries data logging in Mongodb

I have 21 million rows (lines in csv files) that I want to import into MongoDB to report on.
The data comes a process on each PC's within our organisation - which create a row every 15 minutes showing who is logged on.
Columns are: date/time, PC Name, UserName, Idle time (if user logged on)
I need to be able to report from a PC POV (PC usage metrics) and a User POV (user dwell time and activity/movement).
Initially I just loaded the data using mongoimport. But this raw data structure is not easy to report on. This could simply be my lack of knowledge of MongoDB.
I have been reading http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb which is a great article on schema design for time series data in mongodb.
This makes sense for reporting on PC usage - as I could pre-process the data and load it into Mongo as one document per PC/date combination, with an array of hourly buckets.
However I suspect this would make reporting from the user POV difficult.
I'm now thinking of create two collection - one for PC data and another for user data (one document per user/date combination etc).
I would like to know if I'm on the right track - or if anyone could suggest a better solution, of if indeed the original, raw data would suffice - and instead I just need to know how to query from both angles (some kind of map-reduce).
Thanks
Tim

Is it possible to visualize hbase table using Javascript

I am new to HBase. Here is my problem.
I have a very large HBase table. An example data in the table.
1003:15:Species1:MONTH:01 0.1,02 0.7,03 0.3,04 0.1,05 0.1,06 0,07 0,08 0,09 0.1,10 0.2,11 0.3,12 0.1:LATITUDE 26.664503840000002 29.145674380000003,LONGITUDE -96.27139215 -90.40762858
As you can see for each Species there is a month attribute (12 vectors), Lat & Long, etc. There are around 300 unique species and several 1000 observations for one particular species.
I have written a Mapreduce job which does K-means clustering on one particular species. The output of my MR is
C1:1003:15:Species1:MONTH:01 0.1,02 0.7,03 0.3,04 0.1,05 0.1,06 0,07 0,08 0,09 0.1,10 0.2,11 0.3,12 0.1:LATITUDE 26.664503840000002 29.145674380000003,LONGITUDE -96.27139215 -90.40762858
The C1 indicates which cluster it belongs to.
Now I want to visualize the output i.e plot all the Lat and Long for each cluster on a Map. I was thinking of using Mapbox.js and D3.js for my data visualization, since the Lat and Longs in the data are bounding boxes for a particular region.
If I write the o/p of my MR back to Hbase is it possible to retrive the data using javascript on the client side ?
I was thinking of either writing the data to MongoDB which I can query using JS or write a program to create a JSON from the Hbase table which I can visualize. Any suggestions ?
You can use HBAse REST API though security-wise it is probably safer to put your own service in the middle
you can also use node-hbase from https://github.com/alibaba/node-hbase-client to read the hbase data
you can also use hbase-rpc-client https://github.com/falsecz/hbase-rpc-client to read data from nodejs. This client supports hbase 0.96+

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.