Knowledge Graph for TIme-Series Data - knowledge-graph

Would storing time series data in a Knowledge Graph be a good idea ? What could be the benefits of doing so ?

It depends on the queries you want to do on the time series data, but I suspect the answer is NO.
Typical queries on time series data include the following:
moving averages; e.g. 30 day average of stock prices
median
accounting functions; e.g. average growth rate, amortization, internal rate of return and so on.
statistical functions; e.g. autocorrelation, and correlation between two series.
pattern finding; i.e. find a time series (or multiple time series) that has a similar pattern to this time series
In general time series data have a greater need for aggregation of a collection of data rather creating a graph of the data. This will likely cause any time series related queries to have poor performance on a graph like database.
A factor to consider is that the amount of data stored for time series can be way bigger than that for of a typical knowledge graph depending on the sample rate of the time series data.
Here are some of the references that brought me to this conclusion:
Indexing Strategies for Time Series Data
Demystifying Graph Databases - Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries

Related

Athena/DDB to condense millions of data points for plotting them on a graph

I need to plot trend charts on the react app based on user inputs such as timestamps, devices, etc. I have related time series data in DynamoDB and S3 (which I can query using Athena).
Returning all those millions of data points for a graph seems unreasonable and is super laggy.
I guess one option is "binning" where I decide the number of bins based on how big the time range is and take averages of the readings in that bin. However, concerned about how well it will show the drops and high we need to show them accurately.
Athena queries and DDB queries (due to the 1MB limit) - both seem fairly slow so far.
Of course the size of the response payload is another concern as API and Lambda both limit it to 10 and 6Mb respectively.
Any ideas?
I can't suggest anything smarter than "binning", but if you are concerned that the bucket interval might become too wide and performance might suffer, you can fixate the interval. Then create more than one table. For example, the interval can be 1 hour and you can have a new table for each week.
This is what we did when we had to deal with time series in dynamo. At some point, we decided to switch to Amazon Timestream

Effective way to display the data in the chart

I have an application where some values are stored in DB, e.g. one value per second. It is 604800 values per 7 days and if I want to view this value in graph I need some effective way how to get only e.g. 800 values from DB if I have chart with 800px width.
I use some aggregation logic where mean value is computed for values in 2, 3, 4, 5, 6, 10, 12 minute interval and then hour and day interval aggregates are computed.
I use PostgreSQL and this aggregations are computed with statement:
"INSERT INTO aggre_table_ ... SELECT sum(...)/count(*) ... WHERE timestamp > ... and timestamp < ..."
Is there any better way how to do this or what is the best way of data aggregation for later displaying in charts?
Is it better to do this by some trigger or calling stored procedures?
Is there any DB support for aggregations for D3js, Highcharts or Google Charts?
How to aggregate your data is a large topic that is independent of your technology choices. It depends largely on how sensitive the data is, what the important indicators of the data are, what the implications of those indicators are, etc.
Is a single out of range point significant? Or are you looking for the overall trend? These are big questions with answers that aren't always easy.
My general suggestion:
to display a week worth of data, aggregate to hourly averages.
provide a range around that line indicating the distribution of points around each average
if something significant happened within that aggregated point, indicate it with a separate marker
provide drill down capability for each aggregated point to see the full detail charted, if that level of detail is important (chances are, it's not)
In Highcharts (Highstock in the fact) dataGrouping is used for approximation (see demo).
Also, here you can find more about Highstock.

What is a better approach of storing and querying a big dataset of meteorological data

I am looking for a convenient way to store and to query huge amount of meteorological data (few TB). More information about the type of data in the middle of the question.
Previously I was looking in the direction of MongoDB (I was using it for many of my own previous projects and feel comfortable dealing with it), but recently I found out about HDF5 data format. Reading about it, I found some similarities with Mongo:
HDF5 simplifies the file structure to include only two major types of
object: Datasets, which are multidimensional arrays of a homogenous
type Groups, which are container structures which can hold datasets
and other groups This results in a truly hierarchical, filesystem-like
data format. Metadata is stored in the form of user-defined, named
attributes attached to groups and datasets.
Which looks like arrays and embedded objects in Mongo and also it supports indices for querying the data.
Because it uses B-trees to index table objects, HDF5 works well for
time series data such as stock price series, network monitoring data,
and 3D meteorological data.
The data:
Specific region is divided into smaller squares. On the intersection of each one of the the sensor is located (a dot).
This sensor collects the following information every X minutes:
solar luminosity
wind location and speed
humidity
and so on (this information is mostly the same, sometimes a sensor does not collect all the information)
It also collects this for different height (0m, 10m, 25m). Not always the height will be the same. Also each sensor has some sort of metainformation:
name
lat, lng
is it in water, and many others
Giving this, I do not expect the size of one element to be bigger than 1Mb.
Also I have enough storage at one place to save all the data (so as far as I understood no sharding is required)
Operations with the data.
There are several ways I am going to interact with a data:
convert as store big amount of it: Few TB of data will be given to me as some point of time in netcdf format and I will need to store them (and it is relatively easy to convert it HDF5). Then, periodically smaller parts of data (1 Gb per week) will be provided and I have to add them to the storage. Just to highlight: I have enough storage to save all this data on one machine.
query the data. Often there is a need to query the data in a real-time. The most of often queries are: tell me the temperature of sensors from the specific region for a specific time, show me the data from a specific sensor for specific time, show me the wind for some region for a given time-range. Aggregated queries (what is the average temperature over the last two months) are highly unlikely. Here I think that Mongo is nicely suitable, but hdf5+pytables is an alternative.
perform some statistical analysis. Currently I do not know what exactly it would be, but I know that this should not be in a real time. So I was thinking that using hadoop with mongo might be a nice idea but hdf5 with R is a reasonable alternative.
I know that the questions about better approach are not encouraged, but I am looking for an advice of experienced users. If you have any questions, I would be glad to answer them and will appreciate your help.
P.S I reviewed some interesting discussions, similar to mine: hdf-forum, searching in hdf5, storing meteorological data
It's a difficult question and I am not sure if I can give a definite answer but I have experience with both HDF5/pyTables and some NoSQL databases.
Here are some thoughts.
HDF5 per se has no notion of index. It's only a hierarchical storage format that is well suited for multidimensional numeric data. It's possible to extend on top of HDF5 to implement an index (i.e. PyTables, HDF5 FastQuery) for the data.
HDF5 (unless you are using the MPI version) does not support concurrent write access (read access is possible).
HDF5 supports compression filters which can - unlike popular belief - make data access actually faster (however you have to think about proper chunk size which depends on the way you access the data).
HDF5 is no database. MongoDB has ACID properties, HDF5 doesn't (might be important).
There is a package (SciHadoop) that combines Hadoop and HDF5.
HDF5 makes it relatively easy to do out core computation (i.e. if the data is too big to fit into memory).
PyTables supports some fast "in kernel" computations directly in HDF5 using numexpr
I think your data generally is a good fit for storing in HDF5. You can also do statistical analysis either in R or via Numpy/Scipy.
But you can also think about a hybdrid aproach. Store the raw bulk data in HDF5 and use MongoDB for the meta-data or for caching specific values that are often used.
You can try SciDB if loading NetCDF/HDF5 into this array database is not a problem for you. Note that if your dataset is extremely large, the data loading phase will be very time consuming. I'm afraid this is a problem for all the databases. Anyway, SciDB also provides an R package, which should be able to support the analysis you need.
Alternatively, if you want to perform queries without transforming HDF5 into something else, you can use the product here: http://www.cse.ohio-state.edu/~wayi/papers/HDF5_SQL.pdf
Moreover, if you want to perform a selection query efficiently, you should use index; if you want to perform aggregation query in real time (in seconds), you can consider approximate aggregation. Our group has developed some products to support those functions.
In terms of statistical analysis, I think the answer depends on the complexity of your analysis. If all you need is to compute something like entropy or correlation coefficient, we have products to do it in real time. If the analysis is very complex and ad-hoc, you may consider SciHadoop or SciMATE, which can process scientific data in the MapReduce framework. However, I am not sure if SciHadoop currently can support HDF5 directly.

Reducing Large Datasets with MongoDB and D3

I'm working on a D3 visualization and luckily have made some progress. However, I've run into an issue... and to be honest, I'm not sure if its a MongoDB issue, or a D3 issue. You see, I'm trying to make a series of graphs from a set of sensor points (my JSON object contains timestamps, light, temperature, humidity, and motion detection levels or each datapoint). However, my sensors are uploading data to my MongoDB database every 8 seconds. So, if I query the MongoDB database for just one days worth of data... I get 10,800 datapoints. Worse, if I were to ask for one month of data, I'd be swamped with 324,000 datapoints. My issue is that my d3 visualization slows to a crawl when dealing with more than about 1000 points (I'm visualizing the data on four different graphs each which use a single brush to select a certain domain on the graph. Is there a way to limit the amount of data I'm trying to visualize? Is this better done using MongoDB (so basically filter the data I'm querying and only getting every nth data point based on how big of a time value I'm trying to query). Or is there a better way? Should I try to filter the data using D3 once I've retrieved the entire dataset? What is the best way to go about reducing the amount of points I need to deal with? Thanks in advance.
Mongodb is great at filtering. If you really only need a subset of the data, specify that in a find query -- this could limit to a subset of time, or if you're clever only get data for the first minute of every hour or similar.
Or you can literally reduce the amount of data coming out of mongodb using the aggregation framework. This could be used to get partial sums or averages or similar.

Any easy performance data exploration programs?

I'm trying to optimize some software, so I generated a large volume of real world performance measurements - nothing fancy, just a few numbers describing the case plus time in milliseconds.
I did some basic analysis on it - mostly dividing data into buckets in various ways and calculating bucket averages - and it was quite helpful in giving me a general idea, but it seems these relationships are more complex than I expected.
In the mean time I'll keep throwing various formulas at the data, but perhaps by some chance there exists a tool I could use to explore such data visually, and look for patterns this way? Any recommendations?
If you are willing to spend some money, Tableau and Spotfire are good at visualizing data of practically any kind.
I like Excel for this sort of raw data performance analysis. Dump your raw data into a .csv file, load it up in Excel and from there you can group and graph the data however you want. Once graphed, often discernible patterns will emerge.