Grafana snapshots - is the needed data stored or fetched from the source? - grafana

We want to use Grafana to show measuring data. Now, our measuring setup creates a huge amount of data that is saved in files. We keep the files as-is and do post-processing on them directly with Spark ("Data Lake" approach).
We now want to create some visualization and I thought of setting up Cassandra on the cluster running Spark and HDFS (where the files are stored). There will be a service (or Spark-Streaming job) that dumps selected channels from the measuring data files to a Kafka topic and another job that puts them into Cassandra. I use this approach because we have other stream processing jobs that do on the fly calculations as well.
I now thought of writing a small REST service that makes Grafana's Simple JSON datasource usable to pull the data in and visualize it. So far so good, but as the amount of data we are collecting is huge (sometimes about 300MiB per minute) the Cassandra database should only hold the most recent few hours of data.
My question now is: If someone looks at the data, finds something interesting and creates a snapshot of a dashboard or panel (or a certain event occurrs and a snapshot is taken automatically), and the original data is deleted from Cassandra, can the snapshot still be viewed? Is the data saved with it? Or does the snapshot only save metadata and the data source is queried anew?

According to Grafana docs:
Dashboard snapshot
A dashboard snapshot is an instant way to share an interactive dashboard publicly. When created, we strip sensitive data like queries (metric, template and annotation) and panel links, leaving only the visible metric data and series names embedded into your dashboard. Dashboard snapshots can be accessed by anyone who has the link and can reach the URL.
So, data is saved inside snapshot and no longer depends on original data.
As far as I understand Local Snapshot is stored in grafana db. At your data scale using external storage (webdav, etc) for snapshots can be more a better option.

Related

How to used cached data or scheduled data load in Grafana from PostgresSql

I am using Grafana to connect to Postgressql and visualize data. The data I use is very large and loading either a view or direct sql query is taking a long time. I want to have either a scheduled load to the Grafana or use cached data to ensure the data load is faster. The data need not be live and upto date hence I can use every day load once option if available. Is it possible to do this? Couldn't find the solution anywhere.

How to access gold table in delta lake for web dashboards and other?

I am using the delta lake oss version 0.8.0.
Let's assume we calculated aggregated data and cubes using the raw data and saved the results in a gold table using delta lake.
My question is, is there a well known way to access these gold table data and deliver them to a web dashboard for example?
In my understanding, you need a running spark session to query a delta table.
So one possible solution could be to write a web api, which executes these spark queries.
Also you could write the gold results in a database like postgres to access it, but that seems just duplicating the data.
Is there a known best practice solution?
The real answer depends on your requirements regarding latency, number of requests per second, amount of data, deployment options (cloud/on-prem, where data located - HDFS/S3/...), etc. Possible approaches are:
Have the Spark running in the local mode inside your application - it may require a lot of memory, etc.
Run Thrift JDBC/ODBC server as a separate process, and access data via JDBC/ODBC
Read data directly using the Delta Standalone Reader library for JVM, or via delta-rs library that works with Rust/Python/Ruby

Data streaming to Google Cloud ML Engine

I found that Google ml engine expects data in cloud storage, big query etc. Is there any way to stream data to ml-engine. For example, imagine that I need to use data in WordPress or Drupal site to create a tensorflow model, say a spam detector. One way is to export the whole data as CSV and upload it to cloud storage using google-cloud--php library. The problem here is that, for every minor change, we have to upload the whole data. Is there any better way?
By minor change, do you mean "when you get new data, you have to upload everything--the old and new data--again to gcs"? One idea is to export just the new data to gcs on some schedule, making many csv files over time. You can write your trainer to take a file pattern and expand it using get_matching_files/Glob or multiple file paths.
You can also modify your training code to start from an old checkpoint and train over just the new data (which is in its own file) for a few steps.

Adding user information to centralized logging with ELK stack

I am using ELK stack (first project) to centralize logs of a server and visualize some real-time statistics with Kibana. The logs are stored in an ES index and I have another index with user information (IP, name, demographics). I am trying to:
Join user information with the server logs, matching the IPs. I want to include this information in the Kibana dashboard (e.g. to show in real-time the username of the connected users).
Create new indexes with filtered and processed information (e.g. users that have visited more than 3 times certain url).
Which is the best design to solve those problems (e.g. include username in the logstash stage through a filter, do scheduled jobs,...)? If the processing task (2) gets more complex, would it be better to use MongoDB instead?
Thank you!
I recently wanted to cross reference some log data with user data (containing IPs among other data) and just used elasticsearch's bulk import API. This meant extracting the data from a RDBMS, converting it to JSON and outputting a flat file that adhered to the format desired by the bulk import API (basically prefixing a row that describes the index and type).
That should work for an initial import, then your delta could be achieved using triggers in whatever stores your user data. Might simply write to a flat file and process like other logs. Other options might be JDBC River.
I am also interested to know where the data is stored originally (DB, pushing straight from a server..). However, I initially used the ELK stack to pull data back from a DB server using a batch file utilizing BCP (running on a scheduled task) and storing it to a flat file, monitoring the file with Logstash, and manipulating the data inside the LS config (grok filter). You may also consider a simple console/web application to manipulate the data before grokking with Logstash.
If possible, I would attempt to pull your data via SQL Server SPROC/BCP command and match the returned, complete message within Logstash. You can then store the information in a single index.
I hope this helps as I am by no means an expert, but I will be happy to answer more questions for you if you get a little more specific with the details of your current data storage; namely how the data is entering Logstash. RabbitMQ is another valuable tool to take a look at for your input source.

Is it possible to configure ArangoDB to make snapshots of the graph database at specific times?

so far I know ArangoDB uses MVCC and therefore it creates revisions of nodes and edges for a undefined period of time until the garbage collector removes them.
I would like to implement a graph database schema and I need to keep the state of this database at specific times. This means I will configures times when the database management system take a snapshot of the state (e.g. every week).
So my question in short: is it possible to keep the revisions/versions of nodes/edges in arangodb (or maybe with a plugin) and a timestamp of their creation?
If no, is there a other graph databases which is able to do this?
I think you can use arangodump (link to ArangoDB client tools manual) binary to create a snapshot at the desired point in time.
This will save the state of the database (or just the specific collections that contain your graph data) to JSON files, which can be used for auditing or later reloading the data.
arangodump is contained in the ArangoDB distributions.
The data dumped by arangodump will not contain any creation timestamps, but if you need them you can make them part of your data by just filling a "created" attribute in each node / edge when you create it.
I hope this helps.