Just installed graphite using postgres for storage and sending data to graphite using statsd. Works fine!
My issue is I created a bunch of series (mostly gauges) that were just for testing and I want them gone but see no way to delete them. I have no whisper files to delete since I am using postgres.
In looking at the tables in postgres for the graphite database I see nothing that contains the series. I see my custom graphs and my user but nowhere in the graphite database can I find my testing series to blow away.
Any pointers? Are the series not kept in the postgres DB?
Graphite only uses PostgreSQL/MySQL/SQLite for storing user profiles, saved graphs & dashboards, and events (annotation-style data). Time-series metrics are stored in the native Whisper files. In most cases these files will exist under /opt/graphite/storage/whisper/.
Say you sent a metric by accident named foo.bar.baz. This file will exist at /opt/graphite/storage/whisper/foo/bar/baz.wsp and can be deleted from the command-line with sudo rm /opt/graphite/storage/whisper/foo/bar/baz.wsp.
Related
I would like to automatically stream data from an external PostgreSQL database into a Google Cloud Platform BigQuery database in my GCP account. So far, I have seen that one can query external databases (MySQL or PostgreSQL) with the EXTERNAL_QUERY() function, e.g.:
https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries
But for that to work, the database has to be in GCP Cloud SQL. I tried to see what options are there for streaming from the external PostgreSQL into a Cloud SQL PostgreSQL database, but I could only find information about replicating it in a one time copy, not streaming:
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external
The reason why I want this streaming into BigQuery is that I am using Google Data Studio to create reports from the external PostgreSQL, which works great, but GDS can only accept SQL query parameters if it comes from a Google BigQuery database. E.g. if we have a table with 1M entries, and we want a Google Data Studio parameter to be added by the user, this will turn into a:
SELECT * from table WHERE id=#parameter;
which means that the query will be faster, and won't hit the 100K records limit in Google Data Studio.
What's the best way of creating a connection between an external PostgreSQL (read-only access) and Google BigQuery so that when querying via BigQuery, one gets the same live results as querying the external PostgreSQL?
Perhaps you missed the options stated on the google cloud user guide?
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external#setup-replication
Notice in this section, it says:
"When you set up your replication settings, you can also decide whether the Cloud SQL replica should stay in-sync with the source database server after the initial import is complete. A replica that should stay in-sync is online. A replica that is only updated once, is offline."
I suspect online mode is what you are looking for.
What you are looking for will require some architecture design based on your needs and some coding. There isn't a feature to automatically sync your PostgreSQL database with BigQuery (apart from the EXTERNAL_QUERY() functionality that has some limitations - 1 connection per db - performance - total of connections - etc).
In case you are not looking for the data in real time, what you can do is with Airflow for instance, have a DAG to connect to all your DBs once per day (using KubernetesPodOperator for instance), extract the data (from past day) and loading it into BQ. A typical ETL process, but in this case more EL(T). You can run this process more often if you cannot wait one day for the previous day of data.
On the other hand, if streaming is what you are looking for, then I can think on a Dataflow Job. I guess you can connect using a JDBC connector.
In addition, depending on how you have your pipeline structure, it might be easier to implement (but harder to maintain) if at the same moment you write to your PostgreSQL DB, you also stream your data into BigQuery.
Not sure if you have tried this already, but instead of adding a parameter, if you add a dropdown filter based on a dimension, Data Studio will push that down to the underlying Postgres db in this form:
SELECT * from table WHERE id=$filter_value;
This should achieve the same results you want without going through BigQuery.
I have one postgres database deployed in kubernetes attached to a pvc [with RWX access mode]. What is the right way to update (create table) the database through my CI/CD instead of logging in to the pod and running queries [Without deleting the pvc] ?
My understanding is that the background reason for your question is how to deploy DB structure changes DB onto production with minimal downtime. For that I'd go with a Blue Green Deployments 1 .
After your comments I assume that you already have running instance of PostgreSQL and would like to modify the content of the DB by altering file structure directly on a "disk" (in this case PVC).
Modifying data structure directly on disk is not the best idea if we are speaking about data integrity, etc.
The reasons for that statement are explained in this article 2. It describes how exactly postgreSQL stores data on disk.
PostgreSQL (by default) writes blocks of data (what PostgreSQL calls pages) to disk in 8k chunks.
Additionally, there is a relation between table and file_path, so postgresql knows which exactly file stores which table.
SELECT pg_relation_filepath('test_data');
pg_relation_filepath
----------------------
base/20886/186770
In this example the file /database/base/20866/186770 contains the actual data for the table test_data.
What is the right way to update the database instead of logging in to the pod and running queries
However, if you sure that you have complete set of files for the DB to operate (like the one you are using during pg_dump / pg_restore) you can try placing that data on another PVC and recreate pod, however that will still result in a downtime.
Hope that helps.
We want to use Grafana to show measuring data. Now, our measuring setup creates a huge amount of data that is saved in files. We keep the files as-is and do post-processing on them directly with Spark ("Data Lake" approach).
We now want to create some visualization and I thought of setting up Cassandra on the cluster running Spark and HDFS (where the files are stored). There will be a service (or Spark-Streaming job) that dumps selected channels from the measuring data files to a Kafka topic and another job that puts them into Cassandra. I use this approach because we have other stream processing jobs that do on the fly calculations as well.
I now thought of writing a small REST service that makes Grafana's Simple JSON datasource usable to pull the data in and visualize it. So far so good, but as the amount of data we are collecting is huge (sometimes about 300MiB per minute) the Cassandra database should only hold the most recent few hours of data.
My question now is: If someone looks at the data, finds something interesting and creates a snapshot of a dashboard or panel (or a certain event occurrs and a snapshot is taken automatically), and the original data is deleted from Cassandra, can the snapshot still be viewed? Is the data saved with it? Or does the snapshot only save metadata and the data source is queried anew?
According to Grafana docs:
Dashboard snapshot
A dashboard snapshot is an instant way to share an interactive dashboard publicly. When created, we strip sensitive data like queries (metric, template and annotation) and panel links, leaving only the visible metric data and series names embedded into your dashboard. Dashboard snapshots can be accessed by anyone who has the link and can reach the URL.
So, data is saved inside snapshot and no longer depends on original data.
As far as I understand Local Snapshot is stored in grafana db. At your data scale using external storage (webdav, etc) for snapshots can be more a better option.
I connected my sonarqube server to my postgres db however when I view the the "metrics" table, it lacks the actual value of the metric.
Those are all the columns I get, which are not particularly helpful. How can I get the actual values of the metrics?
My end goal is to obtain metrics such as duplicate code, function size, complexity etc. on my projects. I understand I could also use the REST api to do this however another application I am using will need a db to extract data from.
As far as i know connecting to db just helps to store data, not to display data.
You can check stored data on sonarqube's gui
Click on project
Click on Activity
I have deployed Grafana monitoring system, it saves the databases on "/home/user/data" directory, the problem is that the data persists for ever so the filesystem is full usage and I would like to remove this data, for example weekly.
You do not say what data you would like to remove or what is generating all the data (it is logs?). It seems strange to just delete data from your database? Will your users not miss their dashboards?
I am going to assume you want to remove data from the database. There are a few ways to do this.
Save a clean copy of the Sqlite database and then replace the database file once a week. This will lose all your data.
For most data saved in the database, you could use the Grafana API to remove data.
An example, would be to remove dashboards. With curl and basic auth:
curl -X DELETE http://admin:admin#localhost:3000/api/dashboards/db/testdash
Use the sqlite cli to write sql queries to delete data directly.