How to run Apache Ignite sql queries over a rest-api - rest

I am trying to access Apache Ignite over the http-rest api . I see that the api mostly provides ability to request data with a specific key ( meaning you should always know/have the key to query data) .
However i would like to understand
1) if we have the ability to say query a set of records which are filtered based on one or more of the value fields of the POJO value.
2) If we can run join like sql-queries through the rest api when my data is as part of more then 1 cache and some fields in them have common values to create the relation among caches.

Take a look at the REST API documentation for the list of the available commands - there are commands that allow to execute SQL.
Also take a look at the Ignite SQL documentation for the syntax reference and some examples.
Finally, please see the Ignite SQL examples - you can find them in the Ignite distribution or in the git repository. E.g. SqlDmlExample should give a notion of how to execute various SQL queries on an Ignite cache.

Related

How to combine data from postgreSQL and dynamic json in grafana

I have a grafana dashboard where I want to use an orcestra cities map dashboard to show status of some stations. The status is available as json from a http server (using nagios for this part) but the status has no idea of the location of the stations. This I have in a postGIS database.
I know I can set up a script that reads the status json and inserts the data into a table in the postgis database. This can run each five minutes or something. This feels a bit kludgy, so I wonder if there are some other ways of doing this.
Could it be possible to use a foreign data wrapper to fetch the json into postgis? The only json fdw I have found is to read a set of files, I would need to read from a http server.
If not, is it possible to combine data from json and postgres in one data set in grafana? I can read in data from both sources and present them e.g. as time series in one panel, but here I need to be able to join the two so that I use some of the attributes from json to categorize the points from postgis (or the other way around if that should be easier)
In theory you can do that in the Grafana. You need to have 2 queries with results from both sources (how to write query, configure datasources for that is not in the scope of this question) + you need a key, which can be used for a join in both results (e.g. city_id).
Then you may use join transformation to "join" both query results into single dataset.

streaming PostgreSQL tables into Google BigQuery

I would like to automatically stream data from an external PostgreSQL database into a Google Cloud Platform BigQuery database in my GCP account. So far, I have seen that one can query external databases (MySQL or PostgreSQL) with the EXTERNAL_QUERY() function, e.g.:
https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries
But for that to work, the database has to be in GCP Cloud SQL. I tried to see what options are there for streaming from the external PostgreSQL into a Cloud SQL PostgreSQL database, but I could only find information about replicating it in a one time copy, not streaming:
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external
The reason why I want this streaming into BigQuery is that I am using Google Data Studio to create reports from the external PostgreSQL, which works great, but GDS can only accept SQL query parameters if it comes from a Google BigQuery database. E.g. if we have a table with 1M entries, and we want a Google Data Studio parameter to be added by the user, this will turn into a:
SELECT * from table WHERE id=#parameter;
which means that the query will be faster, and won't hit the 100K records limit in Google Data Studio.
What's the best way of creating a connection between an external PostgreSQL (read-only access) and Google BigQuery so that when querying via BigQuery, one gets the same live results as querying the external PostgreSQL?
Perhaps you missed the options stated on the google cloud user guide?
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external#setup-replication
Notice in this section, it says:
"When you set up your replication settings, you can also decide whether the Cloud SQL replica should stay in-sync with the source database server after the initial import is complete. A replica that should stay in-sync is online. A replica that is only updated once, is offline."
I suspect online mode is what you are looking for.
What you are looking for will require some architecture design based on your needs and some coding. There isn't a feature to automatically sync your PostgreSQL database with BigQuery (apart from the EXTERNAL_QUERY() functionality that has some limitations - 1 connection per db - performance - total of connections - etc).
In case you are not looking for the data in real time, what you can do is with Airflow for instance, have a DAG to connect to all your DBs once per day (using KubernetesPodOperator for instance), extract the data (from past day) and loading it into BQ. A typical ETL process, but in this case more EL(T). You can run this process more often if you cannot wait one day for the previous day of data.
On the other hand, if streaming is what you are looking for, then I can think on a Dataflow Job. I guess you can connect using a JDBC connector.
In addition, depending on how you have your pipeline structure, it might be easier to implement (but harder to maintain) if at the same moment you write to your PostgreSQL DB, you also stream your data into BigQuery.
Not sure if you have tried this already, but instead of adding a parameter, if you add a dropdown filter based on a dimension, Data Studio will push that down to the underlying Postgres db in this form:
SELECT * from table WHERE id=$filter_value;
This should achieve the same results you want without going through BigQuery.

Expose Redshift tables through REST API

I am currently contemplating on how to expose data present in Redshift tables in a meaningful and consistent way through REST API.
The way I want it to work is that caller calls the API and then we do some kind of dynamic querying on the tables. I am worried about the latency as the queries could range from simple to very complicated ones. Since Redshift requires connecting to the database as a client, Some of the approaches we could have are:
Create a lambda function connecting to Redshift, which is invoked through API gateway
Using OData to create RESTful APIs. However, I don't think Redshift supports OData out of the box.
I am inclining towards OData since it has advance filtering options along with pagination.
I am seeking advice, will OData be enough and if yes, how exactly one integrates OData with redshift.
Any other advise/approaches are welcome too.
Thanks!
Let me go over the different options:
Redshift data api
Redshift data API let's you invoke queries and get their results in an asynchronous manner.
You can use the API directly from front-end, or you can put it behind API Gateway.
Lambda
If you trust your users and can get a proper authentication you can simply invoke the Lambda directly from front-end and pass it some SQL to run or generate SQL based on the parameters. You can potentially swap this with Athena using federated query. Optionally you can add in API Gateway for some additional features like rate-limiting and different forms of authentication. Keep in mind that both Lambda and API Gateway have limit in terms of data returned and execution time.
For long running queries I would suggest that the Lambda, API Gateway or even from the front-end itself invoke an AWS Glue Python Shell Job which will use an unload query to drop the results in S3. The front-end can pool for when the job is done.
If you have few types of queries then you can make proper rest API.
Instead of Lambda, you can also use Amazon Athena Federated Query, which you can actually query with directly from the front-end.
OData Implementation
There are third party OData implementations for Redshift. Just google it. With a front-end library that consumes OData(I used KendoUI in the past) you can potentially make a working feature rich front-end in days. The main concern with this option is the tools costs may over your budget. Of course the hours you spent making things is also a cost but it really depends on what are your actual requirements.
So how to choose?
Depending on your requirements I would suggest simply going through the option and selecting them based on costs, time to implementation, performance, reliability and security.
How about Redshift performance?
This is the most difficult part about Redshift and on-demand queries. On Redshift you don't have indexes, data can be compressed and the data is stored columnar fashion. All of these can make Redshift slower than your average relational database for a random query.
However you can make sure that your table is sorted with a distribution style that matches your queries and your queries use the columnar storage to their advantage(not all columns are requested), then it can be faster.
Another thing to keep in mind is that Redshift doesn't handle concurrency well, I believe by default there can only 8 concurrent queries, which you can increase it but you definitely wouldn't to go more than 20.
If your users can wait for their queries(I've seen bad queries go over 2 hours. I'm sure you can make them take longer, then Redshift is fine, if not then you could try putting Postgres in front of Redshift by using external tables and then use your average indexes in front of it to speed things up.

Is it possible to use grafana to write query results of SQL DBs (postgres / mysql) into influxDB ?

I would like to query several different DB's using grafana, and in order to keep metrics history I would like to keep it in influxDB.
I know that I can write my own little process that holds queries and send it to influx, but I wonder if its possible by grafana only?
You won't be able to use Grafana to do that. Grafana isn't really an appropriate tool for transforming/writing data. But either way, its query engine generally just works with one single datasource/database at a time, rather than multiple, which is what you'd need here.

Rest api vs sqoop

I was trying to import data from mysql to hdfs.
I was able to do it with sqoop but this can be done by fetching the data from api also.
My question is about when to use rest api to load data in hdfs instead of sqoop?
Please specify some difference with use cases!
You could use Sqoop to pull data from Mysql and into Hbase, then put a REST API over Hbase (on Hadoop)... Would be not much different than a REST API over Mysql.
Basically, you're comparing two different things. Hadoop is not meant to replace traditional databases or N-tier user-facing applications, it just is a more distributed, fault tolerant place to store large amounts of data.
And you typically wouldn't use a REST API to talk to a database, then put those values into Hadoop, because that wouldn't be distributed and all database results go through a single process
Sqoop (SQL <=> Hadoop) is basically used for loading data from RDBMS to HDFS.
It's a direct connection to database where you can append/modify/delete data in table(s) using sqoop eval command if privileges are not defined properly for the user accessing the db from sqoop
But using Rest web services api we can fetch data from various databases (can be NoSQL or RDBMS both) connected internally via code.
Consider you are calling a getUsersData restful web service using curl command which is specifically designed only to provide users data and doesn't allow to append/modify/update any components of db irrespective of database (RDBMS/NoSQL)