How to get column dependencies in views in Redshift or Postgres? - postgresql

I've defined some views, built on other views/tables in Redshift, and would like to get info from the system tables regarding the dependencies at the column level.
Say, for example, I have these definitions:
CREATE TABLE t1 AS (SELECT 2 as a, 4 as b, 99 as c );
CREATE VIEW v1 AS (SELECT a, b FROM t1);
CREATE VIEW v2 AS (SELECT a*b/2 as x FROM v1);
What I'd like to do is create some sort of query on the system or catalog tables that will return something like:
target_column | target_table_or_view | source_column | source_table_or_view |
------------------------------------------------------------------------------
x | v2 | a | v1
x | v2 | b | v1
a | v1 | a | t1
b | v1 | b | t1
I've tried the solution given here: How to create a dependency list for an object in Redshift?. However, this query doesn't produce the "target column" column I'm looking for and I don't know how to adjust it.
Is this possible? Ideally I'd like to do this in Redshift, but if needed I can use a newer version of Postgres.

There is no dependency associated with the “target column” in PostgreSQL, so you cannot find it in the metadata.
It is the complete view (its query rewrite rule, to be exact) that has a dependency on the source table and column.

Related

Is it possible to create a graph in AGE using existing table in the database?

I have just started with Apache AGE extension. I am exploring the functionalities of graph database. Is there a way to create a graph from existing tables/schema such that the table becomes the label and the attributes become the properties for the vertex?
The create_graph('graph name') is used for creating graphs but I can only create a new graph using this function.
It's not as simple as that. For a start you have to understand this.
When deriving a graph model from a relational model, keep in mind some general guidelines.
A row is a node.
A table name is a label name.
A join or foreign key is a relationship.
Using those relationships, you can model out the data. This is if you need to ensure no errors.
Without an example here is the Dynamic way of creating Graph from Relational model.
1st make a PostgreSQL function that takes in the arguments. Example, name and title of Person. It will create a node.
CREATE OR REPLACE FUNCTION public.create_person(name text, title text)
RETURNS void
LANGUAGE plpgsql
VOLATILE
AS $BODY$
BEGIN
load 'age';
SET search_path TO ag_catalog;
EXECUTE format('SELECT * FROM cypher(''graph_name'', $$CREATE (:Person {name: %s, title: %s})$$) AS (a agtype);', quote_ident(name), quote_ident(title));
END
$BODY$;
2nd use the function like so,
SELECT public.create_person(sql_person.name, sql_person.title)
FROM sql_schema.Person AS sql_person;
You'll have created a node for every row in SQL_SCHEMA.Person
To export data from a PGSQL table to an AGE graph, you can try exporting a CSV file. For example, if you have the following table called employees:
SELECT * from employees;
id | name | manager_id | title
----+------------------------+------------+------------
1 | Gabriel Garcia Marquez | | Boss
2 | Dostoevsky | 1 | Director
3 | Victor Hugo | 1 | Manager
4 | Albert Camus | 2 | Engineer
5 | Haruki Murakami | 3 | Analyst
6 | Virginia Woolf | 1 | Consultant
7 | Liu Cixin | 2 | Manager
8 | Franz Kafka | 4 | Intern
9 | Daphne Du Maurier | 7 | Engineer
First export a CSV using the following command:
\copy (SELECT * FROM employees) to '/home/username/employees.csv' with csv header
Now you can import this into AGE. Remember that for a graph database, the name of the table is the name of the vertex label. The columns of the table are the properties of the vertex.
First make sure you create a label for your graph. In this case, the label name will be 'employees', the same as the table name.
SELECT create_vlabel('graph_name','employees');
Now we load all the nodes of this label (each row from the original table is one node in the graph).
SELECT load_labels_from_file('graph_name','employees','/home/username/employees.csv');
Now your graph should have all the table data of the employees table.
More information can be found on the documentation:
https://age.apache.org/age-manual/master/intro/agload.html
I don't think it's possible to create a graph using existing tables. Because when we create a graph the graph name is the schema name and the label name for vertices and edges are table names. Create a sample graph and then run the below command to understand more about what schemas and table names are present in Postgresql.
SELECT * FROM pg_catalog.pg_tables

Cloud Data Fusion ETL from PostGres to BigQuery - idempotent load

I'm trying to use Google's Cloud Data Fusion (CDF) to perform an ETL of some OLTP data in PostGres into BigQuery (BQ). We will copy the contents of a few select tables into an equivalent table in BQ every night - we will add one column with the datestamp.
So imagine we have a table with two columns A & B, and one row of data like this in PostGres
|--------------------|
| A | B |
|--------------------|
| egg | milk |
|--------------------|
Then over two days, the BigQuery table would look like this
|-------------------------------|
| ds | A | B |
|-------------------------------|
| 22-01-01 | egg | milk |
|-------------------------------|
| 22-01-02 | egg | milk |
|-------------------------------|
However, I'm worried that the way I am doing this in CDF is not idempotent and if the pipeline runs twice I'll have duplicate data for a given day in BQ (not desired)
One idea is to delete rows for that day in BQ before doing the ETL (as part of the same pipeline). However, not sure how to do this, or if it is best practice. Any ideas?
You could delete the data in a BigQuery action at the start of the pipeline, though that runs into other issues if people are actively querying the table, or if the delete action succeeds but the rest of the pipeline fails.
The BigQuery sink allows you to configure it to upsert data instead of inserting. This should make it idempotent as long as your data has a key that can be used.
Some other possibilities are to place a BigQuery execute after the sink that runs a BigQuery MERGE, or to write a custom Condition plugin that queries BigQuery and only runs the rest of the pipeline if data for the date does not already exist.
You can use one of these 2 options, depending on what you want to do with the information:
Option 1
You can create a blank new_table with the same schema (ds,A,B). You will insert the data into the old_table from Data Fusion. With the MERGE statement, you will compare the data from the old_table with the new_table; the data that does not exist into the new_table will be inserted, and the data that exist and have different data will update this other data.
MERGE merge_example.new_table T
USING dataset.old_table S
ON T.ds = S.ds
WHEN MATCHED THEN
UPDATE SET T.A = s.a, T.B=s.b
WHEN NOT MATCHED THEN
INSERT (ds,A, B) VALUES(ds, A, B)
Option 2
It is the same process as Option 1, but this query only inserts the data that does not exist into the new_table.
insert into `dataset.new_table`
select ds, A, B from `dataset.old_table`
where ds not in (select ds from `dataset.new_table`)
The difference between Option 1 and Option 2 is that option 1 will update the data that exists which has a different value in the new_table and insert the new data. Option 2 will just insert the new data.
You can execute these queries with a Scheduled Query once a day. You can see this documentation.

Publishing table back to tickerplant

I am trying to publish a table straight from a real time engine. Basically I have a real time engine that connects to the tickerplant, subscribes to a raw version of a table and adds some new columns. Now I want this enhanced version of the table to be pushed back to the tickerplant. I have a pub function which pushes the table in the follwoing way:
neg[handle](`.u.upd;`tablename;tabledata)
the problem is that I get a type error. I looked at the schemas of the table and they are slightly different.
meta table1
c | t f a
----------------| -----
time | p
sym | s
col1 | c
col2 | s
col3 | i
meta table2
c | t f a
----------------| -----
time | p
sym | s
col1 | C
col2 | s
col3 | i
That capital C most likely is the problem. However, I cannot load the schema in the tickerplant with capital letters. Any idea how should I go about this?
You can define the schema with a generic list type and it will take its type from the first insert.
tab:([] col1:`int$();generic:();col3:`$())
Another issue is that your tickerplant might be expecting a list (of lists) to be sent to its .u.upd rather than the table you may be sending to it, so you may want to value flip your table before sending it. (And note that the tickerplant would try to prepend a timestamp if the first column isn't already a timestamp)
The capital C in your meta table is the result of the incoming data being nested. To resolve this you should declare the schema with an untyped empty list.
table2:([] time:`timestamp$();sym:`$();col1:();col2:`$();col3:"I"$())
Consequently, until a result is entered its meta is:
q)meta table2
c | t f a
----| -----
time| p
sym | s
col1|
col2| s
col3| i
This will then be updated to match the first entry into the table.
Also, .u.upd requires the input to not be a table but a list of lists, this can be resolved
using:
neg[handle](`.u.upd;`tablename;value flip tabledata)

Using filtered results as field for calculated field in Tableau

I have a table that looks like this:
+------------+-----------+---------------+
| Invoice_ID | Charge_ID | Charge_Amount |
+------------+-----------+---------------+
| 1 | A | $10 |
| 1 | B | $20 |
| 2 | A | $10 |
| 2 | B | $20 |
| 2 | C | $30 |
| 3 | C | $30 |
| 3 | D | $40 |
+------------+-----------+---------------+
In Tableau, how can I have a field that SUMs the Charge_Amount for the Charge_IDs B, C and D, where the invoice has a Charge_ID of A? The result would be $70.
My datasource is SQL Server, so I was thinking that I could add a field (called Has_ChargeID_A) to the SQL Server Table that tells if the invoice has a Charge_ID of A, and then in Tableau just do a SUM of all the rows where Has_ChargeID_A is true and Charge_ID is either B, C or D. But I would prefer if I can do this directly in Tableau (not this exactly, but anything that will get me to the same result).
Your intuition is steering you in the right direction. You do want to filter to only Invoices that contain row with a Charge_ID of A, and you can do this directly in Tableau.
First place Invoice_ID on the filter shelf, then select the Condition tab for the filter. Then select the "By formula" option on the condition tab and enter the formula you wish to use to determine which invoice_ids are included by the filter.
Here is a formula for your example:
count(if Charge_ID = 'A' then 'Y' end) > 0
For each data row, it will calculate the value of the expression inside the parenthesis, and then only include invoice_ids with at least one non-null value for the internal expression. (The implicit else for the if statement, "returns" null).
The condition tab for a dimension field equates to a HAVING clause in SQL.
If condition formulas get complex, it's often a good a idea to define them with a calculated field -- or a combination of several simpler calculated fields, just to keep things manageable.
Finally, if you end up working with sets of dimensions like this frequently, you can define them as sets. You can still drop sets on the filter shelf, but then can reuse them in other ways: like testing set membership in a calculated field (like a SQL IN clause), or by creating new sets using intersection and union operators. You can think of sets like named filters, such as the set of invoices that contain type A charge.

perform join on 2 DB

I have 2 DB on the same server with the same user:
ckan_default
datastore_default
The relation of ckan_default is:
Schema | Name | Type | Owner
-------+-------------------------------+-------+----------
public | resource | table | ckanuser
public | resource_group | table | ckanuser
public | package | table | ckanuser
....
The relation of datastore_default is:
Schema | Name | Type | Owner
-------+--------------------------------------+-------+----------
public | 1bc7932e-2507-467b-8c12-c9f321b760f7 | table | ckanuser
public | 449138df-e089-41f2-8939-dcee53a31bc1 | table | ckanuser
public | 7235f781-1b16-4abf-ac04-8d68fa62e432 | table | ckanuser
....
I wont to JOIN the 2 DB ON ckan_default.resource.id = datastore_default."NAME OF RELATION".
How?
I dont think you can.
You can use dblink extension to query database B from A, but the query will be separated from the data context of database A.. this is how postgresql works.
EDIT: you can populate a view from the result of a dblink query, and then use it:
CREATE VIEW myremote_pg_proc AS
SELECT *
FROM dblink('dbname=postgres', 'select proname, prosrc from pg_proc')
AS t1(proname name, prosrc text);
SELECT * FROM myremote_pg_proc WHERE proname LIKE 'bytea%';
Examples in the link i posted.
PL/Proxy is another option, similar to dblink. I have used it in the past to talk between servers, where my use-case was a poor-man's distributed database cluster. The data on the the other servers was pulled in for certain large reports and it worked pretty well. The servers were all in the same colocation though, so if the other databases are geographically spread out then you are going to pay an additional penalty for network latency and data transfer times.