Is it possible to create a graph in AGE using existing table in the database? - postgresql

I have just started with Apache AGE extension. I am exploring the functionalities of graph database. Is there a way to create a graph from existing tables/schema such that the table becomes the label and the attributes become the properties for the vertex?
The create_graph('graph name') is used for creating graphs but I can only create a new graph using this function.

It's not as simple as that. For a start you have to understand this.
When deriving a graph model from a relational model, keep in mind some general guidelines.
A row is a node.
A table name is a label name.
A join or foreign key is a relationship.
Using those relationships, you can model out the data. This is if you need to ensure no errors.
Without an example here is the Dynamic way of creating Graph from Relational model.
1st make a PostgreSQL function that takes in the arguments. Example, name and title of Person. It will create a node.
CREATE OR REPLACE FUNCTION public.create_person(name text, title text)
RETURNS void
LANGUAGE plpgsql
VOLATILE
AS $BODY$
BEGIN
load 'age';
SET search_path TO ag_catalog;
EXECUTE format('SELECT * FROM cypher(''graph_name'', $$CREATE (:Person {name: %s, title: %s})$$) AS (a agtype);', quote_ident(name), quote_ident(title));
END
$BODY$;
2nd use the function like so,
SELECT public.create_person(sql_person.name, sql_person.title)
FROM sql_schema.Person AS sql_person;
You'll have created a node for every row in SQL_SCHEMA.Person

To export data from a PGSQL table to an AGE graph, you can try exporting a CSV file. For example, if you have the following table called employees:
SELECT * from employees;
id | name | manager_id | title
----+------------------------+------------+------------
1 | Gabriel Garcia Marquez | | Boss
2 | Dostoevsky | 1 | Director
3 | Victor Hugo | 1 | Manager
4 | Albert Camus | 2 | Engineer
5 | Haruki Murakami | 3 | Analyst
6 | Virginia Woolf | 1 | Consultant
7 | Liu Cixin | 2 | Manager
8 | Franz Kafka | 4 | Intern
9 | Daphne Du Maurier | 7 | Engineer
First export a CSV using the following command:
\copy (SELECT * FROM employees) to '/home/username/employees.csv' with csv header
Now you can import this into AGE. Remember that for a graph database, the name of the table is the name of the vertex label. The columns of the table are the properties of the vertex.
First make sure you create a label for your graph. In this case, the label name will be 'employees', the same as the table name.
SELECT create_vlabel('graph_name','employees');
Now we load all the nodes of this label (each row from the original table is one node in the graph).
SELECT load_labels_from_file('graph_name','employees','/home/username/employees.csv');
Now your graph should have all the table data of the employees table.
More information can be found on the documentation:
https://age.apache.org/age-manual/master/intro/agload.html

I don't think it's possible to create a graph using existing tables. Because when we create a graph the graph name is the schema name and the label name for vertices and edges are table names. Create a sample graph and then run the below command to understand more about what schemas and table names are present in Postgresql.
SELECT * FROM pg_catalog.pg_tables

Related

Insert last characters of a value in Postgres table while inserting that same value

CONTEXT:
I'm currently building a custom import script in Python with psycopg2 that inserts values from a csv file into a postgres database. The csv however provides me with a value that needs refining.
PROBLEM: In the example below you can see what I want:
I want the 5 last digits from the 15-digit.
mytestdb=# select * from testtable;
uid | first_name | last_name | age | 15-digit | last_5_digits
-----+------------+-----------+-----+----------------------------+-----------------
1 | John | Doe | 42 | 99999999912345 | 12345
I know I could accomplish this by first inserting the supplied values (first_name, last_name, age and 15-digit) and then using RIGHT(15-digit,5) and an UPDATE statement fill the last_5_digits field.
I would however prefer to do this during the initial INSERT of the row. This would considerably lower the amount of transactions on the database.
Could anyone help me getting this done?

Versioning in the database

I want to store full versioning of the row every time a update is made for amount sensitive table.
So far, I have decided to use the following approach.
Do not allow updates.
Every time a update is made create a new
entry in the table.
However, I am undecided on what is the best database structure design for this change.
Current Structure
Primary Key: id
id(int) | amount(decimal) | other_columns
First Approach
Composite Primary Key: id, version
id(int) | version(int) | amount(decimal) | change_reason
1 | 1 | 100 |
1 | 2 | 20 | correction
Second Approach
Primary Key: id
Uniqueness Index on [origin_id, version]
id(int) | origin_id(int) | version(int) | amount(decimal) | change_reason
1 | NULL | 1 | 100 | NULL
2 | 1 | 2 | 20 | correction
I would suggest a new table which store unique id for item. This serves as lookup table for all available items.
item Table:
id(int)
1000
For the table which stores all changes for item, let's call it item_changes table. item_id is a FOREIGN KEY to item table's id. The relationship between item table to item_changes table, is one-to-many relationship.
item_changes Table:
id(int) | item_id(int) | version(int) | amount(decimal) | change_reason
1 | 1000 | 1 | 100 | NULL
2 | 1000 | 2 | 20 | correction
With this, item_id will never be NULL as it is a valid FOREIGN KEY to item table.
The best method is to use Version Normal Form (vnf). Here is an answer I gave for a neat way to track all changes to specific fields of specific tables.
The static table contains the static data, such as PK and other attributes which do not change over the life of the entity or such changes need not be tracked.
The version table contains all dynamic attributes that need to be tracked. The best design uses a view which joins the static table with the current version from the version table, as the current version is probably what your apps need most often. Triggers on the view maintain the static/versioned design without the app needing to know anything about it.
The link above also contains a link to a document which goes into much more detail including queries to get the current version or to "look back" at any version you need.
Why you are not going for SCD-2 (Slowly Changing Dimension), which is a rule/methodology to describe the best solution for your problem. Here is the SCD-2 advantage and example for using, and it makes standard design pattern for the database.
Type 2 - Creating a new additional record. In this methodology, all history of dimension changes is kept in the database. You capture attribute change by adding a new row with a new surrogate key to the dimension table. Both the prior and new rows contain as attributes the natural key(or other durable identifiers). Also 'effective date' and 'current indicator' columns are used in this method. There could be only one record with the current indicator set to 'Y'. For 'effective date' columns, i.e. start_date, and end_date, the end_date for current record usually is set to value 9999-12-31. Introducing changes to the dimensional model in type 2 could be very expensive database operation so it is not recommended to use it in dimensions where a new attribute could be added in the future.
id | amount | start_date |end_date |current_flag
1 100 01-Apr-2018 02-Apr-2018 N
2 80 04-Apr-2018 NULL Y
Detail Explanation::::
Here, all you need to add the 3 extra column, START_DATE, END_DATE, CURRENT_FLAG to track your record properly. When the first time record inserted # source, this table will be store the value as:
id | amount | start_date |end_date |current_flag
1 100 01-Apr-2018 NULL Y
And, when the same record will be updated then you have to update the "END_DATE" of the previous record as current_system_date and "CURRENT_FLAG" as "N", and insert the second record as below. So you can track everything about your records. as below...
id | amount | start_date |end_date |current_flag
1 100 01-Apr-2018 02-Apr-2018 N
2 80 04-Apr-2018 NULL Y

perform join on 2 DB

I have 2 DB on the same server with the same user:
ckan_default
datastore_default
The relation of ckan_default is:
Schema | Name | Type | Owner
-------+-------------------------------+-------+----------
public | resource | table | ckanuser
public | resource_group | table | ckanuser
public | package | table | ckanuser
....
The relation of datastore_default is:
Schema | Name | Type | Owner
-------+--------------------------------------+-------+----------
public | 1bc7932e-2507-467b-8c12-c9f321b760f7 | table | ckanuser
public | 449138df-e089-41f2-8939-dcee53a31bc1 | table | ckanuser
public | 7235f781-1b16-4abf-ac04-8d68fa62e432 | table | ckanuser
....
I wont to JOIN the 2 DB ON ckan_default.resource.id = datastore_default."NAME OF RELATION".
How?
I dont think you can.
You can use dblink extension to query database B from A, but the query will be separated from the data context of database A.. this is how postgresql works.
EDIT: you can populate a view from the result of a dblink query, and then use it:
CREATE VIEW myremote_pg_proc AS
SELECT *
FROM dblink('dbname=postgres', 'select proname, prosrc from pg_proc')
AS t1(proname name, prosrc text);
SELECT * FROM myremote_pg_proc WHERE proname LIKE 'bytea%';
Examples in the link i posted.
PL/Proxy is another option, similar to dblink. I have used it in the past to talk between servers, where my use-case was a poor-man's distributed database cluster. The data on the the other servers was pulled in for certain large reports and it worked pretty well. The servers were all in the same colocation though, so if the other databases are geographically spread out then you are going to pay an additional penalty for network latency and data transfer times.

fetching a table with another tables "conditions"

I have two tables like this:
Table Name: users
emx | userid
---------------
1 | 1
2 | 2
and another table called bodies
id | emx | text
--------------------------
1 | 1 | Hello
2 | 2 | How are you?
As you can see, bodies table has emx which is id numbers of users table. Now, when i want to fetch message that contains Hello i just search it on bodies and get the emx numbers and after that i fetch users table with these emx numbers. So, i am doing 2 sql queries to find it.
So, all i want to do is make this happen in 1 SQL query.
I tried some queries which is not correct and also i tried JOIN too. No luck yet. I just want to fetch users table with message contains 'Hello' in bodies table.
Note: I am using PostgreSQL 9.1.3.
Any idea / help is appreciated.
Read docs on how to join tables.
Try this:
SELECT u.emx, u.userid, b.id, b.text
FROM bodies b
JOIN users u USING (emx)
WHERE b.text ~ 'Hello';
This is how I'd do the join. I've left out the exact containment test.
SELECT users.userid
FROM users JOIN bodies ON (users.emx = bodies.emx)
WHERE ⌜true if bodies.text contains ?⌟

How to create a PostgreSQL partitioned sequence?

Is there a simple (ie. non-hacky) and race-condition free way to create a partitioned sequence in PostgreSQL. Example:
Using a normal sequence in Issue:
| Project_ID | Issue |
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |
Using a partitioned sequence in Issue:
| Project_ID | Issue |
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
| 2 | 2 |
I do not believe there is a simple way that is as easy as regular sequences, because:
A sequence stores only one number stream (next value, etc.). You want one for each partition.
Sequences have special handling that bypasses the current transaction (to avoid the race condition). It is hard to replicate this at the SQL or PL/pgSQL level without using tricks like dblink.
The DEFAULT column property can use a simple expression or a function call like nextval('myseq'); but it cannot refer to other columns to inform the function which stream the value should come from.
You can make something that works, but you probably won't think it simple. Addressing the above problems in turn:
Use a table to store the next value for all partitions, with a schema like multiseq (partition_id, next_val).
Write a multinextval(seq_table, partition_id) function that does something like the following:
Create a new transaction independent on the current transaction (one way of doing this is through dblink; I believe some other server languages can do it more easily).
Lock the table mentioned in seq_table.
Update the row where the partition id is partition_id, with an incremented value. (Or insert a new row with value 2 if there is no existing one.)
Commit that transaction and return the previous stored id (or 1).
Create an insert trigger on your projects table that uses a call to multinextval('projects_table', NEW.Project_ID) for insertions.
I have not used this entire plan myself, but I have tried something similar to each step individually. Examples of the multinextval function and the trigger can be provided if you want to attempt this...