Virtualizing star schema - virtualization

Does anyone have experinces about virtualizing star schema model?
i.e. making database views which contain the data for facts and dimensions, instead of having them in physical tables.

Related

PostgreSQL database design on multiple disks

Currently I've one physical machine with few SSD disks and PostgreSQL fresh installation:
I'll load ~1-2Tb of data in few distinct tables (they've not interconnection between themselves) where each table comprises distinct data entity.
I think about two approaches:
Create DB (with corresponding table for data entity) on each disk for each entity.
Create one DB but store each table for corresponding data entity on separate disks.
So, my questions is as follows: what approach is preferred and which can be achieved with less cost?
Eagerly waiting for your advices, comrades
You can answer the question yourself.
Are the data used by the same application?
Are the data from these tables joined?
Should these tables always be started and stopped together and have the same PostgreSQL version?
If yes, then they had best be stored together in a single database. Create three logical volumes that is striped across your SSDs: one for the data, one for pg_wal, one for the logs.
If not, you might be best off with a database or a database cluster per table.

Are schemas in PostgreSQL physical objects?

I use schemas in PostgreSQL for organizing my huge accounting database. At the end of every year I make a reconcile process by creating a new schema for the next year.
Are the files of the new schema physically separated from the old schema? Or all schemas stored on the hard disk together?
This is a vital thing for me because at the end of every year I've huge tables with millions of records which means I'll call heavy queries soon (I didn't plan for it when I decided to choose PostgreSQL).
Schemas are namespaces so they are a "logical" thing, not a physical thing.
As documented in the manual each table is represented as one (or more files) inside the directory corresponding to the database the table is created in. The namespaces (schemas) are not reflected in the physical database layout.
In general you shouldn't care about the storage of the database to begin with and your SQL queries will not know where the actual data is stored.
"millions" of rows is not considered "huge" these days. If you do run in performance problems, you will tune your query using e.g. indexes or by rewriting it to a more efficient solution. In rare cases partitioning a table can help with really huge tables - but we are talking hundreds of millions or even billions of rows. With medium to small sized tables, partitioning usually doesn't help with performance.

Best approach to implement inheritance in a data warehouse based on a postgres database

I am developing a multi-step data pipeline that should optimize the following process:
1) Extract data from a NoSQL database (MongoDB).
2) Transform and load the data into a relational (PostgreSQL) database.
3) Build a data warehouse using the Postgres database
I have manually coded a script to handle steps 1) and 2), which is an intermediate ETL pipeline. Now my goal is to build the data warehouse using the Postgres database, but I came across with a few doubts regarding the DW design. Below is the dimensional model for the relational database:
There are 2 main tables, Occurrence and Canonical, from which inherit a set of others (drawn in red and blue, respectively). Note that there are 2 child data types, ObserverNodeOccurrence and CanonicalObserverNode, that have an extra many-to-many relationship with another table.
I made some research regarding how inheritance should be implemented in a data warehouse and figured the best practice would be to merge together the family data types (super and child tables) into a single table. Doing this would imply adding extra attributes and a lot of null values. My new dimensional model would look like the following:
Question 1: Do you think this is the best approach to address this problem? If not, what would be?
Question 2: Any software recommendations for on-premise data warehouses? (on-premise is a must since it contains sensitive data)
Usually having fewer tables to join and denormalizing data will improve query performance for data warehouse queries, so they are often considered a good thing.
This would suggest your second table design. NULL values don't occupy any space in a PostgreSQL table, so you need not worry about that.
As described here there are three options to implement inheritance in a relational database.
IMO the only practicable way to be used in data warehouse is the Table-Per-Hierarchy option, which merges all entities in one table.
The reason is not only the performance gain by saving the joins. In data warehouse often the historical view of the data is important. Think, how would you model a change in a subtype in some entity?
An important thing is to define a discriminator column which uniquely defines the source entity.

How does pglogical-2 handle logical replication on same table while allowing it to be writeable on both databases?

Based on the above image, there are certain tables I want to be in the Internal Database (right hand side). The other tables I want to be replicated in the external database.
In reality there's only one set of values that SHOULD NOT be replicated across. The rest of the database can be replicated. Basically the actual price columns in the prices table cannot be replicated across. It should stay within the internal database.
Because the vendors are external to the network, they have no access to the internal app.
My plan is to create a replicated version of the same app and allow vendors to submit quotations and picking items.
Let's say the replicated tables are at least quotations and quotation_line_items. These tables should be writeable (in terms of data for INSERTs, UPDATEs, and DELETEs) at both the external database and the internal database. Hence at both databases, the data in the quotations and quotation_line_items table are writeable and should be replicated across in both directions.
The data in the other tables are going to be replicated in a single direction (from internal to external) except for the actual raw prices columns in the prices table.
The quotation_line_items table will have a price_id column. However, the raw price values in the prices table should not appear in the external database.
Ultimately, I want the data to be consistent for the replicated tables on both databases. I am okay with synchronous replication, so a bit of delay (say, a couple of second for the write operations) is fine.
I came across pglogical https://github.com/2ndQuadrant/pglogical/tree/REL2_x_STABLE
and they have the concept of PUBLISHER and SUBSCRIBER.
I cannot tell based on the readme which one would be acting as publisher and subscriber and how to configure it for my situation.
That won't work. With the setup you are dreaming of, you will necessarily end up with replication conflicts.
How do you want to prevent that data are modified in a conflicting fashion in the two databases? If you say that that won't happen, think again.
I believe that you would be much better off using a single database with two users: one that can access the “secret” table and one that cannot.
If you want to restrict access only to certain columns, use a view. Simple views are updateable in PostgreSQL.
It is possible with BDR replication which uses pglogical. On a basic level by allocating ranges of key ids to each node so writes are possible in both locations without conflict. However BDR is now a commercial paid for product.

Transforming relational data bases to graph databases

As part of my final thesis, I must transform a relational database in a graph-oriented database, specifically a PostgreSQL database into a Neo4j embedded database. Now, the way is the problem. In Rik Van Bruggen's book: Learning Neo4j, he mentions a data import process using ETL activities with Trascend and MuleSoft tools, but in their official sites, there's no documentation about how to do it, neither help documentation nor examples. Apart from these tools, what other ways can I use to transform this information without using my own code?
Some modeling advice:
A well normalized relational model, which was not yet denormalized for performance reasons can be translated into the equivalent graph model.
Graph model shapes are mostly driven by use-cases, so there will be opportunity for optimization and model evolution afterwards.
A good, normalized Entity-Relationship diagram often already represents a decent graph model.
So if you still have the orignal ER diagram available, try to use it as a guide.
Here are some tips that help you with the transformation:
Each entity table is represented by a label on nodes
Each row in a table is a node
Columns on those tables become node properties.
Remove technical primary keys, keep business primary keys
Add unique constraints for business primary keys, add indexes for frequent lookup attributes
Replace foreign keys with relationships to the other table, remove them afterwards
Remove data with default values, no need to store those
Data in tables that is denormalized and duplicated might have to be pulled out into separate nodes to get a cleaner model.
Indexed column names, might indicate an array property (like email1, email2, email3)
JOIN tables are transformed into relationships, columns on those tables become relationship properties
It is important to have an understanding of the graph model before you start to import data, then it just becomes the task of hydrating that model.
LOAD CSV might be your best option, but of course it means outputting a CSV first. Here are some great resources:
http://neo4j.com/docs/stable/query-load-csv.html
http://watch.neo4j.org/video/112447027
http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/
http://jexp.de/blog/2014/10/load-cvs-with-success/
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
I've also written a ruby gem which lets you write a little ruby code to import data from various sources. It's called neo4apis. You can look at the neo4apis-twitter gem to get an idea for how it works:
https://github.com/neo4jrb/neo4apis-twitter/
https://github.com/neo4jrb/neo4apis-twitter/blob/master/lib/neo4apis/twitter.rb
I've actually been wanting to implement a neo4apis-activerecord to make it easy to import from SQL with ActiveRecord
You can not directly export data from relational and import to neo4j.
Because these are two different database structures.
Relational Database -
A relational database is a set of tables containing data fitted into predefined categories. Each table (which is sometimes called a relation) contains one or more data categories in columns. Each row contains a unique instance of data for the categories defined by the columns.
Graph-oriented database -
A graph database is essentially a collection of nodes and edges. Each node represents an entity (such as a person or business) and each edge represents a connection or relationship between two nodes.
Sollution To your Problem-
First, you need to design Neo4j Data structure. e.g What will be the nodes you required, what will be the relationships between the nodes.
After that you create Script in your application language to fetch data from relational database and insert it into neo4j.
Load CSA is a option to Import/Export (backup) functionality with graph database. you can not directly Export/Import data from Relational DB to Graph DB