Compare two pg_dump results - postgresql

I've got two postgresql databases with simillar but not identical schemas. I want to compare those schemas in order to add missing columns and indices.
So far I've collected the schemas using pg_dump -s. But some columns appear in tables in different order, therefore I can't just use diff to spot the difference.
Can I somehow alphabetically sort columns in pg_dump?

Related

Wide column vs column family vs columnar vs column oriented DB definition

There are lots of confusions among these terms. I'd like to through my understanding out and see if people agree. I have seen conflicting and wrong definition all over the web.
In my mind, wide column and column family DB are essentially the same thing. They are
the data are organized logically by a group of key-value pairs (each one called column);
is identified by a unique row key;
each row can have variable length or definition of columns and
stored on disk one row after another. So column family (wide column) table is similar to relational DB's table in that they are organized as rows still.
The main difference is they it doesn't have fixed schema for columns and can't do table join obviously.
An example of 3 rows (column families): each row has different length and/or columns., but on disk rowkey1's entire content is a continuous line followed by other rows similar to relational DB
rowkey1 k1-v k2-v k3-v
rowkey2 k1-v k4v
rowkey3 k2-v k4-v k5-v
On the other hand, the term columnar DB is the same column oriented DB. They are stored on disk one column at a time, not one row at a time. It is great for time series or any multi series analytical purpose. The fact each column has the same type of data and is stored together allows for better data compression as an added bonus.
an example:
on disk:
a:1 b:2 c:3 d:4
10:1 9:2 8:3 7:4
The definition from Wikipedia also helps further:
Wide-column stores such as Bigtable and Apache Cassandra are not column stores in the original sense of the term, since their two-level structures do not use a columnar data layout. In genuine column stores, a columnar data layout is adopted such that each column is stored separately on disk. Wide-column stores do often support the notion of column families that are stored separately. However, each such column family typically contains multiple columns that are used together, similar to traditional relational database tables. Within a given column family, all data is stored in a row-by-row fashion, such that the columns for a given row are stored together, rather than each column being stored separately. Wide-column stores that support column families are also known as column family databases.
Reference: https://en.wikipedia.org/wiki/Wide-column_store

PostgreSQL - Compare ts_vector fields

I have two tables in which I have data coming from two different sources. One of the field of each table contains the title of a movie, but for some reason out of my control, the titles are not always exactly the same.
So I use the ts_vector to get rid of all the minor differences (stop words, plurals and so on).
See an example here: http://sqlfiddle.com/#!17/5ccbc/3
My problem is how to compare the two ts_vector without taking into account the numberic values, but just the text content. If I compare directly the two fields, I only get the exact match between values, including position of each word. The only solution I have found is using the strip() function, that remove positions and weights from tsvector, leaving only the text content.
I was wondering if there is a fastest way to compare ts_vectors.
You could create in index on the stripped vector:
create index on tbl1 (strip(ts_title));
create index on tbl2 (strip(ts_title));
But given that your query has to fetch every row of each table, it is unlikely this would serve much of a point. Doing a merge join between the precomputed stripped vectors could be faster, but probably not once you include the overhead of building and maintaining the indexes. If the real WHERE clause is more restrictive (selecting only a few rows from one or the other of the tables) then please share the real query.

Tableau Extract API with multiple tables in a database

I am currently experimenting with Tableau Extract API to generate some TDE from the tables I have in a PostgreSQL database. I was able to write a code to generate the TDE from single table, but I would like to do this for multiple joined tables. To be more specific, if I have two tables that are inner joined by some field, how would I generate the TDE for this?
I can see that if I am working with small number of tables, I could use a SQL query with JOIN clauses to create a one gigantic table, and generate the TDE from that table.
>> SELECT * FROM table_1 INNER JOIN table_2
INTO new_table_1
ON table_1.id_1 = table_2.id_2;
>> SELECT * FROM new_table_1 INNER JOIN TABLE_3
INTO new_table_2
ON new_table_1.id_1 = table_3.id_3
and then generate the TDE from new_table_2.
However, I have some tables that have over 40 different fields, so this could get messy.
Is this even a possibility with current version of the API?
You can read from as many tables or other sources as you want. Or use complex query with lots of joins, or create a view and read from that. Usually, creating a view is helpful when you have a complex query joining many tables.
The data extract API is totally agnostic about how or where you get the data to feed it -- the whole point is to allow you to grab data from unusual sources that don't have pre-built drivers for Tableau.
Since Tableau has a Postgres driver and can read from it directly, you don't need to write a program with the data extract API at all. You can define your extract with Tableau Desktop. If you need to schedule automated refreshes of the extract, you can use Tableau Server or its tabcmd command.
Many thanks for your replies. I am aware that I could use Tableau Desktop to define my extract. In fact, I have done this many times before. I am just trying to create the extracts using the API, because I need to create some calculated fields, which is near impossible to create using the Tableau Desktop.
At this point, I am hesitant to use JOINs in the SQL query because the resulting table would look too complicated to comprehend (some of these tables also have same field names).
When you say that I could read from multiple tables or sources, does that mean with the Tableau Extract API? At this point, I cannot find anywhere in this API that accommodates multiple sources. For example, I know that when I use multiple tables in the Tableau Desktop, there are icons on the left hand side that tells me that the extract is composed of multiple tables. This just doesn't seem to be happening with the API, which leaves me stranded. Anyways, thank you again for your replies.
Going back to the topic, this is something that I tried few days ago on my python code
try:
tdefile= tde.Extract("extract.tde")
except:
os.remove("extract.tde")
tdefile = tde.Extract("extract.tde")
tableDef = tde.TableDefinition()
# Read each column in table and set the column data types using tableDef.addColumn
# Some code goes here...
for eachTable in tableNames:
tableAdd = tdeFile.addTable(eachTable, tableDef)
# Use SQL query to retrieve bunch_of_rows from eachTable
for some_row in bunch_of_rows:
# Read each row in table, and set the values in each column position of each row
# Some code goes here...
tableAdd.insert(some_row)
some_row.close()
tdefile.close()
When I execute this code, I get the error that eachTable has to be called "Extract".
Of course, this code has its flaws, as there is no where in this code that tells how each table are being joined.
So I am little thrown off here, because it doesn't seem like I can use multiple tables unless I use JOINs to generate one table that contains everything.

Mongodb import and deciphering changed rows

I have a large csv file which contains over 30million rows. I need to load this file on a daily basis and identify which of the rows have changed. Unfortunately there is no unique key field but it's possible to use four of the fields to make it unique. Once I have identified the changed rows I will then want to export the data. I have tried using a traditional SQL Server solution but the performance is so slow it's not going to work. Therefore I have been looking at Mongodb - this has managed to import the file in about 20 minutes (which is fine). Now I don't have any experience using Monogdb and more importantly knowing best practices. So, my idea is the following:
As a one off - Import data into a collection using the mongoimport.
Copy all of the unique id's generated by mongo and put them in a separate collection.
Import new data into the existing collection using upsert fields which should create a new id for each new and changed row.
Compare the 'copy' to the new collection to list out all the changed rows.
Export changed data.
This to me will work but I am hoping there is a much better way to tackle this problem.
Use unix sort and diff.
Sort the file on disk
sort -o new_file.csv -t ',' big_file.csv
sort -o old_file.csv -t ',' yesterday.csv
diff new_file.csv old_file.csv
Commands may need some tweeking.
You can also use mysql to import the file via
http://dev.mysql.com/doc/refman/5.1/en/load-data.html (LOAD FILE)
and then create KEY (or primary key) on the 4 fields.
Then load yesterday's file into a different table and then use a 2 sql statements to compare the files...
But, diff will work best!
-daniel

How should a table with two sets of almost duplicate column names be designed?

I have a table that has around 40 columns. The only difference in the columns names is that the last 20 all start with "B" before the column name. This table is used for comparing. In other words, compare the data in the first 20 columns to the data in the last 20 columns.
I know this is very bad design, so how should this table be redesigned, so that there are only 20 columns, yet we can still compare the data?
EDIT: if it helps, we also use this data to find a matched cohort
Also note that performance is of main concern here. By duplicating the columns the getting of data is extremely fast.
Thanks!
Two possible architectures and a query tip.
1) Build your table with a "Type" column, and use that to flag "primary" vs. "alternate". In your case, "A" vs. "B" might be appropriate.
2) Build a vertical partition, two identical tables (for primary and alternate data), that share a common primary key. (If Id = 42 is in one table, it must be in the other--unless "alternate" data is optional, in which case don't populate the second table.) Also optionally, have a third table that tracks all possible primary keys, along with any data that is known to always be common to both tables.
Tip: Read up on SELECT...EXCEPT and SELECT...INTERSECT. They run disturbingly quickly, and are idea for comparing all columns and rows between two datasets for differences (except) and matches (intersect). You can use this fairly easily with either of the two structures, and it would work with your existing code as well (though it might be fussier to write the query).