ETL Process when and how to add in Foreign Keys T-SQL SSIS - tsql

I am in the early stages of creating a Data Warehouse based loosely on the Kimball methodology.
I am currently investigating my source data. I understand by the adding of a Primary key (not a natural key) this will then allow me to make the connections between the facts and dimensions.
Sounds like a silly question but how exactly is this done? Are there any good articles that run through this process?
I would imagine we bring in all of the Dimensions first. And when the fact data is brought over a lookup is performed that "pushes" the Foreign key into the Fact table? At what point is this done? Within SSIS whats is the "best practice" method? Is this all done in one package for example?
Is that roughly how it happens?
In this case do we have to be particularly careful in what order we load our data, or we could be loading facts for which there is no corresponding dimension?

I would imagine we bring in all of the Dimensions first. And when the
fact data is brought over a lookup is performed that "pushes" the
Foreign key into the Fact table? At what point is this done? Within
SSIS whats is the "best practice" method? Is this all done in one
package for example?
It would depend on your schema and table design.
Assuming it's star schema and the FK is based on the data value itself:
DIM1 <- FACT1 -> DIM2
^
|
FACT2 -> DIM3
you'll first fill DIM1 and DIM2 before inserting into FACT1 as you would need the FK.
Assuming it's snowflake schema:
DIM1_1
^
|
DIM1 <- FACT1 -> DIM2
you'll first fill DIM1_1 then DIM1 and DIM2 before inserting into FACT1.
Assuming the FK relation is based on something else (mostly a number) instead of the data value itself (kinda an optimization when dealing with huge amount of data and/or strings as dimension values), you won't need to wait until you insert the data into DIM table. I'm sure it's very confusing :), so I'll try to explain in short. The steps involved would be something like (assume a simple star schema with 2 tables, FACT1 and DIMENSION1):
Extract FACT and DIMENSION values from the data set you are processing.
Generate a unique number based on the DIMENSION's value (which say is a string), using a reproducible algorithm (e.g. SHA1, given same string, it always gives same number).
Insert into FACT1 table, the number and FACT values.
Insert into DIMENSION1 table, the number and DIMENSION values.
Steps 3 & 4 can be done in parallel. as long as there is NO constraint in place. A join on a numeric column would be more efficient than one of a string.
And there is no need to store the mapping for #2 because it's reproducible (just ensure you pick the right algo).
Obviously this can be extended for snowflake schema and/or multiple dimensions.
HTH

Related

Binary to binary cast with JSONb

How to avoid the unnecessary CPU cost?
See this historic question with failure tests. Example: j->'x' is a JSONb representing a number and j->'y' a boolean. Since the first versions of JSONb (issued in 2014 with 9.4) until today (6 years!), with PostgreSQL v12... Seems that we need to enforce double conversion:
Discard j->'x' "binary JSONb number" information and transforms it into printable string j->>'x';discard j->'y' "binary JSONb boolean" information and transforms it into printable string j->>'y'.
Parse string to obtain "binary SQL float" by casting string (j->>'x')::float AS x; parse string to obtain "binary SQL boolean" by casting string (j->>'y')::boolean AS y.
Is there no syntax or optimized function to a programmer enforce the direct conversion?
I don't see in the guide... Or it was never implemented: is there a technical barrier to it?
NOTES about typical scenario where we need it
(responding to comments)
Imagine a scenario where your system need to store many many small datasets (real example!) with minimal disk usage, and managing all with a centralized control/metadata/etc. JSONb is a good solution, and offer at least 2 good alternatives to store in the database:
Metadata (with schema descriptor) and all dataset in an array of arrays;
Separating Metadata and table rows in two tables.
(and variations where metadata is translated to a cache of text[], etc.) Alternative-1, monolitic, is the best for the "minimal disk usage" requirement, and faster for full information retrieval. Alternative-2 can be the choice for random access or partial retrieval, when the table Alt2_DatasetLine have also more one column, like time, for time series.
You can create all SQL VIEWS in a separated schema, for example
CREATE mydatasets.t1234 AS
SELECT (j->>'d')::date AS d, j->>'t' AS t, (j->>'b')::boolean AS b,
(j->>'i')::int AS i, (j->>'f')::float AS f
FROM (
select jsonb_array_elements(j_alldata) j FROM Alt1_AllDataset
where dataset_id=1234
) t
-- or FROM alt2...
;
And CREATE VIEW's can by all automatic, running the SQL string dynamically ... we can reproduce the above "stable schema casting" by simple formating rules, extracted from metadata:
SELECT string_agg( CASE
WHEN x[2]!='text' THEN format(E'(j->>\'%s\')::%s AS %s',x[1],x[2],x[1])
ELSE format(E'j->>\'%s\' AS %s',x[1],x[1])
END, ',' ) as x2
FROM (
SELECT regexp_split_to_array(trim(x),'\s+') x
FROM regexp_split_to_table('d date, t text, b boolean, i int, f float', ',') t1(x)
) t2;
... It's a "real life scenario", this (apparently ugly) model is surprisingly fast for small traffic applications. And other advantages, besides disk usage reduction: flexibility (you can change datataset schema without need of change in the SQL schema) and scalability (2, 3, ... 1 billion of different datasets on the same table).
Returning to the question: imagine a dataset with ~50 or more columns, the SQL VIEW will be faster if PostgreSQL offers a "bynary to bynary casting".
Short answer: No, there is no better way to extract a jsonb number as PostgreSQL than (for example)
CAST(j ->> 'attr' AS double precision)
A JSON number happens to be stored as PostgreSQL numeric internally, so that wouldn't work “directly” anyway. But there is no principal reason why there could not be a more efficient way to extract such a value as numeric.
So, why don't we have that?
Nobody has implemented it. That is often an indication that nobody thought it worth the effort. I personally think that this would be a micro-optimization – if you want to go for maximum efficiency, you extract that column from the JSON and store it directly as column in the table.
It is not necessary to modify the PostgreSQL source to do this. It is possible to write your own C function that does exactly what you envision. If many people thought this was beneficial, I'd expect that somebody would already have written such a function.
PostgreSQL has just-in-time compilation (JIT). So if an expression like this is evaluated for a lot of rows, PostgreSQL will build executable code for that on the fly. That mitigates the inefficiency and makes it less necessary to have a special case for efficiency reasons.
It might not be quite as easy as it seems for many data types. JSON standard types don't necessarily correspond to PostgreSQL types in all cases. That may seem contrived, but look at this recent thread in the Hackers mailing list that deals with the differences between the numeric types between JSON and PostgreSQL.
All of the above are not reasons that such a feature could never exist, I just wanted to give reasons why we don't have it.

Why to create empty (no rows, no columns) table in PostgreSQL

In answer to this question I've learned that you can create empty table in PostgreSQL.
create table t();
Is there any real use case for this? Why would you create empty table? Because you don't know what columns it will have?
These are the things from my point of view that a column less table is good for. They probably fall more into the warm and fuzzy category.
1.
One practical use of creating a table before you add any user
defined columns to it is that it allows you to iterate fast when
creating a new system or just doing rapid dev iterations in general.
2.
Kind of more of 1, but lets you stub out tables that your app logic or procedure can make reference too, even if the columns have
yet to
be put in place.
3.
I could see it coming in handing in a case where your at a big company with lots of developers. Maybe you want to reserve a name
months in advance before your work is complete. Just add the new
column-less table to the build. Of course they could still high
jack it, but you may be able to win the argument that you had it in
use well before they came along with their other plans. Kind of
fringe, but a valid benefit.
All of these are handy and I miss them when I'm not working in PostgreSQL.
I don't know the precise reason for its inclusion in PostgreSQL, but a zero-column table - or rather a zero-attribute relation - plays a role in the theory of relational algebra, on which SQL is (broadly) based.
Specifically, a zero-attribute relation with no tuples (in SQL terms, a table with no columns and no rows) is the relational equivalent of zero or false, while a relation with no attributes but one tuple (SQL: no columns, but one row, which isn't possible in PostgreSQL as far as I know) is true or one. Hugh Darwen, an outspoken advocate of relational theory and critic of SQL, dubbed these "Table Dum" and "Table Dee", respectively.
In normal algebra x + 0 == x and x * 0 == 0, whereas x * 1 == x; the idea is that in relational algebra, Table Dum and Table Dee can be used as similar primitives for joins, unions, etc.
PostgreSQL internally refers to tables (as well as views and sequences) as "relations", so although it is geared around implementing SQL, which isn't defined by this kind of pure relation algebra, there may be elements of that in its design or history.
It is not empty table - only empty result. PostgreSQL rows contains some invisible (in default) columns. I am not sure, but it can be artifact from dark age, when Postgres was Objected Relational database - and PG supported language POSTQUEL. This empty table can work as abstract ancestor in class hierarchy.
List of system columns
I don't think mine is the intended usage however recently I've used an empty table as a lock for a view which I create and change dynamically with EXECUTE. The function which creates/replace the view has ACCESS EXCLUSIVE on the empty table and the other functions which uses the view has ACCESS.

Cassandra CompositeType as row key Validator

I'm working on some POC.
I have the Column Family which stores server event. Avoiding to get row oversize we are splitting each row to N another rows using compositeType in row key:
CREATE COLUMN FAMILY logs with comparator='ReversedType(TimeUUIDType)' and key_validation_class='CompositeType(UTF8Type,IntegerType)' and default_validation_class=UTF8Type;
so for each server name we have N rows and we are writing data to each row using Very Simple Round Robin algorithm.
I have no problem to write data to any row:
Mutator<Composite> mutator = HFactory.createMutator(keySpace, CompositeSerializer.get());
HColumn<UUID,String> col =
HFactory.createColumn( TimeUUIDUtils.getUniqueTimeUUIDinMillis(), log);
Composite rowName = new Composite();
rowName.addComponent(serverName, StringSerializer.get());
rowName.addComponent(this.roundRobinDestributor.getRow(), IntegerSerializer.get());
mutator.insert(rowName, columnFamilyName, col);
}
So far so good, but now I have two quetions:
1) Due to the fact that if I want to get all logs for some serverName I would scan row keys, should I use ByteOrderedPartitioner?
2) Can any body help me, or point me on some help how to create Hector query which will bring all rows for server1 ( {server1:0}, {server1:1} {server1:2), etc...)? I saw a lot of example using CompositeType as comparator, but no example for key validator.
Any help or comment is highly appreciated.
First of all, row oversizing shouldn't be a problem in cassandra. Despite that, it might worth to spilt rows, since data distribution across cluster will be more even in this situation.
ByteOrderedPartitioner doesn't look like a good option here, since it would be hard to achieve uniform distribution of rows across cluster, that will lead to hotspots.
There's no way to query range of keys when using RandomPartitioner. However, if the maximum N value is reasonably small (up to 256) MultigetSliceQuery might be used to query whole set of rows.

How should a table with two sets of almost duplicate column names be designed?

I have a table that has around 40 columns. The only difference in the columns names is that the last 20 all start with "B" before the column name. This table is used for comparing. In other words, compare the data in the first 20 columns to the data in the last 20 columns.
I know this is very bad design, so how should this table be redesigned, so that there are only 20 columns, yet we can still compare the data?
EDIT: if it helps, we also use this data to find a matched cohort
Also note that performance is of main concern here. By duplicating the columns the getting of data is extremely fast.
Thanks!
Two possible architectures and a query tip.
1) Build your table with a "Type" column, and use that to flag "primary" vs. "alternate". In your case, "A" vs. "B" might be appropriate.
2) Build a vertical partition, two identical tables (for primary and alternate data), that share a common primary key. (If Id = 42 is in one table, it must be in the other--unless "alternate" data is optional, in which case don't populate the second table.) Also optionally, have a third table that tracks all possible primary keys, along with any data that is known to always be common to both tables.
Tip: Read up on SELECT...EXCEPT and SELECT...INTERSECT. They run disturbingly quickly, and are idea for comparing all columns and rows between two datasets for differences (except) and matches (intersect). You can use this fairly easily with either of the two structures, and it would work with your existing code as well (though it might be fussier to write the query).

DB2 Auto generated Column / GENERATED ALWAYS pros and cons over sequence

Earlier we were using 'GENERATED ALWAYS' for generating the values for a primary key. But now it is suggested that we should, instead of using 'GENERATED ALWAYS' , use sequence for populating the value of primary key. What do you think can be the reason of this change? It this just a matter of choice?
Earlier Code:
CREATE TABLE SCH.TAB1
(TAB_P INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1, NO CACHE),
.
.
);
Now it is
CREATE TABLE SCH.TAB1
(TAB_P INTEGER ),
.
.
);
now while inserting, generate the value for TAB_P via sequence.
I tend to use identity columns more than sequences, but I'll compare the two for you.
Sequences can generate numbers for any purpose, while an identity column is strictly attached to a column in a table.
Since a sequence is an independent object, it can generate numbers for multiple tables (or anything else), and is not affected when any table is dropped. When a table with a identity column is dropped, there is no memory of what value was last assigned by that identity column.
A table can have only one identity column, so if you want to want to record multiple sequential numbers into different columns in the same table, sequence objects can handle that.
The most common requirement for a sequential number generator in a database is to assign a technical key to a row, which is handled well by an identity column. For more complicated number generation needs, a sequence object offers more flexibility.
This might probably be to handle ids in case there are lots of deletes on the table.
For eg: In case of identity, if your ids are
1
2
3
Now if you delete record 3, your table will have
1
2
And then if your insert a new record, the ids will be
1
2
4
As opposed to this, if you are not using an identity column and are generating the id using code, then after delete for the new insert you can calculate id as max(id) + 1, so the ids will be in order
1
2
3
I can't think of any other reason, why an identity column should not be used.
Heres something I found on the publib site:
Comparing IDENTITY columns and sequences
While there are similarities between IDENTITY columns and sequences, there are also differences. The characteristics of each can be used when designing your database and applications.
An identity column has the following characteristics:
An identity column can be defined as
part of a table only when the table
is created. Once a table is created,
you cannot alter it to add an
identity column. (However, existing
identity column characteristics might
be altered.)
An identity column
automatically generates values for a
single table.
When an identity
column is defined as GENERATED
ALWAYS, the values used are always
generated by the database manager.
Applications are not allowed to
provide their own values during the
modification of the contents of the
table.
A sequence object has the following characteristics:
A sequence object is a database
object that is not tied to any one
table.
A sequence object generates
sequential values that can be used in
any SQL or XQuery statement.
Since a sequence object can be used
by any application, there are two
expressions used to control the
retrieval of the next value in the
specified sequence and the value
generated previous to the statement
being executed. The PREVIOUS VALUE
expression returns the most recently
generated value for the specified
sequence for a previous statement
within the current session. The NEXT
VALUE expression returns the next
value for the specified sequence. The
use of these expressions allows the
same value to be used across several
SQL and XQuery statements within
several tables.
While these are not all of the characteristics of these two items, these characteristics will assist you in determining which to use depending on your database design and the applications using the database.
I don't know why anyone would EVER use an identity column rather than a sequence.
Sequences accomplish the same thing and are far more straight forward. Identity columns are much more of a pain especially when you want to do unloads and loads of the data to other environments. I not going to go into all the differences as that information can be found in the manuals but I can tell you that the DBA's have to almost always get involved anytime a user wants to migrate data from one environment to another when a table with an identity is involved because it can get confusing for the users. We have no issues when a sequence is used. We allow the users to update any schema objects so they can alter their sequences if they need to.