Relational Data to Flat File - rdbms

I hope you can help find an answer to a problem that will become a recurring theme at work. This involves denormalising data from RDBMS tables to flat file formats with repeating groups (sharing domain and meaning) across columns. Unfortunately this is unavoidable.
Here's a very simplified example of the transformation I'd require:
TABLE A TABLE B
------------------- 1 -> MANY ----------------------------
A_KEY FIELD_A B_KEY A_KEY FIELD_B
A_KEY_01 A_VALUE_01 B_KEY_01 A_KEY_01 B_VALUE_01
A_KEY_02 A_VALUE_02 B_KEY_02 A_KEY_01 B_VALUE_02
B_KEY_03 A_KEY_02 B_VALUE_03
This will become:
A_KEY FIELD_A B_KEY1 FIELD_B1 B_KEY2 FIELD_B2
A_KEY_01 A_VALUE_01 B_KEY_01 B_VALUE_01 B_KEY_02 B_VALUE_02
A_KEY_02 A_VALUE_02 B_KEY_03 B_VALUE_03
Each entry from TABLE A will have one row in the output flat file with one column per related field from TABLE B. Columns in the output file can have empty values for fields obtained from TABLE B.
I realise this will create an extremely wide file, but this is a requirement. I've had a look at MapForce and Apatar, but I think this problem is too bizarre or I can't use them correctly.
My question: is there already a tool that will accomplish this or should I develop one from scratch (I don't want to reinvent the wheel)?

I'm pretty sure you can't solve this in plain SQL, but depending on your RDBMS, it may be possible to create a stored procedure or some such thing. Otherwise it's a fairly easy thing to do in a scripting language. Which technology are you using?

Does this help?
using-pivot-in-sql-server-2008

Thanks for all your help. As it turns out the relationship is ONE -> MAX of 3 and this constraint will not change as the data is now static so the following run-of-the-mill SQL works:
select A.A_KEY, A.FIELD_A, B.B_KEY, B.FIELD_B, B2.B_KEY, B2.FIELD_B, B3.B_KEY,
B3.FIELD_B
from
A left join B on (A.A_KEY = B.A_KEY)
left join B B2 on (A.A_KEY = B2.A_KEY and B2.B_KEY != B.B_KEY)
left join B B3 on (A.A_KEY = B3.A_KEY and B3.B_KEY != B.B_KEY
and B3.B_KEY != B2.B_KEY)
group by A.A_KEY
order by A.A_KEY

Related

Perl : Tracking duplicates

I am trying to figure out what would be the best way to go ahead and locate duplicates in a 5 column csv data. The real data has more than million rows in it.
Following is the content of mentioned 6 columns.
Name, address, city, post-code, phone number, machine number
Data does not have fixed length, data might in certain columns might be missing in certain instances.
I am thinking of using perl to first normalize all the short forms used in names, city and address. Fellow perl enthusiasts from stackoverflow have helped me a lot.
But there would still be a lot of data which would be difficult to match.
So I am wondering is it possible to match content based on "LIKELINESS / SIMILARITY" (eg. google similar to gugl) the likeliness would be required to overcome errors that creeped in while collecting data.
I have 2 tasks in hand w.r.t. the data.
Flag duplicate rows with certain identifier
Mention the percentage match between similar rows.
I would really appreciate if I could get suggestions as to what all possible methods could be employed and which would propbably be best because of their certain merits.
You could write a Perl program to do this, but it will be easier and faster to put it into a SQL database and use that.
Most SQL databases have a way to import CSV. For this answer, I suggest PostgreSQL because it has very powerful string functions which you will need to find your fuzzy duplicates. Create your table with an auto incremented ID column if your CSV data doesn't already have unique IDs.
Once the import is done, add indexes on the columns you want to check for duplicates.
CREATE INDEX name ON whatever (name);
You can do a self-join to look for duplicates in whatever way you like. Here's an example that finds duplicate names.
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE t1.name = t2.name
PostgreSQL has powerful string functions including regexes to do the comparisons.
Indexes will have a hard time working on things like lower(t1.name). Depending on the sorts of duplicates you want to work with, you can add indexes for these transforms (this is a feature of PostgreSQL). For example, if you wanted to search case insensitively you can add an index on the lower-case name. (Thanks #asjo for pointing that out)
CREATE INDEX ON whatever ((lower(name)));
// This will be muuuuuch faster
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE lower(t1.name) = lower(t2.name)
A "likeness" match can be achieved in several ways, a simple one would be to use the fuzzystrmatch functions like metaphone(). Same trick as before, add a column with the transformed row and index it.
Other simple things like data normalization are better done on the data itself before adding indexes and looking for duplicates. For example, trim out and squish extra whitespace.
UPDATE whatever SET name = trim(both from name);
UPDATE whatever SET name = regexp_replace(name, '[[:space:]]+', ' ');
Finally, you can use the Postgres Trigram module to add fuzzy indexing to your table (thanks again to #asjo).

Why to create empty (no rows, no columns) table in PostgreSQL

In answer to this question I've learned that you can create empty table in PostgreSQL.
create table t();
Is there any real use case for this? Why would you create empty table? Because you don't know what columns it will have?
These are the things from my point of view that a column less table is good for. They probably fall more into the warm and fuzzy category.
1.
One practical use of creating a table before you add any user
defined columns to it is that it allows you to iterate fast when
creating a new system or just doing rapid dev iterations in general.
2.
Kind of more of 1, but lets you stub out tables that your app logic or procedure can make reference too, even if the columns have
yet to
be put in place.
3.
I could see it coming in handing in a case where your at a big company with lots of developers. Maybe you want to reserve a name
months in advance before your work is complete. Just add the new
column-less table to the build. Of course they could still high
jack it, but you may be able to win the argument that you had it in
use well before they came along with their other plans. Kind of
fringe, but a valid benefit.
All of these are handy and I miss them when I'm not working in PostgreSQL.
I don't know the precise reason for its inclusion in PostgreSQL, but a zero-column table - or rather a zero-attribute relation - plays a role in the theory of relational algebra, on which SQL is (broadly) based.
Specifically, a zero-attribute relation with no tuples (in SQL terms, a table with no columns and no rows) is the relational equivalent of zero or false, while a relation with no attributes but one tuple (SQL: no columns, but one row, which isn't possible in PostgreSQL as far as I know) is true or one. Hugh Darwen, an outspoken advocate of relational theory and critic of SQL, dubbed these "Table Dum" and "Table Dee", respectively.
In normal algebra x + 0 == x and x * 0 == 0, whereas x * 1 == x; the idea is that in relational algebra, Table Dum and Table Dee can be used as similar primitives for joins, unions, etc.
PostgreSQL internally refers to tables (as well as views and sequences) as "relations", so although it is geared around implementing SQL, which isn't defined by this kind of pure relation algebra, there may be elements of that in its design or history.
It is not empty table - only empty result. PostgreSQL rows contains some invisible (in default) columns. I am not sure, but it can be artifact from dark age, when Postgres was Objected Relational database - and PG supported language POSTQUEL. This empty table can work as abstract ancestor in class hierarchy.
List of system columns
I don't think mine is the intended usage however recently I've used an empty table as a lock for a view which I create and change dynamically with EXECUTE. The function which creates/replace the view has ACCESS EXCLUSIVE on the empty table and the other functions which uses the view has ACCESS.

ETL Process when and how to add in Foreign Keys T-SQL SSIS

I am in the early stages of creating a Data Warehouse based loosely on the Kimball methodology.
I am currently investigating my source data. I understand by the adding of a Primary key (not a natural key) this will then allow me to make the connections between the facts and dimensions.
Sounds like a silly question but how exactly is this done? Are there any good articles that run through this process?
I would imagine we bring in all of the Dimensions first. And when the fact data is brought over a lookup is performed that "pushes" the Foreign key into the Fact table? At what point is this done? Within SSIS whats is the "best practice" method? Is this all done in one package for example?
Is that roughly how it happens?
In this case do we have to be particularly careful in what order we load our data, or we could be loading facts for which there is no corresponding dimension?
I would imagine we bring in all of the Dimensions first. And when the
fact data is brought over a lookup is performed that "pushes" the
Foreign key into the Fact table? At what point is this done? Within
SSIS whats is the "best practice" method? Is this all done in one
package for example?
It would depend on your schema and table design.
Assuming it's star schema and the FK is based on the data value itself:
DIM1 <- FACT1 -> DIM2
^
|
FACT2 -> DIM3
you'll first fill DIM1 and DIM2 before inserting into FACT1 as you would need the FK.
Assuming it's snowflake schema:
DIM1_1
^
|
DIM1 <- FACT1 -> DIM2
you'll first fill DIM1_1 then DIM1 and DIM2 before inserting into FACT1.
Assuming the FK relation is based on something else (mostly a number) instead of the data value itself (kinda an optimization when dealing with huge amount of data and/or strings as dimension values), you won't need to wait until you insert the data into DIM table. I'm sure it's very confusing :), so I'll try to explain in short. The steps involved would be something like (assume a simple star schema with 2 tables, FACT1 and DIMENSION1):
Extract FACT and DIMENSION values from the data set you are processing.
Generate a unique number based on the DIMENSION's value (which say is a string), using a reproducible algorithm (e.g. SHA1, given same string, it always gives same number).
Insert into FACT1 table, the number and FACT values.
Insert into DIMENSION1 table, the number and DIMENSION values.
Steps 3 & 4 can be done in parallel. as long as there is NO constraint in place. A join on a numeric column would be more efficient than one of a string.
And there is no need to store the mapping for #2 because it's reproducible (just ensure you pick the right algo).
Obviously this can be extended for snowflake schema and/or multiple dimensions.
HTH

Link tables when one column is padded with 0's in Crystal Reports

I have a database that has two tables that need to be linked, but in one table the data is padded with zeros. For example, the tables may look like this:
CUSTOMER.CUSTNUM = 00000000123456
CUSTOMERPHONE.CUSTNUM = 123456
I can't figure out how to get these tables to properly join.
What I'm trying to do now is trick Crystal Reports into specifying the Join clause by adding the following to the selection expert:
Right ({CUSTOMER.CUSTNUM}) = {CUSTOMERPHONE.CUSTNUM}
That's not working though, and I get no records at all in my report.
Any ideas?
Crystal doesn't like heterogeneous joins.
Options:
use a command object, which will give you more control over the linkage
create a SQL Expression that performs the desired concatination; link fields in the record-selection formula
use a subreport for the linked table
alter the table to make the data types compatible
create a SQL view that performs the joins
First thing, why does CUSTOMER.CUSTNUM have leading zeros in the first place? It seems to me that it should be a NUMERIC data type instead of a VARCHAR. CUSTNUM should be consistent in all of the tables. Just a thought.
Anyway, to answer your question, you could try creating a SQL Command in Crystal to join the two tables. In the join, just use your database's function for converting from a varchar to a number. For example, in Access you could do:
SELECT *
FROM `Customer`
LEFT OUTER JOIN `Orders` ON `Orders`.`Numeric Customer ID` = CLng(`Customer`.`Varchar Customer ID`)
If fastest performance isn't an issue, you can accomplish this using Select Expert. I think the problem is your formula.
Try changing your formula from this:
{CUSTOMERPHONE.CUSTNUM} = Right({CUSTOMER.CUSTNUM})
to this:
{CUSTOMERPHONE.CUSTNUM} = Right({CUSTOMER.CUSTNUM}, Length({CUSTOMERPHONE.CUSTNUM}))

Postgres: n:m intermediate table with type

I have a table called "Tag" which consists of an Id, Name and Description column.
Now lets say I have the tables Character (C), Movie (M), Series (S) etc..
And I want to be able to tag entries in C, M, S with multiple tags and one tag may be used for multiple entries.
So I could realize it like this:
T -> TC <- C
T -> TM <- M
T -> TS <- S
Where TC, TM, TS are the intermediate tables.
I was wondering if I could combine TC, TM, TS into one table with a type column added and still use foreign keys.
As of yet I haven't found a way to do it.
Or is this something I shouldn't be doing?
As the comments above suggested you can't combine multiple table into a single one. If you want to have a single view of the "tag relationships" you can pull the needed information into a View. This way, you only need to write a longer query once and are able to use like a single table. Keep in mind that you can't insert data into a view (there are possibilities to do so, but they are a little advanced)