ADF: copy distinct values into lookup table, add FK column to dataset - azure-data-factory

Tough to come up with a reasonable title for this one!
I am copying data from a source table (let's call it Books) that has an enum column (Category):
ID Title Category
----------------------------
1 Test1 Education
2 Blah Leisure
3 Brown fox Leisure
...
So in this example, there are two enum members, Education and Leisure.
The sink is SQL, and I'm getting the distinct set of enum values and putting them in a lookup table (Categories in this example). The Books table in the sink should have a foreign key column called CategoryId that refers to the PK in the Categories lookup table.
So I need to figure out how to use the text from the Category column to get the ID from the lookup table and use it as the value in the Books.CategoryId column. Anyone know how to do that? I'm just getting my feet wet with ADF so I'll really appreciate any assistance.
Thanks!

Add a data flow to your pipeline and that will allow you to build a pattern to dedupe and value lookups. If you need some guidance on building data flows, use our YouTube channel of helper videos: https://aka.ms/dataflowvids

Related

Feedback about my database design (multi tenancy)

The idea of the SaaS tool is to have dynamic tables with dynamic custom fields and values of different types, we were thinking to use "force.com/salesforce.com" example but is seems to be too complicated to maintain moving forward, also making some reports to create with a huge abstraction level, so we came up with simple idea but we have to be sure that this is kinda good approach.
This is the architecture we have today (in few steps).
Each tenant has it own separate database on the cluster (Postgres 12).
TABLE table, used to keep all of those tables as reference, this entity has ManyToOne relation to META table and OneToMany relation with DATA table.
META table is used for metadata configuration, has OneToMany relation with FIELDS (which has name of the fields as well as the type of field e.g. TEXT/INTEGER/BOOLEAN/DATETIME etc. and attribute value - as string, only as reference).
DATA table has ManyToOne relation to TABLES and 50 character varying columns with names like: attribute1...50 which are NULL-able.
Example flow today:
When user wants to open a TABLE DATA e.g. "CARS", we load the META table with all the FIELDS (to get fields for this query). User specified that he want to query against: Brand, Class, Year, Price columns.
We are checking by the logic, the reference for Brand, Class, Year and Price in META>FIELDS table, so we know that Brand = attribute2, Class = attribute 5, Year = attribute6 and Price = attribute7.
We parse his request into a query e.g.: SELECT [attr...2,5,6,7] FROM DATA and then show the results to user, if user decide to do some filters on it, based on this data e.g. Year > 2017 AND Class = 'A' we use CAST() functionality of SQL for example SELECT CAST(attribute6 AS int) AND attribute5 FROM DATA WHERE CAST(attribute6 AS int) > 2017 AND attribute5 = 'A';, so then we can actually support most principles of SQL.
However moving forward we are scared a bit:
Manage such a environment for more tenants while we are going to have more tables (e.g. 50 per customer, with roughly 1-5 mil per TABLE (5mil is maximum which we allow, for bigger data we have BigQuery) which is giving us 50-250 mil rows in single table DATA_X) which might affect performance of the queries, especially when we gave possibilities to manage simple WHERE statements (less,equal,null etc.) using some abstraction language e.g. GET CARS [BRAND,CLASS,PRICE...] FILTER [EQ(CLASS,A),MT(YEAR,2017)] developed to be similar to JQL (Jira Query Language).
Transactions lock, as we allow to batch upload CSV into the DATA_X so once they want to load e.g. 1GB of the data, it kinda locks the table for other systems to access the DATA table.
Keeping multiple NULL columns which can affect space a bit (for now we are not that scared as while TABLE creation, customer can decide how many columns he wants, so based on that we are assigning this TABLE to one of hardcoded entities DATA_5, DATA_10, DATA_15, DATA_20, DATA_30, DATA_50, where numbers corresponds to limitations of the attribute columns, and those entities are different, we also support migration option if they decide to switch from 5 to 10 attributes etc.
We are on super early stage, so we can/should make those before we scale, as we knew that this is most likely not the best approach, but we kept it to run the project for small customers which for now is working just fine.
We were thinking also about JSONB objects but that is not the option, as we want to keep it simple for getting the data.
What do you think about this solution (fyi DATA has PRIMARY key out of 2 tables - (ID,TABLEID) and built in column CreatedAt which is used form most of the queries, so there will be maximum 3 indexes)?
If it seem bad, what would you recommend as the alternative to this solution based on the details which I shared (basically schema-less RDBMS)?
IMHO, I anticipate issues when you wanted to join tables and also using cast etc.
We had followed the approach below that will be of help to you
We have a table called as Cars and also have a couple of tables like CarsMeta, CarsExtension columns. The underlying Cars table will have all the common fields for a ll tenant's. Also, we will have the CarsMeta table point out what are the types of columns that you can have for extending the Cars entity. In the CarsExtension table, you will have columns like StringCol1...5, IntCol1....5, LongCol1...10
In this way, you can easily filter for data also like,
If you have a filter on the base table, perform the search, if results are found, match the ids to the CarsExtension table to get the list of exentended rows for this entity
In case the filter is on the extended fields, do a search on the extension table and match with that of the base entity ids.
As we will have the extension table organized like below
id - UniqueId
entityid - uniqueid (points to the primary key of the entity)
StringCol1 - string,
...
IntCol1 - int,
...
In this case, it will be easy to do a join for entity and then get the data along with the extension fields.
In case you are having the table metadata and data being inferred from separate tables, it will be a difficult task to maintain this over long period of time and also huge volume of data.
HTH

What is the proper way to insert data to multiple separate tables when inserting into a table?

For example I have a table called product_list, which holds a list of products.
If I insert 1 row of data into product_list, part of the data (such as product_id & product name) should also be inserted in another table like product_price which holds the price for all products (new products would have 0 or NULL values for their price).
My question here is the method in approaching this. What is the proper way to do this?
My current ideas:
1 - Using a trigger to insert into the other tables like product_price,etc whenever I insert a product data into product_list
2 - Using a function (stored procedure) like product_add to add a new product into each tables.
Which method is better? Or if there a better suggestion, then I'd like to know about it. Thanks in advance.

Cassandra: Column Families for complex queries?

Every source tells me that supporting complex queries in cassandra is complicated and you usually need to create a new Column Family to support specific queries (like JOINS in a relational database).
I don't understand why you would actually need another Column Family for a query.
An example of this was demonstrated by IBM here: http://www.ibm.com/developerworks/library/os-apache-cassandra/
The system has Books with the following columns: Author,Price, tag1, tag2, tag...
If I wanted to perform a query like "Get all authors that have written books with the tag sci-fi", they recommend creating a column family called TagsToAuthor. Why is this necessary. I believe you can do the following 2 solutions without creating a new column family:
Create a Tag column family, with the columns: Book1, Book2, Book..., Author1, Author2, Author...
Create a Tag column family & create a BookTag column family that contains the columns: book_id & tag_id. Although Cassandra doesnt have a join functionality, you can simply get the tag id from the Tag column family, then get the list of book_id's by querying BookTag, then using those id's to query Book. Just like you would in a normal relational database.
What are the disadvantages to these solutions?

What are the proper use-cases for the PostgreSQL Array Datatype?

It seems to me that the functionality of the PostgreSQL array datatype overlaps a lot with the standard one-to-many and many-to-many relationships.
For example, a table called users could have an array field called "favorite_colors", or there could be a separate table called "favorite_colors" and a join table between "users" and "favorite_colors".
In what cases is the array datatype OK to use instead of a full-blown join?
An array should not be used similar to a relation. It should rather contain indexed values that relate to one row very tightly. For example if you had a table with the results of a football match, than you would not need to do
id team1 team2 goals1 goals2
but would do
id team[2] goals[2]
Because in this example, most would also consider normalizing this into two tables would be silly.
So all in all I would use it in cases where you are not interested in making relations and where you else would add fields like field1 field2 field3.
One incredibly handy use case is tagging:
CREATE TABLE posts (
title TEXT,
tags TEXT[]
);
-- Select all posts with tag 'kitty'
SELECT * FROM posts WHERE tags #> '{kitty}';
I totally agree with #marc. Arrays are to be used when you are absolutely sure you don't need to create any relationship between the items in the array with any other table. It should be used for a tightly coupled one to many relationship.
A typical example is creating a multichoice questions system. Since other questions don't need to be aware of the options of a question, the options can be stored in an array.
e.g
CREATE TABLE Question (
id integer PRIMARY KEY,
question TEXT,
options VARCHAR(255)[],
answer VARCHAR(255)
)
This is much better than creating a question_options table and getting the options with a join.
The Postgresql documentation gives good examples:
CREATE TABLE sal_emp (
name text,
pay_by_quarter integer[],
schedule text[][]
);
The above command will create a table named sal_emp with a column of
type text (name), a one-dimensional array of type integer
(pay_by_quarter), which represents the employee's salary by quarter,
and a two-dimensional array of text (schedule), which represents the
employee's weekly schedule.
Or, if you prefer:
CREATE TABLE tictactoe (
squares integer[3][3] );
If I want to store some similar type of set of data, and those data don't have any other attribute.
I prefer to use arrays.
One example is :
Storing contact numbers for a user
So, when we want to store contact number, usually main one and a alternate one, in such case
I prefer to use array.
CREATE TABLE students (
name text,
contacts varchar ARRAY -- or varchar[]
);
But if these data have additional attributes, say storing cards.
A card can have expiry date and other details.
Also, storing tags as an array a bad idea. A tag can be associated to multiple posts.
Don't use arrays in such cases.

Insert a record into table in ms access

I need to create a small "application" using ms access 2007. All I need is to create a form that will take care of input/output related to a few db tables.
I created the tables: patients, treatments and labresults.
Primary key in patients table is ID, and it serves as a foreign key in treatments and labresults tables, where it is named patientID.
I also created a form that I mentioned in the beginning of this question. It has multiple tabs. 1st tab opens/creates the patient, and the second one is used to enter data that is to be entered into the labresults table. Same applies to treatments table.
My problem: I added a button to the 2nd tab, and I want to attach an 'action' that will set an INSERT query with values taken from fields (controls) in tab 2, together with ID field taken from tab1 (that corresponds with patient ID from patients table), and then execute a query.
Right now I'm trying to achieve this myself, but with little success. Also, searchin MS site for solution(s) is kind of hard, since it always show results that have 'query' in it :)... And query isn't smt I want to use. (However, I'll accept any solution).
Thx
Tables:
patients
ID - primary key, autogenerated
patientID - internal number of the patient record. I could've used this, but it would complicate my life later on :)
gender
age
dateOfDiagnose
- field names are actually in Serbian, but field names aren't that much important
labtests
ID - primary key
patientID - foreign key, from patients table
... bunch of numerical data :)
There are 2 more tables, but they basically reflect some additional info and are not as important.
My form needs to enable user to enter data about the patient, and then enter several rows in labtests table, as treatment progresses. There are 2 types of treatment, and there is a table related to that, but they only have few fields in it, containint info about the start of the treatment, and the rest is info about lab tests.
It is quite possible to run SQL under VBA, but I suspect that it may be easier to simply set up the form properly. For example, if you set up the ID as the Link Child & Master fields for the lab results subform (which I hope you are using), then it will be filled in automatically. Furthermore, it is also possible to set the default value of a control to the value that was previously entered with very little code. I therefore suggest that you add some notes on what you wish to achieve.
Some further notes based on subsequent comment
From your comments, it appears that you have a minimum of three relevant tables:
Patients
PatientID PK, Counter
Treatments
TreatmentID PK, Counter
PatientID FK
TreatmentTypeID FK
LabResults
LabResultID PK, Counter
TreatmentID FK
PatientID FK <-- Optional, can be got through TreatmentID
LabResultTypeID FK
In addition, you will need a TreatmentTypes table that list the two current treatments and allows for further treatment types:
TreatmentTypes
TreatmentTypeID PK, Counter
TreatmentDescription
You will also need:
LabResultTypes
LabResultTypeID PK, Counter <-- This can be referenced in the LabResults table
TreatmentTypeID FK
There are arguments for PKs other than those suggested, in that you have natural keys, but I think it is easier when working with Access to use autonumbers, however, you will need indexes for the natural keys.
I strongly recommend that all tables also include date created, date modified/updated and a user ID for both. Access does not keep this data, you must do it yourself. Date created can be set by a default value, and if you are using the 2010 version, the other fields can have 'triggers' (data-level macros), otherwise, you will need code, but it is not too difficult. Note also that these same 'triggers' could be used to insert relevant records : Meet the Access 2010 macro designer.
Your form will contain records for each patient and two subforms, Treatments and LabResults. The Treatments subform is bound to the Treatments table having PatientID for LinkChild and Master fields and a combobox for TreatmentTypeID:
Rowsource: SELECT TreatmentTypeID, TreatmentDescription FROM TreatmentTypes
ColumnCount: 2
BoundColumn: 1
After a treatment type is added to the Treatments subform, you can run a query, or run SQL under VBA. You can either use the After Insert event for the form contained by the treatments subform or a button, the After Insert event has advantages, but it is only triggered when the user saves the record or moves from the record, which is the same thing. Working from the treatments subform, and the After Insert event, the SQL would be on the lines of:
''Reference Microsoft DAO x.x Object Library
Set db = CurrentDB
sSQL = "INSERT INTO LabResults (TreatmentID, PatientID, LabResultTypeID) " _
& "SELECT " _
& Me.TreatmentID & ", " _
& Me.PatientID & ", " _
& "LabResultTypeID FROM LabResultTypes " _
& "WHERE TreatmentTypeID " = & Me.TreatmentTypeID
db.Execute sSQL
MsgBox "Records inserted: " & db.RecordsAffected
Me.Parent.LabResults_Subform.Form.Requery
The query should also include the date and username, as suggested above, if your version is less than 2010 and you have not set up triggers.
You will be tempted to use datasheets, I suspect, for the subforms. I suggest you resist the temptation and use either single forms of continuous forms, they are more powerful. For an interesting approach, you may wish to look at the Northwind sample database Customer Orders form.