Merge two datasets duplicate BY variables Or I want to make following form - merge

I am a novice in SAS program.
I have a question about merging two dataset.
The two data sets look like (please click this Image link (Excel sheet image):
Please let me know key concepts or code to make this happen!
I have searched the answer through Googling etc., but there is no site that exactly solve what I want.
(If it is possible to tackle above question without PROC SQL.)

To get the desired result you should do a cartesian product (Cross join) which returns all the rows in all tables. Each row in table1 is paired with all the rows in table2. I have used Proc SQL to do this and I am eager to see how this can be done using Data step. Here's what I know,
Proc Sql;
create table test_merge as
select a.*, b.type_rhs, b.rhs1, b.rhs2
from test a, test11 b
where a.yearmonth=b.yearmonth
;
quit;
Again, I am new to SAS as well and I think this is one of the ways to create the desired output.
When working with huge data, you will see a note in log that says "The execution of this query involves performing one or more Cartesian product joins that can not be optimized."

Related

Reference foreign keys using SSIS-Lookup

I am asking for help on the following topic. I am trying to create an ETL process with two Excel data sources (S1 ~300 rows and S2 ~7000 rows). S1 contains project information and employee details and S2 contains the amount of hours, which each employee worked in which project at a timestamp.
I want to insert the amount of hours, which each employee worked in each project at a timestamp, into the fact table by referencing to the existing primary keys in the dimension tables. If an entry is not present in the dimension tables already, i want to add a new entry first and use the newly generated id. The destination table structure looks as follows (Data Warehouse, Star Schema):Destination Table Structure
In SSIS, i created three Data Flow tasks for filling the Dimension Tables (project, employee and time) with distinct values (using group by, as S1 and S2 contain a lot of duplicate rows)first, and a fourth data flow task (see image below) to insert the FactTable data, and this is where I'm running into problems:
Data Flow Task FactTable
I am using three LookUp functions to retrieve the foreignKeys project_id, employee_id and time_id from the Dimension tables (using project name, employee number and timestamp). If the id is found, it is passed on all the way to Merge Join 1, if not, a new Dimension Entry is created (lets say project) and the generated project_id passed on instead. Same goes for employee and time respectively.
There is two issues with this:
1) The "amount of hours" (passed by Multicast four, see image above) is not matched in the final result (No Match)
2) The amount of rows being inserted keeps increasing forever (Endless Join, I belive due to the Merge joins).
What I've tried:
I have used one UNION instead of three Merge Joins before, but this resulted in the foreign keys being in seperate rows each, instead of merged together.
I used Merge (instead of Merge Join) and combined the join as well as sort conditions in as I fell all possible ways.
I understand that this scenario might be confusing for everybody else, but thank your for taking time looking at it! Any help is greatly appreciated.
Solved it
For anybody having similar issues:
Seperate Data Flows for filling Dimension Tables with those filling Fact Tables will do the trick.
Its a clean solution and easier to debug.
Also: Dont run the LookUp Functions in parallel, but rather one after each other and pass on the attributes. Saves unnecessary Merges as well.
So as a Sum Up:
Four Data Flow Tasks, three for filling dimension tables ONLY and one for filling fact tables ONLY.
Loading Multiple Tables using SSIS keeping foreign key relationships
The answer posted by onupdatecascade is basically it.
Good luck!

Data Lake Analytics - Large vertex query

I have a simple query which make a GROUP BY using two fields:
#facturas =
SELECT a.CodFactura,
Convert.ToInt32(a.Fecha.ToString("yyyyMMdd")) AS DateKey,
SUM(a.Consumo) AS Consumo
FROM #table_facturas AS a
GROUP BY a.CodFactura, a.DateKey;
#table_facturas has 4100 rows but query takes several minutes to finish. Seeing the graph explorer I see it uses 2500 vertices because I'm having 2500 CodFactura+DateKey unique rows. I don't know if it normal ADAL behaviour. Is there any way to reduce the vertices number and execute this query faster?
First: I am not sure your query actually will compile. You would need the Convert expression in your GROUP BY or do it in a previous SELECT statement.
Secondly: In order to answer your question, we would need to know how the full query is defined. Where does #table_facturas come from? How was it produced?
Without this information, I can only give some wild speculative guesses:
If #table_facturas is coming from an actual U-SQL Table, your table is over partitioned/fragmented. This could be because:
you inserted a lot of data originally with a distribution on the grouping columns and you either have a predicate that reduces the number of rows per partition and/or you do not have uptodate statistics (run CREATE STATISTICS on the columns).
you did a lot of INSERT statements, each inserting a small number of rows into the table, thus creating a big number of individual files. This will "scale-out" the processing as well. Use ALTER TABLE REBUILD to recompact.
If it is coming from a fileset, you may have too many small files in the input. See if you can merge them into less, larger files.
You can also try to hint a small number of rows in your query that creates #table_facturas if the above does not help by adding OPTION(ROWCOUNT=4000).

merge features n time

Here is my issue, I'm working on a project of water supply where I have to merge line features representing pipes if they have the same material of construction and if they are touching each other. The merge is done two by two what it means that some features will be duplicated in some cases like described in the figure below :
this shows exactly my issue. After the merge I will get three records but what I want is just one record which encompasses the whole pipes that fill the conditions put in the where clause :
Here is the query that helps me do the merge :
drop table if exists touches_material;
create table touches_material as
select distinct a.*,
st_Union(a.geom,b.geom) as fusion from pipe a, pipe b
where a.id < b.id and a.material = b.material and
st_touches(a.geom,b.geom)
group by a.id,a.geom,b.geom
the following picture picture shows the expected result on a test data, it's realized via QGIS GIS software :
but this is what I' getting with my query :
if you have any idea about how to achieve the aim that I invoked, I would be very thankful to get an answer from you. Best regards.

Tableau Extract API with multiple tables in a database

I am currently experimenting with Tableau Extract API to generate some TDE from the tables I have in a PostgreSQL database. I was able to write a code to generate the TDE from single table, but I would like to do this for multiple joined tables. To be more specific, if I have two tables that are inner joined by some field, how would I generate the TDE for this?
I can see that if I am working with small number of tables, I could use a SQL query with JOIN clauses to create a one gigantic table, and generate the TDE from that table.
>> SELECT * FROM table_1 INNER JOIN table_2
INTO new_table_1
ON table_1.id_1 = table_2.id_2;
>> SELECT * FROM new_table_1 INNER JOIN TABLE_3
INTO new_table_2
ON new_table_1.id_1 = table_3.id_3
and then generate the TDE from new_table_2.
However, I have some tables that have over 40 different fields, so this could get messy.
Is this even a possibility with current version of the API?
You can read from as many tables or other sources as you want. Or use complex query with lots of joins, or create a view and read from that. Usually, creating a view is helpful when you have a complex query joining many tables.
The data extract API is totally agnostic about how or where you get the data to feed it -- the whole point is to allow you to grab data from unusual sources that don't have pre-built drivers for Tableau.
Since Tableau has a Postgres driver and can read from it directly, you don't need to write a program with the data extract API at all. You can define your extract with Tableau Desktop. If you need to schedule automated refreshes of the extract, you can use Tableau Server or its tabcmd command.
Many thanks for your replies. I am aware that I could use Tableau Desktop to define my extract. In fact, I have done this many times before. I am just trying to create the extracts using the API, because I need to create some calculated fields, which is near impossible to create using the Tableau Desktop.
At this point, I am hesitant to use JOINs in the SQL query because the resulting table would look too complicated to comprehend (some of these tables also have same field names).
When you say that I could read from multiple tables or sources, does that mean with the Tableau Extract API? At this point, I cannot find anywhere in this API that accommodates multiple sources. For example, I know that when I use multiple tables in the Tableau Desktop, there are icons on the left hand side that tells me that the extract is composed of multiple tables. This just doesn't seem to be happening with the API, which leaves me stranded. Anyways, thank you again for your replies.
Going back to the topic, this is something that I tried few days ago on my python code
try:
tdefile= tde.Extract("extract.tde")
except:
os.remove("extract.tde")
tdefile = tde.Extract("extract.tde")
tableDef = tde.TableDefinition()
# Read each column in table and set the column data types using tableDef.addColumn
# Some code goes here...
for eachTable in tableNames:
tableAdd = tdeFile.addTable(eachTable, tableDef)
# Use SQL query to retrieve bunch_of_rows from eachTable
for some_row in bunch_of_rows:
# Read each row in table, and set the values in each column position of each row
# Some code goes here...
tableAdd.insert(some_row)
some_row.close()
tdefile.close()
When I execute this code, I get the error that eachTable has to be called "Extract".
Of course, this code has its flaws, as there is no where in this code that tells how each table are being joined.
So I am little thrown off here, because it doesn't seem like I can use multiple tables unless I use JOINs to generate one table that contains everything.

How to tell if record has changed in Postgres

I have a bit of an "upsert" type of question... but, I want to throw it out there because it's a little bit different than any that I've read on stackoverflow.
Basic problem.
I'm working on moving from mysql to PostgreSQL 9.1.5 (hosted on Heroku). As a part of that, I need to import multiple CSV files everyday. Some of the data is sales information and is almost guaranteed to be new and need to be inserted. But, other parts of the data is almost guaranteed to be the same. For example, the csv files (note plural) will have POS (point of sale) information in them. This rarely changes (and is most likely only via additions). Then there is product information. There are about 10,000 products (vast majority will be unchanged, but it's possible to have both additions and updates).
The final item (but is important), is that I have a requirement to be able to provide an audit trail/information for any given item. For example, if I add a new POS record, I need to be able to trace that back to the file it was found in. If I change a UPC code or description of a product, then I need to be able to trace it back to the import (and file) where the change came from.
Solution that I'm contemplating.
Since the data is provided to me via CSV, then I'm working around the idea that COPY will be the best/fastest way. The structure of the data in the files is not exactly what I have in the database (i.e. final destination). So, I'm copying them into tables in the staging schema that match the CSV (note: one schema per datasource). The tables in the staging schemas will have a before insert row triggers. These triggers can decide what to do with the data (insert, update or ignore).
For the tables that are most likely to contain new data, then it will try to insert first. If the record is already there, then it will return NULL (and stop the insert into the staging table). For tables that rarely change, then it will query the table and see if the record is found. If it is, then I need a way to see if any of the fields are changed. (because remember, I need to show that the record was modified by import x from file y) I obviously can just boiler plate out the code and test each column. But, was looking for something a little more "eloquent" and more maintainable than that.
In a way, what I'm kind of doing is combining a importing system with an audit trail system. So, in researching audit trails, I reviewed the following wiki.postgresql.org article. It seems like the hstore might be a nice way of getting changes (and being able to easily ignore some columns in the table that aren't important - e.g. "last_modified")
I'm about 90% sure it will all work... I've created some testing tables etc and played around with it.
My question?
Is a better, more maintainable way of accomplishing this task of finding the maybe 3 records out of 10K that require a change to the database. I could certainly write a python script (or something else) that reads the file and tries to figure out what to do with each record, but that feels horribly inefficient and will lead to lots of round trips.
A few final things:
I don't have control over the input files. I would love it if they only sent me the deltas, but they don't and it's completely outside of my control or influence.
he system is grow and new data sources are likely to be added that will greatly increase the amount of data being processed (so, I'm trying to keep things efficient)
I know this is not nice, simple SO question (like "how to sort a list in python") but I believe one of the great things about SO is that you can ask hard questions and people will share their thoughts about how they think the best way to solve it is.
I have lots of similar operations. What I do is COPY to temporary staging tables:
CREATE TEMP TABLE target_tmp AS
SELECT * FROM target_tbl LIMIT 0; -- only copy structure, no data
COPY target_tmp FROM '/path/to/target.csv';
For performance, run ANALYZE - temp. tables are not analyzed by autovacuum!
ANALYZE target_tmp;
Also for performance, maybe even create an index or two on the temp table, or add a primary key if the data allows for that.
ALTER TABLE ADD CONSTRAINT target_tmp_pkey PRIMARY KEY(target_id);
You don't need the performance stuff for small imports.
Then use the full scope of SQL commands to digest the new data.
For instance, if the primary key of the target table is target_id ..
Maybe DELETE what isn't there any more?
DELETE FROM target_tbl t
WHERE NOT EXISTS (
SELECT 1 FROM target_tmp t1
WHERE t1.target_id = t.target_id
);
Then UPDATE what's already there:
UPDATE target_tbl t
SET col1 = t1.col1
FROM target_tmp t1
WHERE t.target_id = t1.target_id
To avoid empty UPDATEs, simply add:
...
AND col1 IS DISTINCT FROM t1.col1; -- repeat for relevant columns
Or, if the whole row is relevant:
...
AND t IS DISTINCT FROM t1; -- check the whole row
Then INSERT what's new:
INSERT INTO target_tbl(target_id, col1)
SELECT t1.target_id, t1.col1
FROM target_tmp t1
LEFT JOIN target_tbl t USING (target_id)
WHERE t.target_id IS NULL;
Clean up if your session goes on (temp tables are dropped at end of session automatically):
DROP TABLE target_tmp;
Or use ON COMMIT DROP or similar with CREATE TEMP TABLE.
Code untested, but should work in any modern version of PostgreSQL except for typos.