PowerBI: Appending tables with calculated columns

PowerBI: Appending tables with calculated columns - append

i have uploaded into Power BI 4 Tables: 1) Exchange Rates 2) Ledger Tree 3) Actuals results 4) Forecast.
Tables 3 and 4 had different columns or information (eg. 3 showed values by ledger in usd and 4 in local currencies; table 3) is not showing the ledger currency while Table 4 does), so I had to add in both several calculated columns (including looking values from tables 1 and 2) to make them look alike.
I would like now to put table 3 and 4 above one another one but I am not sure how to do it?
Any help is much appreciated

In Power Query to Union tables use append query
When selecting this, you can add to the existing query, or create a new table from the appended tables. If you do append you can select the old tables no longer used, and turn off 'Enable Load' so the reference table isn't loaded into the Power Pivot
In DAX, you can union tables using the Union function

Related

PostgreSQL - 100 million records transfer from archive to a new table

I have a requirement to transfer data from 2 tables (Table A and Table B) into a new table.
I am using a query to join both A and B tables using an ID column.
Table A and B are archive tables without any indexes. (Millions of records)
Table X and Y are a replica of A and B with good indexes. (Some thousands of records)
Below is the code for my project.
with data as
(
SELECT a.*, b.* FROM A_archive a
join B_archive b where a.transaction_id = b.transaction_id
UNION
SELECT x.*, y.* FROM X x
join Y y where x.transaction_id = y.transaction_id
)
INSERT INTO
Another_Table
(
columns
)
select * from data
On Conflict(transaction_id)
do udpate ...
The above whole thing is running in production environment and has nearly 140 million records.
Due to this production database is taking almost 10 hours to process the data and failing.
I am also having a distributed job scheduler in AWS to schedule this query inside a function and retrieve the latest records every 5 hours. The archive tables store closed invoice data. Pega UI will be using this table for retrieving data about closed invoices and showing to the customer.
Please suggest something that is a bit more performant.

UNION removes duplicate rows. On big unindexed tables that is an expensive operation. Try UNION ALL if you don't need deduplication. It will save the s**tton of data shuffling and comparisons required for deduplication.
Without indexes on your archival tables your JOIN operation will be grossly inefficient. Index, at a minimum, the transaction_id columns you use in your ON clause.
You don't say what you want to do with the resulting table. In many cases you'll be able to use a VIEW rather than a table for your purposes. A VIEW removes the work of creating the derived table. Actually it defers the work to the time of SELECT operations using the derived structure. If your SELECT operations have highly selective WHERE clauses the savings can be astonishing. For this to work well you may need to put appropriate indexes on your archival tables.
You use SELECT * when you could enumerate the columns you need. That certainly puts one redundant column into your result: it generates two copies of transaction_id. It also may generate other redundant or unused data. Always avoid SELECT * in production software unless you know you need it.
Keep this in mind: SQL is declarative, not procedural. You declare (describe) the result you require, and you let the server work out the best way to get it. VIEWs let the server do this work for you in cases like your table combination. It will use the indexes you provide as best it can.

That UNION must be costly, it pretty much builds a temp-table in the background containing all the A-B + X-Y records, sorts it (over all fields) and then removes any doubles. If you say 100 million records are involved then that's a LOT of sorting going on that most likely will involve swapping out to disk.
Keep in mind that you only need to do this if there are expected duplicates
in the result from the JOIN between A and B
in the result from the JOIN between X and Y
in the combined result from the two above
IF neither of those are expected, just use UNION ALL
In fact, in that case, why not have 1 INSERT operation for A-B and another one for X-Y? Going by the description I'd say that whatever is in X-Y should overrule whatever is in A-B anyway, right?
Also, as mentioned by O.Jones, archive tables or not, they should come at least with a (preferably clustered) index on the transaction_id fields you're JOINing on. (same for the Another_Table btw)
All that said, processing 100M records in 1 transaction IS going to take some time, it's just a lot of data that's being moved around. But 10h does sound excessive indeed.

column values change between loading two partioned tables in KDB (q)

I have two partioned kdb tables on disk (one called trades, one called books). I created the data by
using
.Q.dpft[`:I:/check/trades/;2020.01.01;`symTrade;`trades]
and
.Q.dpft[`:I:/check/books/;2020.01.01;`sym;`books]
for each day. If I select data from the trades table and then load the books table (without selecting data) the values in the symTrade columns of my result change to new values. I assume it has got something to do with the paritioning in the books table getting applied to the result from trades table (also the trades table is no longer accessible after loading the books table).
How do I:
keep the trades table accessible after loading the books table?
avoid having my symTrade column overwritten by the sym values in
the books table?
Here is an example:
system "l I:/check/trades/";
test: 10 sublist select from trades where date=2020.01.01;
show cols test;
// gives `date`symTrade`time`Price`Qty`Volume
select distinct symTrade from test;
// gives TICKER1
// now loading another table
system "l I:/check/books";
select distinct symTrade from test;
// now gives a different value e.g. TICKER200

I think the problem is that you are saving these tables to two different databases.
The first argument in .Q.dpft is the path to the root of the database, and the fourth argument is the name of the table you want to store. So when you do
.Q.dpft[`:I:/check/trades/;2020.01.01;`symTrade;`trades]
You are storing the trades table in a database in I:/check/trades and when you do
.Q.dpft[`:I:/check/books/;2020.01.01;`sym;`books]
you are storing the books table in a database in I:/check/books. I think q can only load in one database at a time, so that might be the problem.
Try doing this
.Q.dpft[`:I:/check/;2020.01.01;`symTrade;`trades]
.Q.dpft[`:I:/check/;2020.01.01;`sym;`books]
system "l I:/check/";
Let us know if that works!

Can redshift stored procs be used to make a date range UNION ALL query

Since redshift does not natively support date partitioning, other than in redshift spectrum, all our tables are date partitioned
my_table_name_YYYY_MM_DD
So every time we do queries it's usually looks like this
select columns, i, want from
(select * from tbl1_date UNION ALL
select * from tbl2_date UNION ALL
select * from tbl3_date UNION ALL
select * from tbl4_date);
Where there's one UNION ALL per day.
Can stored procedures generate a date rangeso our business analysts stop losing their hair when I send them a python or bash script to generate the date range?

Yes, you could create a stored procedure that generates dynamic SQL using only the needed tables. See my answer here for a template to start from: Issue with passing column name as a parameter to "PREPARE" in Redshift
However, you should be aware that Redshift is able to achieve most of what you want automatically using a "Time Series Table" view. This documented here:
Using Time Series Tables
Use Time-Series Tables
You define a view that is composed of a UNION ALL over a sequence of identical tables with a sort key defined on a commonly filtered date or timestamp column. When you query that view Redshift is able to eliminate the scans on any UNION'ed tables that would not contain relevant data.
For example:
CREATE OR REPLACE VIEW store_sales_vw
AS SELECT * FROM store_sales_1998
UNION ALL SELECT * FROM store_sales_1999
UNION ALL SELECT * FROM store_sales_2001
UNION ALL SELECT * FROM store_sales_2002
UNION ALL SELECT * FROM store_sales_2003
;
SELECT cd.cd_education_status
,COUNT(*) sales_count
,AVG(ss_quantity) avg_quantity
FROM store_sales_vw vw
JOIN customer_demographics cd
ON vw.ss_cdemo_sk = cd.cd_demo_sk
WHERE ss_sold_ts BETWEEN '1999-09-01' AND '2000-08-31'
GROUP BY cd.cd_education_status
In this example Redshift will only use the store_sales_1999 and store_sales_2000 tables, skipping the other tables in the view. Note that the table skipping is not based the name of the table. Redshift knows the MIN and MAX values of the sort key timestamp in each table.
If you purse this approach please be sure to keep the total size of the UNION fairly low. I recommend (at most) daily tables for the last week [7], weekly tables for the last month [5], quarterly tables for the last year [4], and then yearly tables for older data.
You can use ALTER TABLE … APPEND to merge the daily tables in weekly tables and so on.

SQL (Redshift) to get the intersect of multiple tables

I'm using Redshift and have 6 tables of IDs in. I want to get the intersect between each of the tables.
So my final output would look something like this:
Table 1 & Table 2 have 10% common IDs
Table 1 & Table 3 have 50% common IDs
.....
.....
Table 6 & Table 4 have 20% common IDs
Table 6 & Table 5 have 3% common IDs
I can easily get the data, but it would be a lot of repeating the same SQL, so I've tried to create some tables of all the IDs and tables they are in but I'm stuck as to what to get the data in one or two SQL's.
Any ideas welcome!

you could try to full join all these tables by ID in a subquery and then use conditional aggregate so that Table 1 & Table 2 have 10% common IDs would be expressed as
100.0*sum(case when id1 is not null and id2 is not null then 1 end)/count(id1)
(taking Table 1 row count as denominator)

How to duplicate partition content?

I'm trying to set-up a testing environment for performance testing, currently we have a table with 8 million records and we want to duplicate this records for 30 days.
In other words:
- Table 1
--Partition1(8 million records)
--Partition2(0 records)
.
.
--Partition30(0 records)
Now I want to take the 8 million records in Partition1 and duplicate them across the rest of partitions, the only difference that they have is a column that contains a DATE. This column should vary 1 day in each copy.
Partition1(DATE)
Partition2(DATE+1)
Partition3(DATE+2)
And so on.
The last restrictions are that there are 2 indexes in the original table and they must be preserved in the copies and Oracle DB is 10g.
How can I duplicate this content?
Thanks!

It seems to me to be as simple as running as efficient an insert as possible.
Probably if you cross-join the existing data to a list of integers, 1 .. 29, then you can generate the new dates you need.
with list_of_numbers as (
select rownum day_add
from dual
connect by level <= 29)
insert /*+ append */ into ...
select date_col + day_add, ...
from ...,
list_of_numbers;
You might want to set NOLOGGING on the table, since this is test data.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PowerBI: Appending tables with calculated columns - append

Related

PostgreSQL - 100 million records transfer from archive to a new table

column values change between loading two partioned tables in KDB (q)

Can redshift stored procs be used to make a date range UNION ALL query

SQL (Redshift) to get the intersect of multiple tables

How to duplicate partition content?

Categories

Resources