How to duplicate partition content?

How to duplicate partition content? - oracle10g

I'm trying to set-up a testing environment for performance testing, currently we have a table with 8 million records and we want to duplicate this records for 30 days.
In other words:
- Table 1
--Partition1(8 million records)
--Partition2(0 records)
.
.
--Partition30(0 records)
Now I want to take the 8 million records in Partition1 and duplicate them across the rest of partitions, the only difference that they have is a column that contains a DATE. This column should vary 1 day in each copy.
Partition1(DATE)
Partition2(DATE+1)
Partition3(DATE+2)
And so on.
The last restrictions are that there are 2 indexes in the original table and they must be preserved in the copies and Oracle DB is 10g.
How can I duplicate this content?
Thanks!

It seems to me to be as simple as running as efficient an insert as possible.
Probably if you cross-join the existing data to a list of integers, 1 .. 29, then you can generate the new dates you need.
with list_of_numbers as (
select rownum day_add
from dual
connect by level <= 29)
insert /*+ append */ into ...
select date_col + day_add, ...
from ...,
list_of_numbers;
You might want to set NOLOGGING on the table, since this is test data.

Related

PostgreSQL - 100 million records transfer from archive to a new table

I have a requirement to transfer data from 2 tables (Table A and Table B) into a new table.
I am using a query to join both A and B tables using an ID column.
Table A and B are archive tables without any indexes. (Millions of records)
Table X and Y are a replica of A and B with good indexes. (Some thousands of records)
Below is the code for my project.
with data as
(
SELECT a.*, b.* FROM A_archive a
join B_archive b where a.transaction_id = b.transaction_id
UNION
SELECT x.*, y.* FROM X x
join Y y where x.transaction_id = y.transaction_id
)
INSERT INTO
Another_Table
(
columns
)
select * from data
On Conflict(transaction_id)
do udpate ...
The above whole thing is running in production environment and has nearly 140 million records.
Due to this production database is taking almost 10 hours to process the data and failing.
I am also having a distributed job scheduler in AWS to schedule this query inside a function and retrieve the latest records every 5 hours. The archive tables store closed invoice data. Pega UI will be using this table for retrieving data about closed invoices and showing to the customer.
Please suggest something that is a bit more performant.

UNION removes duplicate rows. On big unindexed tables that is an expensive operation. Try UNION ALL if you don't need deduplication. It will save the s**tton of data shuffling and comparisons required for deduplication.
Without indexes on your archival tables your JOIN operation will be grossly inefficient. Index, at a minimum, the transaction_id columns you use in your ON clause.
You don't say what you want to do with the resulting table. In many cases you'll be able to use a VIEW rather than a table for your purposes. A VIEW removes the work of creating the derived table. Actually it defers the work to the time of SELECT operations using the derived structure. If your SELECT operations have highly selective WHERE clauses the savings can be astonishing. For this to work well you may need to put appropriate indexes on your archival tables.
You use SELECT * when you could enumerate the columns you need. That certainly puts one redundant column into your result: it generates two copies of transaction_id. It also may generate other redundant or unused data. Always avoid SELECT * in production software unless you know you need it.
Keep this in mind: SQL is declarative, not procedural. You declare (describe) the result you require, and you let the server work out the best way to get it. VIEWs let the server do this work for you in cases like your table combination. It will use the indexes you provide as best it can.

That UNION must be costly, it pretty much builds a temp-table in the background containing all the A-B + X-Y records, sorts it (over all fields) and then removes any doubles. If you say 100 million records are involved then that's a LOT of sorting going on that most likely will involve swapping out to disk.
Keep in mind that you only need to do this if there are expected duplicates
in the result from the JOIN between A and B
in the result from the JOIN between X and Y
in the combined result from the two above
IF neither of those are expected, just use UNION ALL
In fact, in that case, why not have 1 INSERT operation for A-B and another one for X-Y? Going by the description I'd say that whatever is in X-Y should overrule whatever is in A-B anyway, right?
Also, as mentioned by O.Jones, archive tables or not, they should come at least with a (preferably clustered) index on the transaction_id fields you're JOINing on. (same for the Another_Table btw)
All that said, processing 100M records in 1 transaction IS going to take some time, it's just a lot of data that's being moved around. But 10h does sound excessive indeed.

how to test if a postgres partition has been populated or not

How can I (quickly) test if a postgres partition has any rows in it?
I have a partitioned postgres table 'TABLE_A', partitioned by date-range. The name of each individual partition indicates the date-range i.e. TABLE_A_20220101 (1st Jan this year) TABLE_A_20220102 (2nd Jan 2022)
The table includes many years of data, so it includes several thousand individual partitions, each partition contains many millions of rows.
Is there a quick way of testing if a partition has any data in it? There are several solutions I've found, but they all involve count(*) and all take ages.
Please note - I'm NOT trying to accurately determine the row-count, just determine if each partition has any rows in it.

You can use an exists condition:
select exists (select * from partition_name limit 1)
That will return true if partition_name contains at least one row

How to remove sort phase in spark dataframe join?

I had created a bucketed table using below command in Spark:
df.write.bucketBy(200, "UserID").sortBy("UserID").saveAsTable("topn_bucket_test")
Size of Table : 50 GB
Then I joined another table (say t2 , size :70 GB)(Bucketed as before ) with above table on UserId column . I found that in the execution plan the table topn_bucket_test was being sorted (but not shuffled) before the join and I expected it to be neither shuffled nor sorted before join as it was bucketed. What can be the reason ? and how to remove sort phase for topn_bucket_test?

As far as I am concerned it is not possible to avoid the sort phase. When using the same bucketBy call it is unlikely that the physical bucketing will be identical in both tables. Imagine the first table having UserID ranging from 1 to 1000 and the second from 1 to 2000. Different UserIDs might end up in the 200 buckets and within those bucket there might be multiple different (and unsorted!) UserIDs.

Postgresql: checking if values exist in full table efficiently

We have a transaction table of sale to customers with over 2000 million rows on Redshift. Every months transaction table has 5 million rows. For MIS (monthly 5 million rows only), I need to check if a customer is new based on mobile number, or the mobile number exists in the 2000 million database without joining it on the full table so my query remains efficient.
What I have tried:
newtable=SELECT DISTINCT(mobile_no) as mobile_no,'old' as category FROM table
maintable=SELECT maintable.*, coalesce(nq.category,'new')
FROM maintable as maintable
LEFT JOIN (newquery) as nq on nq.mobile_no=maintable.mobile_no;
This is very slow takes over 50 mins. I also tried
SELECT exists (SELECT 1 FROM newtable WHERE mobile_no=maintable.mobile_no LIMIT 1) as as category but this gives an 'out of memory' error.

Amazon RedShift is a data warehouse, so it won't be fast on queries by design. If you will be doing analysis on the data and expect a faster result, you might want to explore other products they offer such as EMR to do your queries faster.
Here is a reference on what each service is intention is: https://aws.amazon.com/big-data/datalakes-and-analytics/

Slow SQL Server 2008 R2 performance?

I'm using SQL Server 2008 R2 on my development machine (not a server box).
I have a table with 12.5 million records. It has 126 columns, half of which are int. Most columns in most rows are NULL. I've also tested with an EAV design which seems 3-4 times faster to return the same records (but that means pivoting data to make it presentable in a table).
I have a website that paginates the data. When the user tries to go to the last page of records (last 25 records), the resulting query is something like this:
select * from (
select
A.Id, part_id as PartObjectId,
Year_formatted 'year', Make_formatted 'Make',
Model_formatted 'Model',
row_number() over ( order by A.id ) as RowNum
FROM vehicles A
) as innerQuery where innerQuery.RowNum between 775176 and 775200
... but this takes nearly 3 minutes to run. That seems excessive? Is there a better way to structure this query? In the browser front-end I'm using jqGrid to display the data. The user can navigate to the next, previous, first, or last page. They can also filter and order data (example: show all records whose Make is "Bugatti").
vehicles.Id is int and is the primary key (clustered ASC). part_id is int, Make and Model are varchar(100) and typically only contain 20 - 30 characters.
Table vehicles is updated ~100 times per day in individual transactions, and 20 - 30 users use the webpage to view, search, and edit/add vehicles 8 hours/day. It gets read from and updated a lot.
Would it be wise to shard the vehicles table into multiple tables only containing say 3 million records each? Would that have much impact on performance?
I see lots of videos and websites talking about people having tables with 100+ million rows that are read from and updated often without issue.
Note that the performance issues I observe are on my own development computer. The database has a dedicated 16GB of RAM. I'm not using SSD or even SCSI for that matter. So I know hardware would help, but 3 minutes to retrieve the last 25 records seems a bit excessive no?
Though I'm running these tests on SQL Server 2008 R2, I could also use 2012 if there is much to be gained from doing so.

Yes there is a better way, even on older releases of MsSQL But it is involved. First, this process should be done in a stored procedure. The stored procedure should take as 2 of it's input parameters, the page requested (#page)and the page size (number of records per page - #pgSiz).
In the stored procedure,
Create a temporary table variable and put into it a sorted list of the integer Primary Keys for all the records, with a rowNumber column that is itself an indexed, integer, Primary Key for the temp table
Declare #PKs table
(rowNo integer primary key Identity not null,
vehicleId integer not null)
Insert #PKS (vehicleId)
Select vehicleId from Vehicles
Order By --[Here put sort criteria as you want pages sorted]
--[Try to only include columns that are in an index]
then, based on which page (and the page size), (#page, #pgSiz) the user requested, the stored proc selects the actual data for that page by joining to this temp table variable:
Select [The data columns you want]
From #PKS p join Vehicles v
on v.VehicleId = p.VehicleId
Where rowNo between #page*#pgSiz+1 and (#page+1)*#pgSiz
order by rowNo -- if you want to sort page of records on server
assuming #page is 0-based. Also, the Stored proc will need some input argument validation to ensure that the #page, #pgSize values are reasonable (do not take the code pas the end of the records.)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse