Polybase query from external table not honoring reject options - sql-server-2019

We are running SQL 2019 with CU 12 with an external data source that points to ADLS Gen2 storage account. We have two parquet files in the same directory where one file has 2 columns and the other file has 3 columns. We purposely did this to test the reject options knowing that our schemas will change over time.
/employee/file1.csv (2 columns/5 rows)
/employee/file2.csv (3 columns/5 rows)
Based on the documentation for reject options, we should be able to query across the external table and return non-dirty rows in the result set if reject rows fall within the reject configuration which is listed below.
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15&tabs=dedicated
CREATE EXTERNAL TABLE [dbo].[Employee] (
[FirstName] varchar(100) NOT NULL,
[LastName] varchar(100) NOT NULL
)
WITH (LOCATION='/employee/',
DATA_SOURCE = DATA_LAKE,
FILE_FORMAT = ParquetFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 1000000
);
When we select from the external table, I would expect to have it return the 5 rows from the one file that has 2 columns and reject the 5 rows from the file that contains 3 columns. Instead, we get no rows at all with the following exception.
Unexpected error encountered creating the record reader.
HadoopExecutionException: Column count mismatch. Source file has 3
columns, external table definition has 2 columns.
I feel like I must be missing something or my understanding of how reject options support file schema differences is incorrect. Can anyone shed any light on this?

The way Polybase works is that it first queries the schema on both files to see if they match the specification; since they don't, Polybase doesn't query one file successfully and fails on the other one. Both files must adhere to the external table specification first, and then you can have some records with two columns and other records with three columns. You can learn more about how Polybase works in my book "Hands-on data virtualization with Polybase".

Related

PowerBI: Appending tables with calculated columns

i have uploaded into Power BI 4 Tables: 1) Exchange Rates 2) Ledger Tree 3) Actuals results 4) Forecast.
Tables 3 and 4 had different columns or information (eg. 3 showed values by ledger in usd and 4 in local currencies; table 3) is not showing the ledger currency while Table 4 does), so I had to add in both several calculated columns (including looking values from tables 1 and 2) to make them look alike.
I would like now to put table 3 and 4 above one another one but I am not sure how to do it?
Any help is much appreciated
In Power Query to Union tables use append query
When selecting this, you can add to the existing query, or create a new table from the appended tables. If you do append you can select the old tables no longer used, and turn off 'Enable Load' so the reference table isn't loaded into the Power Pivot
In DAX, you can union tables using the Union function

How to avoid skewing in redshift for Big Tables?

I wanted to load the table which is having a table size of more than 1 TB size from S3 to Redshift.
I cannot use DISTSTYLE as ALL because it is a big table.
I cannot use DISTSTYLE as EVEN because I want to use this table in joins which are making performance issue.
Columns on my table are
id INTEGER, name VARCHAR(10), another_id INTEGER, workday INTEGER, workhour INTEGER, worktime_number INTEGER
Our redshift cluster has 20 nodes.
So, I tried distribution key on a workday but the table is badly skewed.
There are 7 unique work days and 24 unique work hours.
How to avoid the skew in such cases?
How we avoid skewing of the table in case of an uneven number of row counts for the unique key (let's say hour1 have 1million rows, hour2 have 1.5million rows, hour3 have 2million rows, and so on)?
Distribute your table using DISTSTYLE EVEN and use either SORTKEY or COMPOUND SORTKEY. Sort Key will help your query performance. Try this first.
DISTSTYLE/DISTKEY determines how your data is distributed. From the columns used in your queries, it is advised choose a column that causes the least amount of skew as the DISTKEY. A column which has many distinct values, such as timestamp, would be a good first choice. Avoid columns with few distinct values, such as credit card types, or days of week.
You might need to recreate your table with different DISTKEY / SORTKEY combinations and try out which one will work best based on your typical queries.
For more info https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
Here is the architecture that I recommend
1) load to a staging table with dist even and sort by something that is sorted on your loaded s3 data - this means you will not have to vacuum the staging table
2) set up a production table with the sort / dist you need for your queries. after each copy from s3, load that new data into the production table and vacuum.
3) you may wish to have 2 mirror production tables and flip flop between them using a late binding view.
its a bit complex to do this you need may need some professional help. There may be specifics to your use case.
As of writing this(Just after Re-invent 2018), Redshift has Automatic Distribution available, which is a good starter.
The following utilities will come in handy:
https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AdminScripts
As indicated in Answers POSTED earlier try a few combinations by replicating the same table with different DIST keys ,if you don't like what Automatic DIST is doing. After the tables are created run the admin utility from the git repos (preferably create a view on the SQL script in the Redshift DB).
Also, if you have good clarity on query usage pattern then you can use the following queries to check how well the sort key are performing using the below SQLs.
/**Queries on tables that are not utilizing SORT KEYs**/
SELECT t.database, t.table_id,t.schema, t.schema || '.' || t.table AS "table", t.size, nvl(s.num_qs,0) num_qs
FROM svv_table_info t
LEFT JOIN (
SELECT tbl, COUNT(distinct query) num_qs
FROM stl_scan s
WHERE s.userid > 1
AND s.perm_table_name NOT IN ('Internal Worktable','S3')
GROUP BY tbl) s ON s.tbl = t.table_id
WHERE t.sortkey1 IS NULL
ORDER BY 5 desc;
/**INTERLEAVED SORT KEY**/
--check skew
select tbl as tbl_id, stv_tbl_perm.name as table_name,
col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;
of course , there is always room for improvement in the SQLs above, depending on specific stats that you may want to look at or drill down to.
Hope this helps.

Convert T-SQL Cross Apply to Redshift

I am converting the following T-SQL statement to Redshift. The purpose of the query is to convert a column in the table with a value containing a comma delimited string with up to 60 values into multiple rows with 1 value per row.
SELECT
id_1
, id_2
, value
into dbo.myResultsTable
FROM myTable
CROSS APPLY STRING_SPLIT([comma_delimited_string], ',')
WHERE [comma_delimited_string] is not null;
In SQL this processes 10 million records in just under 1 hour which is fine for my purposes. Obviously a direct conversation to Redshift isn't possible due to Redshift not having a Cross Apply or String Split functionality so I built a solution using the process detailed here (Redshift. Convert comma delimited values into rows) which utilizes split_part() to split the comma delimited string into multiple columns. Then another query that unions everything to get the final output into a single column. But the typical run takes over 6 hours to process the same amount of data.
I wasn't expecting to run into this issue just knowing the power difference between the machines. The SQL Server I was using for the comparison test was a simple server with 12 processors and 32 GB of RAM while the Redshift server is based on the dc1.8xlarge nodes (I don't know the total count). The instance is shared with other teams but when I look at the performance information there are plenty of available resources.
I'm relatively new to Redshift so I'm still assuming I'm not understanding something. But I have no idea what am I missing. Are there things I need to check to make sure the data is loaded in an optimal way (I'm not an adim so my ability to check this is limited)? Are there other Redshift query options that are better than the example I found? I've searched for other methods and optimizations but unless I start looking into Cross Joins, something I'd like to avoid (Plus when I tried to talk to the DBA's running the Redshift cluster about this option their response was a flat "No, can't do that.") I'm not even sure where to go at this point so any help would be much appreciated!
Thanks!
I've found a solution that works for me.
You need to do a JOIN on a number table, for which you can take any table as long as it has more rows that the numbers you need. You need to make sure the numbers are int by forcing the type. Using the funcion regexp_count on the column to be split for the ON statement to count the number of fields (delimiter +1), will generate a table with a row per repetition.
Then you use the split_part function on the column, and use the number.num column to extract for each of the rows a different part of the text.
SELECT comma_delimited_string, numbers.num, REGEXP_COUNT(comma_delimited_string , ',')+1 AS nfields, SPLIT_PART(comma_delimited_string, ',', numbers.num) AS field
FROM mytable
JOIN
(
select
(row_number() over (order by 1))::int as num
from
mytable
limit 15 --max num of fields
) as numbers
ON numbers.num <= regexp_count(comma_delimited_string , ',') + 1

Slow SQL Server 2008 R2 performance?

I'm using SQL Server 2008 R2 on my development machine (not a server box).
I have a table with 12.5 million records. It has 126 columns, half of which are int. Most columns in most rows are NULL. I've also tested with an EAV design which seems 3-4 times faster to return the same records (but that means pivoting data to make it presentable in a table).
I have a website that paginates the data. When the user tries to go to the last page of records (last 25 records), the resulting query is something like this:
select * from (
select
A.Id, part_id as PartObjectId,
Year_formatted 'year', Make_formatted 'Make',
Model_formatted 'Model',
row_number() over ( order by A.id ) as RowNum
FROM vehicles A
) as innerQuery where innerQuery.RowNum between 775176 and 775200
... but this takes nearly 3 minutes to run. That seems excessive? Is there a better way to structure this query? In the browser front-end I'm using jqGrid to display the data. The user can navigate to the next, previous, first, or last page. They can also filter and order data (example: show all records whose Make is "Bugatti").
vehicles.Id is int and is the primary key (clustered ASC). part_id is int, Make and Model are varchar(100) and typically only contain 20 - 30 characters.
Table vehicles is updated ~100 times per day in individual transactions, and 20 - 30 users use the webpage to view, search, and edit/add vehicles 8 hours/day. It gets read from and updated a lot.
Would it be wise to shard the vehicles table into multiple tables only containing say 3 million records each? Would that have much impact on performance?
I see lots of videos and websites talking about people having tables with 100+ million rows that are read from and updated often without issue.
Note that the performance issues I observe are on my own development computer. The database has a dedicated 16GB of RAM. I'm not using SSD or even SCSI for that matter. So I know hardware would help, but 3 minutes to retrieve the last 25 records seems a bit excessive no?
Though I'm running these tests on SQL Server 2008 R2, I could also use 2012 if there is much to be gained from doing so.
Yes there is a better way, even on older releases of MsSQL But it is involved. First, this process should be done in a stored procedure. The stored procedure should take as 2 of it's input parameters, the page requested (#page)and the page size (number of records per page - #pgSiz).
In the stored procedure,
Create a temporary table variable and put into it a sorted list of the integer Primary Keys for all the records, with a rowNumber column that is itself an indexed, integer, Primary Key for the temp table
Declare #PKs table
(rowNo integer primary key Identity not null,
vehicleId integer not null)
Insert #PKS (vehicleId)
Select vehicleId from Vehicles
Order By --[Here put sort criteria as you want pages sorted]
--[Try to only include columns that are in an index]
then, based on which page (and the page size), (#page, #pgSiz) the user requested, the stored proc selects the actual data for that page by joining to this temp table variable:
Select [The data columns you want]
From #PKS p join Vehicles v
on v.VehicleId = p.VehicleId
Where rowNo between #page*#pgSiz+1 and (#page+1)*#pgSiz
order by rowNo -- if you want to sort page of records on server
assuming #page is 0-based. Also, the Stored proc will need some input argument validation to ensure that the #page, #pgSize values are reasonable (do not take the code pas the end of the records.)

Redshift Copy and auto-increment does not work

I am using the COPY command from redshift to copy json data from S3.
The table definition is as follows:
CREATE TABLE my_raw
(
id BIGINT IDENTITY(1,1),
...
...
) diststyle even;
The command for copy i am using is as follows:
COPY my_raw FROM 's3://dev-usage/my/2015-01-22/my-usage-1421928858909-15499f6cc977435b96e610298919db26' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXX' json 's3://bemole-usage/schemas/json_schema' ;
I am expecting that any new id inserted will always be > select max(id) from my_raw . In fact it's clearly not the case.
If I issue the above copy command twice, the first time the ids start from 1 to N although that file is creating 114 records(that's a known issue with redshift when it has multiple shards). The second time the ids are also between 1 and N but it took free numbers that were not used in the first copy.
See below for a demo:
usagedb=# COPY my_raw FROM 's3://bemole-usage/my/2015-01-22/my-usage-1421930213881-b8afbe07ab34401592841af5f7ddb31c' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXXX' json 's3://bemole-usage/schemas/json_schema' COMPUPDATE OFF;
INFO: Load into table 'my_raw' completed, 114 record(s) loaded successfully.
COPY
usagedb=#
usagedb=# select max(id) from my_raw;
max
------
4556
(1 row)
usagedb=# COPY my_raw FROM 's3://bemole-usage/my/2015-01-22/my-usage-1421930213881-b8afbe07ab34401592841af5f7ddb31c' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXXX' json 's3://bemole-usage/schemas/my_json_schema' COMPUPDATE OFF;
INFO: Load into table 'my_raw' completed, 114 record(s) loaded successfully.
COPY
usagedb=# select max(id) from my_raw;
max
------
4556
(1 row)
Thx in advance
The only solution i found to make sure have sequential Ids based on the insertion is to maintain a pair of tables. The first one is the stage table in which the items are inserted by the COPY command. The stage table will actually not have an ID column.
Then I have another table that is the exact replica of the stage table except that it has an additional column for the Ids. Then there is a job that takes care of filling the master table from the stage using the ROW_NUMBER() function.
In practice, this means executing the following statement after each Redshift COPY is performed:
insert into master
(id,result_code,ct_timestamp,...)
select
#{startIncrement}+row_number() over(order by ct_timestamp) as id,
result_code,...
from stage;
Then the Ids are guaranteed to be sequential/consecutives in the master table.
I can't reproduce your problem, however it is interesting how you have identity columns set correctly in conjunction with copy. Here a small summary:
Be aware that you can specify the columns (and their order) for a copy command.
COPY my_table (col1, col2, col3) FROM s3://...
So if:
EXPLICIT_IDS flag is NOT set
no columns listed like shown above
and you csv does not contain data for the IDENTITY column
then the identity values in the table will be set automatically in monotonously as we all want it.
doc:
If an IDENTITY column is included in the column list, then EXPLICIT_IDS must also be specified; if an IDENTITY column is omitted, then EXPLICIT_IDS cannot be specified. If no column list is specified, the command behaves as if a complete, in-order column list was specified, with IDENTITY columns omitted if EXPLICIT_IDS was also not specified.