Loaded multiple files in table and now want to find Min and Max value is coming from which file - tsql

I have loaded multiple files in SQL Server.
Let's Say
table structure
Now
select COL_A, min(COL_C),max(COL_C)
from tbl
group by COL_A
I want to write SQL query to find which file has that Min(COL_C) and Max(COL_C) value. As you can see I am storing FILENAME in table.
So I want result like this

Here's one possible way you could get your results, using apply to correlate each aggregated value back to the same table to find the filename for that value, and using string_agg to produce a delimeted list if there are ties.
select * from (
select COL_A, Min(COL_C) MinColC,Max(COL_C) MaxColC
from T
group by COL_A
)x
outer apply (
select string_agg([filename], ', ') sMinFilename
from T
where T.COL_A=x.COL_A and T.COL_C=x.MinColC
)mn
outer apply (
select string_agg([filename], ', ') MinFilename
from T
where T.COL_A=x.COL_A and T.COL_C=x.MaxColC
)mx

Related

SQL Pivot using a subquery in FOR

Using SQL Server 2016 and referring to this article:
https://www.sqlshack.com/dynamic-pivot-tables-in-sql-server/
That article uses this pivot:
SELECT * FROM (
SELECT
[Student],
[Subject],
[Marks]
FROM Grades
) StudentResults
PIVOT (
SUM([Marks])
FOR [Subject]
IN (
[Mathematics],
[Science],
[Geography]
)
) AS PivotTable
How can you change the query so that the Subjects ([Mathematics], [Science], [Geography]) don't have to be hardcoded in the query?
Can you rather get the Subject list using a subquery? How do you get the FOR to work with a query like this?
...
FOR [Subject]
IN (
SELECT subject FROM grades WHERE student = "Jacob"
)
How can you change the query so that the Subjects ([Mathematics], [Science], [Geography]) don't have to be hardcoded in the query?
You can't; you'll have to form the SQL as a string and execute it dynamically
SQL makes it easy to have a variable number of columns (you just write more words in a SELECT), which then also makes it easy to forget that columns are like properties of an object (and an entire row is like an instance of an object); they aren't something that vary dynamically every time you run a program. As a Person you don't have a Name this week and not next week.
The number of columns output from a query isn't meant to vary; the number of rows is. If you want variable numbers of attributes, you'll have to form them as rows and then have your front end behave differently to account for them (i.e. don't do the pivot). If you can't do this because you have no front end, and you really do need a varying number of columns, you have to write a different SQL each time (which you can do by concatenating together a new SQL string and EXECing it, but be under no illusions - it works because it's a totally different SQL/the programmatic equivalent of you editing your hardcoded query and re-running it)
It looks something like (not tested - consider this pseudocode):
DECLARE #sql VARCHAR(4000) = CONCAT('
SELECT * FROM (
SELECT
[Student],
[Subject],
[Marks]
FROM Grades
) StudentResults
PIVOT (
SUM([Marks])
FOR [Subject]
IN (',
SELECT STRING_AGG(Subject, ',') FROM (SELECT DISTINCT QUOTENAME(Subject) FROM Grades) x,
' )
) AS PivotTable'
) --end concat
EXEC #sql

SELECT * except nth column

Is it possible to SELECT * but without n-th column, for example 2nd?
I have some view that have 4 and 5 columns (each has different column names, except for the 2nd column), but I do not want to show the second column.
SELECT * -- how to prevent 2nd column to be selected?
FROM view4
WHERE col2 = 'foo';
SELECT * -- how to prevent 2nd column to be selected?
FROM view5
WHERE col2 = 'foo';
without having to list all the columns (since they all have different column name).
The real answer is that you just can not practically (See LINK). This has been a requested feature for decades and the developers refuse to implement it. The best practice is to mention the column names instead of *. Using * in itself a source of performance penalties though.
However, in case you really need to use it, you might need to select the columns directly from the schema -> check LINK. Or as the below example using two PostgreSQL built-in functions: ARRAY and ARRAY_TO_STRING. The first one transforms a query result into an array, and the second one concatenates array components into a string. List components separator can be specified with the second parameter of the ARRAY_TO_STRING function;
SELECT 'SELECT ' ||
ARRAY_TO_STRING(ARRAY(SELECT COLUMN_NAME::VARCHAR(50)
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME='view4' AND
COLUMN_NAME NOT IN ('col2')
ORDER BY ORDINAL_POSITION
), ', ') || ' FROM view4';
where strings are concatenated with the standard operator ||. The COLUMN_NAME data type is information_schema.sql_identifier. This data type requires explicit conversion to CHAR/VARCHAR data type.
But that is not recommended as well, What if you add more columns in the long run but they are not necessarily required for that query?
You would start pulling more column than you need.
What if the select is part of an insert as in
Insert into tableA (col1, col2, col3.. coln) Select everything but 2 columns FROM tableB
The column match will be wrong and your insert will fail.
It's possible but I still recommend writing every needed column for every select written even if nearly every column is required.
Conclusion:
Since you are already using a VIEW, the simplest and most reliable way is to alter you view and mention the column names, excluding your 2nd column..
-- my table with 2 rows and 4 columns
DROP TABLE IF EXISTS t_target_table;
CREATE TEMP TABLE t_target_table as
SELECT 1 as id, 1 as v1 ,2 as v2,3 as v3,4 as v4
UNION ALL
SELECT 2 as id, 5 as v1 ,-6 as v2,7 as v3,8 as v4
;
-- my computation and stuff that i have to messure, any logic could be done here !
DROP TABLE IF EXISTS t_processing;
CREATE TEMP TABLE t_processing as
SELECT *, md5(t_target_table::text) as row_hash, case when v2 < 0 THEN true else false end as has_negative_value_in_v2
FROM t_target_table
;
-- now we want to insert that stuff into the t_target_table
-- this is standard
-- INSERT INTO t_target_table (id, v1, v2, v3, v4) SELECT id, v1, v2, v3, v4 FROM t_processing;
-- this is andvanced ;-)
INSERT INTO t_target_table
-- the following row select only the columns that are pressent in the target table, and ignore the others.
SELECT r.* FROM (SELECT to_jsonb(t_processing) as d FROM t_processing) t JOIN LATERAL jsonb_populate_record(NULL::t_target_table, d) as r ON TRUE
;
-- WARNING : you need a object that represent the target structure, an exclusion of a single column is not possible
For columns col1, col2, col3 and col4 you will need to request
SELECT col1, col3, col4 FROM...
to omit the second column. Requesting
SELECT *
will get you all the columns

Returning the results in the 'SAME ORDER' as the input param

In the below query what should I do to get the results in the same order as the input param.
DECLARE #sql varchar(max)
SET #sql = 'SELECT a.num AS Num, a.photo as Photo , row_number() over (order by (select 0)) rn
FROM tbl a (nolock) WHERE a.num IN (' + #NumList + ') '
I pass in to the #NumList param the following (as an example):
1-235,1-892,2-847,1-479,3-890,1-239,2-892
This works fine, however I need the results returning in the 'SAME ORDER' as the input param.
I have created a SQL Fiddle
If #NumList contains unique values you could use CharIndex to find their position within the param ex:
order by charindex(a.num, #NumList)
Create a local temporary table #numbers. Ensure that it has an auto-increment identity column.
Insert the numbers from #NumList into #numbers in the correct order.
Split #NumList at the commas and turn the individual values into rows. See e.g. this question for ideas how to do this.
Alternatively, turn #NumList from a VARCHAR-typed variable into a table variable. (That way, you might even be able to use it directly, i.e. in place of #numbers.)
Modify your query so that rows from tbl a are joined to #numbers. Also, add an ORDER BY clause that sorts the result by the auto-increment identity column of #numbers.

Hive: How to do a SELECT query to output a unique primary key using HiveQL?

I have the following schema dataset which i want to transform into a table that can be exported to SQL. I am using HIVE. Input as follows
call_id,stat1,stat2,stat3
1,a,b,c,
2,x,y,z,
3,d,e,f,
1,j,k,l,
The output table needs to have call_id as its primary key so it needs to be unique. The output schema should be
call_id,stat2,stat3,
1,b,c, or (1,k,l)
2,y,z,
3,e,f,
The problem is that when i use the keyword DISTINCT in the HIVE query, the DISTINCT applies to the all the colums combined. I want to apply the DISTINCT operation only to the call_id. Something on the lines of
SELECT DISTINCT(call_id), stat2,stat3 from intable;
However this is not valid in HIVE(I am not well-versed in SQL either).
The only legal query seems to be
SELECT DISTINCT call_id, stat2,stat3 from intable;
But this returns multiple rows with same call_id as the other columns are different and the row on the whole is distinct.
NOTE: There is no arithmetic relation between a,b,c,x,y,z, etc. So any trick of averaging or summing is not viable.
Any ideas how i can do this?
One quick idea,not the best one, but will do the work-
hive>create table temp1(a int,b string);
hive>insert overwrite table temp1
select call_id,max(concat(stat1,'|',stat2,'|',stat3)) from intable group by call_id;
hive>insert overwrite table intable
select a,split(b,'|')[0],split(b,'|')[1],split(b,'|')[2] from temp1;
,,I want to apply the DISTINCT operation only to the call_id"
But how will then Hive know which row to eliminate?
Without knowing the amount of data / size of the stat fields you have, the following query can the job:
select distinct i1.call_id, i1.stat2, i1.stat3 from (
select call_id, MIN(concat(stat1, stat2, stat3)) as smin
from intable group by call_id
) i2 join intable i1 on i1.call_id = i2.call_id
AND concat(i1.stat1, i1.stat2, i1.stat3) = i2.smin;

Finding duplicates between two tables

I've got two SQL2008 tables, one is a "Import" table containing new data and the other a "Destination" table with the live data. Both tables are similar but not identical (there's more columns in the Destination table updated by a CRM system), but both tables have three "phone number" fields - Tel1, Tel2 and Tel3. I need to remove all records from the Import table where any of the phone numbers already exist in the destination table.
I've tried knocking together a simple query (just a SELECT to test with just now):
select t2.account_id
from ImportData t2, Destination t1
where
(t2.Tel1!='' AND (t2.Tel1 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel2!='' AND (t2.Tel2 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel3!='' AND (t2.Tel3 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
... but I'm aware this is almost certainly Not The Way To Do Things, especially as it's very slow. Can anyone point me in the right direction?
this query requires a little more that this information. If You want to write it in the efficient way we need to know whether there is more duplicates each load or more new records. I assume that account_id is the primary key and has a clustered index.
I would use the temporary table approach that is create a normalized table #r with an index on phone_no and account_id like
SELECT Phone, Account into #tmp
FROM
(SELECT account_id, tel1, tel2, tel3
FROM destination) p
UNPIVOT
(Phone FOR Account IN
(Tel1, tel2, tel3)
)AS unpvt;
create unclustered index on this table with the first column on the phone number and the second part the account number. You can't escape one full table scan so I assume You can scan the import(probably smaller). then just join with this table and use the not exists qualifier as explained. Then of course drop the table after the processing
luke
I am not sure on the perforamance of this query, but since I made the effort of writing it I will post it anyway...
;with aaa(tel)
as
(
select Tel1
from Destination
union
select Tel2
from Destination
union
select Tel3
from Destination
)
,bbb(tel, id)
as
(
select Tel1, account_id
from ImportData
union
select Tel2, account_id
from ImportData
union
select Tel3, account_id
from ImportData
)
select distinct b.id
from bbb b
where b.tel in
(
select a.tel
from aaa a
intersect
select b2.tel
from bbb b2
)
Exists will short-circuit the query and not do a full traversal of the table like a join. You could refactor the where clause as well, if this still doesn't perform the way you want.
SELECT *
FROM ImportData t2
WHERE NOT EXISTS (
select 1
from Destination t1
where (t2.Tel1!='' AND (t2.Tel1 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel2!='' AND (t2.Tel2 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel3!='' AND (t2.Tel3 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
)