More efficient way to fuzzy match large datasets in SAS - merge

I have a dataset with over 33 million records that includes a name field. I need to flag records where this name field value also appears in a second dataset that includes about 5 million records. For my purposes, a fuzzy match would be both acceptable and beneficial.
I wrote the following program to do this. It works but has been thus far running for 4 days, so I'd like to find a more efficient way to write it.
proc sql noprint;
create table INDIV_MATCH as
select A.NAME, SPEDIS(A.NAME, B.NAME) as SPEDIS_VALUE, COMPGED(A.NAME,B.NAME) as COMPGED_SCORE
from DATASET1 A join DATASET2 B
on COMPGED(A.NAME, B.NAME) le 400 and SPEDIS(A.NAME, B.NAME) le 10
order by A.name;
quit;
Any help would be much appreciated!

Related

PostgreSQL how to GROUP BY single field from returned table

So I have complicated query, to simplify let it be like
SELECT
t.*,
SUM(a.hours) AS spent_hours
FROM (
SELECT
person.id,
person.name,
person.age,
SUM(contacts.id) AS contact_count
FROM
person
JOIN contacts ON contacts.person_id = person.id
) AS t
JOIN activities AS a ON a.person_id = t.id
GROUP BY t.id
Such query works fine in MySQL, but Postgres needs to know that GROUP BY field is unique, and despite it actually is, in this case I need to GROUP BY all returned fields from returned t table.
I can do that, but I don't believe that will work efficiently with big data.
I can't JOIN with activities directly in first query, as person can have several contacts which will lead query counting hours of activity several time for every joined contact.
Is there a Postgres way to make this query work? Maybe force to treat Postgres t.id as unique or some other solution that will make same in Postgres way?
This query will not work on both database system, there is an aggregate function in the inner query but you are not grouping it(unless you use window functions). Of course there is a special case for MySQL, you can use it with disabling "sql_mode=only_full_group_by". So, MySQL allows this usage because of it' s database engine parameter, but you cannot do that in PostgreSQL.
I knew MySQL allowed indeterminate grouping, but I honestly never knew how it implemented it... it always seemed imprecise to me, conceptually.
So depending on what that means (I'm too lazy to look it up), you might need one of two possible solutions, or maybe a third.
If you intent is to see all rows (perform the aggregate function but not consolidate/group rows), then you want a windowing function, invoked by partition by. Here is a really dumbed down version in your query:
.
SELECT
t.*,
SUM (a.hours) over (partition by t.id) AS spent_hours
FROM t
JOIN activities AS a ON a.person_id = t.id
This means you want all records in table t, not one record per t.id. But each row will also contain a sum of the hours for all values that value of id.
For example the sum column would look like this:
Name Hours Sum Hours
----- ----- ---------
Smith 20 120
Jones 30 30
Smith 100 120
Whereas a group by would have had Smith once and could not have displayed the hours column in detail.
If you really did only want one row per t.id, then Postgres will require you to tell it how to determine which row. In the example above for Smith, do you want to see the 20 or the 100?
There is another possibility, but I think I'll let you reply first. My gut tells me option 1 is what you're after and you want the analytic function.

Merge multiple tables having different columns

I have 4 tables each tables has different number of columns as listed below.
tableA - 34
tableB - 47
tableC - 26
tableD - 16
Every table has a common column called id, now i need to perform a union the problem is since the columns are not equal length and entirely different i can't do a union.
Based on id only i can get the details from every table, so how should i approach this.
What is the optimized way to solve this, tried full join but that takes too much time.
Tried so far
SELECT * FROM tableA FULL JOIN
tableB FULL JOIN
tableC FULL JOIN
tableD
USING (id)
WHERE tableA.id = 123 OR
tableB.id = 123 OR
tableC.id = 123 OR
tableD.id = 123
Snowflake does have a declared limitation in use of Set operators (such as UNION):
When using these operators:
Make sure that each query selects the same number of columns.
[...]
However, since the column names are well known, it is possible to come up with a superset of all unique column names required in the final result and project them explicitly from each query.
There's not enough information in the question on how many columns overlap (47 unique columns?), or if they are all different (46 + 33 + 25 + 15 = 119 unique columns?). The answer to this would determine the amount of effort required to write out each query, as it would involve adapting a query from the following form:
SELECT * FROM t1
Into an explicit form with dummy columns defined with acceptable defaults that match the data type on tables where they are present:
SELECT
present_col1,
NULL AS absent_col2,
0.0 AS absent_col3,
present_col4,
[...]
FROM
t1
You can also use some meta programming with stored procedures to "generate" such an altered query by inspecting independent result's column names using the Statement::getColumnCount(), Statement::getColumnName(), etc. APIs and forming a superset union version with default/empty values.

Convert T-SQL Cross Apply to Redshift

I am converting the following T-SQL statement to Redshift. The purpose of the query is to convert a column in the table with a value containing a comma delimited string with up to 60 values into multiple rows with 1 value per row.
SELECT
id_1
, id_2
, value
into dbo.myResultsTable
FROM myTable
CROSS APPLY STRING_SPLIT([comma_delimited_string], ',')
WHERE [comma_delimited_string] is not null;
In SQL this processes 10 million records in just under 1 hour which is fine for my purposes. Obviously a direct conversation to Redshift isn't possible due to Redshift not having a Cross Apply or String Split functionality so I built a solution using the process detailed here (Redshift. Convert comma delimited values into rows) which utilizes split_part() to split the comma delimited string into multiple columns. Then another query that unions everything to get the final output into a single column. But the typical run takes over 6 hours to process the same amount of data.
I wasn't expecting to run into this issue just knowing the power difference between the machines. The SQL Server I was using for the comparison test was a simple server with 12 processors and 32 GB of RAM while the Redshift server is based on the dc1.8xlarge nodes (I don't know the total count). The instance is shared with other teams but when I look at the performance information there are plenty of available resources.
I'm relatively new to Redshift so I'm still assuming I'm not understanding something. But I have no idea what am I missing. Are there things I need to check to make sure the data is loaded in an optimal way (I'm not an adim so my ability to check this is limited)? Are there other Redshift query options that are better than the example I found? I've searched for other methods and optimizations but unless I start looking into Cross Joins, something I'd like to avoid (Plus when I tried to talk to the DBA's running the Redshift cluster about this option their response was a flat "No, can't do that.") I'm not even sure where to go at this point so any help would be much appreciated!
Thanks!
I've found a solution that works for me.
You need to do a JOIN on a number table, for which you can take any table as long as it has more rows that the numbers you need. You need to make sure the numbers are int by forcing the type. Using the funcion regexp_count on the column to be split for the ON statement to count the number of fields (delimiter +1), will generate a table with a row per repetition.
Then you use the split_part function on the column, and use the number.num column to extract for each of the rows a different part of the text.
SELECT comma_delimited_string, numbers.num, REGEXP_COUNT(comma_delimited_string , ',')+1 AS nfields, SPLIT_PART(comma_delimited_string, ',', numbers.num) AS field
FROM mytable
JOIN
(
select
(row_number() over (order by 1))::int as num
from
mytable
limit 15 --max num of fields
) as numbers
ON numbers.num <= regexp_count(comma_delimited_string , ',') + 1

LEFT JOIN returns incorrect result in PostgreSQL

I have two tables: A (525,968 records) and B (517,831 records). I want to generate a table with all the rows from A and the matched records from B. Both tables has column "id" and column "year". The combination of id and year in table A is unique, but not in table B. So I wrote the following query:
SELECT
A.id,
A.year,
A.v1,
B.x1,
B.e1
FROM
A
LEFT JOIN B ON (A.id = B.id AND A.year = B.year);
I thought the result should contain the same total number of records in A, but it only returns about 517,950 records. I'm wondering what the possible cause may be.
Thanks!
First of all, I understand that this is an example, but postgres may hava an issues with capital letters in the table names.
Secondly, it may be a good idea to check how exactly you calculated 525,968 records. The thing is - if you use sime kind of client of database administration / queries - it may show you different / technical information about tables (there may be internal row counters in postgres that may actually differ from the number of records).
And finally to check yourself do something like
SELECT
count("A".id)
FROM
"A"

T-SQL - CROSS APPLY to a PIVOT? (using pivot with a table-valued function)?

I have a table-valued function, basically a split-type function, that returns up to 4 rows per string of data.
So I run:
select * from dbo.split('a','1,a15,b20,c40;2,a25,d30;3,e50')
I get:
Seq Data
1 15
2 25
However, my end data needs to look like
15 25
so I do a pivot.
select [1],[2],[3],[4]
from dbo.split('a','1,a15,b20,c40;2,a25,d30;3,e50')
pivot (max(data) for seq in ([1],[2],[3],[4]))
as pivottable
which works as expected:
1 2
--- ---
15 25
HOWEVER, that's great for one row. I now need to do it for several hundred records at once. My thought is to do a CROSS APPLY, but not sure how to combine a CROSS APPLY and a PIVOT.
(yes, obviously the easy answer is to write a modified version that returns 4 columns, but that's not a great option for other reasons)
Any help greatly appreciated.
And the reason I'm doing this: the current query uses as scalar-valued version of SPLIT, called 12 times within the same SELECT against the same million rows (where the data string is 500+ bytes).
So far as I know, that would require it scan the same 500bytes * 1000000rows, 12 times.
This is how you use cross apply. Assume table1 is your table and Line is the field in your table you want to split
SELECT * fROM table1 as a
cross apply dbo.split(a.Line) as b
pivot (max(data) for seq in ([1],[2],[3],[4])) as p