Splitting a dataset into training and test set in postgres

Splitting a dataset into training and test set in postgres - postgresql

I have a dataset and I want to use postgres sql to split it into 70:30 ratio into training and test set. How can i do that. I used the following code but it doesn't seem to work
create table training_test as
(
WITH TEMP as
(
SELECT ROW_NUMBER() AS ROW_ID , Random() as RANDOM_VALUE,D.*
FROM analytics.model_data_discharge_v1 as D
ORDER BY RANDOM_VALUE
)
SELECT 'Training',T.* FROM TEMP T
WHERE ROW_ID <= 493896*0.70
UNION
SELECT 'Test',T.* FROM TEMP T
WHERE ROW_ID > 493896*0.70
) distributed by(hospitalaccountrecord);

select t.*,
case
when random() < 0.7 then 'training'
else 'test'
end as split
from analytics.model_data_discharge_v1 t

Don't use random splitting is NOT repeatable! random() will return different results each time.
Instead, you can, for example, use hashing and modulo to split a dataset as Google Cloud suggests.
Hash using a fingerprint function one the columns/features that is not correlated to target (to avoid leaving valuable information out of the training set). Otherwise, concatenate all the fields as JSON string and hash on that
Take the absolute value of the hash
Calculate the abs(hash(column)) modulo 10
If the result < 8 then it will be part of the 80% training set
If the result == 8 then it will be part of the 20% test set
Example using BiQuery (I've taken from a GCP ML course) :
Training set
Test set
In that way, you get the exact 80% data each time.

If you want a stratified split, you can use the following code.
The first bit guarantees that each group has the minimum size to be split.
with ssize as (
select
group
from to_split_table
group by group
having count(*) >= {{ MINIMUM GROUP SIZE }}) -- {{ MINIMUM GROUP SIZE }} = 1 / {{ TEST_THRESHOLD }}
select
id_aux,
ts.group,
case
when
cast(row_number() over (partition by ts.group order by rand()) as double) / cast(count() over (partition by ts.group) as double)
< {{ TEST_THRESHOLD }} then 'test'
else 'train'
end as splitting
from to_split_table ts
join ssize
on ts.group = ssize.group

Related

Query table by a value in the second dimension of a two dimensional array column

WHAT I HAVE
I have a table with the following definition:
CREATE TABLE "Highlights"
(
id uuid,
chunks numeric[][]
)
WHAT I NEED TO DO
I need to query the data in the table using the following predicate:
... WHERE id = 'some uuid' and chunks[????????][1] > 10 chunks[????????][3] < 20
What should I put instead of [????????] in order to scan all items in the first dimension of the array?
Notes
I'm not entirely sure that chunks[][1] even close to something I need.
All I need is to test a row, whether its chunks column contains a two dimensional array, that has in any of its tuples some specific values.

May be there's better alternative, but this might do - you just go over first dimension of each array and testing your condition:
select *
from highlights as h
where
exists (
select
from generate_series(1, array_length(h.chunks, 1)) as tt(i)
where
-- your condition goes here
h.chunks[tt.i][1] > 10 and h.chunks[tt.i][3] < 20
)
db<>fiddle demo
update as #arie-r pointed out, it'd be better to use generate_subscripts function:
select *
from highlights as h
where
exists (
select *
from generate_subscripts(h.chunks, 1) as tt(i)
where
h.chunks[tt.i][3] = 6
)
db<>fiddle demo

Postgresql upper limit of a calculated field

is there a way to set an upper limit to a calculation (calculated field) which is already in a CASE clause? I'm calculating percentages and, obviously, don't want the highest value exceed '100'.
If it wasn't in a CASE clause already, I'd create something like 'case when calculation > 100.0 then 100 else calculation end as needed_percent' but I can't do it now..
Thanks for any suggestions.

I think using least function will be the best option.
select least((case when ...), 100) from ...

There is a way to set an upper limit on a calculated field by creating an outer query. Check out my example below. The inner query will be the query that you have currently. Then just create an outer query on it and use a WHERE clause to limit it to <= 1.
SELECT
z.id,
z.name,
z.percent
FROM(
SELECT
id,
name,
CASE WHEN id = 2 THEN sales/SUM(sales) ELSE NULL END AS percent
FROM
users_table
) AS z
WHERE z.percent <= 1

BigQuery - Partitioning data according to some hash criteria

I have a table in BigQuery. I have a certain string column which represents a unique id (uid). I want to filter only a sample of this table, by taking only a portion of the uids (let's say 1/100).
So my idea is to sample the data by doing something like this:
if(ABS(HASH(uid)) % 100 == 0) ...
The problem is this will actually filter in 1/100 ratio only if the distribution of the hash values is uniform. So, in order to check that, I would like to generate the following table:
(n goes from 0 to 99)
0 <number of rows in which uid % 100 == 0>
1 <number of rows in which uid % 100 == 1>
2 <number of rows in which uid % 100 == 2>
3 <number of rows in which uid % 100 == 3>
.. etc.
If I see the numbers in each row are of the same magnitude, then my assumption is correct.
Any idea how to create such a query, or alternatively do the sampling another way?

Something like
Select ABS(HASH(uid)) % 100 as cluster , count(*) as cnt
From yourtable
Group each by cluster
the UID is of different cases (upper, lower) and types you can use some string manipulation within the hash. something like:
Select ABS(HASH(upper(string(uid)))) % 100 as cluster , count(*) as cnt
From yourtable
Group each by cluster

As an alternative to HASH(), you can try RAND() - it doesn't depend on the ids being uniformly distributed.
For example, this would give you 10 roughly equally sized partitions:
SELECT word, INTEGER(10*RAND()) part
FROM [publicdata:samples.shakespeare]
Verification:
SELECT part, COUNT(*) FROM (
SELECT word, INTEGER(10*RAND()) part
FROM [publicdata:samples.shakespeare]
)
GROUP BY part
ORDER BY part
Each group ends up with about 16465 elements.

Greatest N per group in Open SQL

Selecting the rows from a table by (partial) key with the maximum value in a particular column is a common task in SQL. This question has some excellent answers that cover a variety of approaches to it. Unfortunately I'm struggling to replicate this in my ABAP program.
None of the commonly used approaches seem to be supported:
Joining on a subquery is not supported in syntax: SELECT * FROM X as x INNER JOIN ( SELECT ... ) AS y
Using IN for a composite key is not supported in syntax as far as I know: SELECT * FROM X WHERE (key1, key2) IN ( SELECT key1 key2 FROM ... )
Left join to itself with smaller-than comparison is not supported, outer joins only support EQ comparisons: SELECT * FROM X AS x LEFT JOIN X as xmax ON x-key1 = xmax-key1 AND x-key2 < xmax-key2 WHERE xmax-key IS INITIAL
After trying each of these solutions in turn only to discover that ABAP doesn't seem to support them and being unable to find any equivalents I'm starting to think that I'll have no choice but to dump the data of the subquery to an itab.
What is the best practice for this common programming requirement in ABAP development?

First of all, specific requirement, would give you a better answer. As it happens I bumped into this question when working on a program, that uses 3 distinct methods of pseudo-grouping, (while looking for alternatives) and ALL 3 can be used to answer your question, depending on what exactly you need to do. I'm sure there are more ways to do it.
For instance, you can pull maximum values within a group by simply selecting max( your_field ) and grouping by some fields, if that's all you need.
select bname, nation, max( date_from ) from adrp group by bname, nation. "selects highest "from" date for each bname
If you need to use that max value as a filter condition within a query, you can do it by performing pseudo-grouping using sub-query and max within sub-query like this (notice how I move out the BNAME check into sub query, which means I don't have to check both fields using in (subquery) addition):
select ... from adrp as b_adrp "Pulls the latest person info for a user (some conditions are missing, but this is a part of an actual query)
where b_adrp~date_from in (
select max( date_from ) "Highest date_from where both dates are valid
from adrp where persnumber = b_adrp~persnumber and nation = b_adrp~nation and date_from <= #sy-datum )
The query above allows you to select selects all user info from base query and (where the first one only allows to take aggregated and grouped data).
Finally, If you need to check based on composite key and compare it to multiple agregate function results, the implementation will heavily depend on specifics of your requirement (and since your question has none, I'll provide a generic one). Easiest option is to use exists / not exists instead of in (subquery), in exact same way and form the subquery to check for existance of specific key or condition rather than pull a list ( you can nest subqueries if you have to ):
select * from bkpf where exists ( select 1 from bkpf as b where belnr = bkpf~belnr and gjahr = bkpf~gjahr group by belnr, gjahr having max( budat ) = bkpf~budat ) "Took an available example, that I had in testing program.
All 3 queries will get you max value of a column within a group and in fact, all 3 can use joins to achieve identical results.

please find my answers below your questions.
Joining on a subquery is not supported in syntax: SELECT * FROM X as x INNER JOIN ( SELECT ... ) AS y
Putting the subquery in your where condition should do the work SELECT * FROM X AS x INNER JOIN Y AS y ON x~a = y~b WHERE ( SELECT * FROM y WHERE ... )
Using IN for a composite key is not supported in syntax as far as I know: SELECT * FROM X WHERE (key1, key2) IN ( SELECT key1 key2 FROM ... )
You have to split your WHERE clause: SELECT * FROM X WHERE key1 IN ( SELECT key1 FROM y ) AND key2 IN ( SELECT key2 FROM y )
Left join to itself with smaller-than comparison is not supported, outer joins only support EQ comparisons.
Yes, thats right at the moment.

Left join to itself with smaller-than comparison is not supported, outer joins only support EQ comparisons:
SELECT * FROM X AS x LEFT JOIN X as xmax ON x-key1 = xmax-key1 AND x-key2 < xmax-key2 WHERE xmax-key IS INITIAL
This is not true. This SELECT is perfectly valid:
SELECT b1~budat
INTO TABLE lt_bkpf
FROM bkpf AS b1
LEFT JOIN bkpf AS b2
ON b2~belnr < b1~belnr
WHERE b1~bukrs <> ''.
And was valid at least since 7.40 SP08, since July 2013, so at the time you asked this question it was valid as well.

sp_executesql vs user defined scalar function

In the table below I am storing some conditions like this:
Then, generally, in second table, I am having the following records:
and what I need is to compare these values using the right condition and store the result ( let's say '0' for false, and '1' for true in additional column).
I am going to do this in a store procedure and basically I am going to compare from several to hundreds of records.
What of the possible solution is to use sp_executesql for each row building dynamic statements and the other is to create my own scalar function and to call it for eacy row using cross apply.
Could anyone tell which is the more efficient way?
Note: I know that the best way to answer this is to make the two solutions and test, but I am hoping that there might be answered of this, based on other stuff like caching and SQL internal optimizations and others, which will save me a lot of time because this is only part of a bigger problem.

I don't see the need in use of sp_executesql in this case. You can obtain result for all records at once in a single statement:
select Result = case
when ct.Abbreviation='=' and t.ValueOne=t.ValueTwo then 1
when ct.Abbreviation='>' and t.ValueOne>t.ValueTwo then 1
when ct.Abbreviation='>=' and t.ValueOne>=t.ValueTwo then 1
when ct.Abbreviation='<=' and t.ValueOne<=t.ValueTwo then 1
when ct.Abbreviation='<>' and t.ValueOne<>t.ValueTwo then 1
when ct.Abbreviation='<' and t.ValueOne<t.ValueTwo then 1
else 0 end
from YourTable t
join ConditionType ct on ct.ID = t.ConditionTypeID
and update additional column with something like:
;with cte as (
select t.AdditionalColumn, Result = case
when ct.Abbreviation='=' and t.ValueOne=t.ValueTwo then 1
when ct.Abbreviation='>' and t.ValueOne>t.ValueTwo then 1
when ct.Abbreviation='>=' and t.ValueOne>=t.ValueTwo then 1
when ct.Abbreviation='<=' and t.ValueOne<=t.ValueTwo then 1
when ct.Abbreviation='<>' and t.ValueOne<>t.ValueTwo then 1
when ct.Abbreviation='<' and t.ValueOne<t.ValueTwo then 1
else 0 end
from YourTable t
join ConditionType ct on ct.ID = t.ConditionTypeID
)
update cte
set AdditionalColumn = Result
If above logic is supposed to be applied in many places, not just over one table, then yes you may think about function. Though I would used rather inline table-valued function (not scalar), because of there is overhead imposed with use of user defined scalar functions (to call and return, and the more rows to be processed the more time wastes).
create function ftComparison
(
#v1 float,
#v2 float,
#cType int
)
returns table
as return
select
Result = case
when ct.Abbreviation='=' and #v1=#v2 then 1
when ct.Abbreviation='>' and #v1>#v2 then 1
when ct.Abbreviation='>=' and #v1>=#v2 then 1
when ct.Abbreviation='<=' and #v1<=#v2 then 1
when ct.Abbreviation='<>' and #v1<>#v2 then 1
when ct.Abbreviation='<' and #v1<#v2 then 1
else 0
end
from ConditionType ct
where ct.ID = #cType
which can be applied then as:
select f.Result
from YourTable t
cross apply ftComparison(ValueOne, ValueTwo, t.ConditionTypeID) f
or
select f.Result
from YourAnotherTable t
cross apply ftComparison(SomeValueColumn, SomeOtherValueColumn, #someConditionType) f

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Splitting a dataset into training and test set in postgres - postgresql

select t.*, case when random() < 0.7 then 'training' else 'test' end as split from analytics.model_data_discharge_v1 t

Related

Query table by a value in the second dimension of a two dimensional array column

Postgresql upper limit of a calculated field

BigQuery - Partitioning data according to some hash criteria

Greatest N per group in Open SQL

sp_executesql vs user defined scalar function

Categories

Resources