I have a table in BigQuery. I have a certain string column which represents a unique id (uid). I want to filter only a sample of this table, by taking only a portion of the uids (let's say 1/100).
So my idea is to sample the data by doing something like this:
if(ABS(HASH(uid)) % 100 == 0) ...
The problem is this will actually filter in 1/100 ratio only if the distribution of the hash values is uniform. So, in order to check that, I would like to generate the following table:
(n goes from 0 to 99)
0 <number of rows in which uid % 100 == 0>
1 <number of rows in which uid % 100 == 1>
2 <number of rows in which uid % 100 == 2>
3 <number of rows in which uid % 100 == 3>
.. etc.
If I see the numbers in each row are of the same magnitude, then my assumption is correct.
Any idea how to create such a query, or alternatively do the sampling another way?
Something like
Select ABS(HASH(uid)) % 100 as cluster , count(*) as cnt
From yourtable
Group each by cluster
the UID is of different cases (upper, lower) and types you can use some string manipulation within the hash. something like:
Select ABS(HASH(upper(string(uid)))) % 100 as cluster , count(*) as cnt
From yourtable
Group each by cluster
As an alternative to HASH(), you can try RAND() - it doesn't depend on the ids being uniformly distributed.
For example, this would give you 10 roughly equally sized partitions:
SELECT word, INTEGER(10*RAND()) part
FROM [publicdata:samples.shakespeare]
Verification:
SELECT part, COUNT(*) FROM (
SELECT word, INTEGER(10*RAND()) part
FROM [publicdata:samples.shakespeare]
)
GROUP BY part
ORDER BY part
Each group ends up with about 16465 elements.
Related
WHAT I HAVE
I have a table with the following definition:
CREATE TABLE "Highlights"
(
id uuid,
chunks numeric[][]
)
WHAT I NEED TO DO
I need to query the data in the table using the following predicate:
... WHERE id = 'some uuid' and chunks[????????][1] > 10 chunks[????????][3] < 20
What should I put instead of [????????] in order to scan all items in the first dimension of the array?
Notes
I'm not entirely sure that chunks[][1] even close to something I need.
All I need is to test a row, whether its chunks column contains a two dimensional array, that has in any of its tuples some specific values.
May be there's better alternative, but this might do - you just go over first dimension of each array and testing your condition:
select *
from highlights as h
where
exists (
select
from generate_series(1, array_length(h.chunks, 1)) as tt(i)
where
-- your condition goes here
h.chunks[tt.i][1] > 10 and h.chunks[tt.i][3] < 20
)
db<>fiddle demo
update as #arie-r pointed out, it'd be better to use generate_subscripts function:
select *
from highlights as h
where
exists (
select *
from generate_subscripts(h.chunks, 1) as tt(i)
where
h.chunks[tt.i][3] = 6
)
db<>fiddle demo
I am comparing from different table to get the COLUMN_NAME of the MAXIMUM value
Examples.
These are example tables: Fruit_tb, Vegetable_tb, State_tb, Foods_tb
Under Fruit_tb
fr_id fruit_one fruit_two
1 20 50
Under Vegetables_tb (v = Vegetables)
v_id v_one V_two
1 10 9
Under State_tb
stateid stateOne stateTwo
1 70 87
Under Food_tb
foodid foodOne foodTwo
1 10 3
Now here is the scenario, I want to get the COLUMN NAMES of the max or greatest value in each table.
You can maybe find out the row which contains the max value of a column. For eg:
SELECT fr_id , MAX(fruit_one) FROM Fruit_tb GROUP BY fr_id;
In order to find the out the max value of a table:
SELECT fr_id ,fruit_one FROM Fruit_tb WHERE fruit_one<(SELECT max(fruit_one ) from Fruit_tb) ORDER BY fr_id DESC limit 1;
A follow up SO for the above scenario.
Maybe you can use GREATEST in order to get the column name which has the max value. But then what I'm not sure is whether you'll be able to retrieve all the columns of different tables at once. You can do something like this to retrieve from a single table:
SELECT CASE GREATEST(`id`,`fr_id`)
WHEN `id` THEN `id`
WHEN `fr_id` THEN `fr_id`
ELSE 0
END AS maxcol,
GREATEST(`id`,`fr_id`) as maxvalue FROM Fruit_tb;
Maybe this SO could help you. Hope it helps!
I have a dataset and I want to use postgres sql to split it into 70:30 ratio into training and test set. How can i do that. I used the following code but it doesn't seem to work
create table training_test as
(
WITH TEMP as
(
SELECT ROW_NUMBER() AS ROW_ID , Random() as RANDOM_VALUE,D.*
FROM analytics.model_data_discharge_v1 as D
ORDER BY RANDOM_VALUE
)
SELECT 'Training',T.* FROM TEMP T
WHERE ROW_ID <= 493896*0.70
UNION
SELECT 'Test',T.* FROM TEMP T
WHERE ROW_ID > 493896*0.70
) distributed by(hospitalaccountrecord);
select t.*,
case
when random() < 0.7 then 'training'
else 'test'
end as split
from analytics.model_data_discharge_v1 t
Don't use random splitting is NOT repeatable! random() will return different results each time.
Instead, you can, for example, use hashing and modulo to split a dataset as Google Cloud suggests.
Hash using a fingerprint function one the columns/features that is not correlated to target (to avoid leaving valuable information out of the training set). Otherwise, concatenate all the fields as JSON string and hash on that
Take the absolute value of the hash
Calculate the abs(hash(column)) modulo 10
If the result < 8 then it will be part of the 80% training set
If the result == 8 then it will be part of the 20% test set
Example using BiQuery (I've taken from a GCP ML course) :
Training set
Test set
In that way, you get the exact 80% data each time.
If you want a stratified split, you can use the following code.
The first bit guarantees that each group has the minimum size to be split.
with ssize as (
select
group
from to_split_table
group by group
having count(*) >= {{ MINIMUM GROUP SIZE }}) -- {{ MINIMUM GROUP SIZE }} = 1 / {{ TEST_THRESHOLD }}
select
id_aux,
ts.group,
case
when
cast(row_number() over (partition by ts.group order by rand()) as double) / cast(count() over (partition by ts.group) as double)
< {{ TEST_THRESHOLD }} then 'test'
else 'train'
end as splitting
from to_split_table ts
join ssize
on ts.group = ssize.group
I need to divide with UPDATE command rows (selected from subselect) in PostgreSQL table into groups, these groups will be identified with integer value in one of its columns. These groups should be with the same size. Source table contains billions of records.
For example I need to divide 213 selected rows into groups, every group should contains 50 records. The result will be:
1 - 50. row => 1
51 - 100. row => 2
101 - 150. row => 3
151 - 200. row => 4
200 - 213. row => 5
There is no problem to do it with some loop (or use PostgreSQL window functions), but I need to do it very efficiently and quickly. I can't use sequence in id because there should be gaps in these ids.
I have an idea to use random integer number generator and set it as default value for a row. But this is not useable when I need to adjust group size.
The query below should display 213 rows with a group-number from 0-4. Just add 1 if you want 1-5
SELECT i, (row_number() OVER () - 1) / 50 AS grp
FROM generate_series(1001,1213) i
ORDER BY i;
create temporary sequence s minvalue 0 start with 0;
select *, nextval('s') / 50 grp
from t;
drop sequence s;
I think it has the potential to be faster than the row_number version #Richard. But the difference could be not relevant depending on the specifics.
I am using SQL 2008 and trying to process the data I have in a table in batches, however, there is a catch. The data is broken into groups and, as I do my processing, I have to make sure that a group will always be contained within a batch or, in other words, that the group will never be split across different batches. It's assumed that the batch size will always be much larger than the group size. Here is the setup to illustrate what I mean (the code is using Jeff Moden's data generation logic: http://www.sqlservercentral.com/articles/Data+Generation/87901)
DECLARE #NumberOfRows INT = 1000,
#StartValue INT = 1,
#EndValue INT = 500,
#Range INT
SET #Range = #EndValue - #StartValue + 1
IF OBJECT_ID('tempdb..#SomeTestTable','U') IS NOT NULL
DROP TABLE #SomeTestTable;
SELECT TOP (#NumberOfRows)
GroupID = ABS(CHECKSUM(NEWID())) % #Range + #StartValue
INTO #SomeTestTable
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
This will create a table with about 435 groups of records containing between 1 and 7 records in each. Now, let's say I want to process these records in batches of 100 records per batch. How can I make sure that my GroupID's don't get split between different batches? I am fine if each batch is not exactly 100 records, it could be a little more or a little less.
I appreciate any suggestions!
This will result in slightly smaller batches than 100 entries, it'll remove all groups that aren't entirely in the selection;
WITH cte AS (SELECT TOP 100 * FROM (
SELECT GroupID, ROW_NUMBER() OVER (PARTITION BY GroupID ORDER BY GroupID) r
FROM #SomeTestTable) a
ORDER BY GroupID, r DESC)
SELECT c1.GroupID FROM cte c1
JOIN cte c2
ON c1.GroupID = c2.GroupID
AND c2.r = 1
It'll select the groups with the lowest GroupID's, limited to 100 entries into a common table expression along with the row number, then it'll use the row number to throw away any groups that aren't entirely in the selection (row number 1 needs to be in the selection for the group to be, since the row number is ordered descending before cutting with TOP).