How to group by more than 64 keys in BigQuery - group-by

Using Google-BigQuery, I created a query with almost 100 fields, grouping by 96 of them:
SELECT
field1,field2,(...),MAX(field100) as max100
FROM dataset.table1
GROUP BY field1,field2,(...),field96
and I got this error
Error: Maximum number of keys in GROUP BY clause is 64, query has 96 GROUP BY keys.
so, there is no chance to group by more than 64 fields using google-bigquery. Any suggestion?

If some of these fields are strings, and there is a character which cannot appear in them (say, ':'), then you could concatenate them together and group by concatenation, i.e.
SELECT CONCAT(field1, ':', field2, ':', field3) as composite_field, ...
FROM dataset.table
GROUP BY 1, 2, ..., 64
In order to recover the original fields later, you could use
SELECT
regexp_extract(composite_field, r'([^:]*):') field1,
regexp_extract(composite_field, r'[^:]*:([^:]*)') field2,
regexp_extract(composite_field, r'[^:]*:[^:]*:(.*)') field3,
...
FROM (...)

It seems that is an internal limit, not documented.
Another solution that I have developed is similar to the Mosha's solution.
You can add an extra column called, for example, hashref. That new column is computed by all the columns that you would like to group by, separated with a pipe for example and applying md5 or sha256 to the line.
Then you can group by with the new hashref and for the other columns you just apply the min() function, that is also an aggregator.
line = name + "|" + surname + "|" + age
hashref = md5(line)
... and then ...
SELECT hashref, min(name), min(surname)
FROM mytable
GROUP BY hashref

Related

Redshift how to split a stringified array into separate parts

Say I have a varchar column let's say religions that looks like this: ["Christianity", "Buddhism", "Judaism"] (yes it has a bracket in the string) and I want the string (not array) split into multiple rows like "Christianity", "Buddhism", "Judaism" so it can be used in a WHERE clause.
Eventually I want to use the results of the query in a where clause like this:
SELECT ...
FROM religions
WHERE name in
(
<this subquery>
)
How can one do this?
You can use the function JSON_PARSE to convert the varchar string into an array. Then you can use the strategy described in Convert varchar array to rows in redshift - Stack Overflow to convert the array to separate rows.
You can do the following.
Create a temporary table with sequence of numbers
Using the sequence and split_part function available in redshift, you can split the values based on the numbers generated in the temporary table by doing a cross join.
To replace the double quote and square brackets, you can use the regexp_replace function in Redshift.
create temp table seq as
with recursive numbers(NUMBER) as
(
select 1 UNION ALL
select NUMBER + 1 from numbers where NUMBER < 28
)
select * from numbers;
select regexp_replace(split_part(val,',',seq.number),'[]["]','') as value
from
(select '["christianity","Buddhism","Judaism"]' as val) -- You can select the actual column from the table here.
cross join
seq
where seq.number <= regexp_count(val,'[,]')+1;

PostgreSQL calculate prefix combinations after split

I do have a string as entry, of the form foo:bar:something:221. I'm looking for a way to generate a table with all prefixes for this string, like:
foo
foo:bar
foo:bar:something
foo:bar:something:221
I wrote the following query to split the string, but can't figure out where to go from there:
select unnest(string_to_array('foo:bar:something:221', ':'));
An option is to simulate a loop over all elements, then take the sub-array from the input for each element index:
with data(input) as (
values (string_to_array('foo:bar:something:221', ':'))
)
select array_to_string(input[1:g.idx], ':')
from data
cross join generate_series(1, cardinality(input)) as g(idx);
generate_series(1, cardinality(input)) generates as many rows as the array has elements. And the expression input[1:g.idx] takes the "sub-array" starting with the first up to the "idx" one. As the output is an array, I use array_to_string to re-create the representation with the :
You can use string_agg as a window function. The default frame is from the beginning of the partition to the current row:
SELECT string_agg(s, ':') OVER (ORDER BY n)
FROM unnest(string_to_array('foo:bar:something:221', ':')) WITH ORDINALITY AS u(s, n);
string_agg
-----------------------
foo
foo:bar
foo:bar:something
foo:bar:something:221
(4 rows)

SPLIT_PART with a negative value [Postgres 9.5]

I need to use the split_part function on that query:
CREATE TABLE client_group_by_group_test AS SELECT *, SPLIT_PART( groupe,
',', 1 ) AS group1, SPLIT_PART(SPLIT_PART(groupe,',',2),',',-1) AS
group2, SPLIT_PART(SPLIT_PART(groupe,'',3),'',-3) AS group3,
SPLIT_PART(groupe,'',-4) AS group4 FROM planification_client
but it gives me the following error:
ERROR: field position must be greater than zero
So, how can I deal with negative values here?
Can this kind reverse(split_part(reverse(col_A), '_'::text, 1)) of statement work? I'm referencing to that question.
EDIT: I'm completely stuck with this query.
More details: I have one column with "server name" and another with its different groups separated with coma.
server name| group |
-----------+------------------------------+
XPTERTBIEP9|GRNW_SPO_S_F_H, GRNW_SPO_S_I_J|
The output I need to get is if the server has multiple groups, they need to be in the different column like group1, group2...
server name| group |group1 |group 2
-----------+------------------------------+--------------+--------------
XPTERTBIEP9|GRNW_SPO_S_F_H, GRNW_SPO_S_I_J|GRNW_SPO_S_F_H|GRNW_SPO_S_I_J
If the negative number is supposed to indicated the offset from the end, a two step approach might be better:
CREATE TABLE client_group_by_group_test
AS
SELECT ...,
agroups[1] as group1,
agroups2[cardinality(agroups2) - 1] as groups2,
agroups3[cardinality(agroups3) - 3] as groups3,
agroups[cardinality(agroups) - 4] as group4
from (
select *,
string_to_array(groupe, ',') as agroups,
(string_to_array(string_to_array(groupe, ','), ',')[2]) as agroups2,
(string_to_array(string_to_array(groupe, ','), ',')[3]) as agroups3,
from planification_client
) t
Note that you need to list the desired columns in the outer most SELECT to exclude the intermediate "agroups" columns.

Perl + PostgreSQL-- Selective Column to Row Transpose

I'm trying to find a way to use Perl to further process a PostgreSQL output. If there's a better way to do this via PostgreSQL, please let me know. I basically need to choose certain columns (Realtime, Value) in a file to concatenate certains columns to create a row while keeping ID and CAT.
First time posting, so please let me know if I missed anything.
Input:
ID CAT Realtime Value
A 1 time1 55
A 1 time2 57
B 1 time3 75
C 2 time4 60
C 3 time5 66
C 3 time6 67
Output:
ID CAT Time Values
A 1 time 1,time2 55,57
B 1 time3 75
C 2 time4 60
C 3 time5,time6 66,67
You could do this most simply in Postgres like so (using array columns)
CREATE TEMP TABLE output AS SELECT
id, cat, ARRAY_AGG(realtime) as time, ARRAY_AGG(value) as values
FROM input GROUP BY id, cat;
Then select whatever you want out of the output table.
SELECT id
, cat
, string_agg(realtime, ',') AS realtimes
, string_agg(value, ',') AS values
FROM input
GROUP BY 1, 2
ORDER BY 1, 2;
string_agg() requires PostgreSQL 9.0 or later and concatenates all values to a delimiter-separated string - while array_agg() (v8.4+) creates am array out of the input values.
About 1, 2 - I quote the manual on the SELECT command:
GROUP BY clause
expression can be an input column name, or the name or ordinal number
of an output column (SELECT list item), or ...
ORDER BY clause
Each expression can be the name or ordinal number of an output column
(SELECT list item), or
Emphasis mine. So that's just notational convenience. Especially handy with complex expressions in the SELECT list.

SQL basic full-text search

I have not worked much with TSQL or the full-text search feature of SQL Server so bear with me.
I have a table nvarchar column (Col) like this:
Col ... more columns
Row 1: '1'
Row 2: '1|2'
Row 3: '2|40'
I want to do a search to match similar users. So if I have a user that has a Col value of '1' I would expect the search to return the first two rows. If I had a user with a Col value of '1|2' I would expect to get Row 2 returned first and then Row 1. If I try to match users with a Col value of '4' I wouldn't get any results. I thought of doing a 'contains' by splitting the value I am using to query but it wouldn't work since '2|40' contains 4...
I looked up the documentation on using the 'FREETEXT' keyword but I don't think that would work for me since I essentially need to break up the Col values into words using the '|' as a break.
Thanks,
John
You should not store values like '1|2' in a field to store 2 values. If you have a maximum of 2 values, you should use 2 fields to store them. If you can have 0-many values, you should store them in a new table with a foreign key pointing to the primary key of your table..
If you only have max 2 values in your table. You can find your data like this:
DECLARE #s VARCHAR(3) = '1'
SELECT *
FROM <table>
WHERE #s IN(
PARSENAME(REPLACE(col, '|', '.'), 1),
PARSENAME(REPLACE(col, '|', '.'), 2)
--,PARSENAME(REPLACE(col, '|', '.'), 3) -- if col can contain 3
--,PARSENAME(REPLACE(col, '|', '.'), 4) -- or 4 values this can be used
)
Parsename can handle max 4 values. If 'col' can contain more than 4 values use this
DECLARE #s VARCHAR(3) = '1'
SELECT *
FROM <table>
WHERE '|' + col + '|' like '%|' + #s + '|%'
Need to mix this in with a case for when there is no | but this returns the left and right hand sides
select left('2|10', CHARINDEX('|', '2|10') - 1)
select right('2|10', CHARINDEX('|', '2|10'))