I have data like this which I need to flatten for each Id with the corresponding key and size with two different columns.
So I was watching the tutorial on snowflake which has this function
select distinct json.key as column_name,
from raw.public.table_name,
lateral flatten(input => table_name) json
I was trying to find something in postgres query
id | json_data
1 | {"KEY": "mekq1232314342134434", "size": 0}
2 | {"KEY": "meksaq12323143421344", "size": 2}
3 | {"KEY": "meksaq12323324421344", "size": 3}
So I need two things here first I need a distinct key from these jsonb columns,
2.
I need to flatten the jsonb columns
id | kEY | size
1 | mekq1232314342134434 | 0
Another option, beside ->>, would be to use jsonb_to_record:
with
sample_data (id, json) as (values
(1, '{"KEY": "mekq1232314342134434", "size": 0}' :: jsonb),
(2, '{"KEY": "meksaq12323143421344", "size": 2}' :: jsonb),
(3, '{"KEY": "meksaq12323324421344", "size": 3}' :: jsonb)
)
select
id, "KEY", size
from
sample_data,
lateral jsonb_to_record(sample_data.json) as ("KEY" text, size int);
-- `lateral` is optional in this case
┌────┬──────────────────────┬──────┐
│ id │ KEY │ size │
├────┼──────────────────────┼──────┤
│ 1 │ mekq1232314342134434 │ 0 │
│ 2 │ meksaq12323143421344 │ 2 │
│ 3 │ meksaq12323324421344 │ 3 │
└────┴──────────────────────┴──────┘
(3 rows)
Related
I'm wondering if there's a lightweight syntax for filtering a polars DataFrame against a multi-column key, other than inner/anti joins. (There's nothing wrong with the joins, but it would be nice if there's something more compact).
Using the following frame as an example:
import polars as pl
df = pl.DataFrame(
data = [
["x",123, 4.5, "misc"],
["y",456,10.0,"other"],
["z",789,99.5,"value"],
],
columns = ["a","b","c","d"],
)
A PostgreSQL statement could use a VALUES expression, like so...
(("a","b") IN (VALUES ('x',123),('y',456)))
...and a pandas equivalent might set a multi-column index.
pf.set_index( ["a","b"], inplace=True )
pf[ pf.index.isin([('x',123),('y',456)]) ]
The polars syntax would look like this:
df.join(
pl.DataFrame(
data = [('x',123),('y',456)],
columns = {col:tp for col,tp in df.schema.items() if col in ("a","b")},
orient = 'row',
),
on = ["a","b"],
how = "inner", # or 'anti' for "not in"
)
Is a multi-column is_in construct, or equivalent expression, currently available with polars? Something like the following would be great if it exists (or could be added):
df.filter( pl.cols("a","b").is_in([('x',123),('y',456)]) )
In the next polars release >0.13.44 this will work on the struct datatype.
We convert the 2 (or more) columns we want to check to a struct with pl.struct and call the is_in expression. (A conversion to struct is a free operation)
df = pl.DataFrame(
data=[
["x", 123, 4.5, "misc"],
["y", 456, 10.0, "other"],
["z", 789, 99.5, "value"],
],
columns=["a", "b", "c", "d"],
)
df.filter(
pl.struct(["a", "b"]).is_in([{"a": "x", "b": 123}, {"a": "y", "b": 456}])
)
shape: (2, 4)
┌─────┬─────┬──────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ str │
╞═════╪═════╪══════╪═══════╡
│ x ┆ 123 ┆ 4.5 ┆ misc │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ y ┆ 456 ┆ 10.0 ┆ other │
└─────┴─────┴──────┴───────┘
Filtering by data in another DataFrame.
The idiomatic way to filter data by presence in another DataFrame are semi and anti joins. Inner joins also filter by presence, but they include the columns of the right hand DataFrame, where a semi join does not and only filters the left hand side.
semi: keep rows/keys that are in both DataFrames
anti: remove rows/keys that are in both DataFrames
The reason why these joins are preferred over is_in is that they are much faster and currently allow for more optimization.
Given the following data, i'm looking to groupby and combine two columns into one, holding a dictionary. One column supplies the keys, while the values stem from another column which is aggregated into a list first.
import polars as pl
data = pl.DataFrame(
{
"names": ["foo", "ham", "spam", "cheese", "egg", "foo"],
"dates": ["1", "1", "2", "3", "3", "4"],
"groups": ["A", "A", "B", "B", "B", "C"],
}
)
>>> print(data)
names dates groups
0 foo 1 A
1 ham 1 A
2 spam 2 B
3 cheese 3 B
4 egg 3 B
5 foo 4 C
# This is what i'm trying to do:
groups combined
0 A {'1': ['foo', 'ham']}
1 B {'2': ['spam'], '3': ['cheese', 'egg']}
2 C {'4': ['foo']}
In pandas i can do this using two groupby statements, in pyspark using a set of operations around "map_from_entries" but despite various attempts i haven't figured out a way in polars. So far i use agg_list(), convert to pandas and use a lambda. While this works, it certainly doesn't feel right.
data = data.groupby(["groups", "dates"])["names"].agg_list()
data = (
data.to_pandas()
.groupby(["groups"])
.apply(lambda x: dict(zip(x["dates"], x["names_agg_list"])))
.reset_index(name="combined")
)
Alternativly, inspired by this post i've tried a number of variations similar to the following, including converting the dict to json strings among other things.
data = data.groupby(["groups"]).agg(
pl.apply(exprs=["dates", "names_agg_list"], f=build_dict).alias("combined")
)
With the release of polars>=0.12.10 you can do this:
print(data
.groupby(["groups", "dates"]).agg(pl.col("names").list().keep_name())
.groupby("groups")
.agg([
pl.apply([pl.col("dates"), pl.col("names")], lambda s: dict(zip(s[0], s[1].to_list())))
])
)
shape: (3, 2)
┌────────┬─────────────────────────────────────┐
│ groups ┆ dates │
│ --- ┆ --- │
│ str ┆ object │
╞════════╪═════════════════════════════════════╡
│ A ┆ {'1': ['foo', 'ham']} │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ C ┆ {'4': ['foo']} │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ {'3': ['cheese', 'egg'], '2': ['... │
└────────┴─────────────────────────────────────┘
This not really how you should be using DataFrames though. There is likely a solution that lets you deal with more flattened dataframes and doesn't require you to put slow python objects in dataframes.
tab:
num │ value_two │ value_three │ value_four
─────┼───────────┼─────────────┼────────────
1 │ a │ A │ 4.0
2 │ a │ A2 │ 75.0
3 │ b │ A3 │ 7.0
I want to create a 2D json array like this
[[1,"a","A",4.0],[2,"a","A2",75.0],[3,"b","A3",7.0]]
I have tried two things:
First SELECT json_agg(tab) FROM tab but it returns an array of objects.
The second thing that I tried kinda works, the only detail is that it returns a 2d string array.
SELECT json_agg(ARRAY[num::TEXT,value_two,value_three,value_four::TEXT]) FROM tab
[["1","a","A","4.0"],["2","a","A2",75.0],["3","b","A3","7.0"]]
Short answer:
=# select json_agg(json_build_array(num, value_two, value_three, value_four)) as answer
from tab;
answer
-----------------------------------------------------------------
[[1, "a", "A", 4.0], [2, "a", "A2", 75.0], [3, "b", "A3", 7.0]]
(1 row)
Native PostgreSQL arrays like the one you created with
ARRAY[num::TEXT,value_two,value_three,value_four::TEXT]
are strictly typed, which is why you had to cast num and value_four to text.
To get the type mixing allowed in JSON, use json_build_array(), instead.
Is there any way to embed a flag in a select that indicates that it is the first or the last row of a result set? I'm thinking something to the effect of:
> SELECT is_first_row() AS f, is_last_row() AS l FROM blah;
f | l
-----------
t | f
f | f
f | f
f | f
f | t
The answer might be in window functions but I've only just learned about them, and I question their efficiency.
SELECT first_value(unique_column) OVER () = unique_column, last_value(unique_column) OVER () = unique_column, * FROM blah;
seems to do what I want. Unfortunately, I don't even fully understand that syntax, but since unique_column is unique and NOT NULL it should deliver unambiguous results. But if it does sorting, then the cure might be worse than the disease. (Actually, in my tests, unique_column is not sorted, so that's something.)
EXPLAIN ANALYZE doesn't indicate there's an efficiency problem, but when has it ever told me what I needed to know?
And I might need to use this in an aggregate function, but I've just been told window functions aren't allowed there. 😕
Edit:
Actually, I just added ORDER BY unique_column to the above query and the rows identified as first and last were thrown into the middle of the result set. It's as if first_value()/last_value() really means "the first/last value I picked up before I began sorting." I don't think I can safely do this optimally. Not unless a much better understanding of the use of the OVER keyword is to be had.
I'm running PostgreSQL 9.6 in a Debian 9.5 environment.
This isn't a duplicate, because I'm trying to get the first row and last row of the result set to identify themselves, while Postgres: get min, max, aggregate values in one select is just going for the minimum and maximum values for a column in a result set.
You can use the lead() and lag() window functions (over the appropiate window) and compare them to NULL:
-- \i tmp.sql
CREATE TABLE ztable
( id SERIAL PRIMARY KEY
, starttime TIMESTAMP
);
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '1 minute');
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '2 minute');
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '3 minute');
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '4 minute');
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '5 minute');
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '6 minute');
SELECT id, starttime
, ( lead(id) OVER www IS NULL) AS is_first
, ( lag(id) OVER www IS NULL) AS is_last
FROM ztable
WINDOW www AS (ORDER BY id )
ORDER BY id
;
SELECT id, starttime
, ( lead(id) OVER www IS NULL) AS is_first
, ( lag(id) OVER www IS NULL) AS is_last
FROM ztable
WINDOW www AS (ORDER BY starttime )
ORDER BY id
;
SELECT id, starttime
, ( lead(id) OVER www IS NULL) AS is_first
, ( lag(id) OVER www IS NULL) AS is_last
FROM ztable
WINDOW www AS (ORDER BY starttime )
ORDER BY random()
;
Result:
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
id | starttime | is_first | is_last
----+----------------------------+----------+---------
1 | 2018-08-31 18:38:45.567393 | f | t
2 | 2018-08-31 18:37:45.575586 | f | f
3 | 2018-08-31 18:36:45.587436 | f | f
4 | 2018-08-31 18:35:45.592316 | f | f
5 | 2018-08-31 18:34:45.600619 | f | f
6 | 2018-08-31 18:33:45.60907 | t | f
(6 rows)
id | starttime | is_first | is_last
----+----------------------------+----------+---------
1 | 2018-08-31 18:38:45.567393 | t | f
2 | 2018-08-31 18:37:45.575586 | f | f
3 | 2018-08-31 18:36:45.587436 | f | f
4 | 2018-08-31 18:35:45.592316 | f | f
5 | 2018-08-31 18:34:45.600619 | f | f
6 | 2018-08-31 18:33:45.60907 | f | t
(6 rows)
id | starttime | is_first | is_last
----+----------------------------+----------+---------
2 | 2018-08-31 18:37:45.575586 | f | f
4 | 2018-08-31 18:35:45.592316 | f | f
6 | 2018-08-31 18:33:45.60907 | f | t
5 | 2018-08-31 18:34:45.600619 | f | f
1 | 2018-08-31 18:38:45.567393 | t | f
3 | 2018-08-31 18:36:45.587436 | f | f
(6 rows)
[updated: added a randomly sorted case]
It is simple using window functions with particular frames:
with t(x, y) as (select generate_series(1,5), random())
select *,
count(*) over (rows between unbounded preceding and current row),
count(*) over (rows between current row and unbounded following)
from t;
┌───┬───────────────────┬───────┬───────┐
│ x │ y │ count │ count │
├───┼───────────────────┼───────┼───────┤
│ 1 │ 0.543995119165629 │ 1 │ 5 │
│ 2 │ 0.886343683116138 │ 2 │ 4 │
│ 3 │ 0.124682310037315 │ 3 │ 3 │
│ 4 │ 0.668972567655146 │ 4 │ 2 │
│ 5 │ 0.266671542543918 │ 5 │ 1 │
└───┴───────────────────┴───────┴───────┘
As you can see count(*) over (rows between unbounded preceding and current row) returns rows count from the data set beginning to current row and count(*) over (rows between current row and unbounded following) returns rows count from the current to data set end. 1 indicates the first/last rows.
It works until you ordering your data set by order by. In this case you need to duplicate it in the frames definitions:
with t(x, y) as (select generate_series(1,5), random())
select *,
count(*) over (order by y rows between unbounded preceding and current row),
count(*) over (order by y rows between current row and unbounded following)
from t order by y;
┌───┬───────────────────┬───────┬───────┐
│ x │ y │ count │ count │
├───┼───────────────────┼───────┼───────┤
│ 1 │ 0.125781774986535 │ 1 │ 5 │
│ 4 │ 0.25046408502385 │ 2 │ 4 │
│ 5 │ 0.538880597334355 │ 3 │ 3 │
│ 3 │ 0.802807193249464 │ 4 │ 2 │
│ 2 │ 0.869908029679209 │ 5 │ 1 │
└───┴───────────────────┴───────┴───────┘
PS: As mentioned by a_horse_with_no_name in the comment:
there is no such thing as the "first" or "last" row without sorting.
In fact, Window Functions are a great approach and for that requirement of yours, they are awesome.
Regarding efficiency, window functions work over the data set already at hand. Which means the DBMS will just add extra processing to infer first/last values.
Just one thing I'd like to suggest: I like to put an ORDER BY criteria inside the OVER clause, just to ensure the data set order is the same between multiple executions, thus returning the same values to you.
Try using
SELECT columns
FROM mytable
Join conditions
WHERE conditions ORDER BY date DESC LIMIT 1
UNION ALL
SELECT columns
FROM mytable
Join conditions
WHERE conditions ORDER BY date ASC LIMIT 1
SELECT just cut half of the processing time. You can go for indexing also.
Say I have a database table teams that has an ordering column position, the position can either be null if it is the last result, or the id of next team that is positioned one higher than that team. This would result in a list that is always strictly sorted (if you use ints you have to manage all the other position values when inserting a new team, ie increment them all by one), and the insertion becomes less complicated...
But to retrieve this table as a sorted query has proved tricky, here is where I'm at so far:
WITH RECURSIVE teams AS (
SELECT *, 1 as depth FROM team
UNION
SELECT t.*, ts.depth + 1 as depth
FROM team t INNER JOIN teams ts ON ts.order = t.id
SELECT
id, order, depth
FROM
teams
;
Which gets me something like:
id | order | depth
----+-------+-------
53 | 55 | 1
55 | 52 | 1
55 | 52 | 2
52 | 54 | 2
52 | 54 | 3
54 | | 3
54 | | 4
Which kind of reflects where I need to get to in terms of ordering (the max of depth represents the ordering I want...) however I cant work out how to alter the query to get something like:
id | order | depth
----+-------+-------
53 | 55 | 1
55 | 52 | 2
52 | 54 | 3
54 | | 4
It seems however I change the query it complains at me about applying a GROUP BY across both id and depth... How do I get from where I am now to where I want to be?
Your recursive query should to start somewhere (for now you selecting whole table in the first subquery). I propose to start from the last record where order column is null and walk to the first record:
with recursive team(id, ord) as (values(53,55),(55,52),(52,54),(54,null)),
teams as (
select *, 1 as depth from team where ord is null -- select the last record here
union all
select t.*, ts.depth + 1 as depth
from team t join teams ts on ts.id = t.ord) -- note that the JOIN condition reversed comparing to the original query
select * from teams order by depth desc; -- finally reverse the order
┌────┬──────┬───────┐
│ id │ ord │ depth │
├────┼──────┼───────┤
│ 53 │ 55 │ 4 │
│ 55 │ 52 │ 3 │
│ 52 │ 54 │ 2 │
│ 54 │ ░░░░ │ 1 │
└────┴──────┴───────┘