Kusto query which calculates percentages of values by keys - key-value

I want to calculate percentages of values by keys. An example would be, given a table like:
datatable (key: string, value: string)
[
"a","1",
"a","2",
"b","x",
"b","x",
"b","x",
"b","y",
]
I want to get results like:
[
"a","1",.5,
"a","2",.5,
"b","x",.75,
"b","y",.25,
]
I understand how to use as and toscalar to get percentages of values across all keys, but I can't figure out how to make that work by keys.

We need to use join between aggregations in two levels
datatable (key: string, value: string)
[
"a","1",
"a","2",
"b","x",
"b","x",
"b","x",
"b","y",
]
| summarize count() by key, value
| as summarize_by_key_value
| summarize sum(count_) by key
| join kind=inner summarize_by_key_value on key
| project key, value, percentage = 1.0 * count_ / sum_count_
key
value
percentage
a
1
0.5
a
2
0.5
b
x
0.75
b
y
0.25
Fiddle

Related

Combine JSONB array of values by consecutive pairs

In postgresql, I have a simple one JSONB column data store:
data
----------------------------
{"foo": [1,2,3,4]}
{"foo": [10,20,30,40,50,60]}
...
I need to convert consequent pairs of values into data points, essentially calling the array variant of ST_MakeLine like this: ST_MakeLine(ARRAY(ST_MakePoint(10,20), ST_MakePoint(30,40), ST_MakePoint(50,60))) for each row of the source data.
Needed result (note that the x,y order of each point might need to be reversed):
data geometry (after decoding)
---------------------------- --------------------------
{"foo": [1,2,3,4]} LINE (1 2, 3 4)
{"foo": [10,20,30,40,50,60]} LINE (10 20, 30 40, 50 60)
...
Partial solution
I can already iterate over individual array values, but it is the pairing that is giving me trouble. Also, I am not certain if I need to introduce any ordering into the query to preserve the original ordering of the array elements.
SELECT ARRAY(
SELECT elem::int
FROM jsonb_array_elements(data -> 'foo') elem
) arr FROM mytable;
You can achieve this by using window functions lead or lag, then picking only every second row:
SELECT (
SELECT array_agg((a, b) ORDER BY o)
FROM (
SELECT elem::int AS a, lead(elem::int) OVER (ORDER BY o) AS b, o
FROM jsonb_array_elements(data -> 'foo') WITH ORDINALITY els(elem, o)
) AS pairs
WHERE o % 2 = 1
) AS arr
FROM example;
(online demo)
And yes, I would recommend to specify the ordering explicitly, making use of WITH ORDINALITY.

When merging two queries in Power BI, can I exact match on one key and fuzzy match on a second key?

I am merging two tables in Power BI where I want to have an exact match on one field, then a fuzzy match on a second field.
In the example below, I want for there to be an exact match for the "Key" columns in Table 1 and Table 2. In table 2, the "Key" column is not a unique identifier and can have multiple names associated with a key. So, I want to then fuzzy match on the name column. Is there a way to do this in Power BI?
Table 1
Key
Name1
info_a
1
Michael
a
2
Robert
b
Table 2
Key
Name2
info_b
1
Mike
aa
1
Andrea
cc
2
Robbie
bb
2
Michelle
dd
Result
Key
Name1
Name2
info_a
info_b
1
Michael
Mile
a
aa
2
Robert
Robbie
b
bb
I ended up using a Python script to solve this problem.
I merged Table 1 and Table 2 on the field ("Key") where an exact match was required.
Then I added this Python script:
from fuzzywuzzy import fuzz
def get_fuzz_score(
df: pd.DataFrame, col1: str, col2: str, scorer=fuzz.token_sort_ratio
) -> pd.Series:
"""
Parameters
----------
df: pd.DataFrame
col1: str, name of column from df
col2: str, name of column from df
scorer: fuzzywuzzy scorer (e.g. fuzz.ratio, fuzz.Wratio, fuzz.partial_ratio, fuzz.token_sort_ratio)
Returns
-------
scores: pd.Series
"""
scores = []
for _, row in df.iterrows():
if row[col1]in [np.nan, None] or row[col2] in [np.nan, None]:
scores.append(None)
else:
scores.append(scorer(row[col1], row[col2]))
return scores
dataset['fuzzy_score'] = get_fuzz_score(dataset, 'Name1', 'Name2', fuzz.WRatio)
dataset['MatchRank'] = dataset.groupby(['Key'])['fuzzy_score'].rank('first', ascending=False)
Then I could just consider the matches where MatchRank = 1

postgresql: postgresq create random integer between null to 100

I want to generate a table with 1000 rows of
-- random int between `1-100` (including 1 and 100)
-- random int between `1-100` and also null (including 1 and 100)
-- random float between `0-100` (including 0 and 100)
-- random float between `0-100`and also null (including 0 and 100)
-- random Male (M) and Female (F) i.e M/F values
-- random Male (M) and Female (F) including null/empty i.e M/F values
-- random names of cities from a list (i.e newyork, london, mumbai, dubai etc)
-- random names of cities from a list including null/empty (i.e newyork, london, mumbai, dubai etc)
Currently I know
create table foo as select random() as test,
from generate_series(1,1000) s(i);
How can I do this
You can use multiplication and type casts or CASE expressions.
To get an integer between 42 and 1001:
42 + CAST (floor(random() * 960) AS integer)
I cannot think of a way to generate a double precision value that includes the upper bound, but then you never need that with double precision.
To get m or f evenly distributed:
CASE WHEN random() < 0.5 THEN `m` ELSE `f` END
For the cities, select a random entry from a lookup table.

Convert jsonb comma separated values into a json object using a psql script

I have a table in postgresql that has two columns:
Table "schemaname.tablename"
Column | Type | Collation | Nullable | Default
--------+-------------------+-----------+----------+---------
_key | character varying | | not null |
value | jsonb | | |
Indexes:
"tablename_pkey" PRIMARY KEY, btree (_key)
and I'd like to convert a nested property value of the jsonb that looks like this:
{
"somekey": "[k1=v1, k2=v2, k3=v2]",
}
into this:
{
"somekey": [
"java.util.LinkedHashMap",
{
"k1": "v1",
"k2": "v2",
"k3": "v3"
}
]
}
I've managed to parse the comma separted string into an array of strings but aside from having to still apply another split on '=' I don't really know how to do the actual UPDATE on all rows of the table and generate the proper jsonb value for "somekey" key.
select regexp_split_to_array(RTRIM(LTRIM(value->>'somekey','['),']'),',') from schemaname.tablename;
Any ideas?
Try this one (self-contained test data):
WITH tablename (_key, value) AS (
VALUES
('test', '{"somekey":"[k1=v1, k2=v2, k3=v2]"}'::jsonb),
('second', '{"somekey":"[no one=wants to, see=me, with garbage]"}'::jsonb),
('third', '{"somekey":"[some,key=with a = in it''s value, some=more here]"}'::jsonb)
)
SELECT
tab._key,
jsonb_insert(
'{"somekey":["java.util.LinkedHashMap"]}', -- basic JSON structure
'{somekey,0}', -- path to insert after
jsonb_object( -- create a JSONB object on-the-fly from the key-value array
array_agg(key_values) -- aggregate all key-value rows into one array
),
true -- we want to insert after the matching element, not before it
) AS json_transformed
FROM
tablename AS tab,
-- the following is an implicit LATERAL join (function based on eahc row for previous table)
regexp_matches( -- produces multiple rows
btrim(tab.value->>'somekey', '[]'), -- as you started with
'(\w[^=]*)=([^,]*)', -- define regular expression groups for keys and values
'g' -- we want all key-value sets
) AS key_values
GROUP BY 1
;
...resulting in:
_key | json_transformed
--------+-------------------------------------------------------------------------------------------------------
second | {"somekey": ["java.util.LinkedHashMap", {"see": "me", "no one": "wants to"}]}
third | {"somekey": ["java.util.LinkedHashMap", {"some": "more here", "some,key": "with a = in it's value"}]}
test | {"somekey": ["java.util.LinkedHashMap", {"k1": "v1", "k2": "v2", "k3": "v2"}]}
(3 rows)
I hope the inline comments explain how it works in enough detail.
Without requiring aggregate/group by:
The following requires no grouping as we don't need aggregate function array_agg, but are a little bit less strict on the key-value format and will break a query easily because of some data (the previous variant will just drop some key-value):
WITH tablename (_key, value) AS (
VALUES
('test', '{"somekey":"[k1=v1, k2=v2, k3=v2]"}'::jsonb),
('second', '{"somekey":"[no one=wants to, see=me, with garbage]"}'::jsonb)
)
SELECT
_key,
jsonb_insert(
'{"somekey":["java.util.LinkedHashMap"]}', -- basic JSON structure
'{somekey,0}', -- path to insert after
jsonb_object( -- create a JSONB object on-the-fly from the key-value array
key_values -- take the keys + values as split using the function
),
true -- we want to insert after the matching element, not before it
) AS json_transformed
FROM
tablename AS tab,
-- the following is an implicit LATERAL join (function based on eahc row for previous table)
regexp_split_to_array( -- produces an array or keys and values: [k, v, k, v, ...]
btrim(tab.value->>'somekey', '[]'), -- as you started with
'(=|,\s*)' -- regex to match both separators
) AS key_values
;
...results into:
_key | json_transformed
--------+--------------------------------------------------------------------------------
test | {"somekey": ["java.util.LinkedHashMap", {"k1": "v1", "k2": "v2", "k3": "v2"}]}
second | {"somekey": ["java.util.LinkedHashMap", {"see": "me", "no one": "wants to"}]}
(2 rows)
Feeding it with garbage (as in the "second" row before) or with an = character in the value (as in the "third" row before) would result in the following error here:
ERROR: array must have even number of elements

Proper way to archive large JSON objects in a PostgreSQL table that will be API accessible?

I've been working at this problem for a bit now. I'm working on a statistics website as a hobby for a game that I play.
Basically I have a script accessing the game's API every 5 minutes (probably going to increase this to 15 minutes) and pulling the current state of all the matches at once. I was originally storing this object as a JSON column in my table. (Each row then had a 118kb object in the JSON column)
The problem was trying to query the table to get the entire archive for a one week period (which is the duration of the match). Basically, it was pulling 2016 118kb records for a week long match-up when all I wanted was a specific key out of the JSON. Requests to this API endpoint are taking about 10 seconds to complete!
I've only found ways in PostgreSQL to query a row based on a JSON key, but not a way to do something like SELECT match.kills FROM matches WHERE....
I've realized that that's not going to work so I want to try to take keys from the JSON objects and insert them into the corresponding table column.
The JSON object skeleton looks like this:
{
id: string,
start_time: timestamp,
end_time: timestamp,
scores: {
green: number,
blue: number,
red: number
},
worlds: number[],
all_worlds: number[][],
deaths: {
green: number,
blue: number,
red: number
},
kills: {
green: number,
blue: number,
red: number
},
maps: [
{
id: number,
type: string,
scores: same as above,
bonuses: {
type: string,
owner: string
},
deaths: same as above,
kills: same as above,
objectives: [
{
id: string,
type: string,
owner: string,
last_flipped: timestamp,
claimed_by: guild id (put this into another api endpoint),
claimed_at: timestamp
},
... (repeat 17 times)
]
},
... (repeat 3 times)
]
}
So I want to store this in my database with the keys as columns, but I'm not quite sure how to accomplish it for keys with values of the type object.
The end goal is to store this in a way that I'll have an API accessible by a URL such as:
mywebsite.com/api/v1/matcharchive?data=kills,deaths,score&matchid=1-1&archive_time=2016-07-09T02:00:00Z
and it will query the database for only those 3 keys in the object and return them.
What is the proper way to store a JSON object with this many keys into a PSQL table?
You just need to use the -> operator on the json field. This example is slightly edited so the auto-increment keys are a little off.
host=# create table tmp1 ( id serial primary key, data json);
CREATE TABLE
host=# \d tmp1
Table "public.tmp1"
Column | Type | Modifiers
--------+---------+---------------------------------------------------
id | integer | not null default nextval('tmp1_id_seq'::regclass)
data | json |
Indexes:
"tmp1_pkey" PRIMARY KEY, btree (id)
host=# insert into tmp1 (data) values ('{"a":1, "b":2}'), ('{"a":3, "b":4}'), ('{"a":5, "c":6}');
INSERT 0 3
host=# select * from tmp1;
id | data
----+----------------
2 | {"a":1, "b":2}
3 | {"a":3, "b":4}
4 | {"a":5, "c":6}
(3 rows)
host=# select id, data->'b' from tmp1;
id | ?column?
----+----------
2 | 2
3 | 4
4 |
(3 rows)