Grouping data into buckets by frequency in Postgres 11.6 - postgresql

Using Postgres 11.6, I'm trying to analyze some event data. The goal is to find the durations for all events with a specific name, and then split each one out into evenly sized buckets. We're looking for any times that "clump" for a specific event. I'm editing my question as the specific case may be obscuring what I'm trying to ask.
Simple example
The question is "how do you group rows by a value, then split occurrences by frequency into buckets with count and average for each of those buckets." Here's a hand-done toy example with rounded averages:
Months with values, each number here represents a row.
Jan 12 24 60 150 320 488
Feb 8 16 40 100 220
Mar 4 8 20 310
Overall figures
Month Count Avg Min Max
Jan 6 176 12 488
Feb 5 77 8 220
Mar 4 86 4 310
The same original data, but with more data, including repeated values and a wider range.
Jan 12 12 12 12 24 24 60 60 150 320 488 500
Feb 8 8 8 8 8 16 40 100 220 440 1100
Mar 4 8 8 8 8 20 20 20 20 310
Overall figures
Month Count Avg Min Max
Jan 12 140 12 500
Feb 11 178 8 1100
Mar 10 43 4 310
Mock-up of one of the sets of data split out into 3 buckets
Month Count Avg Min Max Bucket
Jan 4 12 12 12 0
Jan 4 42 24 60 1
Jan 4 365 150 500 2
...and so on for Feb and Mar
I'm just guessing at how the buckets would split in the mock-up above.
That pretty much captures what I'm trying to do. Group by month name (from_to_node in my real case), split the resulting rows into buckets, and then get min, max, avg, and count for each bucket. It's starting to sound like a pivot (?)
Real Table Setup
Here's the structure of table I'm getting a feed for:
CREATE TABLE IF NOT EXISTS data.edge_event (
id uuid,
inv_id uuid,
facility_id uuid,
from_node citext,
to_node citext,
from_to_node citext,
from_node_dts timestamp without time zone,
to_node_dts timestamp without time zone,
seconds integer,
cycle_id uuid
);
The duration is pre-calculated in seconds, and the area of interest for now is only the from_to_node name. So, it's fair to think of the example as
CREATE TABLE IF NOT EXISTS data.edge_event (
from_to_node citext,
seconds integer
);
Raw Data
Within the edge_event table, there are 159 distinct from_to_node values over around 300K event rows. Some are found in only a handful of edge_event records, some are found in thousands, or tens of thousands. That's too much to provide a good sample for. But to make the problem simpler to follow, a from_to_node might be
Boxing_Assembly 1256
Meaning "it took 1256 seconds to move this part from the Boxing phase to the Assembly phase." And here we might have 10,000 other records for "Boxing_Assembly" with different durations.
Goal
We're looking for two things out of each from_to_node. For something like Boxing_Assembly, I'm trying to do this:
Sort the seconds taken into buckets, say 20 buckets. This is for a histogram.
For each bucket get the
count of edge_event rows
avg(seconds) within the bucket
min/first_value(seconds) within the bucket
max/last_value(seconds) within the bucket
So, we're looking to chart durations to look for clusters, and then get the raw seconds out of any common clusters.
What I've tried
I've tried a lot of different code, and I've not succeeded. It seems like a problem for GROUP BY and/or window functions. There's something I'm not getting, as my results are far from the mark.
I know that I haven't provided sample data, which makes it harder to help. But I'm guessing that what I'm missing is one++ concepts. Pretty much, I want to know how to split out the edge_event data by from_to_node and then by seconds. Given the huge ranges across from_to_nodes, I'm trying to bucket each individually based on their own min/max.
Thanks very much for any help.
Draft Attempt
I've developed a query that works a bit, but not entirely. This is an edit from my original post with broken code.
WITH
min_max AS
(
SELECT from_to_node,
min(seconds),
max(seconds)
FROM edge_event
GROUP BY from_to_node
)
SELECT edge_event.from_to_node,
width_bucket (seconds, min_max.min, min_max.max, 99) as bucket, -- Bucket are counted from 0, so 9 gets you 10 buckets, if you have enough data.
count(*) as frequency,
min(seconds) as seconds_min,
max(seconds) as seconds_max,
max(seconds) - min(seconds) as bucket_width,
round(avg(seconds)) as seconds_avg
FROM edge_event
JOIN min_max ON (min_max.from_to_node = edge_event.from_to_node)
WHERE min_max.min <> min_max.max AND -- Can't have a bucket with an upper and lower bound that are the same.
edge_event.from_to_node IN (
'Boxing_Assembly',
'Assembly_Waiting For QA')
GROUP BY edge_event.from_to_node,
bucket
ORDER BY from_to_node,
bucket
What I'm getting back looks pretty good:
from_to_node bucket frequency seconds_min seconds_max bucket_width seconds_avg
Boxing_Assembly 1 912 17 7052 7035 3037
Boxing_Assembly 2 226 7058 13937 6879 9472
Boxing_Assembly 3 41 14151 21058 6907 16994
Boxing_Assembly 4 16 21149 27657 6508 23487
Boxing_Assembly 5 4 28926 33896 4970 30867
Boxing_Assembly 6 1 37094 37094 0 37094
Boxing_Assembly 7 1 43228 43228 0 43228
Boxing_Assembly 10 2 63666 64431 765 64049
Boxing_Assembly 14 1 94881 94881 0 94881
Boxing_Assembly 16 1 108254 108254 0 108254
Boxing_Assembly 37 1 257226 257226 0 257226
Boxing_Assembly 40 1 275140 275140 0 275140
Boxing_Assembly 68 1 471727 471727 0 471727
Boxing_Assembly 100 1 696732 696732 0 696732
Assembly_Waiting For QA 1 41875 1 18971 18970 726
Assembly_Waiting For QA 9 1 207457 207457 0 207457
Assembly_Waiting For QA 15 1 336711 336711 0 336711
Assembly_Waiting For QA 38 1 906519 906519 0 906519
Assembly_Waiting For QA 100 1 2369669 2369669 0 2369669
One problem here is that the buckets aren't evenly sized...they seem kind of weird. I've also tried specifying 10, 20, or 100 buckets, and get similar results. I'm hoping that there is a better way to allocate the data to buckets that I'm missing, and that there's a way to have zero-entry buckets instead of gaps.

I would use the PostgreSQL optimizer for that. It collects exactly the information you want.
Create a temporary table with the values you are interested in and ANALYZE it. Then look into pg_stats for the following:
if there are "most common values", you have them and their frequency right there.
Otherwise, look for adjacent histogram boundaries that are close together. Such a bucket is an interval where values are "lumped".

Related

How to determine the number of digits of a number in a table

I am trying to determine the number of digits of a number in a table. For example if I have a table like this:
4 200 50 1236
69 54 285 1
1458 2 69 555
The answer would be
1 3 2 4
2 2 3 1
4 1 2 3
I used to be able to do this with this code
strlength(num2str(ADCPCRUM2(i,2)))
but then my input was numeric, and not a table.
How do I determine the length of a number in a table?
floor(log10(A)) does this. log10() basically counts the number of digits before/behind the decimal separator where the most significant number is.
When using this on a table, a simple call to table2array() should solve it.
Caveat: this only works for integers; for non-integer inputs it would get a lot more involved.

Number of different insertion sequence of Key values in a hash table

A hash table of length 10 uses open addressing with hash function h(k)=k mod 10, and linear probing. After inserting 8 values into an empty hash table, the table is as shown below
0 |
1 | 91
2 | 2
3 | 13
4 | 24
5 | 12
6 | 62
7 | 77
8 | 82
9 |
How many different insertion sequences of the key values using the same hash function and linear probing will result in the hash table shown above?
ANSWER - 128.
I know for 91,2,13,24,77 its 5! = 120 but i can't figure out what are the other 8 combinations ?
The answer given is wrong, Actualy it was a mocktest and answer provided by the source is wrong. The real answer is 168.
It can be done in 2 ways -
1) 91,2,13,24,12,62,77,82 - Here if you see and filter out details
_,91,_,2_,13,_,24,_,12,_,62,_,82
In all the available gaps we could fill 77 it will always result in 7th slot so
total number of ways 77 can come - any of 7 places i.e 7.
Now 91,2,13,24 can come in any order and can be arranged as above so total ways - 4! and for every of the 4! arrangements 77 can come at any of the 7 places so answer is - 4!*7 = 168.
2) Second way is - There are 3 possible sequence only
i) 91,2,13,24,77,12,62,82
Here 91,2,13,24,77 can come in any order, They will get there respective
slots so total 5! ways.
ii) 91,2,13,24,12,77,62,82
Here 91,2,13,24 can come in any order and we have fixed 77 after 12 so total
4! ways.
iii) 91,2,13,24,12,62,77,82
same here with 4! ways 91,2,13,and 24 can come and 77 is fixed after 62.
so total 5!+4!+4!=168.

Displaying data to a map, creating a choropleth

What I would like to do is create a choropleth map which is darker or lighter based on the number of data points in a particular area.
I have the following data:
RO-B, 9
PL-MZ, 24
SE-C, 3
DE-NI, 5
PL-DS, 14
ES-CM, 11
RO-IS, 2
DE-BY, 51
SE-Z, 18
CH-BE, 10
PL-WP, 1
ES-IB, 1
DE-BW, 21
DE-BE, 24
DE-BB, 1
IE-M, 26
ES-PV, 1
DE-SN, 6
CH-ZH, 31
ES-GA, 1
NL-GE, 2
IE-U, 1
ES-AN, 4
FR-J, 82
DE-HH, 34
PL-PD, 1
PL-LD, 6
GB-WLS, 60
GB-ENG, 8619
RO-BV, 45
CH-VD, 2
PL-SL, 1
DE-HE, 17
SE-I, 1
HU-PE, 4
PL-MA, 4
SE-AB, 3
CH-BS, 20
ES-CT, 31
DE-TH, 25
IE-C, 1
CZ-ST, 1
DE-NW, 29
NL-NH, 3
DE-RP, 9
CZ-PR, 4
IE-L, 134
HU-BU, 10
RO-CJ, 1
GB-NIR, 29
ES-MD, 33
CH-LU, 11
GB-SCT, 172
CH-GE, 3
BE-BRU, 30
BE-VLG, 25
It references the ISO3166-2 of a country and sub region, and the # corresponds to the amount of data points affiliated with that region.
I've seen this project on GitHub which seems to also use the same ISO3166-2 to reference countries.
I'm trying to figure out how I could modify their code to display my data points, such that if the number is higher the area would be darker, if the number is less it would be lighter.
It seems it should be possible, the first thing I was trying to do was modify this jsfiddle code, which seems to be very close to what I need, but I couldn't get it to work.
For instance this line:
osmeRegions.geoJSON('RU-MOW',
Seems to directly reference a ISO3166-2 code, but it's not as simple as just changing that (or maybe it is but I couldn't get that to work properly).
Does anyone know if I could possibly adapt the code from that project to create the map rendering I've described?
Or perhaps there's a different way to achieve the same goal?

Stata longwise average

I'm using Stata and trying to compute conditional means based on time/date. For each store I want to calculate mean (inventory) per year. If there are missing year gaps, then I want to take the mean from the closest two dates' inventory values.
I've used (below) to get overall means per store, but I need more granularity.
egen mean_inv = mean(inventory), by (store)
I've also tried this loop with similar results:
by id, sort: gen v1'=_n'
forvalues x = 1/'=n'{
by store: sum inventory if v1==`x'
replace mean_inv= r(mean) if v1==`x'
}
Visually, I want mean inventory per store: (store id is not sequential)
5/1/2003 2/3/2006 8/9/2006 3/5/2007 6/9/2007 2/1/2008
13 18 12 15 24 11
[mean1] [mean2] [mean3] [mean4] [mean5]
store date inventory
1 16750 17
1 18234 16
1 15844 13
1 17111 14
1 17870 13
1 16929 13.5
1 17503 13
4 15987 18
4 15896 16
4 18211 16
4 17154 18
4 17931 24
4 16776 23
12 16426 26
12 17681 17
12 16386 17
12 16603 18
12 17034 16
12 17205 16
42 15798 18
42 16022 18
42 17496 16
42 17870 18
42 16204 18
42 16778 14
33 18053 23
33 16086 13
33 16450 21
33 17374 19
33 16814 19
33 15834 16
33 16167 16
56 17686 16
56 17623 18
56 17231 20
56 15978 16
56 16811 15
56 17861 20
It is hard to relate your code to the word description of your problem.
Your egen call calculates means by store, not year.
Your longer fragment does not make complete sense given lack of definitions and at least one typo.
Note that your variable v1 contains identifiers that run 1 up within groups of store, and does not distinguish different values of store, as you (seem to) imply. It strains credibility that it produces results anywhere near those by the egen call.
n is not defined and the code evaluating it is presumably intended to be
`=n'
If you calculate
by store: sum inventory if v1 == `x'
several means will be calculated in turn but only the last to be calculated will be accessible as r(mean).
The sample data are unrelated to the problem. There is no year variable and even if the dates are Stata daily dates, they are all dates within 1960.
Setting all that aside, suppose you have variables store, inventory and year. You can try
collapse inventory, by(store year)
fillin store year
ipolate inventory year, gen(inventory2) by(store)
The collapse produces a reduced dataset of means. The ipolate interpolates across gaps, as you ask. fillin may not be adequate to give all the store and year combinations you want and you may need to add further years manually before the interpolation. If you want to put these results back with the original data, that's a merge.
In total, this is a pretty messy question.

cluster my data and testing of input

cluster the given data and use any retrieval algorithm to show output as shown below.
(any clustering algorithm)
Euclidean distance may be used for finding closest cases.
let a data file containing input vectors like
caseid f1 f2 f3 f4
1 30 45 9.5 1500
2 35 45 8 1600
3 38 47 10 1550
4 32 50 9.5 1800
..
..
..
t1 30 45 9.5 1500(target)
output should like
NO. f1 f2 f3 f4
t1 30 45 9.5 1500 (target)
21 35 45 10 1500(1st closest to target)
39 35 50 8 1500 (2nd closes)
56 35 42 9.5 1500 (3rd closes)
This looks like a classic nearest neighbor query to me, not like clustering.
Also I'd be careful with using Euclidean distance here. A difference of 1 in attribute f1 does not look like it is equal to a difference of 1 in attribute f4. The values seem to have a completely different magnitude.