How to aggregate jsonb object?

How to aggregate jsonb object? - postgresql

I have jsonb field in table with values like that:
{
"1":[{"start":64800,"finish":68400},{"start":61200,"finish":64800},{"start":75600,"finish":79200},{"start":79200,"finish":82800}],
"2":[{"start":68400,"finish":72000},{"start":72000,"finish":75600},{"start":75600,"finish":79200},{"start":79200,"finish":82800}],
"3":[{"start":46800,"finish":50400},{"start":50400,"finish":54000}],
"4":[{"start":50400,"finish":54000}],
"5":[{"start":79200,"finish":82800},{"start":82800,"finish":0},{"start":0,"finish":3600}],
"6":[{"start":68400,"finish":72000},{"start":72000,"finish":75600},{"start":79200,"finish":82800}]
}
0...6 - day of week, it's array of working time.
So I need to aggregate working time by days of week. For example, one row has
"5":[{"start":79200,"finish":82800},{"start":82800,"finish":0},{"start":0,"finish":3600}]
another row has
"5":[{"start":75600,"finish":79200}]
and I want to get
"5":[{"start":75600,"finish":79200},{"start":79200,"finish":82800},{"start":82800,"finish":0},{"start":0,"finish":3600}]

If you need this report to be run manually & irregularly, you can do this within PostgreSQL itself. But if performance matters to you, you should normalize your schema (f.ex. if your jsonb column's structure is that simple, there is literally no need to use JSON at all: you don't have any unstructured data).
SELECT jsonb_object_agg(dow, working_times_agg)
FROM (SELECT dow, jsonb_agg(working_time) working_times_agg
FROM table t,
jsonb_each(t.jsonb_column) o(dow, working_times),
jsonb_array_elements(working_times) working_time
GROUP BY dow) working_times_agg_by_dow

Related

get postgres to use an index when querying timestamps in a function

I have a system with a large number of tables that contain historical data. Each table has a ts_from and ts_to column which are of type timestamptz. These represent the time period in which the data for a particular row was valid.
These columns are indexed.
If I want to query all rows that were valid at a particular timestamp, it is trivial to write the ts_from <= #at_timestamp AND ts_to >= #at_timestamp WHERE clause to utilitise the index.
However, I wanted to create a function called Temporal.at which would take the #at_timestamp column and the ts_from / ts_to columns and do this by hiding the complexity of the comparison from the query that uses it. You might think this is trivial, but I would also like to extend the concept to create a function called Temporal.between which would take a #from_timestamp and #to_timestamp and select all rows that were valid between those two periods. That function would not be trivial, as one would have to check where rows partially overlap the period rather than always being fully enclosed by it.
The issue is this: I have written these functions but they do not cause the index to be used. The query performance is woefully slow on the history tables, some of which have hundreds of millions of rows.
The questions therefore are:
a) Is there a way to write these functions so that we can be sure the indexes will be used?
b) Am I going about this completely the wrong way and is there a better way to proceed?

This is complicated if you model ts_from and ts_to as two different timestamp columns. Instead, you should use a range type: tstzrange. Then everything will become simple:
for containment in an interval, use #at_timestamp <# from_to
for interval overlap, use tstzinterval(#from_timestamp, #to_timestamp) && from_to
Both queries can be supported by a GiST index on the range column.

DynamoDB column with tilde and query using JPA

i have table column with tilde value like below
vendorAndDate - Column name
Chipotle~08-26-2020 - column value
I want to query for month "vendorAndPurchaseDate like '%~08%2020'" and for year ends with 2020 "vendorAndPurchaseDate like '%2020'". I am using Spring Data JPA to query the values. I have not worked on column with tilde values before. Please point me in a right direction or some examples

You cannot.
If vendorAndPurchaseDate is your partition key , you need to pass the whole value.
If vendorAndPurchaseDate is range key , you can only perform
= ,>,<>=,<=,between and begins_with operation along with a partition key
reference : https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html

DynamoDB does not support this type of wildcard query.
Let's consider a more DynamoDB way of handling this type of query. It sounds like you want to support 2 access patterns:
Get Item by month
Get Item by year
You don't describe your Primary Keys (Partition Key/Sort Key), so I'm going to make some assumptions to illustrate one way to address these access patterns.
Your attribute appears to be a composite key, consisting of <vendor>~<date>, where the date is expressed by MM-DD-YYYY. I would recommend storing your date fields in YYYY-MM-DD format, which would allow you to exploit the sort-ability of the date field. An example will make this much clearer. Imagine your table looked like this:
I'm calling your vendorAndDate attribute SK, since I'm using it as a Sort Key in this example. This table structure allows me to implement your two access patterns by executing the following queries (in pseudocode to remain language agnostic):
Access Pattern 1: Fetch all Chipotle records for August 2020
query from MyTable where PK = "Vendors" and SK between Chipotle~2020-08-00 and Chipotle~2020-08-31
Access Pattern 2: Fetch all Chipotle records for 2020
query from MyTable where PK = "Vendors" and SK between Chipotle~2020-01-01 and Chipotle~2020-12-31
Because dates stored in ISO8601 format (e.g. YYYY-MM-DD...) are lexicographically sortable, you can perform range queries in DynamoDB in this way.
Again, I've made some assumptions about your data and access patterns for the purpose of illustrating the technique of using lexicographically sortable timestamps to implement range queries.

Converting complex query with inner join to tableau

I have a query like this, which we use to generate data for our custom dashboard (A Rails app) -
SELECT AVG(wait_time) FROM (
SELECT TIMESTAMPDIFF(MINUTE,a.finished_time,b.start_time) wait_time
FROM (
SELECT max(start_time + INTERVAL avg_time_spent SECOND) finished_time, branch
FROM mytable
WHERE name IN ('test_name')
AND status = 'SUCCESS'
GROUP by branch) a
INNER JOIN
(
SELECT MIN(start_time) start_time, branch
FROM mytable
WHERE name IN ('test_name_specific')
GROUP by branch) b
ON a.branch = b.branch
HAVING avg_time_spent between 0 and 1000)t
GROUP BY week
Now I am trying to port this to tableau, and I am not being able to find a way to represent this data in tableau. I am stuck at how to represent the inner group by in a calculated field. I can also try to just use a custom sql data source, but I am already using another data source.
columns in mytable -
start_time
avg_time_spent
name
branch
status
I think this could be achieved new Level Of Details formulas, but unfortunately I am stuck at version 8.3

Save custom SQL for rare cases. This doesn't look like a rare case. Let Tableau generate the SQL for you.
If you simply connect to your table, then you can usually write calculated fields to get the information you want. I'm not exactly sure why you have test_name in one part of your query but test_name_specific in another, so ignoring that, here is a simplified example to a similar query.
If you define a calculated field called worst_case_test_time
datediff(min(start_time), dateadd('second', max(start_time), avg_time_spent)), which seems close to what your original query says.
It would help if you explained what exactly you are trying to compute. It appears to be some sort of worst case bound for avg test time. There may be an even simpler formula, but its hard to know without a little context.
You could filter on status = "Success" and avg_time_spent < 1000, and place branch and WEEK(start_time) on say the row and column shelves.
P.S. Your query seems a little off. Don't you need an aggregation function like MAX or AVG after the HAVING keyword?

How to retrieve a list of Columns from a single row in Cassandra?

The below is a sample of my Cassandra CF.
column1 column2 column3 ......
row1 : name:abay,value:10 name:benny,value:7 name:catherine,value:24 ................
ComparatorType:utf8
How can i fetch columns with name ('abay', 'john', 'peter', 'allen') from this row in a single query using Hector API.
The number of names in the list may vary every time.
I know that i can get them in a sorted order using SliceQuery.
But there are cases when i need to fetch data randomnly, as i mentioned above.
Kindly help me.

Based on your query, it seems you have two options.
If you only need to run this query occasionally, you can get all columns for the row and filter them on the client. If you have at most a few thousand columns, this should be ok for an occasional query.
If you need to run this frequently, you'll want to write the data such that you can query using name as the key. This probably means you'll have to write the data twice into two CFs, where one is by your current key, and the other is by name. This is a common Cassandra tactic.

Postgres hstore for time series

I am new to postgres and am experimenting with the hstore extension.Looking for some guidance. I need to support basic reporting on timeseries data for various products that we sell. I have a large amount data in the format "Timestamp, Value" for each product. This data is available in a csv fle for each product.
I am thinking of using hstore to store this data in the key value format. Assuming that all the timeseries data for a single product can be stored in a single hstore object. I need to be able to query this data by specific times, say what was the value of a product at a given time? Also need to run simple queries like retrieving the times where the product costed more than $100.
I'm planning to have a table with a product id column and an hstore column. But I am not very clear on how to make this work:
The hstore column needs to be loaded from thousands of timestamp,value records that exist in a csv. The hstore should be appended whenever we get a new csv.
The table needs to store the productId and corresponding Timeseries data.
Can you please advise if using hstore would be helpful ? If yes then how can I load data from csv as explained above. Also, if there could be any impact on the performance on inserts/updates in the hstore, as data grows please share your experiences.

I do think you should start with a simple, normalised schema first, especially since you are new to PostgreSQL. Something like:
CREATE TABLE product_data
(
product TEXT, -- I'm making an assumption about the types of your columns
time TIMESTAMP,
value DOUBLE PRECISION,
PRIMARY KEY (product, time);
);
I would definitely keep hstore and similar options in mind, if and when your data becomes large enough that efficiency is more important and simplicity. But note that all options have an efficiency tradeoff.
Do you know how much data you're going to support? Number of products, number of distinct timestamps for each product?
What other queries do you want to run? A query for the times where a single product cost more than $100 would benefit from an index on (product, value), if the product has many distinct timestamps.
Other options
hstore is most useful if you want to store a table set of arbitrary key-value pairs in a row. You could use it here, with a row for each product, and each distinct timestamp for that product being a key in the product's table. The downsides are that keys and values in hstore are text, whereas your keys are timestamps, and your values are numbers of some kind. So there will be a certain reduction in type checking, and a certain increase in type casting cost required. Another possible downside is that some queries on the hstore might not use indexes very efficiently. The above table can use simple btree indexes for range queries (say you want to pull out the values between two dates for a product). But hstore indexes are much more limited; you can use a gist or gin index on an hstore column to find all the rows that feature a certain key.
Another option (which I've played with and use experimentally for some of my databases) is arrays. Basically, each product will have an array of values, and each timestamp maps to an index in the array. This is easy if the timestamps are perfectly regular. For example, if all your products had a value every hour for every day, you could use a table like this:
CREATE TABLE product_data
(
product TEXT,
day DATE,
values DOUBLE PRECISION[], -- An array from 0 to 23.
PRIMARY KEY (product, day);
);
You can construct views and indexes to make querying this table moderate easy. (I wrote a blog post on this technique at http://ejrh.wordpress.com/2011/03/20/vector-denormalisation-in-postgresql/.)
But my advice is still: start with a simple table, then explore ways to improve efficiency when you know you're going to need them.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse