How to sort a Scala List[Map[String, Any]] by an arbitrary number of keys in the Map? - scala

I have a List[Map[String, Any]] that represents the results of a query.
The keys of a map instance (or row) are the column names.
Each query is different and its result may contain a different set of columns, compared to any other query. Queries cannot be predicted in advance, hence I cannot use case classes to represent a result row.
Within the results for a given query, all columns appear in every row.
The values are largely Int, Double and String types.
I need to be able to sort the results by multiple columns, in both ascending and descending order.
For example, in pseudocode / SQL:
ORDER BY column1 ASC, column2 DESC, column3 ASC
I have three distinct problems:
Sort by a single column where its type (in the map, as opposed its underlying type) is Any
Sort by either direction
Chain multiple sort instructions together
How can I do this?
UPDATE
I can do part 1. and part 2. by writing a custom Ordering[Any]. I don't yet know how to chain the sorts together.

Related

grouping multiple queries into a single one, with Postgres

I have a very simple query:
SELECT * FROM someTable
WHERE instrument = '{instrument}' AND ts >= '{fromTime}' AND ts < '{toTime}'
ORDER BY ts
That query is applied to 3 tables across 2 databases.
I receive a list of rows that have timestamps (ts). I take the last timestamp and it serves as the basis for the 'fromTime' of the next iteration. toTime is usually equal to 'now'.
This allows me to only get new rows at every iteration.
I have about 30 instrument types and I need an update every 1s.
So that's 30 instruments * 3 queries = 90 queries per second.
How can I rewrite the query so that I could use a function like this:
getData table [(instrument, fromTime) list] toTime
and get back some dictionary, in the form:
Dictionary<instrument, MyDataType list>
To use a list of instruments, I could do something like:
WHERE instrument in '{instruments list}'
but this wouldn't help with the various fromTime as there is one value per instrument.
I could take the min of all fromTime values, get the data for all instruments and then filter the results out, but that's wasteful since I could potentially query a lot of data to throw is right after.
What is the right strategy for this?
So there is a single toTime to test against per query, but a different fromTime per instrument.
One solution to group them in a single query would be to pass a list of (instrument, fromTime) couples as a relation.
The query would look like this:
SELECT [columns] FROM someTable
JOIN (VALUES
('{name of instrument1}', '{fromTime for instrument1}'),
('{name of instrument2}', '{fromTime for instrument2}'),
('{name of instrument3}', '{fromTime for instrument3}'),
...
) AS params(instrument, fromTime)
ON someTable.instrument = params.instrument AND someTable.ts >= params.fromTime
WHERE ts < 'toTime';
Depending on your datatypes and what method is used by the client-side driver
to pass parameters, you may have to be explicit about the datatype of
your parameters by casting the first value of the list, as in, for
example:
JOIN (VALUES
('name of instrument1', '{fromTime for instrument1}'::timestamptz),
If you had much more than 30 values, a variant of this query with arrays as parameters (instead of the VALUES clause) could be preferred. The difference if that it would take 3 parameters: 2 arrays + 1 upper bound, instead of N*2+1 parameters. But it depends on the ability of the client-side driver to support Postgres arrays as a datatype, and the ability to pass them as a single value.

Postgres: Query Values in nested jsonb-structure with unknown keys

I am quite new in working with psql.
Goal is to get values from a nested jsonb-structure where the last key has so many different characteristics it is not possible to query them explicitely.
The jsonb-structure in any row is as follows:
TABLE_Products
{'products':[{'product1':['TYPE'], 'product2':['TYPE2','TYPE3'], 'productN':['TYPE_N']}]}
I want to get the values (TYPE1, etc.) assigned to each product-key (product1, etc.). The product-keys are the unknown, because of too many different names.
My work so far achieves to pull out a tuple for each key:value-pair on the last level. To illustrate this here you can see my code and the results from the previously described structure.
My Code:
select url, jsonb_each(pro)
from (
select id , jsonb_array_elements(data #> '{products}') as pro
from TABLE_Products
where data is not null
) z
My result:
("product2","[""TYPE2""]")
("product2","[""TYPE3""]")
My questions:
Is there a way to split this tuple on two columns?
Or how can I query the values kind of 'unsupervised', so without knowing the exact names of 'product1 ... n'

When to use sort By clause in hive QL

I checked the difference between sort by vs order by clause in hive.
Order by used when total ordering is required while sort by is used when there are multiple reducer & input to reducer required to be in sorted order. Hence sort by could lead to total order if there is only one reducer & partial ordering if there are multiple reducer-
Ref- https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
My question is when do we need to use sort by clause in hiveQL ?
when the data is sorted then joins are faster since optimizer is aware that data is sorted in specific order & after which value it needs to stop looking for required predicate selection (where clause condition).
Case 1 - Order By
Now if your data in given field has specific order or your select query needs data in specific order
eg-
rank of employee order by their salary (i.e order by salary & band)
or
order the employees based on joining date (i.e order by joining date)
then you need to save data / result using order by clause (to get total order) & we need to use order by (salary) so that whenever you query the target data you can get the required ordered data by default.
Case 2 - sort by
And if data in given field is not required in specific order like uniquely generated alphanumeric fields like Customer_id
In this case logically final data is not required to be present in specific order based on customer_id but since it's a unique key & mostly used in joining
hence while we store the data for customer transaction details in each partition it needs to be present in sorted/ordered manner to make the join faster.
So, in this case we use sort by (customer_id) while storing final result.

PostgreSql Queries treats Int as string datatypes

I store the following rows in my table ('DataScreen') under a JSONB column ('Results')
{"Id":11,"Product":"Google Chrome","Handle":3091,"Description":"Google Chrome"}
{"Id":111,"Product":"Microsoft Sql","Handle":3092,"Description":"Microsoft Sql"}
{"Id":22,"Product":"Microsoft OneNote","Handle":3093,"Description":"Microsoft OneNote"}
{"Id":222,"Product":"Microsoft OneDrive","Handle":3094,"Description":"Microsoft OneDrive"}
Here, In this JSON objects "Id" amd "Handle" are integer properties and other being string properties.
When I query my table like below
Select Results->>'Id' From DataScreen
order by Results->>'Id' ASC
I get the improper results because PostgreSql treats everything as a text column and hence does the ordering according to the text, and not as integer.
Hence it gives the result as
11,111,22,222
instead of
11,22,111,222.
I don't want to use explicit casting to retrieve like below
Select Results->>'Id' From DataScreen order by CAST(Results->>'Id' AS INT) ASC
because I will not be sure of the datatype of the column due to the fact that JSON structure will be dynamic and the keys and values may change next time. and Hence could happen the same with another JSON that has Integer and string keys.
I want something so that Integers in Json structure of JSONB column are treated as integers only and not as texts (string).
How do I write my query so that Id And Handle are retrieved as Integer Values and not as strings , without explicit casting?
I think your assumtions about the id field don't make sense. You said,
(a) Either id contains integers only or
(b) it contains strings and integers.
I'd say,
If (a) then numerical ordering is correct.
If (b) then lexical ordering is correct.
But if (a) for some time and then (b) then the correct order changes, too. And that doesn't make sense. Imagine:
For the current database you expect the order 11,22,111,222. Then you add a row
{"Id":"aa","Product":"Microsoft OneDrive","Handle":3095,"Description":"Microsoft OneDrive"}
and suddenly the correct order of the other rows changes to 11,111,22,222,aa. That sudden change is what bothers me.
So I would either expect a lexical ordering ab intio, or restrict my id field to integers and use explicit casting.
Every other option I can think of is just not practical. You could, for example, create a custom < and > implementation for your id field which results in 11,111,22,222,aa. ("Order all integers by numerical value and all strings by lexical order and put all integers before the strings").
But that is a lot of work (it involves a custom data type, a custom cast function and a custom operator function) and yields some counterintuitive results, e.g. 11,111,22,222,0a,1a,2a,aa (note the position of 0a and so on. They come after 222).
Hope, that helps ;)
If Id always integer you can cast it in select part and just use ORDER BY 1:
select (Results->>'Id')::int From DataScreen order by 1 ASC

array_agg guaranteed consistent across multiple columns in Postgres?

Suppose I have the following table in Postgres 9.4:
a | b
---+---
1 | 2
3 | 1
2 | 3
1 | 1
If I run
select array_agg(a) as a_agg, array_agg(b) as b_agg from foo
I get what I want
a_agg | b_agg
-----------+-----------
{1,3,2,1} | {2,1,3,1}
The orderings of the two arrays are consistent: the first element of each comes from a single row, as does the second, as does the third. I don't actually care about the order of the arrays, only that they be consistent across columns.
It seems natural that this would "just happen", and it seems to. But is it reliable? Generally, the ordering of SQL things is undefined unless an ORDER BY clause is specified. It is perfectly possible to get postgres to generate inconsistent pairings with inconsistent ORDER BY clauses within array_agg (with some explicitly counterproductive extra work):
select array_agg(a order by b) as agg_a, array_agg(b order by a) as agg_b from foo;
yields
agg_a | agg_b
-----------+-----------
{3,1,1,2} | {2,1,3,1}
This is no longer consistent. The first array elements 3 and 2 did not come from the same original row.
I'd like to be certain that, without any ORDER BY clause, the natural thing just happens. Even with an ordering on either column, ambiguity would remain because of the duplicate elements. I'd prefer to avoid imposing an unambiguous sort, because in my real application, the tables will be large and the sorting might be costly. But I can't find any documentation that guarantees or specifies that, absent imposition of inconsistent orderings, multiple array_agg calls will be ordered consistently, even though it'd be very surprising if they weren't.
Is it safe to assume that the ordering of multiple array_agg columns will be consistently ordered when no ordering is explicitly imposed on the query or within the aggregate functions?
According to PostgreSQL documentation :
Ordinarily, the input rows are fed to the aggregate function in an unspecified order. [...]
However, some aggregate functions (such as array_agg and string_agg) produce results that depend on the ordering of the input rows. When using such an aggregate, the optional order_by_clause can be used to specify the desired ordering.
The way I understand it : you can't be sure that the order of rows is preserved unless you use ORDER BY.
It seems there is a similar (or almost same) question here:
PostgreSQL array_agg order
I prefer ebk's answer
So I think it's fine to assume that all the aggregates, none of which uses ORDER BY, in your query will see input data in the same order. The order itself is unspecified though (which depends on the order the FROM clause supplies rows).
But you can still add order in array_agg function to force same order.