When to use sort By clause in hive QL - hiveql

I checked the difference between sort by vs order by clause in hive.
Order by used when total ordering is required while sort by is used when there are multiple reducer & input to reducer required to be in sorted order. Hence sort by could lead to total order if there is only one reducer & partial ordering if there are multiple reducer-
Ref- https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
My question is when do we need to use sort by clause in hiveQL ?

when the data is sorted then joins are faster since optimizer is aware that data is sorted in specific order & after which value it needs to stop looking for required predicate selection (where clause condition).
Case 1 - Order By
Now if your data in given field has specific order or your select query needs data in specific order
eg-
rank of employee order by their salary (i.e order by salary & band)
or
order the employees based on joining date (i.e order by joining date)
then you need to save data / result using order by clause (to get total order) & we need to use order by (salary) so that whenever you query the target data you can get the required ordered data by default.
Case 2 - sort by
And if data in given field is not required in specific order like uniquely generated alphanumeric fields like Customer_id
In this case logically final data is not required to be present in specific order based on customer_id but since it's a unique key & mostly used in joining
hence while we store the data for customer transaction details in each partition it needs to be present in sorted/ordered manner to make the join faster.
So, in this case we use sort by (customer_id) while storing final result.

Related

Postgres massive UPSERT not wasting sequence numbers

I have a too complicated task for me, and hope someone can help me :)
I have two different structures, containing data about products:
1.
products with product_id, brand_id (products that I have)
products_sku with product_id, sku_id, vendor_code (SKUs for products)
products_avail with sku_id, scheme_id, quantity (availability for each product sku and scheme)
products_external with product_id_e, brand_id_e, vendor_code_e, sku_id_e
products_avail_external with product_id_e, quantity_e, scheme_id_e
Each SKU identified by (brand_id, vendor_code) pair, so one product from (2) corresponds to one SKU from (1). Also I can have several availability entries for different schemes. Availability records can count up to tens of millions records.
Field sku_id_e in (2) updates with cron task, so if it defined (i.e. not zero) - I can find corresponding record in (1).
I need to get all records from (2) with sku_id_e defined, group them by (sku_id_e, scheme_id_e) and make a set of records in (1), so one records will contain SUM(quantity) of all records with some (sku_id_e, scheme_id_e).
I can do UPSERT but in this case I will waste sequence numbers (which is problem in my case because request will be executed relatively frequently and on massive number of records).
I can use something like ON EMPTY or NOT EXISTS, but this is too complicated for me to combine in one request.
I can just select both datasets and make matching programmatically, but this is definitely not best solution.
Can you help me with making SQL code that will update records in (1) or insert them if such records does not exists (not wasting sequence)?
Thank you in advance!

DynamoDB column with tilde and query using JPA

i have table column with tilde value like below
vendorAndDate - Column name
Chipotle~08-26-2020 - column value
I want to query for month "vendorAndPurchaseDate like '%~08%2020'" and for year ends with 2020 "vendorAndPurchaseDate like '%2020'". I am using Spring Data JPA to query the values. I have not worked on column with tilde values before. Please point me in a right direction or some examples
You cannot.
If vendorAndPurchaseDate is your partition key , you need to pass the whole value.
If vendorAndPurchaseDate is range key , you can only perform
= ,>,<>=,<=,between and begins_with operation along with a partition key
reference : https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html
DynamoDB does not support this type of wildcard query.
Let's consider a more DynamoDB way of handling this type of query. It sounds like you want to support 2 access patterns:
Get Item by month
Get Item by year
You don't describe your Primary Keys (Partition Key/Sort Key), so I'm going to make some assumptions to illustrate one way to address these access patterns.
Your attribute appears to be a composite key, consisting of <vendor>~<date>, where the date is expressed by MM-DD-YYYY. I would recommend storing your date fields in YYYY-MM-DD format, which would allow you to exploit the sort-ability of the date field. An example will make this much clearer. Imagine your table looked like this:
I'm calling your vendorAndDate attribute SK, since I'm using it as a Sort Key in this example. This table structure allows me to implement your two access patterns by executing the following queries (in pseudocode to remain language agnostic):
Access Pattern 1: Fetch all Chipotle records for August 2020
query from MyTable where PK = "Vendors" and SK between Chipotle~2020-08-00 and Chipotle~2020-08-31
Access Pattern 2: Fetch all Chipotle records for 2020
query from MyTable where PK = "Vendors" and SK between Chipotle~2020-01-01 and Chipotle~2020-12-31
Because dates stored in ISO8601 format (e.g. YYYY-MM-DD...) are lexicographically sortable, you can perform range queries in DynamoDB in this way.
Again, I've made some assumptions about your data and access patterns for the purpose of illustrating the technique of using lexicographically sortable timestamps to implement range queries.

How to sort a Scala List[Map[String, Any]] by an arbitrary number of keys in the Map?

I have a List[Map[String, Any]] that represents the results of a query.
The keys of a map instance (or row) are the column names.
Each query is different and its result may contain a different set of columns, compared to any other query. Queries cannot be predicted in advance, hence I cannot use case classes to represent a result row.
Within the results for a given query, all columns appear in every row.
The values are largely Int, Double and String types.
I need to be able to sort the results by multiple columns, in both ascending and descending order.
For example, in pseudocode / SQL:
ORDER BY column1 ASC, column2 DESC, column3 ASC
I have three distinct problems:
Sort by a single column where its type (in the map, as opposed its underlying type) is Any
Sort by either direction
Chain multiple sort instructions together
How can I do this?
UPDATE
I can do part 1. and part 2. by writing a custom Ordering[Any]. I don't yet know how to chain the sorts together.

Where clause versus join clause in Spark SQL

I am writing a query to get records from Table A which satisfies a condition from records in Table B. For example:
Table A is:
Name Profession City
John Engineer Palo Alto
Jack Doctor SF
Table B is:
Profession City NewJobOffer
Engineer SF Yes
and I'm interested to get Table c:
Name Profession City NewJobOffer
Jack Engineer SF Yes
I can do this in two ways using where clause or join query which one is faster and why in spark sql?
Where clause to compare the columns add select those records or join on the column itself, which is better?
It's better to provide filter in WHERE clause. These two expressions are not equivalent.
When you provide filtering in JOIN clause, you will have two data sources retrieved and then joined on specified condition. Since join is done through shuffling (redistributing between executors) data first, you are going to shuffle a lot of data.
When you provide filter in WHERE clause, Spark can recognize it and you will have two data sources filtered and then joined. This way you will shuffle less amount of data. What might be even more important is that this way Spark may also be able to do a filter-pushdown, filtering data at datasource level, which means even less network pressure.

Finding out the hash value of a group of rows

I have a table person in my PostgresSQL database, which contains data of different users.
I need to write a test case, which ensures that some routine does modify the data of user 1, and does not modify data of user 2.
For this purpose, I need to
a) calculate a hash code of all rows of user 1 and those of user 2,
b) then perform the operation under test,
c) calculate the hash code again and
d) compare hash codes from steps a) and c).
I found a way to calculate the hash code for a single row:
SELECT md5(CAST((f.*)AS text))
FROM person f;
In order to achieve my goal (find out whether rows of user 2 have been changed), I need to perform a query like this:
SELECT user_id, SOME_AGGREGATE_FUNCTION(md5(CAST((f.*)AS text)))
FROM person f
GROUP BY user_id;
What aggregate function can I use in order to calculate the hash code of a set of rows?
Note: I just want to know whether any rows of user 2 have been changed. I do not want to know, what exactly has changed.
The simplest way - just concat all the string form md5 with string_agg. But to use this aggregate correctly you must specify ORDER BY.
Or use md5(string_agg(md5(CAST((f.*)AS text)),'')) with some ORDER BY - it will change if any field of f.* changes and it is cheap to compare.
An even simpler way to do it
SELECT user_id, md5(textin(record_out(A))) AS hash
FROM person A