Update an aggregate table with information from another table - scala

I have an aggregate table that needs to be updated frequently from a table that is generated by new data every few hours. Since
the aggregate table grows a lot every update, I need an efficient way to update the aggregate table.
Can someone please show me how to merge information from a new table to an aggregate table as shown below?
For example
val aggregate_table = Seq(("A", 10),("B", 20),("C", 30),("D", 40)).toDF("id", "total")
| id | total |
|-----+-------|
| "A" | 10 |
| "B" | 20 |
| "C" | 30 |
| "D" | 40 |
val new_info_table = Seq(("X"),("B"),("C"),("B"),("A"),("A"),("C"),("B")).toDF("id")
| id |
|-----|
| "X" |
| "B" |
| "C" |
| "B" |
| "A" |
| "A" |
| "C" |
| "B" |
Resulting aggregate_table after aggregate_table has been merged with
new_info_table
aggregate_table
| id | total |
|-----+-------|
| "A" | 12 |
| "B" | 23 |
| "C" | 33 |
| "D" | 40 |
| "X" | 1 |

Related

Compare consecutive rows and extract words(excluding the subsets) in spark

I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid. expected output looks like Table 2.
Input dataframe:
(Table 1)
|-----------+------------+-----------------------------------|
| session_id| value | Timestamp |
|-----------+------------+-----------------------------------|
| 1 | cat | 2021-01-11T13:48:54.2514887-05:00 |
| 1 | catc | 2021-01-11T13:48:54.3514887-05:00 |
| 1 | catch | 2021-01-11T13:48:54.4514887-05:00 |
| 1 | par | 2021-01-11T13:48:55.2514887-05:00 |
| 1 | part | 2021-01-11T13:48:56.5514887-05:00 |
| 1 | party | 2021-01-11T13:48:57.7514887-05:00 |
| 1 | partyy | 2021-01-11T13:48:58.7514887-05:00 |
| 2 | fal | 2021-01-11T13:49:54.2514887-05:00 |
| 2 | fall | 2021-01-11T13:49:54.3514887-05:00 |
| 2 | falle | 2021-01-11T13:49:54.4514887-05:00 |
| 2 | fallen | 2021-01-11T13:49:54.8514887-05:00 |
| 2 | Tem | 2021-01-11T13:49:56.5514887-05:00 |
| 2 | Temp | 2021-01-11T13:49:56.7514887-05:00 |
|-----------+------------+-----------------------------------|
Expected Output:
(Table 2)
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 1 | partyy |
| 2 | fallen |
| 2 | Temp |
|-----------+------------|
Solution I tried:
I added another column called col_length which captures the length of each word in value column. later on tried to compare each row with its subsequent row to see if it is of maximum lenth. But this solution only works party.
val df = spark.read.parquet("/project/project_name/abc")
val dfM = df.select($"session_id",$"value",$"Timestamp").withColumn("col_length",length($"value"))
val ts = Window
.orderBy("session_id")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val result = dfM
.withColumn("running_max", max("col_length") over ts)
.where($"running_max" === $"col_length")
.select("session_id", "value", "Timestamp")
Current Output:
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 2 | fallen |
|-----------+------------|
Multiple columns does not work inside an orderBy clause with window function so I didn't get desired output.I got 1 output per sesison id. Any suggesions would be highly appreciated. Thanks in advance.
You can solve it by using lead function:
val windowSpec = Window.orderBy("session_id")
dfM
.withColumn("lead",lead("value",1).over(windowSpec))
.filter((functions.length(col("lead")) < functions.length(col("value"))) || col("lead").isNull)
.drop("lead")
.show

T-SQL : Pivot table without aggregate

I am trying to understand how to pivot data within T-SQL but can't seem to get it working. I have the following table structure
+-------------------+-----------------------+
| Name | Value |
+-------------------+-----------------------+
| TaskId | 12417 |
| TaskUid | XX00044497 |
| TaskDefId | 23 |
| TaskStatusId | 4 |
| Notes | |
| TaskActivityIndex | 0 |
| ModifiedBy | Orange |
| Modified | /Date(1554540200000)/ |
| CreatedBy | Apple |
| Created | /Date(2121212100000)/ |
| TaskPriorityId | 40 |
| OId | 2 |
+-------------------+-----------------------+
I want to pivot the name column to be columns expected output
+--------+------------------------+-----------+--------------+-------+-------------------+------------+-----------------------+-----------+-----------------------+----------------+-----+
| TASKID | TASKUID | TASKDEFID | TASKSTATUSID | NOTES | TASKACTIVITYINDEX | MODIFIEDBY | MODIFIED | CREATEDBY | CREATED | TASKPRIORITYID | OID |
+--------+------------------------+-----------+--------------+-------+-------------------+------------+-----------------------+-----------+-----------------------+----------------+-----+
| | | | | | | | | | | | |
| 12417 | XX00044497 | 23 | 4 | | 0 | Orange | /Date(1554540200000)/ | Apple | /Date(2121212100000)/ | 40 | 2 |
+--------+------------------------+-----------+--------------+-------+-------------------+------------+-----------------------+-----------+-----------------------+----------------+-----+
Is there an easy way of doing it? The columns are fixed (not dynamic).
Any help appreciated
Try this:
select * from yourtable
pivot
(
min(value)
for Name in ([TaskID],[TaskUID],[TaskDefID]......)
) as pivotable
You can also use case statements.
You must use the aggregate function in the pivot table.
If you want to learn more, here is the reference:
https://learn.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-2017
Output (I only tried three columns):
DB<>Fiddle

Add columns but keep a specific id

I have a table "Listing" that looks like this:
| listing_id | amenities |
|------------|--------------------------------------------------|
| 5629709 | {"Air conditioning",Heating, Essentials,Shampoo} |
| 4156372 | {"Wireless Internet",Kitchen,"Pets allowed"} |
And another table "Amenity" like this:
| amenity_id | amenities |
|------------|--------------------------------------------------|
| 1 | Air conditioning |
| 2 | Kitchen |
| 3 | Heating |
Is there a way to join the two tables in a new one "Listing_Amenity" like this:
| listing_id | amenities |
|------------|-----------|
| 5629709 | 1 |
| 5629709 | 3 |
| 4156372 | 2 |
You could use unnest:
CREATE TABLE Listing_Amenity
AS
SELECT l.listing_id, a.amenity_id
FROM Listing l
, unnest(l.ammenities) sub(elem)
JOIN Amenity a
ON a.ammenities = sub.elem;
db<>fiddle demo

Postgres Group/compress result row inside 1 linked row

I have a table structure with 2 table like this:
result table: 1 Row with generic into + a row UUID.
--------------------------
| uuid | name | other |
--------------------------
| result1 | foo | bar |
--------------------------
| result2 | foo2 | bar2 |
--------------------------
criteria_result:
-----------------------------------
| result_uuid | crit_uuid | value |
-----------------------------------
| result1 | crit1 | 7 |
-----------------------------------
| result1 | crit2 | 8 |
-----------------------------------
| result1 | crit3 | 9 |
-----------------------------------
| result1 | crit7 | 4 |
-----------------------------------
| result2 | crit1 | 2 |
-----------------------------------
What I need is 1 row per result table but that group all the criteria_result table inside, ex:
----------------------------------------------------
| uuid | name | result_crit |
----------------------------------------------------
| result1 | foo | [
| | crit1 | crit2 | crit3 | crit7 |
| | 7 | 8 | 9 | 4 |]
----------------------------------------------------
| result2 | foo2 | [
| | crit1 |
| | 2 | ]
----------------------------------------------------
Or even
-----------------------------------------
| uuid | name | result_crit |
-----------------------------------------
| result1 | foo | [ | name | value |
| crit1 | 7 |
| crit2 | 8 |
| crit3 | 9 |
| crit7 | 4 | ]
-----------------------------------------
-----------------------------------------
| result2 | foo2 | [ | name | value |
| crit1 | 2 | ]
-----------------------------------------
Anything that can get only 1 result per row when I export it but also have all Criteria of that row/result in a sub array/object.
SELECT
result.uuid,
result.name,
criteria_result.result_uuid
FROM
public.criteria_result,
public.result
WHERE
result.uuid = criteria_result.result_uuid;
I tried CUBE, GROUP BY, GROUPING SETS, but I don't seems to get it right or find the answer :/.
Thanks
Note: I do have a recent Postgres 9.5.1.

how to return number of records as a part of a select statement?

I'd like to know if there is a way to include row numbers (basically telling me how many records I'm getting back from a database query).
I have the following SQL query
SELECT w.widget_id, w.class_id, wg.name classname, wg.label AS classgroup, c.label, c.seq,
g.name AS group, p.name, p.type, CASE WHEN v.value IS NOT NULL THEN v.value WHEN g2p.value IS NOT NULL THEN g2p.value ELSE p.value END AS value
FROM widgets_to_categories w
INNER JOIN widget_classes c ON w.class_id = c.class_id
JOIN classes_to_param_groups t2g ON c.class_id = t2g.class_id
JOIN widget_groups g ON t2g.group_id = g.group_id
JOIN param_groups_to_params g2p ON t2g.group_id = g2p.group_id
JOIN provisioning_params p ON g2p.param_id = p.param_id
INNER JOIN widget_cat_groups wg ON c.class_group_id = wg.class_group_id
LEFT JOIN widget_values v ON(w.widget_id=v.device_id AND p.param_id=v.param_id AND g.name=v.group_name )
WHERE w.widget_id=8 ORDER BY c.class_id ASC
And it returns data like:
widget_id | class_id | classname | classgroup | label | seq | group | name | type | value
8 | 1 | toy | group A | test label | 1 | toy | reg | text | af
8 | 1 | toy | group A | test label | 1 | reg2 | fall | text | 25327
8 | 1 | toy | group A | test label | 1 | reg2 | pd | text | dvaa
8 | 1 | toy | group A | test label | 1 | reg2 | ext | text | 28235
8 | 1 | toy | group A | test label | 1 | reg1 | ext | text | 28230
8 | 1 | toy | group A | test label | 1 | toy | meec | text | 094F22DE501
8 | 1 | toy | group A | test label | 1 | toy | mmap | text | 0|
8 | 1 | toy | group A | test label | 1 | reg1 | fna | text | 26014
8 | 1 | toy | group A | test label | 1 | reg1 | fall | text | t-123
8 | 1 | toy | group A | test label | 1 | toy | uen | boolean | false
8 | 1 | toy | group A | test label | 1 | toy | adminpd |
I'd like to know if there's a way to have the database auto generate and return another column that is just an identifier for the row, like so:
id |widget_id | class_id | classname | classgroup | label | seq | group | name | type | value
1 | 8 | 1 | toy | group A | test label | 1 | toy | reg | text | af
2 | 8 | 1 | toy | group A | test label | 1 | reg2 | fall | text | 25327
3 | 8 | 1 | toy | group A | test label | 1 | reg2 | pd | text | dvaa
4 | 8 | 1 | toy | group A | test label | 1 | reg2 | ext | text | 28235
5 | 8 | 1 | toy | group A | test label | 1 | reg1 | ext | text | 28230
6 | 8 | 1 | toy | group A | test label | 1 | toy | meec | text | 094F22DE501
7 | 8 | 1 | toy | group A | test label | 1 | toy | mmap | text | 0|
8 | 8 | 1 | toy | group A | test label | 1 | reg1 | fna | text | 26014
9 | 8 | 1 | toy | group A | test label | 1 | reg1 | fall | text | t-123
10 | 8 | 1 | toy | group A | test label | 1 | toy | uen | boolean | false
11 | 8 | 1 | toy | group A | test label | 1 | toy | adminpd | boolean | false
I think I can do this by selecting into a temporary table.. I haven't figured out the syntax on how to do it yet... But I'm also wondering if there's another simpler way.
Once I get the data back from the database, having this ID field makes it eaiser to manipulate.
Thanks.
You can use the row_number window function to keep track of each row number.
Like so:
create table foo
(
id serial,
val text
);
INSERT INTO foo (val)
VALUES ('One'), ('Two'), ('Three');
SELECT f.*, row_number() OVER(ORDER BY val)
FROM foo AS f
ORDER BY val;
Here's an SQL Fiddle which shows this:
http://sqlfiddle.com/#!15/0c434/2
Additional options:
You could count the result with a query of the form:
SELECT count(*)
FROM
(
SELECT *
FROM foo
);
Or you may be able to get the row count back as part of the Postgres library you're using. For example, psycopg2 (Python) and DBI (Perl) allow for this (with some caveats). The library you're using may offer something similar.