Power Query: some values are counted from zero when count should start at a higher value - merge

I have a column with values counting occurrences.
I am trying to continue the series in Power Query.
I am thus trying to increment 1 to the max of the given column..
The ID column has rows with letter tags : AB or BE. Following these letters, specific numeric ranges are associated. For both AB and BE, number ranges first from 0000 to 3000 and from 3000 to 6000.
I thus have the following possibilities: From AB0000 to AB3000 From AB3001 to AB6000 From BE0000 to BE3000 From BE3001 to AB6000
Each category match to the a specific item in my column geography, from the other workbook: From AB0000 to AB3000, it is ItalyZ From AB3001 to AB6000, it is ItalyB From BE0000 to BE3000, it is UKY From BE3001 to AB6000, it is UKM
I am thus trying to find the highest number associated to the first AB category, the second AB category, the first BE category, and the second.
My issue is that for some values, there is simply "nothing" yet in we source file.
This means that there is no occurrence yet of UKM for example.
Here is an example with no UKM or UKY:
|------------------|---------------------|
| Max | Geography |
|------------------|---------------------|
| 0562 | ItalyZ |
|------------------|---------------------|
| 0563 | ItalyZ |
|------------------|---------------------|
Hence, I have the following result:
|------------------|---------------------|
| Increment | Place |
|------------------|---------------------|
| 0564 | ItalyZ |
|------------------|---------------------|
| 0565 | ItalyZ |
|------------------|---------------------|
| 0565 | ItalyZ |
|------------------|---------------------|
| null | UKM |
|------------------|---------------------|
Here is the used power query code:
let
Source = #table({"Prefix", "Seq_Start", "Seq_End","GeoLocation"},{{"AB",0,2999,"ItalyZ"},{"AB",3000,6000,"ItalyB"},{"BC",0,299,"UKY"},{"BC",3000,6000,"UKM"}}),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Seq_Start", Int64.Type}, {"Seq_End", Int64.Type}}),
#"Merged Queries" = Table.NestedJoin(#"Changed Type", {"Prefix"}, HighestID, {"Prefix"}, "HighestID", JoinKind.LeftOuter),
#"Expanded HighestID" = Table.ExpandTableColumn(#"Merged Queries", "HighestID", {"Number"}, {"Number"}),
#"Filtered Rows" = Table.SelectRows(#"Expanded HighestID", each [Number] >= [Seq_Start] and [Number] <= [Seq_End]),
#"Grouped Rows" = Table.Group(#"Filtered Rows", {"Prefix", "Seq_Start", "Seq_End", "GeoLocation"}, {{"NextSeq", each List.Max([Number]) + 1, type number}})
in
#"Grouped Rows"
I would like to know how I could insure that when I have the first occurrence of a value, I would not have "null", but "0000" (or 0) and so on for the next occurrences.
Because, for example, if I have 0 occurrences of UKY before, I do not know why but the end results will be as follows:
|------------------|---------------------|
| Increment | Place |
|------------------|---------------------|
| 1 | UKM |
|------------------|---------------------|
| 2 | UKM |
|------------------|---------------------|
Which is not ideal because UKM should start at 30000. And because I had no values recorded before, it is starting with "null" only and then, 1, 2...rather than 3001 and 3002.

Related

Update telephone number format to include country identifier

Being a beginner in SQL, I am trying to do the telephone fields to be noted in the format: '"+" + country identifier + telephone number'.
UPDATE public.contact
SET phone_number = CASE WHEN (country_code ='FR')
AND phone_number NOT LIKE '+33%'
AND phone_number <> NULL
THEN CONCAT('+33', phone_number)
WHEN (country_code ='GB')and phone_number NOT LIKE '+44%'
AND phone_number <> NULL
THEN CONCAT('+44', phone_number)
I want to update telephone number format to include country identifier like : 0606080905-> +33606080905 if country_code='FR' . I am looking for a faster and less complex way than what I did.
You can do this with a regular expression using regexp_replace.
Imagine your data being:
+----------+--------------+
Table 'numbers': | country | phone |
+----------+--------------+
| FR | 0606080905 |
| FR | +33606080906 |
| GB | 0123456789 |
| GB | +44987654321 |
| GB | NULL |
+----------+--------------+
Then the following update would replace the leading 0 with the country code +33 for all numbers that do not start with a +xx and have FR as country.
UPDATE numbers
SET phone = REGEXP_REPLACE(trim(phone), '^(0)', '+33')
WHERE country = 'FR'
Explained:
the ^ means start of the string
the (0) is the match that gets replaced (leading zero)
the +33 is the string that is used to replace it
the trim() is just added for safety, in case there are leading spaces
NULL phone numbers won't be affected, as they do not match
You could do this now as you did before with a CASE WHEN or something similar for each of the different possibilities. But since the expression always is the same, an easier way would be to have your country codes and their numerical mapping in a separate table:
+----------+--------+
Table 'mapping': | country | prefix |
+----------+--------+
| FR | +33 |
| GB | +44 |
+----------+--------+
You could then do
UPDATE numbers n
SET phone = REGEXP_REPLACE(trim(phone), '^(0)', prefix)
FROM mapping m
WHERE m.country = n.country
and update all your numbers in one go:
+----------+--------------+
| country | phone |
+----------+--------------+
| FR | +33606080905 |
| FR | +33606080906 |
| GB | +44123456789 |
| GB | +44987654321 |
| GB | NULL |
+----------+--------------+
EDIT: Previously, I had this needlessly complicated answer. You may need something like this if your phone number patterns are more diverse...
The following update would replace the leading 0 with the country code +33 for all numbers that do not start with a +xx and have FR as country.
UPDATE numbers
SET phone = REGEXP_REPLACE(trim(phone), '^(?<![+\d{2}])(0)', '+33')
WHERE country = 'FR'
Explained:
the (?<![+]) is a negative lookbehind assertion that makes sure the regex only matches if there is no + followed by two digits before
the (0) is the match that gets replaced
the +33 is the string that is used to replace it
the trim() is just added for safety, in case there are leading spaces
NULL phone numbers won't be affected, as they do not match
That's about as simple as it gets.
The only way I can imagine to speed up processing is to add a WHERE condition that avoids updating the rows that don't have to be modified.
You could also run several such statements in parallel, where each modifies a different part of the table.
As mentioned in the comment, <> NULL is never true.

Count records between rolling date range in Tableau

I have a file with a [start] and [end] date in Tableau and would like to create a calculated field that counts number of rows on a rolling basis that occur between [start] and [end] for each [person]. This data is like so:
| Start | End | Person
|1/1/2019 |1/7/2019 | A
|1/3/2019 |1/9/2019 | A
|1/8/2019 |1/15/2019| A
|1/1/2019 |1/7/2019 | B
I'd like to create a calculated field [count] with results like so:
| Start | End | Person | Count
|1/1/2019 |1/7/2019 | A | 1
|1/3/2019 |1/9/2019 | A | 2
|1/8/2019 |1/15/2019| A | 2
|1/1/2019 |1/7/2019 | B | 1
EDITED: A good analogy for what [count] represents is: "how many videos does each person rented at the same time as of that moment?" With the 1st row for person A, count is 1, with 1 item rented. As of row 2, person A has 2 items rented. But for the 3rd row [count]= 2 since the video rented in the first row is no longer rented.

pyspark: how to modify column value based on other columns for the same Id

I have a pyspark dataframe with 5 columns: Id, a value X, lower & upper bounds of X and the update date (this dataframe is ordered by "Id, Update"). I read it from a hive table:
(spark.sql(Select *from table1 ordered by Update))
+---+----------+----------+----------+----------+
| Id| X| LB| UB| Update|
+---+----------+----------+----------+----------+
| 1|2019-01-20|2019-01-15|2019-01-25|2019-01-02|
| 1|2019-01-17|2019-01-15|2019-01-25|2019-01-03|
| 1|2019-01-10|2019-01-15|2019-01-25|2019-01-05|
| 1|2019-01-12|2019-01-15|2019-01-25|2019-01-07|
| 1|2019-01-15|2019-01-15|2019-01-25|2019-01-08|
| 2|2018-12-12|2018-12-07|2018-12-17|2018-11-17|
| 2|2018-12-15|2018-12-07|2018-12-17|2018-11-18|
When "X" is lower than "LB" or greater than "UB", "LB" & "UB" will be re-computed according to X and for all the following rows having the same Id.
if(X<LB | X>UB) LB = X-5 (in days)
UB = X+5 (in days)
The result should be like that:
+---+----------+----------+----------+----------+
| Id| X| LB| UB| Update|
+---+----------+----------+----------+----------+
| 1|2019-01-20|2019-01-15|2019-01-25|2019-01-02|
| 1|2019-01-17|2019-01-15|2019-01-25|2019-01-03|
| 1|2019-01-10|2019-01-05|2019-01-15|2019-01-05|
| 1|2019-01-12|2019-01-05|2019-01-15|2019-01-07|
| 1|2019-01-15|2019-01-05|2019-01-15|2019-01-08|
| 2|2018-12-12|2018-12-07|2018-12-17|2018-11-17|
| 2|2018-12-15|2018-12-07|2018-12-17|2018-11-18|
The third, forth & fifth rows are changed.
How can achieve this?
Try Case statement within Select Expression-
df.selectExpr("Id AS Id",
"X AS X",
"CASE WHEN X<LB OR X>UB THEN date_sub(X,5) ELSE LB END AS LB",
"CASE WHEN X<LB OR X>UB THEN date_add(X,5) ELSE UB END AS UB",
"Update AS Update").show()

How to create a Postgres trigger that calculates values

How would you create a trigger that uses the values of the row being inserted to be calculated first so that a value being inserted gets transformed?
Let's say I have this table labor_rates,
+---------------+-----------------+--------------+------------+
| labor_rate_id | rate_per_minute | unit_minutes | created_at |
+---------------+-----------------+--------------+------------+
| bigint | numeric | numeric | timestamp |
+---------------+-----------------+--------------+------------+
Each time a new record is created, I need that the rate is calculated as rate/unit (the smallest unit here is a minute).
So example, when inserting a new record:
INSERT INTO labor_rates(rate, unit)
VALUES (60, 480);
It would create a new record with these values:
+---------------+-----------------+--------------+----------------------------+
| labor_rate_id | rate_per_minute | unit_minutes | created_at |
+---------------+-----------------+--------------+----------------------------+
| 1000000 | 1.1979 | 60 | 2017-03-16 01:59:47.208111 |
+---------------+-----------------+--------------+----------------------------+
One could argue that this should be left as a calculated field instead of storing the calculated value. But in this case, it would be best if the calculated value is stored.
I am fairly new to triggers so any help would be much appreciated.

PostgreSQL BETWEEN selects record when not fulfilled

Why does this query returns a record?:
db2=> select * FROM series WHERE start <= '882001010000' AND "end" >= '882001010000' ORDER BY timestamp DESC LIMIT 1;
id | timestamp | start | end |
-------+---------------------+----------+-----------
23443 | 2016-12-23 17:10:05 | 88160000 | 88209999 |
or with BETWEEN:
db2=> select * FROM series WHERE '882001010000' BETWEEN start AND "end" ORDER BY timestamp DESC LIMIT 1;
id | timestamp | start | end |
-------+---------------------+----------+-----------
23443 | 2016-12-23 17:10:05 | 88160000 | 88209999 |
start and end are TEXT columns.
They are returning records because you are doing the comparisons as strings not as numbers.
Hence: '8' is between '7000000' and '9000', because the comparisons are one character at a time.
If you want numeric comparisons, you can cast the values to numbers. Or, better yet, represent the values as numerics. Postgres has the nice capability of very large precisions.