values columns Sorting pyspark

values columns Sorting pyspark - pyspark

I have this DataFrame bellow:
Ref ° | Indice_1 | Indice_2 | 1 | 2 | indice_from | indice_from | indice_to | indice_to
---------------------------------------------------------------------------------------------------------------------------------------------
1 | 19 | 37.1 | 32 | 62 | ["20031,10031"] | ["13,11/12"] | ["40062,30062"] | ["14A,14"]
---------------------------------------------------------------------------------------------------------------------------------------------
2 | 19 | 37.1 | 44 | 12 | ["40062,30062"] | ["13,11/12"] | ["40062,30062"] | ["14A,14"]
---------------------------------------------------------------------------------------------------------------------------------------------
3 | 19 | 37.1 | 22 | 64 | ["20031,10031"] | ["13,11/12"] | ["20031,10031"] | ["13,11/12"]
---------------------------------------------------------------------------------------------------------------------------------------------
4 | 19 | 37.1 | 32 | 98 | ["20032,10032"] | ["13,11/12"] | ["40062,30062"] | ["13,11/12"]
I want sort asc the values of the column indice_from, indice_from, indice_to, and indice_to and I shouldn't touch the rest of the columns of my DataFrame.
Knowing that the 2 columns indice_from and indice_to contains some times a number + letter like: ["14,14A"]
In case if I have an example like ["14,14A"], always I should have the same structure, for example if I have:
The number 15, the second value should 15 + letter, and 15 < 15 + letter, if first value is 9, the second value should 9 + letter and 9 < 9+letter
New DataFrame:
Ref ° | Indice_1 | Indice_2 | 1 | 2 | indice_from | indice_from | indice_to | indice_to
---------------------------------------------------------------------------------------------------------------------------------------------
1 | 19 | 37.1 | 32 | 62 | ["10031,20031"] | ["11/12,13"] | ["30062,40062"] | ["14,14A"]
---------------------------------------------------------------------------------------------------------------------------------------------
2 | 19 | 37.1 | 44 | 12 | ["30062,40062"] | ["11/12,13"] | ["30062,40062"] | ["14,14A"]
---------------------------------------------------------------------------------------------------------------------------------------------
3 | 19 | 37.1 | 22 | 64 | ["10031,20031"] | ["11/12,13"] | ["10031,20031"] | ["11/12,13"]
---------------------------------------------------------------------------------------------------------------------------------------------
4 | 19 | 37.1 | 32 | 98 | ["10031,20031"] | ["11/12,13"] | ["30062,40062"] | ["11/12,13"]
Someone please can help how can I sort the values of columns indice_from, indice_from, indice_to, and indice_to to obtain new Dataframe like the second df above ?
Thank you

If I understand it correctly then
from pyspark.sql import functions as F
columns_to_sort = ['indice_from', 'indice_from', 'indice_to', 'indice_to']
for c in columns_to_sort:
df = (
df
.withColumn(
c,
F.sort_array(c)
)
)
will do the trick. Let me know if it doesn't

Related

partition by for iterval of 2 seconds

I have a DB with a field of timestamp,
I want to partition it for every 2 seconds (I know how to do it for 1 minute and one second)
this is an example of the DB:
create table data_t(id integer, time_t timestamp without time zone, data_t integer );
insert into data_t(id,time_t,data_t) values(1,'1999-01-08 04:05:06',248),
(2,'1999-01-08 04:05:06.03',45),
(3,'1999-01-08 04:05:06.035',98),
(4,'1999-01-08 04:05:06.9',57),
(5,'1999-01-08 04:05:07',86),
(6,'1999-01-08 04:05:08',84),
(7,'1999-01-08 04:05:08.5',832),
(8,'1999-01-08 04:05:08.7',86),
(9,'1999-01-08 04:05:08.9',863),
(10,'1999-01-08 04:05:9',866),
(11,'1999-01-08 04:05:10',862),
(12,'1999-01-08 04:05:10.5',863),
(13,'1999-01-08 04:05:10.55',826),
(14,'1999-01-08 04:05:11',816),
(15,'1999-01-08 04:05:11.7',186),
(16,'1999-01-08 04:05:12',862),
(17,'1999-01-08 04:05:12.5',826)
;
with t as (
select id,
time_t,
date_trunc('second', data_t.time_t) as time_t_1,
data_t
from data_t
), t1 as(
select *,
extract(hour from time_t_1) as h,
extract(minute from time_t_1) as m,
extract(second from time_t_1) as s
from t ) select *,
row_number() over(partition by h,m,s order by time_t_1) as t_sequence
from t1;
the output of this is:
| id | time_t | time_t_1 | data_t | h | m | s | t_sequence |
|----|--------------------------|----------------------|--------|---|---|----|------------|
| 1 | 1999-01-08T04:05:06Z | 1999-01-08T04:05:06Z | 248 | 4 | 5 | 6 | 1 |
| 2 | 1999-01-08T04:05:06.03Z | 1999-01-08T04:05:06Z | 45 | 4 | 5 | 6 | 2 |
| 3 | 1999-01-08T04:05:06.035Z | 1999-01-08T04:05:06Z | 98 | 4 | 5 | 6 | 3 |
| 4 | 1999-01-08T04:05:06.9Z | 1999-01-08T04:05:06Z | 57 | 4 | 5 | 6 | 4 |
| 5 | 1999-01-08T04:05:07Z | 1999-01-08T04:05:07Z | 86 | 4 | 5 | 7 | 1 |
| 6 | 1999-01-08T04:05:08Z | 1999-01-08T04:05:08Z | 84 | 4 | 5 | 8 | 1 |
| 7 | 1999-01-08T04:05:08.5Z | 1999-01-08T04:05:08Z | 832 | 4 | 5 | 8 | 2 |
| 8 | 1999-01-08T04:05:08.7Z | 1999-01-08T04:05:08Z | 86 | 4 | 5 | 8 | 3 |
| 9 | 1999-01-08T04:05:08.9Z | 1999-01-08T04:05:08Z | 863 | 4 | 5 | 8 | 4 |
| 10 | 1999-01-08T04:05:09Z | 1999-01-08T04:05:09Z | 866 | 4 | 5 | 9 | 1 |
| 11 | 1999-01-08T04:05:10Z | 1999-01-08T04:05:10Z | 862 | 4 | 5 | 10 | 1 |
| 12 | 1999-01-08T04:05:10.5Z | 1999-01-08T04:05:10Z | 863 | 4 | 5 | 10 | 2 |
| 13 | 1999-01-08T04:05:10.55Z | 1999-01-08T04:05:10Z | 826 | 4 | 5 | 10 | 3 |
| 14 | 1999-01-08T04:05:11Z | 1999-01-08T04:05:11Z | 816 | 4 | 5 | 11 | 1 |
| 15 | 1999-01-08T04:05:11.7Z | 1999-01-08T04:05:11Z | 186 | 4 | 5 | 11 | 2 |
| 16 | 1999-01-08T04:05:12Z | 1999-01-08T04:05:12Z | 862 | 4 | 5 | 12 | 1 |
| 17 | 1999-01-08T04:05:12.5Z | 1999-01-08T04:05:12Z | 826 | 4 | 5 | 12 | 2 |
as you can see the t_sequence start over every second but I want it to start over every 2 seconds,
is there a way to do it?
link for SQL fiddle with all the data

Return unique grouped rows with the latest timestamp [duplicate]

This question already has answers here:
Select first row in each GROUP BY group?
(20 answers)
Closed 3 years ago.
At the moment I'm struggling with a problem that looks very easy.
Tablecontent:
Primay Keys: Timestamp, COL_A,COL_B ,COL_C,COL_D
+------------------+-------+-------+-------+-------+--------+--------+
| Timestamp | COL_A | COL_B | COL_C | COL_D | Data_A | Data_B |
+------------------+-------+-------+-------+-------+--------+--------+
| 31.07.2019 15:12 | - | - | - | - | 1 | 2 |
| 31.07.2019 15:32 | 1 | 1 | 100 | 1 | 5000 | 20 |
| 10.08.2019 09:33 | - | - | - | - | 1000 | 7 |
| 31.07.2019 15:38 | 1 | 1 | 100 | 1 | 33 | 5 |
| 06.08.2019 08:53 | - | - | - | - | 0 | 7 |
| 06.08.2019 09:08 | - | - | - | - | 0 | 7 |
| 06.08.2019 16:06 | 3 | 3 | 3 | 3 | 0 | 23 |
| 07.08.2019 10:43 | - | - | - | - | 0 | 42 |
| 07.08.2019 13:10 | - | - | - | - | 0 | 24 |
| 08.08.2019 07:19 | 11 | 111 | 111 | 12 | 0 | 2 |
| 08.08.2019 10:54 | 2334 | 65464 | 565 | 76 | 1000 | 19 |
| 08.08.2019 11:15 | 232 | 343 | 343 | 43 | 0 | 2 |
| 08.08.2019 11:30 | 2323 | rtttt | 3434 | 34 | 0 | 2 |
| 10.08.2019 14:47 | - | - | - | - | 123 | 23 |
+------------------+-------+-------+-------+-------+--------+--------+
Needed query output:
+------------------+-------+-------+-------+-------+--------+--------+
| Timestamp | COL_A | COL_B | COL_C | COL_D | Data_A | Data_B |
+------------------+-------+-------+-------+-------+--------+--------+
| 31.07.2019 15:38 | 1 | 1 | 100 | 1 | 33 | 5 |
| 06.08.2019 16:06 | 3 | 3 | 3 | 3 | 0 | 23 |
| 08.08.2019 07:19 | 11 | 111 | 111 | 12 | 0 | 2 |
| 08.08.2019 10:54 | 2334 | 65464 | 565 | 76 | 1000 | 19 |
| 08.08.2019 11:15 | 232 | 343 | 343 | 43 | 0 | 2 |
| 08.08.2019 11:30 | 2323 | rtttt | 3434 | 34 | 0 | 2 |
| 10.08.2019 14:47 | - | - | - | - | 123 | 23 |
+------------------+-------+-------+-------+-------+--------+--------+
As you can see, I'm trying to get single rows for my primary keys, using the latest timestamp, which is also a primary key.
Currently, I tried a query like:
SELECT Timestamp, COL_A, COL_B, COL_C, COL_D, Data_A, Data_B From Table XY op
WHERE Timestamp = (
SELECT MAX(Timestamp) FROM XY as tsRow
WHERE op.COL_A = tsRow.COL_A
AND op.COL_B = tsRow.COL_B
AND op.COL_C = tsRow.COL_C
AND op.COL_D = tsRow."COL_D
);
which gives me result that looks fine at first glance.
Is there a better or more safe way to get my preferred result?

demo:db<>fiddle
You can use the DISTINCT ON clause, which gives you the first record of an ordered group. Here your group is your (A, B, C, D). This is ordered by the Timestamp column, in descending order, to get the most recent record to be the first.
SELECT DISTINCT ON ("COL_A", "COL_B", "COL_C", "COL_D")
*
FROM
mytable
ORDER BY "COL_A", "COL_B", "COL_C", "COL_D", "Timestamp" DESC
If you want to get your expected order, you need a second ORDER BY after this operation:
SELECT
*
FROM (
SELECT DISTINCT ON ("COL_A", "COL_B", "COL_C", "COL_D")
*
FROM
mytable
ORDER BY "COL_A", "COL_B", "COL_C", "COL_D", "Timestamp" DESC
) s
ORDER BY "Timestamp"
Note: If you have the Timestamp column as part of the PK, are you sure, you really need the four other columns as PK as well? It seems, that the TS column is already unique.

Sort using auxiliary fields, start and end

In PostgreSQL, what is the best way to sort records using start and end fields in a generic way, without the need to include in the query the first record (where start_id=3)?
Example table:
+-------+----------+--------+--------+
| FK_ID | START_ID | END_ID | STRING |
+-------+----------+--------+--------+
| 77 | 1 | 9 | E |
| 82 | 5 | 2 | A |
| 77 | 7 | 1 | I |
| 77 | 3 | 7 | W |
| 82 | 9 | 5 | Q |
| 77 | 9 | 5 | X |
| 82 | 2 | 7 | G |
+-------+----------+--------+--------+
Sorted where FK_ID = 77:
+----+---+---+---+
| 77 | 3 | 7 | W |
| 77 | 7 | 1 | I |
| 77 | 1 | 9 | E |
| 77 | 9 | 5 | X |
+----+---+---+---+
Sorted where FK_ID = 82:
+----+---+---+---+
| 82 | 9 | 5 | Q |
| 82 | 5 | 2 | A |
| 82 | 2 | 7 | G |
+----+---+---+---+
Result query sequence:
+-------+----------+
| FK_ID | SEQUENCE |
+-------+----------+
| 82 | QAG |
| 77 | WIEX |
+-------+----------+

I do not think this is the most efficient way but you can try with a recursive CTE
WITH RECURSIVE path AS (
SELECT * FROM myTable AS t1 WHERE NOT EXISTS(
SELECT 1 FROM myTable AS t2 WHERE t1.fk_id = t2.fk_id AND t2.end_id = t1.start_id
) ORDER BY start_id LIMIT 1
UNION ALL
SELECT myTable.* FROM myTable JOIN path ON path.end_id = myTable.start_id
)
SELECT fk_id,array_to_string(array_agg(string)) FROM path GROUP BY fk_id

Tibco Spotfire - Calculate average only if there are minimum 3 values in a column - see desc

I want to calculate average in Spotfire only when there are minimum 3 values. if there are no values or just 2 values the average should be blank
Raw data:
Product Age Average
1
2
3 10
4 12
5 13 11
6
7 18
8 19
9 20 19
10 21 20

The only way I could really do this is with 3 calculated columns. Insert these calculated columns in this order:
If(Min(If([Age] IS NULL,0,[Age])) over (LastPeriods(3,[Product]))<>0,1) as [BitFlag]
Avg([Age]) over (LastPeriods(3,[Product])) as [TempAvg]
If([BitFlag]=1,[TempAvg]) as [Average]
This will give you the following results. You can ignore / hide the two columns you don't care about.
RESULTS
+---------+-----+---------+------------------+------------------+
| Product | Age | BitFlag | TempAvg | Average |
+---------+-----+---------+------------------+------------------+
| 1 | | | | |
| 2 | | | | |
| 3 | 10 | | 10 | |
| 4 | 12 | | 11 | |
| 5 | 13 | 1 | 11.6666666666667 | 11.6666666666667 |
| 6 | | | 12.5 | |
| 7 | 18 | | 15.5 | |
| 8 | 19 | | 18.5 | |
| 9 | 20 | 1 | 19 | 19 |
| 10 | 21 | 1 | 20 | 20 |
| 11 | | | 20.5 | |
| 12 | 22 | | 21.5 | |
| 13 | 36 | | 29 | |
| 14 | | | 29 | |
| 15 | 11 | | 23.5 | |
| 16 | 23 | | 17 | |
| 17 | 14 | 1 | 16 | 16 |
+---------+-----+---------+------------------+------------------+

org mode - simplify table sum formula of multiple column

Now I need 5 formula for sum of each column, it works fine but I wish it can be simplified to one formula. Is it possible?
|----+----+----+-----+----|
| a | b | c | d | e |
|----+----+----+-----+----|
| 1 | 2 | 3 | 4 | 5 |
| 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 |
|----+----+----+-----+----|
| 34 | 38 | 42 | 160 | 50 |
|----+----+----+-----+----|
#+TBLFM: #>$5=vsum(#2$5..#-1$5)::#>$4=vsum(#2$1..#-1$4)::#>$3=vsum(#2$3..#-1$3)::#>$2=vsum(#2$2..#-1$2)::#>$1=vsum(#2$1..#-1$1)

This should work:
|----+----+----+----+----|
| a | b | c | d | e |
|----+----+----+----+----|
| 1 | 2 | 3 | 4 | 5 |
| 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 |
|----+----+----+----+----|
| 34 | 38 | 42 | 46 | 50 |
|----+----+----+----+----|
#+TBLFM: #>$1..#>$5=vsum(#2$0..#-1$0)
$0 on the RHS is the current column.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

values columns Sorting pyspark - pyspark

If I understand it correctly then from pyspark.sql import functions as F columns_to_sort = ['indice_from', 'indice_from', 'indice_to', 'indice_to'] for c in columns_to_sort: df = ( df .withColumn( c, F.sort_array(c) ) ) will do the trick. Let me know if it doesn't

Related

partition by for iterval of 2 seconds

Return unique grouped rows with the latest timestamp [duplicate]

Sort using auxiliary fields, start and end

Tibco Spotfire - Calculate average only if there are minimum 3 values in a column - see desc

org mode - simplify table sum formula of multiple column

Categories

Resources