pyspark lag function (based on row) - pyspark

I want to achieve if it is increased up to 100 so I can say that it is overpriced. Assume that time to time, the price varies within the day.
from pyspark.sql.window import Window
w = Window.partitionBy('product').orderBy('date')
display(market_price.withColumn('new_price', f.lag('price', 1, 0).over(w))
.select('market_date', 'new_price')
)
OUTPUT:
+---+-----------------------------------------+
| | date | new_price |
+---+-----------------------------------------+
| |2011-01-01T04:07:28.000+0000 | 0 |
| |2011-01-01T04:07:50.000+0000 | 110 |
| |2011-01-01T04:08:30.000+0000 | 150 |
| |2011-01-01T04:09:45.000+0000 | 280 |
+---+-----------------------------------------+
MY DESIRE OUTPUT:
+---+-----------------------------------------------------------+
| | date | new_price | status |
+---+-----------------------------------------------------------+
| |2011-01-01T04:07:28.000+0000 | 0 | not overpriced |
| |2011-01-01T04:07:50.000+0000 | 110 | not overpriced |
| |2011-01-01T04:08:30.000+0000 | 150 | not overpriced |
| |2011-01-01T04:09:45.000+0000 | 280 | overpriced |
+---+-----------------------------------------------------------+
see here that the last column is overpriced because from 150 -> 280.

Here's what I think you want as it takes you input table and produces the output table.
market_price
.withColumn( "old_price", lag("id", 1, 0).over(winowSpec) ) //get the previous price
.select(
col("date"),
col("price") ,
when( //if logic uses 'when'
(col("price")-col("old_price")>100), //is the current price more than 100 over the old price
"overpriced" ) // if true
.otherwise("not overpriced") //otherwise if not > 100
.alias("status") ) // rename the column
.show()

Related

Compare consecutive rows and extract words(excluding the subsets) in spark

I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid. expected output looks like Table 2.
Input dataframe:
(Table 1)
|-----------+------------+-----------------------------------|
| session_id| value | Timestamp |
|-----------+------------+-----------------------------------|
| 1 | cat | 2021-01-11T13:48:54.2514887-05:00 |
| 1 | catc | 2021-01-11T13:48:54.3514887-05:00 |
| 1 | catch | 2021-01-11T13:48:54.4514887-05:00 |
| 1 | par | 2021-01-11T13:48:55.2514887-05:00 |
| 1 | part | 2021-01-11T13:48:56.5514887-05:00 |
| 1 | party | 2021-01-11T13:48:57.7514887-05:00 |
| 1 | partyy | 2021-01-11T13:48:58.7514887-05:00 |
| 2 | fal | 2021-01-11T13:49:54.2514887-05:00 |
| 2 | fall | 2021-01-11T13:49:54.3514887-05:00 |
| 2 | falle | 2021-01-11T13:49:54.4514887-05:00 |
| 2 | fallen | 2021-01-11T13:49:54.8514887-05:00 |
| 2 | Tem | 2021-01-11T13:49:56.5514887-05:00 |
| 2 | Temp | 2021-01-11T13:49:56.7514887-05:00 |
|-----------+------------+-----------------------------------|
Expected Output:
(Table 2)
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 1 | partyy |
| 2 | fallen |
| 2 | Temp |
|-----------+------------|
Solution I tried:
I added another column called col_length which captures the length of each word in value column. later on tried to compare each row with its subsequent row to see if it is of maximum lenth. But this solution only works party.
val df = spark.read.parquet("/project/project_name/abc")
val dfM = df.select($"session_id",$"value",$"Timestamp").withColumn("col_length",length($"value"))
val ts = Window
.orderBy("session_id")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val result = dfM
.withColumn("running_max", max("col_length") over ts)
.where($"running_max" === $"col_length")
.select("session_id", "value", "Timestamp")
Current Output:
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 2 | fallen |
|-----------+------------|
Multiple columns does not work inside an orderBy clause with window function so I didn't get desired output.I got 1 output per sesison id. Any suggesions would be highly appreciated. Thanks in advance.
You can solve it by using lead function:
val windowSpec = Window.orderBy("session_id")
dfM
.withColumn("lead",lead("value",1).over(windowSpec))
.filter((functions.length(col("lead")) < functions.length(col("value"))) || col("lead").isNull)
.drop("lead")
.show

Create Calculated Pivot from Several Query Results in PostgreSQL

I have question regarding how to make a calculated pivot table from several query results on PostgreSQL. I've managed to make three queries results but don't have any idea how to combine and calculate all the data into a single table. I've tried to google it but found out that most of the question is about how to make a pivot table from a single table, which I'm able to do using sum, case, and group by. Well, Here's the simplified version of my query results
Query from query 1 which contains gross value
| city | code | gross |
|-------|------|--------|
| city1 | 21 | 194793 |
| city1 | 25 | 139241 |
| city1 | 28 | 231365 |
| city2 | 21 | 282025 |
| city2 | 25 | 334458 |
| city2 | 28 | 410852 |
| city3 | 21 | 109237 |
Result from query 2 which contains positive adjustments
| city | code | adj_pos |
|-------|------|---------|
| city1 | 21 | 16259 |
| city1 | 25 | 13634 |
| city1 | 28 | 45854 |
| city2 | 25 | 18060 |
| city2 | 28 | 18220 |
Result from query 3 which contains negative adjustments
| city | code | adj_neg |
|-------|------|---------|
| city1 | 25 | 23364 |
| city2 | 21 | 27478 |
| city2 | 25 | 23474 |
And what I want to to is to create something like this
| city | 21_gross | 25_gross | 28_gross | 21_pos | 25_pos | 28_pos | 21_neg | 25_neg | 28_neg |
|-------|----------|----------|----------|--------|--------|--------|--------|--------|--------|
| city1 | 194793 | 139241 | 231365 | 16259 | 13634 | 45854 | | 23364 | |
| city2 | 282025 | 334458 | 410852 | | 18060 | 18220 | 27478 | 23474 | |
| city3 | 109237 | | | | | | | | |
or probably final calculation which come from gross + positive adjustment -
negative adjustment from each city on each code like this
| city | 21_nett | 25_nett | 28_nett |
|-------|---------|---------|---------|
| city1 | 211052 | 129511 | 277219 |
| city2 | 254547 | 329044 | 429072 |
| city3 | 109237 | 0 | 0 |
Any suggestion will be appreciated. Thank you!
I think the best you can achieve is to get the pivoting part as JSON - http://sqlfiddle.com/#!17/b7d64/23:
select
city,
json_object_agg(
code,
coalesce(gross,0) + coalesce(adj_pos,0) - coalesce(adj_neg,0)
) as js
from q1
left join q2 using (city,code)
left join q3 using (city,code)
group by city

Split postgres records into groups based on time fields

I have a table with records that look like this:
| id | coord-x | coord-y | time |
---------------------------------
| 1 | 0 | 0 | 123 |
| 1 | 0 | 1 | 124 |
| 1 | 0 | 3 | 125 |
The time column represents a time in milliseconds. What I want to do is find all coord-x, coord-y as a set of points for a given timeframe for a given id. For any given id there is a unique coord-x, coord-y, and time.
What I need to do however is group these points as long as they're n milliseconds apart. So if I have this:
| id | coord-x | coord-y | time |
---------------------------------
| 1 | 0 | 0 | 123 |
| 1 | 0 | 1 | 124 |
| 1 | 0 | 3 | 125 |
| 1 | 0 | 6 | 140 |
| 1 | 0 | 7 | 141 |
I would want a result similar to this:
| id | points | start-time | end-time |
| 1 | (0,0), (0,1), (0,3) | 123 | 125 |
| 1 | (0,140), (0,141) | 140 | 141 |
I do have PostGIS installed on my database, the times I posted above are not representative but I kept them small just as a sample, the time is just a millisecond timestamp.
The tricky part is picking the expression inside your GROUP BY. If n = 5, you can do something like time / 5. To match the example exactly, the query below uses (time - 3) / 5. Once you group it, you can aggregate them into an array with array_agg.
SELECT
array_agg(("coord-x", "coord-y")) as points,
min(time) AS time_start,
max(time) AS time_end
FROM "<your_table>"
WHERE id = 1
GROUP BY (time - 3) / 5
Here is the output
+---------------------------+--------------+------------+
| points | time_start | time_end |
|---------------------------+--------------+------------|
| {"(0,0)","(0,1)","(0,3)"} | 123 | 125 |
| {"(0,6)","(0,7)"} | 140 | 141 |
+---------------------------+--------------+------------+

In postgresql, how do you find aggregate base on time range

For example, if I have a database table of transactions done over the counter. And I would like to search whether there was any time that was defined as extremely busy (Processed more than 10 transaction in the span of 10 minutes). How would I go about querying it? Could I aggregate based on time range and count the amount of transaction id within those ranges?
Adding example to clarify my input and desired output:
+----+--------------------+
| Id | register_timestamp |
+----+--------------------+
| 25 | 08:10:50 |
| 26 | 09:07:36 |
| 27 | 09:08:06 |
| 28 | 09:08:35 |
| 29 | 09:12:08 |
| 30 | 09:12:18 |
| 31 | 09:12:44 |
| 32 | 09:15:29 |
| 33 | 09:15:47 |
| 34 | 09:18:13 |
| 35 | 09:18:42 |
| 36 | 09:20:33 |
| 37 | 09:20:36 |
| 38 | 09:21:04 |
| 39 | 09:21:53 |
| 40 | 09:22:23 |
| 41 | 09:22:42 |
| 42 | 09:22:51 |
| 43 | 09:28:14 |
+----+--------------------+
Desired output would be something like:
+-------+----------+
| Count | Min |
+-------+----------+
| 1 | 08:10:50 |
| 3 | 09:07:36 |
| 7 | 09:12:08 |
| 8 | 09:20:33 |
+-------+----------+
How about this:
SELECT time,
FROM (
SELECT count(*) AS c, min(time) AS time
FROM transactions
GROUP BY floor(extract(epoch from time)/600);
)
WHERE c > 10;
This will find all ten minute intervals for which more than ten transactions occurred within that interval. It assumes that the table is called transactions and that it has a column called time where the timestamp is stored.
Thanks to redneb, I ended up with the following query:
SELECT count(*) AS c, min(register_timestamp) AS register_timestamp
FROM trak_participants_data
GROUP BY floor(extract(epoch from register_timestamp)/600)
order by register_timestamp
It works close enough for me to be able tell which time chunks are the most busiest for the counter.

Total Found Count Miss-match while doing sorting V2.1 Sphinx

When i do sorting on some unixtimestamp date field the total_found count shows different result . Here is my Query
SELECT * FROM CA_SAC_persons,CA_KC_persons,CA_SFC_persons,CA_SJ_persons
WHERE MATCH('#fullname("^John$" | "^Joseph$" | "^Jose$" | "^Josh$" | "^Robs$")')
ORDER BY filing_date_ts DESC LIMIT 0,1;SHOW META;
Result :
+---------------+-------------+
| Variable_name | Value |
+---------------+-------------+
| total | 1000 |
| total_found | 4813 |
| time | 0.019 |
| docs[9] | 4603 |
| hits[9] | 5312 |
+---------------+-------------+
SELECT * FROM CA_SAC_persons,CA_KC_persons,CA_SFC_persons,CA_SJ_persons
WHERE MATCH('#fullname("^John$" | "^Joseph$" | "^Jose$" | "^Josh$" | "^Robs$")')
ORDER BY filing_date_ts ASC LIMIT 0,1;SHOW META;
Result :
+---------------+-------------+
| Variable_name | Value |
+---------------+-------------+
| total | 1000 |
| total_found | 4812 |
| time | 0.019 |
| docs[9] | 4603 |
| hits[9] | 5312 |
+---------------+-------------+
Why the total_found shows 1 record less in the 2nd Query ?