Replacewhere in pyspark - pyspark

I have saved a dataframe as a delta table partitioned by [customer,site,machine,date] in overwrite mode and replacewhere by date>=value1 and date<=value2:
df.coalesce(1).write.mode('overwrite') \
.option("replaceWhere", date >= '2022-04-01' and date < '2022-04-02') \
.partitionBy([["customer", "site", "machine", "date"]]) \
.format('delta').save(output_filepath)
When I execute the statement twice (first run for customer1 and second run for customer2), then the customer1 data is getting overwritten by customer2 for 2022-04-01.
So I in repleacewhere clause I have added customer( '(date >= '2022-04-01' and date < '2022-04-02') and (customer.in(['customervalue']))')
I am getting error AnalysisException : Cannot recognize the predicate.
What are the other possible ways I can overwrite only for a particular customer and for particular date.
Thanks in adavance!!

Related

How to get the latest value of a time duration (let's say for 1 minute) in Kusto table

I want to find the latest value of a column for particular time duration(1 minute in my case) from Kusto table.
I have timeseries data in PostgreSQL table and I am using last() function (https://docs.timescale.com/api/latest/hyperfunctions/last/)to find the latest value of scaled_value for 1 minute time bucket of PostgreSQL and I want to use same function in Kusto Table to get the latest value of scaled_value . What will be correct function to use in Kusto corresponding to last() function in Postgresql
Code I am using in PostgreSQL :
SELECT CAST(EXTRACT(EPOCH FROM time_bucket('1 minutes', timestamp) AT TIME ZONE 'UTC') * 1000 AS BIGINT) as timestamp_epoch,
vessel_telemetry.timeSeries,
last(vessel_telemetry.scaled_value, vessel_telemetry.timestamp) as scaled_value,
FROM shipping.vessel_telemetry
WHERE vessel_telemetry.ingested_timestamp >= '2022-07-20T10:10:58.71Z' AND vessel_telemetry.ingested_timestamp < '2022-07-20T10:15:33.703985Z'
GROUP BY time_bucket('1 minutes', vessel_telemetry.timestamp), vessel_telemetry.vessel_timeSeries
Corresponding code I am using in ADX
VesselTelemetry_DS
| where ingested_timestamp >= datetime(2022-07-20T10:10:58.71Z) and ingested_timestamp < datetime(2022-07-20T10:15:33.703985Z)
| summarize max_scaled_value = max(scaled_value) by bin(timestamp, 1m), timeSeries
| project timestamp_epoch =(datetime_diff('second', timestamp, datetime(1970-01-01)))*1000, timeSeries, max_scaled_value
The data that i am getting using PostgreSQL is not matching with the data I am getting from ADX Query. I think the functionality of last() function of Postgre is different from max() function of ADX. Is there any function in ADX that we can use to perform same as last() of PSQL
arg_max()
arg_max (ExprToMaximize, * | ExprToReturn [, ...])
Please note the order of the parameters, which is opposite to Timescale's last() -
First the expression to maximize, in your case timestamp and then the expression(s) to return, in your case scaled_value

Spark: replace all smaller values than X by their sum

I have a dataframe, that has a type and a sub type (broadly speaking).
Say something like:
What I'd like to do, is for each type, sum all values that are smaller than X (say 100 here), and replace them with one row where sub-type would be "other"
I.e.
Using window over(Type), I guess I could do two dfs (<100, >=100), where the first I'd sum, pick one row and hack it to get the "Other" single row, and union the result with the >= one. But it seems a rather clumsy way to do it?
(apologies, I don't have access to pyspark right now to do some code).
The way I would propose takes into account the need to have a key to apply an aggregation valid for each row, or you would 'loose' the one with value >= 100.
Therefore, the idée is to add a column that identify rows to be aggregated, and provide the other ones with a unique key. After wards, you'll have to clean the result according to the expected result.
Here is what I propose:
df = df \
.withColumn("to_agg",
F.when(F.col("Value") < 100, "Other")
.otherwise(F.concat(F.col("Type"), F.lit("-"), F.col("Sub-Type")))
) \
.withColumn("sum_other",
F.sum(F.col("Value")).over(Window.partitionBy("Type", "to_agg"))) \
.withColumn("Sub-Type",
F.when(F.col("to_agg") == "Other", F.col("to_agg"))
.otherwise(F.col("Column_4"))) \
.withColumn("Value", F.col("sum_other")) \
.drop("to_agg", "sum_other") \
.dropDuplicates(("Type", "Sub-Type")) \
.orderBy(F.col("Type").asc(), F.col("Value").desc())
Note: the solution to use a groupBy is also valid and is simpler but you will have only the columns used in the statement and the result. That's the reason why I prefer using a window function and enable to keep all other columns from the original dataset.
You could simply replace Sub-Type by other for all rows with Value < 100 and then groupby and sum:
(
df
.withColumn('Sub-Type', F.when(F.col('Value') < 100, 'Other').otherwise(F.col('Sub-Type')
.groupby('Type', 'Sub-Type')
.agg(
F.sum('Value').alias('Value')
)
)

How can I create a KSQL table from a topic using a composite key?

Say I have a topic with temperature forecast data, as follows:
2018-10-25,Melbourne,21
2018-10-26,Melbourne,17
2018-10-27,Melbourne,21
2018-10-25,Sydney,22
2018-10-26,Sydney,20
2018-10-27,Sydney,23
2018-10-26,Melbourne,18
2018-10-27,Melbourne,22
2018-10-26,Sydney,21
2018-10-27,Sydney,24
Each entry contains a date, a city, and a forecast temperature, and represents an update to the forecast for that city on that date. I can describe it as a KSQL stream like this:
CREATE STREAM forecasts_csv ( \
date VARCHAR, \
city VARCHAR, \
temperature INTEGER \
) WITH (kafka_topic='forecasts-csv', value_format='DELIMITED');
Now, I want a table that represents the current (i.e. the latest) forecast temperature for each city, as well as the min and max of that forecast over time. An example desired output is:
{ date='2018-10-27', city='Melbourne', latest=22, min=21, max=22 }
How can I achieve this?
I've managed to get the aggregates (min/max) as follows:
CREATE STREAM forecasts_keyed \
WITH (partitions=4, value_format='JSON') \
AS SELECT date + '/' + city AS forecast_key, * \
FROM forecasts_csv \
PARTITION BY forecast_key;
CREATE TABLE forecasts_minmax \
WITH (partitions=4, value_format='JSON') \
AS SELECT forecast_key, date, city, \
min(temperature) as min, max(temperature) as max \
FROM forecasts_keyed \
GROUP by forecast_key, date, city;
which gives me output messages like:
{"FORECAST_KEY":"2018-10-27/Melbourne","DATE":"2018-10-27","CITY":"Melbourne","MIN":21,"MAX":22}
but I can't work out how to combine this with the "latest" reading.
You need to implement a UDAF, let's call it LATEST, that keeps the latest value of a given column and key. This is very trivial and you can find out how to add your custom UDAF in the KSQL docs: https://docs.confluent.io/current/ksql/docs/developer-guide/udf.html#udafs
Assuming that you have the LATEST UDAF available you can write the following query:
CREATE TABLE foo AS
SELECT
date,
city,
MIN(temperature) AS minValue,
MAX(temperature) AS maxValue,
LATEST(temperature) AS latestValue
FROM forecasts_csv
GROUP BY date, city;

How to amend properties in Spark read jdbc according to growing size of table?

I have a spark job that moves data from Postgres to Redshift on regular basis. I' using jdbc.read function with lowerBound and upperBound params:
df = spark.read.jdbc(url=jdbc_url, \
table='some_table',\
column='id',\
lowerBound=1,\
upperBound=20000000, \
numPartitions=50)
At the current moment upperBound is hardcoded, but the size of table growing every day, so I need somehow update upperBound value dynamically to reflect the size of the table at the start of next job run. How can I make upperBound value equal to current size of the table?
You can fetch upper bound value before you execute the main query and then use them
query = "(SELECT min({0}), max({0}) FROM {1}) AS temp".format(
partition_column, table
)
(lower_bound, upper_bound) = (spark.read
.jdbc(url=url, table=query. properties=properties)
.first())
df = spark.read.jdbc(url=jdbc_url, \
table='some_table',\
column='id',\
lowerBound=1,\
upperBound=upper_bound + 10, \
numPartitions=50)

Postgres query including time range

I have a query that pulls part of the data that I need for tracking. What I need to add is either a column that includes the date or the ability to query the table for a date range. I would prefer the column if possible. I am using psql 8.3.3.
Here is the current query:
select count(mailing.mailing_id) as mailing_count, department.name as org
from mailing, department
where mailing.department_id = department.department_id
group by department.name;
This returns the following information:
mailing_count | org
---------------+-----------------------------------------
2 | org1 name
8 | org2 name
22 | org3 name
21 | org4 name
39 | org5 name
The table that I am querying has 3 columns that have date in a timestamp format which are target_launch_date, created_time and modified_time.
When I try to add the date range to the query I get an error:
Query:
select count(mailing.mailing_id) as mailing_count, department.name as org
from mailing, department
where mailing.department_id = department.department_id
group by department.name,
WHERE (target_launch_date)>= 2016-09-01 AND < 2016-09-05;
Error:
ERROR: syntax error at or near "WHERE" LINE 1:
...department.department_id group by department.name,WHERE(targ...
I've tried moving the location of the date range in the string and a variety of other changes, but cannot achieve what I am looking for.
Any insight would be greatly appreciated!
Here's a query that would do what you need:
SELECT
count(m.mailing_id) as mailing_count,
d.name as org
FROM mailing m
JOIN department d USING( department_id )
WHERE
m.target_launch_date BETWEEN '2016-09-01' AND '2016-09-05'
GROUP BY 2
Since your target_launch_date is of type timestamp you can safely do <= '2016-09-05' which will actually convert to 2016-09-05 00:00:00.00000 giving you all the dates that are before start of that day or exactly 2016-09-05 00:00:00.00000
Couple of additional notes:
Use aliases for table names to shorten the code, eg. mailing m
Use explicit JOIN syntax to connect data from related tables
Apply your WHERE clause before GROUP BY to exclude rows that don't match it
Use BETWEEN operator to handle date >= X AND date <= Y case
You can use USING instead of ON in JOIN syntax when joined column names are the same
You can use column numbers in GROUP BY which point to position of a column in your select
To gain more insight on the matter of how processing of a SELECT statement behaves in steps look at the documentation.
Edit
Approach using BETWEEN operator would account 2015-09-05 00:00:00.00000 to the resultset. If this timestamp should be discarded change BETWEEN x AND y to either of those two:
(...) BETWEEN x AND y::timestamp - INTERVAL '1 microsecond'
(...) >= x AND (...) < y
You were close, you need to supply the column name on second part of where too and you would have a single where:
select count(mailing.mailing_id) as mailing_count, department.name as org
from mailing
inner join department on mailing.department_id = department.department_id
where target_launch_date >= '2016-09-01 00:00:00'
AND target_launch_date < '2016-09-05 00:00:00'
group by department.name;
EDIT: This part is just for Kamil G. showing clearly that between should NOT be used:
create table sample (id int, d timestamp);
insert into sample (id, d)
values
(1, '2016/09/01'),
(2, '2016/09/02'),
(3, '2016/09/03'),
(4, '2016/09/04'),
(5, '2016/09/05'),
(6, '2016/09/05 00:00:00'),
(7, '2016/09/05 00:00:01'),
(8, '2016/09/06');
select * from sample where d between '2016-09-01' and '2016-09-05';
Result:
1;"2016-09-01 00:00:00"
2;"2016-09-02 00:00:00"
3;"2016-09-03 00:00:00"
4;"2016-09-04 00:00:00"
5;"2016-09-05 00:00:00"
6;"2016-09-05 00:00:00"
BTW if you wouldn't believe without seeing explain, then here it is:
Filter: ((d >= '2016-09-01 00:00:00'::timestamp without time zone) AND
(d <= '2016-09-05 00:00:00'::timestamp without time zone))