(Databricks) last_value function usage on huge dataset

(Databricks) last_value function usage on huge dataset - pyspark

For table which contains bilions of rows (10 mln each day) we trying to "fix/clean" source data. For missing rows with measures x,y,z we copy them from the past. To do it right we are using last_value function on specified aggregation level combined by CROSS JOIN with simple table which contains missing days.
It worked in simple POC where we have less than 1mln rows. In production dataset described above time and performance are pretty bad. We cannot just simply extend cluster, we need to try different approach.
Any thoughts what we can use insted of sql (simple last_value)?
All is caluclated in Azure Databricks environment.

Related

Union with Redshift native tables and external tables (Spectrum)

If I have a view that contains a union between a native table and external table like so (pseudocode):
create view vwPageViews as
select from PageViews
union all
select from PageViewsHistory
PageViews has for the last 2 years. External table has for older data than 2 years.
If a user selects from the view with filters for the last 6 months, how does RS Spectrum handle it - does it read the entire external table even though none will be returned (and accordingly cost us money for all of it)? (Assuming the s3 files are parquet based).
ex.
Select from vwPageViews where MyDate >= '01/01/2021'
What's the best approach for querying both cold and historical data using RS and Spectrum? Thanks!

How this will happen on Spectrum will depend on whether or not you have provided partitions for the data in S3. Without partitions (and a where clause on the partition) the Spectrum engines in S3 will have to read every file to determine if the needed data is in any of them. The cost of this will depend on the number and size of the files AND what format they are in. (CSV is more expensive than Parquet for example.)
The way around this is to partition the data in S3 and to have a WHERE clause on the partition value. This will exclude files from needing to be read when they don't match on the partition value.
The rub is in providing the WHERE clause for the partition as this will likely be less granular than the date or timestamp you using in your base data. For example if you partition on YearMonth (YYYYMM) and want to have a day level WHERE clause you will need to 2 parts to the WHERE clause - WHERE date_col >= 2015-07-12 AND part_col >= 201507. How to produce both WHERE conditions will depend on your solution around Redshift.

Is is possible limit the number of rows in the output of a Dataprep flow?

I'm using Dataprep on GCP to wrangle a large file with a billion rows. I would like to limit the number of rows in the output of the flow, as I am prototyping a Machine Learning model.
Let's say I would like to keep one million rows out of the original billion. Is this possible to do this with Dataprep? I have reviewed the documentation of sampling, but that only applies to the input of the Transformer tool and not the outcome of the process.

You can do this, but it does take a bit of extra work in your Recipe--set up a formula in a new column using something like RANDBETWEEN to give you a random integer output between 1 and 1,000 (in this million-to-billion case). From there, you can filter rows based on whatever random integer between 1 and 1,000 as what you'll keep, and then your output will only have your randomized subset. Just have your last part of the recipe remove this temporary column.

So indeed there are 2 approaches to this.
As Courtney Grimes said, you can use one of the 2 functions that create random-number out of a range.
randbetween :
rand :
These methods can be used to slice an "even" portion of your data. As suggested, a randbetween(1,1000) , then pick 1<x<1000 to filter, because it's 1\1000 of data (million out of a billion).
Alternatively, if you just want to have million records in your output, but either
Don't want to rely on the knowledge of the size of the entire table
just want the first million rows, agnostic to how many rows there are -
You can just use 2 of these 3 row filtering methods: (top rows\ range)
P.S
By understanding the $sourcerownumber metadata parameter (can read in-product documentation), you can filter\keep a portion of the data (as per the first scenario) in 1 step (AKA without creating an additional column.
BTW, an easy way of "discovery" of how-to's in Trifacta would be to just type what you're looking for in the "search-transtormation" pane (accessed via ctrl-k). By searching "filter", you'll get most of the relevant options for your problem.
Cheers!

Is there a technique with timescaledb to delete rows to reduce the frequency of older timescale data?

I'm storing a number of rows in a hypertable. The table size is growing quite large now even in its current test configuration.
I'd like to reduce the frequency of data from say once every 5 seconds to say once every 60 seconds for data older than a week by deleting a number of these older records.
Can anyone recommend an approach for doing so, or perhaps a better approach that better fits with timescaledb design?

So one of the next releases will have a bit in feature around data retention policies around continuous aggregations, so that you can define a continuous aggregation policy that rolls up secondly data into minutely data, then drop the secondly data that's older than some time period.
(That capability doesn't exist today with continuous aggs, but will very shortly. Right now the best approach is either to have some cron job that deletes old data, or one that copies from one table to a second while aggregating, then calling drop_chunks on the first table.)

Ok, I've read 2 minutes of timescaledb documentation, so I'm an expert, right. Here's what I propose:
You already have a table (I'll call it the business table) and a hypertable with raw 5-second data in it
Create a second hypertable with the same columns as the first hypertable
Insert into the 2nd hypertable using a 60-second windowing function and average, minimum, or maximum values for your readings data (you have to decide on which aggregation function is meaningful for your case.) This insert SQL looks something like:
INSERT into minute_table (timestamp, my_reading)
(SELECT time_bucket('60 seconds', time) as the_minute, avg(my_raw_reading)
FROM five_second_table
WHERE time < (now() - interval '1week')
GROUP BY the_minute
);
Next, delete from the 5-second hypertable where the timestamp in there is within any range of times in the 60-second hypertable.
Finally, schedule something like this to run every week.
Sorry I'm not fluent in all the timescaledb functions but this should get you started on the 'heavy lift' of manually aggregating up from 5-second to 60-second samples.

Take a look on Data Retention
For example:
SELECT drop_chunks(interval '24 hours', 'conditions');
This will drop all chunks from the hypertable 'conditions' that only include data older than this duration, and will not delete any individual rows of data in chunks.

Best way to query 4 B+ records in Tableau

I am looking a best way to analyse 4B records (1TB data) stored in Vertica using Tableau. I tried using extract of 1M records which works perfectly. but dont know how to manage 4B records, because its taking too long to query on 4B records.
I have following dataset :
timestamp id url domain keyword nor_word cat_1 cat_2 cat_3
So here I need to create descending list of Top 10 ID's, Top 10 url, Top 10 domain, Top 10 keyword, Top 10 nor_word, Top 10 cat_1, Top 10 cat_2, Top 10 cat_3 depending count of each field value in separate worksheet and combine all worksheet in one dashboard.
There is no primary key. This dataset of 1 month so I want to make global filter start date and end date to reduce the query size. But don't know how to create global date filter and display on dashboard ?

You have two questions, one about Vertica and one about Tableau. You should split these up.
Regarding Vertica, you need to know that Vertica stores data in ascending sort order in physical storage. This means that an additional step will always be required anytime you want to get a descending sort order.
I would suggest creating a partition on the date, and subsequently running Database Designer (DBD) in incremental mode and using your queries as samples. By partitioning the data, Vertica can eliminate the partitions during optimization.
Running the DBD will generate some better optimized projections. You should consider the trade-off between how often you will need this data and whether it's worth creating these additional projections as it will impact your load performance.

How to filter in records from one dataset to another?

I am developing an SSRS 2008 report. I created one dataset to give me the records I want. I also created a different dataset that determines access based on security profiles of the person running the report. What I want to do now is join the results of this second dataset with those of the first dataset in T-SQL. So I would like to perform this join within the T-sql sproc for the first dataset. How do I do this? Note that in most cases the results of the second (security) dataset is more than one record.

Are both datasets coming from the same database? You can probably wrap this logic in a stored procedure instead.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

(Databricks) last_value function usage on huge dataset - pyspark

Related

Union with Redshift native tables and external tables (Spectrum)

Is is possible limit the number of rows in the output of a Dataprep flow?

Is there a technique with timescaledb to delete rows to reduce the frequency of older timescale data?

Best way to query 4 B+ records in Tableau

How to filter in records from one dataset to another?

Categories

Resources