multiple negative lag (aka lead) variable in h2o - lag

Background:
Given that
I have a large table (~20 million rows, 40 columns).
I am using h2o to fit some models on the data in the table.
I found there is "h2o.difflag1" that gives a lag-1 transform.
I want to create a column that is "lead ~8000".
Question
How do I make a 3600 lead without making 8000 lags on 39 columns using h2o on my 20-million row dataset? Is there a better way to do this?

Related

How to optimally decides the number of partition in dataframe dynamically?

I have two dataframes of around 11 million records .After transformation and some window analytic function I am having around 7 million records .I am currently trying to find a dynamic way to calculate the number of partition .Normally I take the size of dataframe from the ui and then divide it by 256mb (Partition Bytes which is by default 128) and decide the no of partition .I want to avoid this manual steps and like to know if there is any other dynamic and programmatic way of doing the same .Any help on this will be appreciated.
Thanks

What is space partitioning and dimensions in TimesclaleDB

I am new to the timescale database. I was learning about chunks and how to create chunks based on time.
But there is another time/space chunking which is confusing me a lot. Please help me with below queries.
What is a dimension in a timescale DB?
What is space chunking and how it works?
Thanks in advance.
A dimension in TimescaleDB is associated with a column. Each hypertable requires to define at least a time dimension, which is a time column for the time series. Then a hypertable is divided into chunks, where each chunk contains data for a time interval of the time dimension. As result all new data usually arrives into the latets chunk, while other chunks contain older data.
Then, it is possible to define space dimensions on other columns, for example device column or/and location column. No interval is defined for space dimensions, instead a number of partitions is defined. So for the same time interval, several chunks will be created, which is equivalent to the number of partitions. Data are distributed by a hashing function on the values of the space dimension. For example, if 3 partitions are defined for a space dimension on device column and 12 different device values were present in the data, each space chunk will contain 4 different values with a hash function uniformly distributing the values.
Space dimensions are specifically useful for parallel I/O, when data are stored on several disks. Another scenario is multinode, i.e., distributed version of hypertable (beta feature, which coming to release in 2.0).
There are some complex usage cases when space partitioning will be also helpful.
You can read more in add_dimension docs, cloud KB about space partitioning
A note in the doc:
Supporting more than one additional dimension is currently experimental.

How to perform large computations on Spark

I have 2 tables in Hive: user and item and I am trying to calculate cosine similarity between 2 features of each table for a cartesian product between the 2 tables, i.e. Cross Join.
There are around 20000 users and 5000 items resulting in 100 million rows of calculation. I am running the compute using Scala Spark on Hive Cluster with 12 cores.
The code goes a little something like this:
val pairs = userDf.crossJoin(itemDf).repartition(100)
val results = pairs.mapPartitions(computeScore) // computeScore is a function to compute the similarity scores I need
The Spark job will always fail due to memory issues (GC Allocation Failure) on the Hadoop cluster. If I reduce the computation to around 10 million, it will definitely work - under 15 minutes.
How do I compute the whole set without increasing the hardware specifications? I am fine if the job takes longer to run and does not fail halfway.
if you take a look in the Spark documentation you will see that spark uses different strategies for data management. These policies are enabled by the user via configurations in the spark configuration files or directly in the code or script.
Below the documentation about data management policies:
"MEMORY_AND_DISK" policy would be good for you because if the data (RDD) does not fit in the ram then the remaining partitons will be stored in the hard disk. But this strategy can be slow if you have to access the hard drive often.
There are few steps of doing that:
1. Check the expected Data volume after cross join and divide this by 200 as spark.sql.shuffle.partitions by default comes as 200. It has to be more than 1 GB raw data to each partition.
2. Calculate each row size and multiply with another table row count , you will be able to estimated the rough Volume. The process will work much better in Parquet in comparison to CSV file
3. spark.sql.shuffle.partitions needs to be set based on Total Data Volume/500 MB
4. spark.shuffle.minNumPartitionsToHighlyCompress needs to set a little less than Shuffle Partition
5. Bucketize the source parquet data based on the joining column for both of the files/tables
6. Provide a High Spark Executor Memory and Manage the Java Heap memory too considering the heap space

Are there alternative solution without cross-join in Spark 2?

Stackoverflow!
I wonder if there is a fancy way in Spark 2.0 to solve the situation below.
The situation is like this.
Dataset1 (TargetData) has this schema and has about 20 milion records.
id (String)
vector of embedding result (Array, 300 dim)
Dataset2 (DictionaryData) has this schema and has about 9,000 records.
dict key (String)
vector of embedding result (Array, 300 dim)
For each vector of records in dataset 1, I want to find the dict key that will be the maximum when I compute cosine similarity it with dataset 2.
Initially, I tried cross-join dataset1 and dataset2 and calculate cosine simliarity of all records, but the amount of data is too large to be available in my environment.
I have not tried it yet, but I thought of collecting dataset2 as a list and then applying udf.
Are there any other method in this situation?
Thanks,
There might be two options the one is to broadcast Dataset2 since you need to scan it for each row of Dataset1 thus avoid the network delays by accessing it from a different node. Of course in this case you need to consider first if your cluster can handle the memory cost which 9000rows x 300cols(not too big in my opinion). Also you still need your join although with broadcasting should be faster. The other option is to populate a RowMatrix from your existing vectors and leave spark do the calculations for you

Best way to store data : Many columns vs many rows for a case of 10,000 new rows a day

after checking a lot of similar questions on stackoverflow, it seems that context will tell which way is the best to hold the data...
Short story, I add over 10,000 new rows of data in a very simple table containing only 3 columns. I will NEVER update the rows, only doing selects, grouping and making averages. I'm looking for the best way of storing this data to make the average calculations as fast as possible.
To put you in context, I'm analyzing a recorded audio file (Pink Noise playback in a sound mixing studio) using FFTs. The results for a single audio file is always in the same format: The frequency bin's ID (integer) and its value in decibels (float value). I'm want to store these values in a PostgreSQL DB.
Each bin (band) of frequencies (width = 8Hz) gets an amplitude in decibels. The first bin is ignored, so it goes like this (not actual dB values):
bin 1: 8Hz-16Hz, -85.0dB
bin 2: 16Hz-32Hz, -73.0dB
bin 3: 32Hz-40Hz, -65.0dB
...
bin 2499: 20,000Hz-20,008Hz, -49.0dB
The goal is to store an amplitude of each bin from 8Hz through 20,008Hz (1 bin covers 8Hz).
Many rows approach
For each analyzed audio file, there would be 2,499 rows of 3 columns: "Analysis UID", "Bin ID" and "dB".
For each studio (4), there is one recording daily that is to be appended in the database (that's 4 times 2,499 = 9,996 new rows per day).
After a recording in one studio, the new 2,499 rows are used to show a plot of the frequency response.
My concern is that we also need to make a plot of the averaged dB values of every bin in a single studio for 5-30 days, to see if the frequency response tends to change significantly over time (thus telling us that a calibration is needed in a studio).
I came up with the following data structure for the many rows approach:
"analysis" table:
analysisUID (serial)
studioUID (Foreign key)
analysisTimestamp
"analysis_results" table:
analysisUID (Foreign key)
freq_bin_id (integer)
amplitude_dB (float)
Is this the optimal way of storing data? A single table holding close to 10,000 new rows a day and making averages of 5 or more analysis, grouping by analysisUIDs and freq_bin_ids? That would give me 2,499 rows (each corresponding to a bin and giving me the averaged dB value).
Many columns approach:
I thought I could do it the other way around, breaking the frequency bins in 4 tables (Low, Med Low, Med High, High). Since Postgres documentation says the column limit is "250 - 1600 depending on column types", it would be realistic to make 4 tables containing around 625 columns (2,499 / 4) each representing a bin and containing the "dB" value, like so:
"low" table:
analysisUID (Foreign key)
freq_bin_id_1_amplitude_dB (float)
freq_bin_id_2_amplitude_dB (float)
...
freq_bin_id_625_amplitude_dB (float)
"med_low" table:
analysisUID (Foreign key)
freq_bin_id_626_amplitude_dB (float)
freq_bin_id_627_amplitude_dB (float)
...
freq_bin_id_1250_amplitude_dB (float)
etc...
Would the averages be computed faster if the server only has to Group by analysisUIDs and make averages of each column?
Rows are not going to be an issue, however, the way in which you insert said rows could be. If insert time is one of the primary concerns, then make sure you can bulk insert them OR go for a format with fewer rows.
You can potentially store all the data in a jsonb format, especially since you will not be doing any updates to the data-- it may be convenient to store it all in one table at a time, however the performance may be less.
In any case, since you're not updating the data, the (usually default) fillfactor of 100 is appropriate.
I would NOT use the "many column" approach, as the
amount of data you're talking about really isn't that much. Using your first example of 2 tables and few columns is very likely the optimal way to do your results.
It may be useful to index the following columns:
analysis_results.freq_bin_id
analysis.analysisTimestamp
As to breaking the data into different sections, it'll depend on what types of queries you're running. If you're looking at ALL freq bins, using multiple tables will just be a hassle and net you nothing.
If only querying at some freq_bin's at a time, it could theoretically help, however, you're basically doing table partitions and once you've moved into that land, you might as well make a partition for each frequency band.
If I were you, I'd create your first table structure, fill it with 30 days worth of data and query away. You may (as we often do) be overanalyzing the situation. Postgres can be very, very fast.
Remember, the raw data you're analyzing is something on the order of a few (5 or less) meg per day at an absolute maximum. Analyzing 150 mb of data is no sweat for a DB running with modern hardware if it's indexed and stored properly.
The optimizer is going to find the correct rows in the "smaller" table really, really fast and likely cache all of those, then go looking for the child rows, and it'll know exactly what ID's and ranges to search for. If your data is all inserted in chronological order, there's a good chance it'll read it all in very few reads with very few seeks.
My main concern is with the insert speed, as a doing 10,000 inserts can take a while if you're not doing bulk inserts.
Since the measurements seem well behaved, you could use an array, using the freq_bin as an index (Note: indices are 1-based in sql)
This has the additional advantage of the aray being stored in toasted storage, keeping the fysical table small.
CREATE TABLE herrie
( analysisUID serial NOT NULL PRIMARY KEY
, studioUID INTEGER NOT NULL REFERENCES studio(studioUID)
, analysisTimestamp TIMESTAMP NOT NULL
, decibels float[] -- array with 625 measurements
, UNIQUE (studioUID,analysisTimestamp)
);