Spark Imputer for filling up missing values - scala

requirement -
In the Picture attached, consider the first 3 columns as my raw data. Some rows have quantity column as NULL value which is exactly what I want to fill up.
In an Ideal case, I would fill up any NULL value with the previous KNOWN value.
Spark Imputer seemed to be a very easily implementable library that can help me fill missing values.
But here the issue is,Spark Imputer is limited to mean or Median calculation according to all NON-BULL values present in the data frame as a result of which I don't get desired result (4th column in the Pic).
Logic -
val imputer = new Imputer()
.setInputCols(Array("quantity"))
.setOutputCols(Array("quantity_imputed"))
.setStrategy("mean")
val model = imputer.fit(new_combinedDf)
model.transform(new_combinedDf).show()
Result -
Now is it possible to limit the Mean calculation for EACH null value to be the MEAN of last n values ?
i.e
For 2020-09-26 , where we get the first null value, Is it possible to tweak Spark Imputer to calculate the Mean over last n values only instead of all non-null values in the dataframe ?

Related

KDB - Mutiplying a col by -1 if different col = certain value

I have a set of data whereby there is a column which can take the value of "BUYS" or "SELLs" and I have another column where the quantity is displayed (shown in absolute terms). I want to be able to query this data and make sure that when the value = "SELLS" I am multiplying the quantity by -1.
thanks
You could try using a vector conditional?
https://code.kx.com/q4m3/10_Execution_Control/#1013-vector-conditional-evaluation

How do I Map and Remap string values to Int or double in scala

I have a data file of some columns. I am performing some mathematical computations on the values for that purpose I want to map my non Integer value columns to Int and then after the operations on the values I want to remap them.
Following are my columns values
atom_id,molecule_id,element,type,charge
d100_1,d100,c,22,-0.128
d100_10,d100,h,3,0.132
d100_11,d100,c,29,0.002
d100_12,d100,c,22,-0.128
d100_13,d100,c,22,-0.128
Suppose I want to map only 2 columns and then remap those columns values only. I have searched for methods and found String Indexer but it maps all of the columns of the DF, I need to map only specific columns and then remap the values of those specific columns. Any help will be appreciated.
//edited Part
I have the following columns in my DataFrame
ind1,inda,logp,lumo,mutagenic,element
1,0,4.23,-1.246,yes,c
1,0,4.62,-1.387,yes,b
0,0,2.68,-1.034,no,h
1,0,6.26,-1.598,yes,c
1,0,2.4,-3.172,yes,a
Basically I am writing the code for synthetic Data Generation based on the given input data, so I want to use column values i.e ind1,inda,logp,lumo,mutagenic,element. single row at a time and after applying some math functions on it I will get a row which will consist of 6 values and each value will be representing the corresponding column value.
Now the problem is that all column values are of type double except mutagenic and element. I want to map this mutagenic and element columns to double values for example yes to 0 and No to 1 so that I can use them and then when I will receive the output row then I will reverse map that generated mutagenic value back to the corresponding string value using that mapping function.
Hope so I am clear this time

How to get the average of multiple columns with NULL in PostgreSQL

AVG function in PostgreSQL ignores NULL values when it calculates the average. But what if I want to count the average value of multiple columns with many NULL values?
All of below commands dont work
AVG(col1,col2,col3)
AVG(col1)+AVG(col2)+AVG(col3) -> sum calculation alone gives wrong value because of null calculation
This question is similar to this Average of multiple columns, but is there any simple solution for PostgreSQL specific case?

Getting null when trying to change datatype in pyspark

I have a dataset C1.txt that has one column named features.All the rows are string and represent x and y, The coordinates of a two-dimensional point. I want to change the type to double but when I'm doing that by this code:
from pyspark.sql.types import(StructField,StringType,IntegerType,StructType,DoubleType)
changedTypedf =df.withColumn("features", df["features"].cast(DoubleType()))
I receive null for all rows (before changing datatype).
I don't know what is the wrong,please help me solving this problem.
Thanks

Calculating median of column "Balance" from table "Marketing"

I have a spark (scala) dataframe "Marketing" with approx 17 columns with 1 of them as "Balance". The data type of this column is Int. I need to find the median Balance. I can do upto arranging it in ascending order, but how to proceed after that? I have a given hint that the percentile function of scala can be used. I don't have any idea about this percentile function. Can anyone help?
Median is the same thing as the 50th percentile. If you do not mind using hive functions you can do one of the following:
marketingDF.selectExpr("percentile(CAST(Balance AS BIGINT), 0.5) AS median")
If you do not need an exact figure you can look into using percentile_approx() instead.
Documentation for both functions is located here.