Should i be using StringIndexer on int column?

Should i be using StringIndexer on int column? - pyspark

Hello so i have a pyspark dataframe with the columns "month" and "day_of_week".
these are represented by numeric values already with :
1 representing januaray in "month" a
1 in "day_of_week" representing Monday.
These are when i printSchem(), identified as integer.
Should i be passing through StringIndexer to identify them as categorical variable for machine learning or can i just leave it as is ?

StringIndexer is designed to handle categorical columns (String type). So the answer is no, you can leave it as is.

Related

Spark Imputer for filling up missing values

requirement -
In the Picture attached, consider the first 3 columns as my raw data. Some rows have quantity column as NULL value which is exactly what I want to fill up.
In an Ideal case, I would fill up any NULL value with the previous KNOWN value.
Spark Imputer seemed to be a very easily implementable library that can help me fill missing values.
But here the issue is,Spark Imputer is limited to mean or Median calculation according to all NON-BULL values present in the data frame as a result of which I don't get desired result (4th column in the Pic).
Logic -
val imputer = new Imputer()
.setInputCols(Array("quantity"))
.setOutputCols(Array("quantity_imputed"))
.setStrategy("mean")
val model = imputer.fit(new_combinedDf)
model.transform(new_combinedDf).show()
Result -
Now is it possible to limit the Mean calculation for EACH null value to be the MEAN of last n values ?
i.e
For 2020-09-26 , where we get the first null value, Is it possible to tweak Spark Imputer to calculate the Mean over last n values only instead of all non-null values in the dataframe ?

How do I Map and Remap string values to Int or double in scala

I have a data file of some columns. I am performing some mathematical computations on the values for that purpose I want to map my non Integer value columns to Int and then after the operations on the values I want to remap them.
Following are my columns values
atom_id,molecule_id,element,type,charge
d100_1,d100,c,22,-0.128
d100_10,d100,h,3,0.132
d100_11,d100,c,29,0.002
d100_12,d100,c,22,-0.128
d100_13,d100,c,22,-0.128
Suppose I want to map only 2 columns and then remap those columns values only. I have searched for methods and found String Indexer but it maps all of the columns of the DF, I need to map only specific columns and then remap the values of those specific columns. Any help will be appreciated.
//edited Part
I have the following columns in my DataFrame
ind1,inda,logp,lumo,mutagenic,element
1,0,4.23,-1.246,yes,c
1,0,4.62,-1.387,yes,b
0,0,2.68,-1.034,no,h
1,0,6.26,-1.598,yes,c
1,0,2.4,-3.172,yes,a
Basically I am writing the code for synthetic Data Generation based on the given input data, so I want to use column values i.e ind1,inda,logp,lumo,mutagenic,element. single row at a time and after applying some math functions on it I will get a row which will consist of 6 values and each value will be representing the corresponding column value.
Now the problem is that all column values are of type double except mutagenic and element. I want to map this mutagenic and element columns to double values for example yes to 0 and No to 1 so that I can use them and then when I will receive the output row then I will reverse map that generated mutagenic value back to the corresponding string value using that mapping function.
Hope so I am clear this time

Spark StringIndexer consistent output value for given input [duplicate]

This question already has an answer here:
Spark ML StringIndexer Different Labels Training/Testing
(1 answer)
Closed 5 years ago.
Is it possible to use Spark's StringIndexer to consistently return the same output for a given input (I.e a column labelled 'Apple' will always output say '56.0')
The use case is when indexing multiple DataFrames and not all the inputs are inclusive in both but, you want to ensure ones which are, are converted to the same indexed value.
I'm trying to avoid my own String => Number mapping and wondered if StringIndexer could do this.

After looking some more I came across this similar post:
Spark ML StringIndexer Different Labels Training/Testing
If you save the StringIndexerModel used firstly and reuse it for transformation of any further DataFrames you'll get the same outputs.
I've flagged this post as a duplicate.

Calculating median of column "Balance" from table "Marketing"

I have a spark (scala) dataframe "Marketing" with approx 17 columns with 1 of them as "Balance". The data type of this column is Int. I need to find the median Balance. I can do upto arranging it in ascending order, but how to proceed after that? I have a given hint that the percentile function of scala can be used. I don't have any idea about this percentile function. Can anyone help?

Median is the same thing as the 50th percentile. If you do not mind using hive functions you can do one of the following:
marketingDF.selectExpr("percentile(CAST(Balance AS BIGINT), 0.5) AS median")
If you do not need an exact figure you can look into using percentile_approx() instead.
Documentation for both functions is located here.

How can I convert one column data to a vector using Spark Scala

I am using Spark, Scala to process data. I have one question couldn't figure out. I have a dataframe, which is one column:
data
1
2
3
4
5
I want it to a single vector
[1.0,2.0,3.0,4.0,5.0]
How can I implemented it ? I tried
df.collect().toVector or rdd.foreach, but everytime it returns to me an array of vectors [1,0], [2.0], [3.0], [4.0], [5.0], not one single vector.

This is happening because when you collect a dataframe you get an Array of rows. You need to extract the values from the row objects.
df.collect().map(x => x.getDouble(0)).toVector

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse