Tried/failed to replace null values with means in spark dataframe - scala

Update:I was wrong, the error stems from the vectorassembler, not the random forest, or it comes from both. But the error/issue is the same. When I use the df_noNulls dataframe in the vectorAssembler, it says it cannot vectorize the columns because there are null values.
I've looked at other answers for this question and liberated/borrowed/stolen the answer code to try to get this to work. My end goal is RF/GB/other ML modeling, which do not take kindly to null values. I've put together the following code to pull all numeric columns, get each columns mean, then create a new dataframe that joins the two and replaces all the nulls with the mean. When I then try to create a vector of the numeric columns as the "features" part of the random forest, it returns an error that says "Values to assemble cannot be null".
val numCols = DF.schema.fields filter {
x => x.dataType match {
case x: org.apache.spark.sql.types.DoubleType => true
case x: org.apache.spark.sql.types.IntegerType => true
case x: org.apache.spark.sql.types.LongType => true
case _ => false
}
} map {x => x.name}
//NUMCOLS NOW IS AN ARRAY OF ALL NUMERIC COLUMN NAMES
val numDf = DF.select(numCols.map(col): _*)
//NUMDF IS A DATAFRAME OF ALL NUMERIC COLUMNS
val means = numDf.agg(numDf.columns.map(c => (c -> "avg")).toMap)
//CREATES A DATAFRAME OF MEANS OF ALL NUMERIC VARIABLES
means.persist()
//PERSIST TABLE 'MEANS' FOR JOINING --BROADCAST ALSO WORKS BUT I WAS GETTING MEMORY ISSUES WITH IT SO I SWITCHED IT
val exprs = numDf.columns.map(c => coalesce(col(c), col(s"avg($c)")).alias(c))
//EXPRS CREATES FUNCTION TO REPLACE NULLS WITH MEANS
val df_noNulls = DF.crossJoin(means).select(exprs: _*)
df_noNulls should now be a dataframe of only the numeric columns with no null values, they having been replaced with the column nulls. Yet when trying to make a vector of all the values(minus the label/target) I get the "Values to assemble cannot be null" error. I've attached a screenshot of the error in case that might help. It also says it failed to execute user defined function.
I know I've been asking a lot of questions about scala here recently, sorry about that, I'm just really trying to learn to do this. Below is the rest of the code to the RF step in case the mistake is there somewhere:
val num_feat = numCols.filter(! _.contains("call"))
val features=num_feat
val featureAssembler = new VectorAssembler().setInputCols(features).setOutputCol("features")
val reweight_vector = featureAssembler.transform(df_noNulls)
val rf50 = new RandomForestClassifier().setSeed(9).setLabelCol("call_ind").setFeaturesCol("features").setNumTrees(500).setMaxBins(100).fit(reweight_vector)

I am guessing that the cause for this is a column that is entirely null - in that case, the average would be null too. To avoid that, you can simply add another "fallback" in the coalesce expression, using a literal 0 for example:
val exprs = numDf.columns.map(c => coalesce(col(c), col(s"avg($c)"), lit(0.0)).alias(c))
With the rest of the code unchaned, this should ensure none of the values in df_noNulls is null.

Related

Spark-Scala: Map the first element of list with every other element of list when lists are of varying length

I have dataset of the following type in a textile:
1004,bb5469c5|2021-09-19 01:25:30,4f0d-bb6f-43cf552b9bc6|2021-09-25 05:12:32,1954f0f|2021-09-19 01:27:45,4395766ae|2021-09-19 01:29:13,
1018,36ba7a7|2021-09-19 01:33:00,
1020,23fe40-4796-ad3d-6d5499b|2021-09-19 01:38:59,77a90a1c97b|2021-09-19 01:34:53,
1022,3623fe40|2021-09-19 01:33:00,
1028,6c77d26c-6fb86|2021-09-19 01:50:50,f0ac93b3df|2021-09-19 01:51:11,
1032,ac55-4be82f28d|2021-09-19 01:54:20,82229689e9da|2021-09-23 01:19:47,
I read the file using sc.textFile which returns an RDD of type Array[String] after which I perform the operations .map(x=>x.substring(1,x.length()-1)).map(x=>x.split(",").toList)
After split.toList I want to map the first element of each of the lists obtained to every other element of the list for which I use .map(x=>(x(0),x(1))).toDF("c1","c2")
This works fine for those lists which have only one value after split but skips on all other elements of the lists having more than one value for obvious reasons. For eg:
.map(x=>(x(0),x(1))) returns [1020,23fe40-4796-ad3d-6d5499b|2021-09-19 01:38:59] but skips out on the third element here 77a90a1c97b|2021-09-19 01:34:53
How can I write a map function which returns [1020,23fe40-4796-ad3d-6d5499b|2021-09-19 01:38:59], [1020,77a90a1c97b|2021-09-19 01:34:53] given that all the lists created using .map(x=>x.split(",").toList) are of varying lengths (have varying number of elements)?
I noted the ',' at the end of the file, but split ignores nulls.
The solution is as follows, just try it and you will see it works:
// x._n cannot work here initially.
val rdd = spark.sparkContext.textFile("/FileStore/tables/oddfile_01.txt")
val rdd2 = rdd.map(line => line.split(','))
val rdd3 = rdd2.map(x => (x(0), x.tail.toList))
val rdd4 = rdd3.flatMap{case (x, y) => y.map((x, _))}
rdd4.collect
Cardinality does change in this approach though.

Converting Fields to Ints, Doubles, ect. in Scala in Spark Shell RDD

I have an assignment where I need to load a csv dataset in a spark-shell using spark.read.csv(), and accomplish the following:
Convert the dataset to RDD
Remove the heading (first record (line) in the dataset)
Convert the first two fields to integers
Convert other fields except the last one to doubles. Questions marks should be NaN. The
last field should be converted to a Boolean.
I was able to do steps 1 and 2 with the following code:
//load the dataset as an RDD
val dataRDD = spark.read.csv("block_1.csv").rdd //output is org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[14] at rdd at <console>:23
dataRDD.count() //output 574914
//import Row since RDD is of Row
import org.apache.spark.sql.Row
//function to recognize if a string contains "id_1"
def isHeader(r : Row) = r.toString.contains("id_1")
//filter function will take !isHeader function and apply it to all lines in dataRDD and the //return will form another RDD
val nohead = dataRDD.filter(x => !isHeader(x))
nohead.count() //output is now 574913
nohead.first //output is [37291,53113,0.833333333333333,?,1,?,1,1,1,1,0,TRUE]
nohead //output is org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[15] at filter at <console>:28
I'm trying to convert the fields but every time I use a function like toDouble I get an error stating not a member of:
:25: error: value toDouble is not a member of
org.apache.spark.sql.Row
if ("?".equals(s)) Double.NaN else s.toDouble
I'm not sure what I'm doing wrong and I've taken a look at the website https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Row.html#anyNull()
but I still don't know what I'm doing wrong.
I'm not sure how to convert something if there isn't a toDouble, toInt, or toBoolean function.
Can someone please guide me in the right direction to figure what I'm doing wrong? Where I can possibly look to answer? I need to convert the first two fields to integers, the other fields except for the last one to doubles. Question marks should be NaN. The last field should be converted to Boolean.
Convert the first two fields to integers
Convert other fields except the last one to doubles. Questions marks should be NaN. The last field should be converted to a Boolean.
You can do both 3 and 4 at once using a parse function.
First create the toDouble function since it is used in the parse function:
def toDouble(s: String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
def parse(line: String) = {
val pieces = line.split(',')
val id1 = pieces(0).toInt
val id2 = pieces(1).toInt
val scores = pieces.slice(2, 11).map(toDouble)
val matched = pieces(11).toBoolean
(id1, id2, scores, matched)
}
After you do this, you can call parse on each row in your RDD using map; however, you still have the type issue. To fix this, you could convert nohead from an RDD[Row] to an RDD[String]; however its probably easier to just convert the row to a string as you pass it:
val parsed = noheader.map(line => parse(line.mkString(",")))
This will give parsed as type: RDD[(Int, Int, Array[Double], Boolean)]

Replace missing values with mean - Spark Dataframe

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:
a) To do this for a single column (let's say Col A), this line of code seems to work:
df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
.first()(0).asInstanceOf[Double])
.otherwise($"ColA"))
b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe
c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).
Thanks!
Spark >= 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer
val imputer = new Imputer()
.setInputCols(df.columns)
.setOutputCols(df.columns.map(c => s"${c}_imputed"))
.setStrategy("mean")
imputer.fit(df).transform(df)
Python:
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
)
imputer.fit(df).transform(df)
Spark < 2.2
Here you are:
import org.apache.spark.sql.functions.mean
df.na.fill(df.columns.zip(
df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)
where
df.columns.map(mean(_)): Array[Column]
computes an average for each column,
df.select(_: *).first.toSeq: Seq[Any]
collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),
df.columns.zip(_).toMap: Map[String,Any]
creates aMap: Map[String, Any] which maps from the column name to its average, and finally:
df.na.fill(_): DataFrame
fills the missing values using:
fill: Map[String, Any] => DataFrame
from DataFrameNaFunctions.
To ingore NaN entries you can replace:
df.select(df.columns.map(mean(_)): _*).first.toSeq
with:
import org.apache.spark.sql.functions.{col, isnan, when}
df.select(df.columns.map(
c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq
For imputing the median (instead of the mean) in PySpark < 2.2
## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
Then, apply na.fill
df_imputed = df.na.fill(median_dict)
For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs

Transforming Spark Dataframe Column

I am working with Spark dataframes. I have a categorical variable in my dataframe with many levels. I am attempting a simple transformation of this variable - Only pick the top few levels which has greater than n observations (say,1000). Club all other levels into an "Others" category.
I am fairly new to Spark, so I have been struggling to implement this. This is what I have been able to achieve so far:
# Extract all levels having > 1000 observations (df is the dataframe name)
val levels_count = df.groupBy("Col_name").count.filter("count >10000").sort(desc("count"))
# Extract the level names
val level_names = level_count.select("Col_name").rdd.map(x => x(0)).collect
This gives me an Array which has the level names that I would like to retain. Next, I should define the transformation function which can be applied to the column. This is where I am getting stuck. I believe we need to create a User defined function. This is what I tried:
# Define UDF
val var_transform = udf((x: String) => {
if (level_names contains x) x
else "others"
})
# Apply UDF to the column
val df_new = df.withColumn("Var_new", var_transform($"Col_name"))
However, when I try df_new.show it throws a "Task not serializable" exception. What am I doing wrong? Also, is there a better way to do this?
Thanks!
Here is a solution that would be, in my opinion, better for such a simple transformation: stick to the DataFrame API and trust catalyst and Tungsten to be optimised (e.g. making a broadcast join):
val levels_count = df
.groupBy($"Col_name".as("new_col_name"))
.count
.filter("count >10000")
val df_new = df
.join(levels_count,$"Col_name"===$"new_col_name", joinType="leftOuter")
.drop("Col_name")
.withColumn("new_col_name",coalesce($"new_col_name", lit("other")))

how to use select() and map() in spark - scala?

Im writing a code for data migration from mysql to cassandra using spark. I m trying to generalize it so that given a conf file it can migrate any table. Here im stuck at 2 places:
val dataframe2 = dataframe.select("a","b","c","d","e","f")
After Loading the table from mysql i wish to select only a few columns, i have the names of these columns as a list. How can it be used here?
val RDDtuple = dataframe2.map(r => (r.getAs(0), r.getAs(1), r.getAs(2), r.getAs(3), r.getAs(4), r.getAs(5)))
Here again every table may have a different number of columns, so how can this be achieved?
To use variable number of columns in select(), your list of columns can be converted like this:
val columns = List("a", "b", "c", "d")
val dfSelectedCols = dataFrame.select(columns.head, columns.tail :_*)
Explanation: the first param in DataFrame's select(String, String...) is mandatory, so use columns.head. The remaining part of the list need to be converted to varargs using columns.tail :_*.
It's not very clear from your example, but I suppose that x is a RDD[Row] and that you are trying to convert into a RDD of Tuples, right ? Please give more details and also use meaningful variable names. x, y or z are bad choices, especially if there is no explicit typing.