Replacing null values with 0 after spark dataframe left outer join - scala

I have two dataframes called left and right.
scala> left.printSchema
root
|-- user_uid: double (nullable = true)
|-- labelVal: double (nullable = true)
|-- probability_score: double (nullable = true)
scala> right.printSchema
root
|-- user_uid: double (nullable = false)
|-- real_labelVal: double (nullable = false)
Then, I join them to get the joined Dataframe. It is a left outer join. Anyone interested in the natjoin function can find it here.
https://gist.github.com/anonymous/f02bd79528ac75f57ae8
scala> val joinedData = natjoin(predictionDataFrame, labeledObservedDataFrame, "left_outer")
scala> joinedData.printSchema
|-- user_uid: double (nullable = true)
|-- labelVal: double (nullable = true)
|-- probability_score: double (nullable = true)
|-- real_labelVal: double (nullable = false)
Since it is a left outer join, the real_labelVal column has nulls when user_uid is not present in right.
scala> val realLabelVal = joinedData.select("real_labelval").distinct.collect
realLabelVal: Array[org.apache.spark.sql.Row] = Array([0.0], [null])
I want to replace the null values in the realLabelVal column with 1.0.
Currently I do the following:
I find the index of real_labelval column and use the spark.sql.Row API to set the nulls to 1.0.
(This gives me a RDD[Row])
Then I apply the schema of the joined dataframe to get the cleaned dataframe.
The code is as follows:
val real_labelval_index = 3
def replaceNull(row: Row) = {
val rowArray = row.toSeq.toArray
rowArray(real_labelval_index) = 1.0
Row.fromSeq(rowArray)
}
val cleanRowRDD = joinedData.map(row => if (row.isNullAt(real_labelval_index)) replaceNull(row) else row)
val cleanJoined = sqlContext.createDataFrame(cleanRowRdd, joinedData.schema)
Is there an elegant or efficient way to do this?
Goolging hasn't helped much.
Thanks in advance.

Have you tried using na
joinedData.na.fill(1.0, Seq("real_labelval"))

Related

In pyspark 2.4, how to handle columns with the same name resulting of a self join?

Using pyspark 2.4, I am doing a left join of a dataframe on itself.
df = df.alias("t1") \
.join(df.alias("t2"),
col(t1_anc_ref) == col(t2_anc_ref), "left")
The resulting structure of this join is the following:
root
|-- anc_ref_1: string (nullable = true)
|-- anc_ref_2: string (nullable = true)
|-- anc_ref_1: string (nullable = true)
|-- anc_ref_2: string (nullable = true)
I would like to be able to drop the penultimate column of this dataframe (anc_ref_1).
Using the column name is not possible, as there are duplicates. So instead of this, I select the column by index and then try to drop it:
col_to_drop = len(df.columns) - 2
df= df.drop(df[col_to_drop])
However, that gives me the following error:
pyspark.sql.utils.AnalysisException: "Reference 'anc_ref_1' is
ambiguous, could be: t1.anc_ref_1, t2.anc_ref_1.;"
Question:
When I print the schema, there is no mention of t1 and t2 in column names. Yet it is mentionned in the stack trace. Why is that and can I use it to reference a column ?
I tried df.drop("t2.anc_ref_1") but it had no effect (no column dropped)
EDIT: Works well with df.drop(col("t2.anc_ref_1"))
How can I handle the duplicate column names ? I would like to rename/drop so that the result is:
root
|-- anc_ref_1: string (nullable = true)
|-- anc_ref_2: string (nullable = true)
|-- anc_ref_1: string (nullable = true) -> dropped
|-- anc_ref_2: string (nullable = true) -> renamed to anc_ref_3
Option1
drop the column by referring to the original source dataframe.
Data
df= spark.createDataFrame([ ( 'Value1', 'Something'),
('Value2', '1057873 1057887'),
('Value3', 'Something Something'),
('Value4', None),
( 'Value5', '13139'),
( 'Value6', '1463451 1463485'),
( 'Value7', 'Not In Database'),
( 'Value8', '1617275 16288')
],( 'anc_ref_1', 'anc_ref'))
df.show()
Code
df_as1 = df.alias("df_as1")
df_as2 = df.alias("df_as2")
df1 = df_as1.join(df_as2, df_as1.anc_ref == df_as2.anc_ref, "left").drop(df_as1.anc_ref_1)#.drop(df_as2.anc_ref)
df1.show()
Option 2 Use a string sequence to join and then select the join column
df_as1.join(df_as2, "anc_ref", "left").select('anc_ref',df_as1.anc_ref_1).show()

apache spark add column which is a complex calculation

I have a following dataset df1 in Spark:
root
|-- id: integer (nullable = true)
|-- t: string (nullable = true)
|-- x: double (nullable = false)
|-- y: double (nullable = false)
|-- z: double (nullable = false)
and I need to create a column which will be a kind of calculation result of
sqrt(x)+cqrt(y)+z*constantK
I'm trying something like following:
val constantK=100500
val df2= df1.select($"id", (scala.math.sqrt($"x")+scala.math.cqrt($"y")+$"z"*constantK ))
however, I got a type mismatch error
<console>:59: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Double
val df2= df1.select($"id", (scala.math.sqrt($"x")+scala.math.cqrt($"y")+$"z"*constantK ))
what is the proper way of adding columns with complex calculations which are based on the values of other columns in the dataframe?
Because you are trying tu use Scala.math functions in Spark SQL. SparkSQL has its own operations and types:
import org.apache.spark.sql.functions.sqrt
df1.select($"id", (sqrt($"x")+sqrt($"y")+$"z"*constantK ))
The operator '*' is supported. Take a look to https://spark.apache.org/docs/2.3.0/api/sql/index.html

How to add assign value to empty dataframe existing column in scala?

I am reading a csv file which has | delimiter at last , while load method make last column in dataframe with no name and no values in Spark 1.6
df.withColumnRenamed(df.columns(83),"Invalid_Status").drop(df.col("Invalid_Status"))
val df = sqlContext.read.format("com.databricks.spark.csv").option("delimiter","|").option("header","true").load("filepath")
val df2 = df.withColumnRenamed(df.columns(83),"Invalid_Status").
I expected result
root
|-- FddCell: string (nullable = true)
|-- Trn_time: string (nullable = true)
|-- CELLNAME.FddCell: string (nullable = true)
|-- Invalid_Status: string (nullable = true)
but actual output is
root
|-- FddCell: string (nullable = true)
|-- Trn_time: string (nullable = true)
|-- CELLNAME.FddCell: string (nullable = true)
|-- : string (nullable = true)
with no value in column so I have to drop this column and again make new column.
It is not completely clear what you want to do, to just rename the column to Invalid_Status or to drop the column entirely. What I understand is, you are trying to operate (rename/drop) on the last column which has no name.
But I will try to help you with both the solution -
To Rename the column with same values (blanks) as it is:
val df2 = df.withColumnRenamed(df.columns.last,"Invalid_Status")
Only To Drop the last column without knowing its name, use:
val df3 = df.drop(df.columns.last)
And then add the "Invalid_Status" column with default values:
val requiredDf = df3.withColumn("Invalid_Status", lit("Any_Default_Value"))

How to compute statistics on a streaming dataframe for different type of columns in a single query?

I have a streaming dataframe having three columns time, col1,col2.
+-----------------------+-------------------+--------------------+
|time |col1 |col2 |
+-----------------------+-------------------+--------------------+
|2018-01-10 15:27:21.289|0.4988615628926717 |0.1926744113882285 |
|2018-01-10 15:27:22.289|0.5430687338123434 |0.17084552928040175 |
|2018-01-10 15:27:23.289|0.20527770821641478|0.2221980020202523 |
|2018-01-10 15:27:24.289|0.130852802747647 |0.5213147910202641 |
+-----------------------+-------------------+--------------------+
The datatype of col1 and col2 is variable. It could be a string or numeric datatype.
So I have to calculate statistics for each column.
For string column, calculate only valid count and invalid count.
For timestamp column, calculate only min & max.
For numeric type column, calculate min, max, average and mean.
I have to compute all statistics in a single query.
Right now, I have computed with three queries separately for every type of column.
Enumerate cases you want and select. For example, if stream is defined as:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
val schema = StructType(Seq(
StructField("v", TimestampType),
StructField("x", IntegerType),
StructField("y", StringType),
StructField("z", DecimalType(10, 2))
))
val df = spark.readStream.schema(schema).format("csv").load("/tmp/foo")
The result would be
val stats = df.select(df.dtypes.flatMap {
case (c, "StringType") =>
Seq(count(c) as s"valid_${c}", count("*") - count(c) as s"invalid_${c}")
case (c, t) if Seq("TimestampType", "DateType") contains t =>
Seq(min(c), max(c))
case (c, t) if (Seq("FloatType", "DoubleType", "IntegerType") contains t) || t.startsWith("DecimalType") =>
Seq(min(c), max(c), avg(c), stddev(c))
case _ => Seq.empty[Column]
}: _*)
// root
// |-- min(v): timestamp (nullable = true)
// |-- max(v): timestamp (nullable = true)
// |-- min(x): integer (nullable = true)
// |-- max(x): integer (nullable = true)
// |-- avg(x): double (nullable = true)
// |-- stddev_samp(x): double (nullable = true)
// |-- valid_y: long (nullable = false)
// |-- invalid_y: long (nullable = false)
// |-- min(z): decimal(10,2) (nullable = true)
// |-- max(z): decimal(10,2) (nullable = true)
// |-- avg(z): decimal(14,6) (nullable = true)
// |-- stddev_samp(z): double (nullable = true)

How to get the first row data of each list?

my DataFrame like this :
+------------------------+----------------------------------------+
|ID |probability |
+------------------------+----------------------------------------+
|583190715ccb64f503a|[0.49128147201958017,0.5087185279804199]|
|58326da75fc764ad200|[0.42143416087939345,0.5785658391206066]|
|583270ff17c76455610|[0.3949217100212508,0.6050782899787492] |
|583287c97ec7641b2d4|[0.4965059792664432,0.5034940207335569] |
|5832d7e279c764f52e4|[0.49128147201958017,0.5087185279804199]|
|5832e5023ec76406760|[0.4775830044196701,0.52241699558033] |
|5832f88859cb64960ea|[0.4360509428173421,0.563949057182658] |
|58332e6238c7643e6a7|[0.48730029128352853,0.5126997087164714]|
and I get the column of probability using
val proVal = Data.select("probability").rdd.map(r => r(0)).collect()
proVal.foreach(println)
the result is :
[0.49128147201958017,0.5087185279804199]
[0.42143416087939345,0.5785658391206066]
[0.3949217100212508,0.6050782899787492]
[0.4965059792664432,0.5034940207335569]
[0.49128147201958017,0.5087185279804199]
[0.4775830044196701,0.52241699558033]
[0.4360509428173421,0.563949057182658]
[0.48730029128352853,0.5126997087164714]
but I want to get the first column of data for each row, like this:
0.49128147201958017
0.42143416087939345
0.3949217100212508
0.4965059792664432
0.49128147201958017
0.4775830044196701
0.4360509428173421
0.48730029128352853
how can this be done?
The input is standard random forest input, above the input is val Data = predictions.select("docID", "probability")
predictions.printSchema()
root
|-- docID: string (nullable = true)
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
|-- indexedLabel: double (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)
|-- predictedLabel: string (nullable = true)
and I want to get the first value of the "probability" column
You can use the Column.apply method to get the n-th item on an array column - in this case the first column (using index 0):
import sqlContext.implicits._
val proVal = Data.select($"probability"(0)).rdd.map(r => r(0)).collect()
BTW, if you're using Spark 1.6 or higher, you can also use the Dataset API for a cleaner way to convert the dataframe into Doubles:
val proVal = Data.select($"probability"(0)).as[Double].collect()