Infinity value in a dataframe spark / scala - scala

I have a dataframe with infinity values . How can I replace it by 0.0.
I tried that but thoesn't work .
val Nan= dataframe_final.withColumn("Vitesse",when(col("Vitesse").isin(Double.NaN,Double.PositiveInfinity,Double.NegativeInfinity),0.0))
Example of dataframe
--------------------
| Vitesse |
--------------------
| 8.171069002316942|
| Infinity |
| 4.290418664272539|
|16.19811830014666 |
| |
How can I replace "Infinity by 0.0" ?
Thank you.

scala> df.withColumn("Vitesse", when(col("Vitesse").equalTo(Double.PositiveInfinity),0.0).otherwise(col("Vitesse")))
res1: org.apache.spark.sql.DataFrame = [Vitesse: double]
scala> res1.show
+-----------------+
| Vitesse|
+-----------------+
|8.171069002316942|
| 0.0|
|4.290418664272539|
+-----------------+
You can try like above.

Your approach is correct using when() .otherwise()
add the missing otherWise statement to get Vitesse values as is if value is not in Infinity,-Infinity,NaN.
Example:
val df=Seq(("8.171".toDouble),("4.2904".toDouble),("16.19".toDouble),(Double.PositiveInfinity),(Double.NegativeInfinity),(Double.NaN)).toDF("Vitesse")
df.show()
//+---------+
//| Vitesse|
//+---------+
//| 8.171|
//| 4.2904|
//| 16.19|
//| Infinity|
//|-Infinity|
//| NaN|
//+---------+
df.withColumn("Vitesse", when(col("Vitesse").isin(Double.PositiveInfinity,Double.NegativeInfinity,Double.NaN),0.0).
otherwise(col("Vitesse"))).
show()
//+-------+
//|Vitesse|
//+-------+
//| 8.171|
//| 4.2904|
//| 16.19|
//| 0.0|
//| 0.0|
//| 0.0|
//+-------+

Related

How can I add sequence of string as column on dataFrame and make as transforms

I have a sequence of string
val listOfString : Seq[String] = Seq("a","b","c")
How can I make a transform like
def addColumn(example: Seq[String]): DataFrame => DataFrame {
some code which returns a transform which add these String as column to dataframe
}
input
+-------
| id
+-------
| 1
+-------
output
+-------+-------+----+-------
| id | a | b | c
+-------+-------+----+-------
| 1 | 0 | 0 | 0
+-------+-------+----+-------
I am only interested in making it as transform
You can use the transform method of the datasets together with a single select statement:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.lit
def addColumns(extraCols: Seq[String])(df: DataFrame): DataFrame = {
val selectCols = df.columns.map{col(_)} ++ extraCols.map{c => lit(0).as(c)}
df.select(selectCols :_*)
}
// usage example
val yourExtraColumns : Seq[String] = Seq("a","b","c")
df.transform(addColumns(yourExtraColumns))
Resources
https://towardsdatascience.com/dataframe-transform-spark-function-composition-eb8ec296c108
https://mungingdata.com/apache-spark/chaining-custom-dataframe-transformations/
Use .toDF() and pass your listOfString.
Example:
//sample dataframe
df.show()
//+---+---+---+
//| _1| _2| _3|
//+---+---+---+
//| 0| 0| 0|
//+---+---+---+
df.toDF(listOfString:_*).show()
//+---+---+---+
//| a| b| c|
//+---+---+---+
//| 0| 0| 0|
//+---+---+---+
UPDATE:
Use foldLeft to add the columns to the existing dataframe with values.
val df=Seq(("1")).toDF("id")
val listOfString : Seq[String] = Seq("a","b","c")
val new_df=listOfString.foldLeft(df){(df,colName) => df.withColumn(colName,lit("0"))}
//+---+---+---+---+
//| id| a| b| c|
//+---+---+---+---+
//| 1| 0| 0| 0|
//+---+---+---+---+
//or creating a function
import org.apache.spark.sql.DataFrame
def addColumns(extraCols: Seq[String],df: DataFrame): DataFrame = {
val new_df=extraCols.foldLeft(df){(df,colName) => df.withColumn(colName,lit("0"))}
return new_df
}
addColumns(listOfString,df).show()
//+---+---+---+---+
//| id| a| b| c|
//+---+---+---+---+
//| 1| 0| 0| 0|
//+---+---+---+---+

Splitting the file name

good morning guys,
I have the following dataFrame
+--------------+--------------------------------------------------------------------------+
|co_tipo_arquiv|filename |count_tipo_arquiv|
+--------------+--------------------------------------------------------+-----------------+
|05 |hdfs://spbrhdpdev1.br.experian.local:8020/files/files_01|2 |
|01 |hdfs://spbrhdpdev1.br.experian.local:8020/files/files_02|2 |
+--------------+--------------------------------------------------------+-----------------+
I would like to get only the file name in the filename column
getting that way
+--------------+--------------------------------------------------------------------------+
|co_tipo_arquiv|filename |count_tipo_arquiv|
+--------------+--------------------------------------------------------+-----------------+
|05 |files_01 |2 |
|01 |files_02 |2 |
+--------------+--------------------------------------------------------+-----------------+
I thought about doing a split, but I don't know how to get the last value
split(col("filename"), "/")
but .last dont work
+--------------+-------------------------------------------------------------+
|co_tipo_arquiv|filename |
+--------------+-------------------------------------------------------------+
|05 |[hdfs:, , spbrhdpdev1.br.experian.local:8020,files, files_01]|
|01 |[hdfs:, , spbrhdpdev1.br.experian.local:8020,files, files_02]|
+--------------+-------------------------------------------------------------+
From Spark-2.4+:
We can use element_at function to get last index of array.
1.Using element_at function:
df.withColumn("filename",element_at(split(col("filename"),"/"),-1)).show()
//+--------------+--------+-----------------+
//|co_tipo_arquiv|filename|count_tipo_arquiv|
//+--------------+--------+-----------------+
//| 05|files_01| 2|
//| 01|files_02| 2|
//+--------------+--------+-----------------+
For Spark < 2.4:
2.Using substring_index function:
df.withColumn("filename",substring_index(col("filename"),"/",-1)).show()
//+--------------+--------+-----------------+
//|co_tipo_arquiv|filename|count_tipo_arquiv|
//+--------------+--------+-----------------+
//| 05|files_01| 2|
//| 01|files_02| 2|
//+--------------+--------+-----------------+
3.Using regexp_extract function:
df.withColumn("filename",regexp_extract(col("filename"),"([^\\/]+$)",1)).show()
//+--------------+--------+-----------------+
//|co_tipo_arquiv|filename|count_tipo_arquiv|
//+--------------+--------+-----------------+
//| 05|files_01| 2|
//| 01|files_02| 2|
//+--------------+--------+-----------------+

How to unpivot Spark DataFrame without hardcoding column names in Scala?

Assume you have
val df = Seq(("Jack", 91, 86), ("Mike", 79, 85), ("Julia", 93, 70)).toDF("Name", "Maths", "Art")
which gives:
+-----+-----+---+
| Name|Maths|Art|
+-----+-----+---+
| Jack| 91| 86|
| Mike| 79| 85|
|Julia| 93| 70|
+-----+-----+---+
Now you want to unpivot it by:
df.select($"Name", expr("stack(2, 'Maths', Maths, 'Art', Art) as (Subject, Score)"))
which gives:
+-----+-------+-----+
| Name|Subject|Score|
+-----+-------+-----+
| Jack| Maths| 91|
| Jack| Art| 86|
| Mike| Maths| 79|
| Mike| Art| 85|
|Julia| Maths| 93|
|Julia| Art| 70|
+-----+-------+-----+
So far so godd! Now, what if you don't know the list of column names? What if the list of column names is long or it can change? How can we avoid hardcoding the column names stupidly like that?
Or even something like this is also good:
// fake code
df.select($"Name", unpivot(df.columns.diff("Name")) as ("Subject", "Score"))
Why don't we have api like this?
This works quite well indeed:
def melt(preserves: Seq[String], toMelt: Seq[String], column: String = "variable", row: String = "value", df: DataFrame) : DataFrame = {
val _vars_and_vals = array((for (c <- toMelt) yield { struct(lit(c).alias(column), col(c).alias(row)) }): _*)
val _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
val cols = preserves.map(col _) ++ { for (x <- List(column, row)) yield { col("_vars_and_vals")(x).alias(x) }}
_tmp.select(cols: _*)
}
Source: How to melt Spark DataFrame?
Thanks to #user10938362
By making use of .mkString multi delimiter we can create expression and use it in expr.
Example:
df.show()
//+-----+-----+---+
//| Name|Maths|Art|
//+-----+-----+---+
//| Jack| 91| 86|
//| Mike| 79| 85|
//|Julia| 93| 70|
//+-----+-----+---+
//filtering required cols
val cols=df.columns.filter(_.toLowerCase != "name")
//defining alias cols string
val alias_cols="Subject,Score"
//mkString with 3 seperators
val stack_exp=cols.map(x => s"""'${x}',${x}""").mkString(s"stack(${cols.size},",",",s""") as (${alias_cols})""")
df.select($"Name", expr(s"""${stack_exp}""")).show()
//+-----+-------+-----+
//| Name|Subject|Score|
//+-----+-------+-----+
//| Jack| Maths| 91|
//| Jack| Art| 86|
//| Mike| Maths| 79|
//| Mike| Art| 85|
//|Julia| Maths| 93|
//|Julia| Art| 70|
//+-----+-------+-----+

Converting from org.apache.spark.sql.Dataset to CoordinateMatrix

I have a spark SQL dataset whose schema defined as follows,
User_id <String> | Item_id <String> | Bought_Status <Boolean>
I would like to convert this to a Sparse matrix to apply recommender systems algorithms. This is very huge RDD datasets so I read that CoordinateMatrix is the right way to create a sparse matrix out of this.
However I got stuck at a point where the API doc says that RDD[MatrixEntry] is mandatory to create a CoordinateMatrix. Also MatrixEntry needs a format of int,int, long.
I am not able to convert my data scheme to this format. Can you please help me on how to convert this data to a sparse matrix in spark? I am currently programming using scala
Please note that matrix entity is of type long, long, double
Reference: https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry
Also, as user/ item columns are string, those need to be indexed before processing. Here is how you can create coordinatematrix with scala:
//Imports needed
scala> import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
scala> import org.apache.spark.mllib.linalg.distributed.MatrixEntry
import org.apache.spark.mllib.linalg.distributed.MatrixEntry
scala> import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.StringIndexer
//Let's create a dummy dataframe
scala> val df = spark.sparkContext.parallelize(List(
| ("u1","i1" ,true),
| ("u1","i2" ,true),
| ("u2","i3" ,false),
| ("u2","i4" ,false),
| ("u3","i1" ,true),
| ("u3","i3" ,true),
| ("u4","i3" ,false),
| ("u4","i4" ,false))).toDF("user","item","bought")
df: org.apache.spark.sql.DataFrame = [user: string, item: string ... 1 more field]
scala> df.show
+----+----+------+
|user|item|bought|
+----+----+------+
| u1| i1| true|
| u1| i2| true|
| u2| i3| false|
| u2| i4| false|
| u3| i1| true|
| u3| i3| true|
| u4| i3| false|
| u4| i4| false|
+----+----+------+
//Index user/ item columns
scala> val indexer1 = new StringIndexer().setInputCol("user").setOutputCol("userIndex")
indexer1: org.apache.spark.ml.feature.StringIndexer = strIdx_2de8d35b8301
scala> val indexed1 = indexer1.fit(df).transform(df)
indexed1: org.apache.spark.sql.DataFrame = [user: string, item: string ... 2 more fields]
scala> val indexer2 = new StringIndexer().setInputCol("item").setOutputCol("itemIndex")
indexer2: org.apache.spark.ml.feature.StringIndexer = strIdx_493ce45dbec3
scala> val indexed2 = indexer2.fit(indexed1).transform(indexed1)
indexed2: org.apache.spark.sql.DataFrame = [user: string, item: string ... 3 more fields]
scala> val tempDF = indexed2.withColumn("userIndex",indexed2("userIndex").cast("long")).withColumn("itemIndex",indexed2("itemIndex").cast("long")).withColumn("bought",indexed2("bought").cast("double")).select("userIndex","itemIndex","bought")
tempDF: org.apache.spark.sql.DataFrame = [userIndex: bigint, itemIndex: bigint ... 1 more field]
scala> tempDF.show
+---------+---------+------+
|userIndex|itemIndex|bought|
+---------+---------+------+
| 0| 1| 1.0|
| 0| 3| 1.0|
| 1| 0| 0.0|
| 1| 2| 0.0|
| 2| 1| 1.0|
| 2| 0| 1.0|
| 3| 0| 0.0|
| 3| 2| 0.0|
+---------+---------+------+
//Create coordinate matrix of size 4*4
scala> val corMat = new CoordinateMatrix(tempDF.rdd.map(m => MatrixEntry(m.getLong(0),m.getLong(1),m.getDouble(2))), 4, 4)
corMat: org.apache.spark.mllib.linalg.distributed.CoordinateMatrix = org.apache.spark.mllib.linalg.distributed.CoordinateMatrix#16be6b36
//Check the content of coordinate matrix
scala> corMat.entries.collect
res2: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array(MatrixEntry(0,1,1.0), MatrixEntry(0,3,1.0), MatrixEntry(1,0,0.0), MatrixEntry(1,2,0.0), MatrixEntry(2,1,1.0), MatrixEntry(2,0,1.0), MatrixEntry(3,0,0.0), MatrixEntry(3,2,0.0))
Hope, this helps!

How to use DataFrame Window expressions and withColumn and not to change partition?

For some reason I have to convert RDD to DataFrame, then do something with DataFrame.
My interface is RDD,so I have to convert DataFrame to RDD, and when I use df.withcolumn, the partition change to 1, so I have to repartition and sortBy RDD.
Is there any cleaner solution ?
This is my code :
val rdd = sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val partition = rdd.getNumPartitions
println(partition + "rdd")
val df=rdd.toDF()
val rdd2=df.rdd
val result = rdd.toDF("col1")
.withColumn("csum", sum($"col1").over(Window.orderBy($"col1")))
.withColumn("rownum", row_number().over(Window.orderBy($"col1")))
.withColumn("avg", $"csum"/$"rownum").rdd
println(result.getNumPartitions + "rdd2")
Let's make this as simple as possible, we will generate the same data into 4 partitions
scala> val df = spark.range(1,9,1,4).toDF
df: org.apache.spark.sql.DataFrame = [id: bigint]
scala> df.show
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
+---+
scala> df.rdd.getNumPartitions
res13: Int = 4
We don't need 3 window functions to prove this, so let's do it with one :
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val df2 = df.withColumn("csum", sum($"id").over(Window.orderBy($"id")))
df2: org.apache.spark.sql.DataFrame = [id: bigint, csum: bigint]
So what's happening here is that we didn't just add a column but we computed a window of cumulative sum over the data and since you haven't provided an partition column, the window function will move all the data to a single partition and you even get a warning from spark :
scala> df2.rdd.getNumPartitions
17/06/06 10:05:53 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
res14: Int = 1
scala> df2.show
17/06/06 10:05:56 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---+----+
| id|csum|
+---+----+
| 1| 1|
| 2| 3|
| 3| 6|
| 4| 10|
| 5| 15|
| 6| 21|
| 7| 28|
| 8| 36|
+---+----+
So let's add now a column to partition on. We will create a new DataFrame just for the sake of demonstration :
scala> val df3 = df.withColumn("x", when($"id"<5,lit("a")).otherwise("b"))
df3: org.apache.spark.sql.DataFrame = [id: bigint, x: string]
It has indeed the same number of partitions that we defined explicitly on df :
scala> df3.rdd.getNumPartitions
res18: Int = 4
Let's perform our window operation using the column x to partition :
scala> val df4 = df3.withColumn("csum", sum($"id").over(Window.orderBy($"id").partitionBy($"x")))
df4: org.apache.spark.sql.DataFrame = [id: bigint, x: string ... 1 more field]
scala> df4.show
+---+---+----+
| id| x|csum|
+---+---+----+
| 5| b| 5|
| 6| b| 11|
| 7| b| 18|
| 8| b| 26|
| 1| a| 1|
| 2| a| 3|
| 3| a| 6|
| 4| a| 10|
+---+---+----+
The window function will repartition our data using the default number of partitions set in spark configuration.
scala> df4.rdd.getNumPartitions
res20: Int = 200
I was just reading about controlling the number of partitions when using groupBy aggregation, from https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-performance-tuning-groupBy-aggregation.html, it seems the same trick works with Window, in my code I'm defining a window like
windowSpec = Window \
.partitionBy('colA', 'colB') \
.orderBy('timeCol') \
.rowsBetween(1, 1)
and then doing
next_event = F.lead('timeCol', 1).over(windowSpec)
and creating a dataframe via
df2 = df.withColumn('next_event', next_event)
and indeed, it has 200 partitions. But, if I do
df2 = df.repartition(10, 'colA', 'colB').withColumn('next_event', next_event)
it has 10!