describe() function over rows instead columns - scala

As said in:
https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
The describe() function works for each numerical column, It is possible to do it against rows? My DF size is 53 cols and 346,143 rows, so transpose is not an option. How can I do it?
I'm using Spark 2.11

You can do your own UDF. Either you make an separate UDF for each quantity or put everything in 1 UDF returning a complex result:
val df = Seq(
(1.0,2.0,3.0,4.0,5.0)
).toDF("x1","x2","x3","x4","x5")
val describe = udf(
{ xs : Seq[Double] =>
val xmin = xs.min
val xmax = xs.max
val mean = xs.sum/xs.size.toDouble
(xmin,xmax,mean)
}
)
df
.withColumn("describe",describe(array("*")))
.withColumn("min",$"describe._1")
.withColumn("max",$"describe._2")
.withColumn("mean",$"describe._3")
.drop($"describe")
.show
gives:
+---+---+---+---+---+---+---+----+
| x1| x2| x3| x4| x5|min|max|mean|
+---+---+---+---+---+---+---+----+
|1.0|2.0|3.0|4.0|5.0|1.0|5.0| 3.0|
+---+---+---+---+---+---+---+----+

Related

Spark 2.3: subtract dataframes but preserve duplicate values (Scala)

Copying example from this question:
As a conceptual example, if I have two dataframes:
words = [the, quick, fox, a, brown, fox]
stopWords = [the, a]
then I want the output to be, in any order:
words - stopWords = [quick, brown, fox, fox]
ExceptAll can do this in 2.4 but I cannot upgrade. The answer in the linked question is specific to a dataframe:
words.join(stopwords, words("id") === stopwords("id"), "left_outer")
.where(stopwords("id").isNull)
.select(words("id")).show()
as in you need to know the pkey and the other columns.
Can anyone come up with an answer that will work on any dataframe?
Here is an implementation for you all. I have tested in Spark 2.4.2, it should work for 2.3 too (not 100% sure)
val df1 = spark.createDataset(Seq("the","quick","fox","a","brown","fox")).toDF("c1")
val df2 = spark.createDataset(Seq("the","a")).toDF("c1")
exceptAllCustom(df1, df2, Seq("c1")).show()
def exceptAllCustom(df1 : DataFrame, df2 : DataFrame, pks : Seq[String]): DataFrame = {
val notNullCondition = pks.foldLeft(lit(0==0))((column,cName) => column && df2(cName).isNull)
val joinCondition = pks.foldLeft(lit(0==0))((column,cName) => column && df2(cName)=== df1(cName))
val result = df1.join(df2, joinCondition, "left_outer")
.where(notNullCondition)
pks.foldLeft(result)((df,cName) => df.drop(df2(cName)))
}
Result -
+-----+
| c1|
+-----+
|quick|
| fox|
|brown|
| fox|
+-----+
Turns out it's easier to do df1.except(df2) and then join the results with df1 to get all the duplicates.
Full code:
def exceptAllCustom(df1: DataFrame, df2: DataFrame): DataFrame = {
val except = df1.except(df2)
val columns = df1.columns
val colExpr: Column = df1(columns.head) <=> except(columns.head)
val joinExpression = columns.tail.foldLeft(colExpr) { (colExpr, p) =>
colExpr && df1(p) <=> except(p)
}
val join = df1.join(except, joinExpression, "inner")
join.select(df1("*"))
}

Combine multiple ArrayType Columns in Spark into one ArrayType Column

I want to merge multiple ArrayType[StringType] columns in spark to create one ArrayType[StringType]. For combining two columns I found the soluton here:
Merge two spark sql columns of type Array[string] into a new Array[string] column
But how do I go about combining, if I don't know the number of columns at compile time. At run time, I will know the names of all the columns to be combined.
One option is to use the UDF defined in the above stackoverflow question, to add two columns, multiple times in a loop. But this involves multiple reads on the entire dataframe. Is there a way to do this in just one go?
+------+------+---------+
| col1 | col2 | combined|
+------+------+---------+
| [a,b]| [i,j]|[a,b,i,j]|
| [c,d]| [k,l]|[c,d,k,l]|
| [e,f]| [m,n]|[e,f,m,n]|
| [g,h]| [o,p]|[g,h,o,p]|
+------+----+-----------+
val arrStr: Array[String] = Array("col1", "col2")
val arrCol: Array[Column] = arrString.map(c => df(c))
val assembleFunc = udf { r: Row => assemble(r.toSeq: _*)}
val outputDf = df.select(col("*"), assembleFunc(struct(arrCol:
_*)).as("combined"))
def assemble(rowEntity: Any*):
collection.mutable.WrappedArray[String] = {
var outputArray =
rowEntity(0).asInstanceOf[collection.mutable.WrappedArray[String]]
rowEntity.drop(1).foreach {
case v: collection.mutable.WrappedArray[String] =>
outputArray ++= v
case null =>
throw new SparkException("Values to assemble cannot be
null.")
case o =>
throw new SparkException(s"$o of type ${o.getClass.getName}
is not supported.")
}
outputArray
}
outputDf.show(false)
Process the dataframe schema and get all the columns of the type ArrayType[StringType].
create a new dataframe with functions.array_union of the first two columns
iterate through the rest of the columns and adding each of them to the combined column
>>>from pyspark import Row
>>>from pyspark.sql.functions import array_union
>>>df = spark.createDataFrame([Row(col1=['aa1', 'bb1'],
col2=['aa2', 'bb2'],
col3=['aa3', 'bb3'],
col4= ['a', 'ee'], foo="bar"
)])
>>>df.show()
+----------+----------+----------+-------+---+
| col1| col2| col3| col4|foo|
+----------+----------+----------+-------+---+
|[aa1, bb1]|[aa2, bb2]|[aa3, bb3]|[a, ee]|bar|
+----------+----------+----------+-------+---+
>>>cols = [col_.name for col_ in df.schema
... if col_.dataType == ArrayType(StringType())
... or col_.dataType == ArrayType(StringType(), False)
... ]
>>>print(cols)
['col1', 'col2', 'col3', 'col4']
>>>
>>>final_df = df.withColumn("combined", array_union(cols[:2][0], cols[:2][1]))
>>>
>>>for col_ in cols[2:]:
... final_df = final_df.withColumn("combined", array_union(col('combined'), col(col_)))
>>>
>>>final_df.select("combined").show(truncate=False)
+-------------------------------------+
|combined |
+-------------------------------------+
|[aa1, bb1, aa2, bb2, aa3, bb3, a, ee]|
+-------------------------------------+

Spark Scala: How to update each column of a DataFrame in correspondence with each position of a Vector

I have a DF like this:
+--------------------+-----+--------------------+
| col_0|col_1| col_2|
+--------------------+-----+--------------------+
|0.009069428120139292| 0.3|9.015488712438252E-6|
|0.008070826019024355| 0.4|3.379696051366339...|
|0.009774715414895803| 0.1|1.299590589291292...|
|0.009631155146285946| 0.9|1.218569739510422...|
And two Vectors:
v1[7.0,0.007,0.052]
v2[804.0,553.0,143993.0]
The total number of columns is the same as the total number of position in each vector.
How can apply an equation using the numbers saved in the ith position to make some computation to update the current value of the DF (in the ith position)? I mean, I need to update all values in the DF, using the values in the vectors.
Perhaps something like this is what you're after?
import org.apache.spark.sql.Column
import org.apache.spark.sql.DataFrame
val df = Seq((1,2,3),(4,5,6)).toDF
val updateVector = Vector(10,20,30)
val updateFunction = (columnValue: Column, vectorValue: Int) => columnValue * lit(vectorValue)
val updateColumns = (df: DataFrame, updateVector: Vector[Int], updateFunction:((Column, Int) => Column)) => {
val columns = df.columns
updateVector.zipWithIndex.map{case (updateValue, index) => updateFunction(col(columns(index)), updateVector(index)).as(columns(index))}
}
val dfUpdated = df.select(updateColumns(df, updateVector, updateFunction) :_*)
dfUpdated.show
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 10| 40| 90|
| 40|100|180|
+---+---+---+

Scala Spark - split vector column into separate columns in a Spark DataFrame

I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the vector.
some_columns... | Features
... | [0,1,0,..., 0]
to
some_columns... | f1 | f2 | f3 | ... | fn
... | 0 | 1 | 0 | ... | 0
What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList) and then join with the old one, but it requires spark context to use createDataFrame. I only want to transform the existing data frame. I also know .withColumn("fi", value) but what do I do if n is large?
I'm new to Scala and Spark and couldn't find any good examples for this. I think this can be a common task. My particular case is that I used the CountVectorizer and wanted to recover each column individually for better readability instead of only having the vector result.
One way could be to convert the vector column to an array<double> and then using getItem to extract individual elements.
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]
df.show
//+---+---------------------+
//|id |features |
//+---+---------------------+
//|1 |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+
// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )
// Add a ArrayType Column
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )
// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")
// Create a SQL-like expression using the array
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
// Extract Elements from dfArr
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+

Replace all occurrences of a String in all columns in a dataframe in scala

I have a dataframe with 20 Columns and in these columns there is a value XX which i want to replace with Empty String. How do i achieve that in scala. The withColumn function is for a single column, But i want to pass all 20 columns and replace values that have XX in the entire frame with Empty String , Can some one suggest a way.
Thanks
You can gather all the stringType columns in a list and use foldLeft to apply your removeXX UDF to each of the columns as follows:
val df = Seq(
(1, "aaXX", "bb"),
(2, "ccXX", "XXdd"),
(3, "ee", "fXXf")
).toDF("id", "desc1", "desc2")
import org.apache.spark.sql.types._
val stringColumns = df.schema.fields.collect{
case StructField(name, StringType, _, _) => name
}
val removeXX = udf( (s: String) =>
if (s == null) null else s.replaceAll("XX", "")
)
val dfResult = stringColumns.foldLeft( df )( (acc, c) =>
acc.withColumn( c, removeXX(df(c)) )
)
dfResult.show
+---+-----+-----+
| id|desc1|desc2|
+---+-----+-----+
| 1| aa| bb|
| 2| cc| dd|
| 3| ee| ff|
+---+-----+-----+
def clearValueContains(dataFrame: DataFrame,token :String,columnsToBeUpdated : List[String])={
columnsToBeUpdated.foldLeft(dataFrame){
(dataset ,columnName) =>
dataset.withColumn(columnName, when(col(columnName).contains(token), "").otherwise(col(columnName)))
}
}
You can use this function .. where you can put token as "XX" . Also the columnsToBeUpdated is the list of columns in which you need to search for the particular column.
dataset.withColumn(columnName, when(col(columnName) === token, "").otherwise(col(columnName)))
you can use the above code to replace on exact match.
We can do like this as well in scala.
//Getting all columns
val columns: Seq[String] = df.columns
//Using DataFrameNaFunctions to achieve this.
val changedDF = df.na.replace(columns, Map("XX"-> ""))
Hope this helps.