Calculating average string length of a Column in Dataframes - scala

I'm lost on how I can calculate the average string length of any column in a dataframe using scala. I have been able to easily do it for numeric columns doing the following
val avgDF = df.dtypes.filter(x => x._2 == "DoubleType").map(ct =>avg(col(ct._1))).toList

import org.apache.spark.sql.functions._
val avgDF = df.agg(mean(length(col("yourColumn"))))

val findLength = udf { (ColValue: String) => ColValue.size }
myData.dtypes.filter(x=>x._2=="StringType").foreach(f=>
myData.select(avg(findLength(col(f._1)))).show()
)
Sample Data
Name|Age|email
Hari|12|hary#h0otmail.ocm
Hari|12|hary#h0otmail.ocm
Hari|12|hary#h0otmail.ocm
Hari|12|hary#h0otmail.ocm
Hasasasi|12|hary#h0otmail.in
Output
+-------------------+
|AVG(scalaUDF(Name))|
+-------------------+
| 4.8|
+-------------------+
+--------------------+
|AVG(scalaUDF(email))|
+--------------------+
| 16.8|
+--------------------+

Related

Combine multiple ArrayType Columns in Spark into one ArrayType Column

I want to merge multiple ArrayType[StringType] columns in spark to create one ArrayType[StringType]. For combining two columns I found the soluton here:
Merge two spark sql columns of type Array[string] into a new Array[string] column
But how do I go about combining, if I don't know the number of columns at compile time. At run time, I will know the names of all the columns to be combined.
One option is to use the UDF defined in the above stackoverflow question, to add two columns, multiple times in a loop. But this involves multiple reads on the entire dataframe. Is there a way to do this in just one go?
+------+------+---------+
| col1 | col2 | combined|
+------+------+---------+
| [a,b]| [i,j]|[a,b,i,j]|
| [c,d]| [k,l]|[c,d,k,l]|
| [e,f]| [m,n]|[e,f,m,n]|
| [g,h]| [o,p]|[g,h,o,p]|
+------+----+-----------+
val arrStr: Array[String] = Array("col1", "col2")
val arrCol: Array[Column] = arrString.map(c => df(c))
val assembleFunc = udf { r: Row => assemble(r.toSeq: _*)}
val outputDf = df.select(col("*"), assembleFunc(struct(arrCol:
_*)).as("combined"))
def assemble(rowEntity: Any*):
collection.mutable.WrappedArray[String] = {
var outputArray =
rowEntity(0).asInstanceOf[collection.mutable.WrappedArray[String]]
rowEntity.drop(1).foreach {
case v: collection.mutable.WrappedArray[String] =>
outputArray ++= v
case null =>
throw new SparkException("Values to assemble cannot be
null.")
case o =>
throw new SparkException(s"$o of type ${o.getClass.getName}
is not supported.")
}
outputArray
}
outputDf.show(false)
Process the dataframe schema and get all the columns of the type ArrayType[StringType].
create a new dataframe with functions.array_union of the first two columns
iterate through the rest of the columns and adding each of them to the combined column
>>>from pyspark import Row
>>>from pyspark.sql.functions import array_union
>>>df = spark.createDataFrame([Row(col1=['aa1', 'bb1'],
col2=['aa2', 'bb2'],
col3=['aa3', 'bb3'],
col4= ['a', 'ee'], foo="bar"
)])
>>>df.show()
+----------+----------+----------+-------+---+
| col1| col2| col3| col4|foo|
+----------+----------+----------+-------+---+
|[aa1, bb1]|[aa2, bb2]|[aa3, bb3]|[a, ee]|bar|
+----------+----------+----------+-------+---+
>>>cols = [col_.name for col_ in df.schema
... if col_.dataType == ArrayType(StringType())
... or col_.dataType == ArrayType(StringType(), False)
... ]
>>>print(cols)
['col1', 'col2', 'col3', 'col4']
>>>
>>>final_df = df.withColumn("combined", array_union(cols[:2][0], cols[:2][1]))
>>>
>>>for col_ in cols[2:]:
... final_df = final_df.withColumn("combined", array_union(col('combined'), col(col_)))
>>>
>>>final_df.select("combined").show(truncate=False)
+-------------------------------------+
|combined |
+-------------------------------------+
|[aa1, bb1, aa2, bb2, aa3, bb3, a, ee]|
+-------------------------------------+

Spark How can I filter out rows that contain char sequences from another dataframe?

So, I am trying to remove rows from df2 if the Value in df2 is "like" a key from df1. I'm not sure if this is possible, or if I might need to change df1 into a list first? It's a fairly small dataframe, but as you can see, we want to remove the 2nd and 3rd rows from df2 and just return back df2 without them.
df1
+--------------------+
| key|
+--------------------+
| Monthly Beginning|
| Annual Percentage|
+--------------------+
df2
+--------------------+--------------------------------+
| key| Value|
+--------------------+--------------------------------+
| Date| 1/1/2018|
| Date| Monthly Beginning on Tuesday|
| Number| Annual Percentage Rate for...|
| Number| 17.5|
+--------------------+--------------------------------+
I thought it would be something like this?
df.filter(($"Value" isin (keyDf.select("key") + "%"))).show(false)
But that doesn't work and I'm not surprised, but I think it helps show what I am trying to do if my previous explanation was not sufficient enough. Thank you for your help ahead of time.
Convert the first dataframe df1 to List[String] and then create one udf and apply filter condition
Spark-shell-
import org.apache.spark.sql.functions._
//Converting df1 to list
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
//Creating udf , spark stands for spark session
spark.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
//Applying filter
df2.filter("filterUDF(Value)=0").show
//output
+------+--------+
| key| Value|
+------+--------+
| Date|1/1/2018|
|Number| 17.5|
+------+--------+
Scala-IDE -
val sparkSession=SparkSession.builder().master("local").appName("temp").getOrCreate()
val df1=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df1.csv")
val df2=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df2.csv")
import sparkSession.implicits._
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
sparkSession.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
df2.filter("filterUDF(Value)=0").show
Convert df1 to List. Convert df2 to Dataset.
case class s(key:String,Value:String)
df2Ds = df2.as[s]
Then we can use the filter method to filter out the records.
Somewhat like this.
def check(str:String):Boolean = {
var i = ""
for(i<-df1List)
{
if(str.contains(i))
return false
}
return true
}
df2Ds.filter(s=>check(s.Value)).collect

Spark Scala: How to update each column of a DataFrame in correspondence with each position of a Vector

I have a DF like this:
+--------------------+-----+--------------------+
| col_0|col_1| col_2|
+--------------------+-----+--------------------+
|0.009069428120139292| 0.3|9.015488712438252E-6|
|0.008070826019024355| 0.4|3.379696051366339...|
|0.009774715414895803| 0.1|1.299590589291292...|
|0.009631155146285946| 0.9|1.218569739510422...|
And two Vectors:
v1[7.0,0.007,0.052]
v2[804.0,553.0,143993.0]
The total number of columns is the same as the total number of position in each vector.
How can apply an equation using the numbers saved in the ith position to make some computation to update the current value of the DF (in the ith position)? I mean, I need to update all values in the DF, using the values in the vectors.
Perhaps something like this is what you're after?
import org.apache.spark.sql.Column
import org.apache.spark.sql.DataFrame
val df = Seq((1,2,3),(4,5,6)).toDF
val updateVector = Vector(10,20,30)
val updateFunction = (columnValue: Column, vectorValue: Int) => columnValue * lit(vectorValue)
val updateColumns = (df: DataFrame, updateVector: Vector[Int], updateFunction:((Column, Int) => Column)) => {
val columns = df.columns
updateVector.zipWithIndex.map{case (updateValue, index) => updateFunction(col(columns(index)), updateVector(index)).as(columns(index))}
}
val dfUpdated = df.select(updateColumns(df, updateVector, updateFunction) :_*)
dfUpdated.show
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 10| 40| 90|
| 40|100|180|
+---+---+---+

Scala Spark - split vector column into separate columns in a Spark DataFrame

I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the vector.
some_columns... | Features
... | [0,1,0,..., 0]
to
some_columns... | f1 | f2 | f3 | ... | fn
... | 0 | 1 | 0 | ... | 0
What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList) and then join with the old one, but it requires spark context to use createDataFrame. I only want to transform the existing data frame. I also know .withColumn("fi", value) but what do I do if n is large?
I'm new to Scala and Spark and couldn't find any good examples for this. I think this can be a common task. My particular case is that I used the CountVectorizer and wanted to recover each column individually for better readability instead of only having the vector result.
One way could be to convert the vector column to an array<double> and then using getItem to extract individual elements.
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]
df.show
//+---+---------------------+
//|id |features |
//+---+---------------------+
//|1 |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+
// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )
// Add a ArrayType Column
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )
// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")
// Create a SQL-like expression using the array
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
// Extract Elements from dfArr
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+

How to Transform Spark Dataframe column for HH:MM:SS:Ms to value in seconds?

I would like to transform a spark dataframe column from its value hour min seconds
E.g "01:12:17.8370000"
Would become 4337 s thanks for the comment.
or "00:00:39.0390000"
would become 39 s.
I have read this question but I am lost on how I can use this code to transform my spark dataframe column.
Convert HH:mm:ss in seconds
Something like this
df.withColumn("duration",col("duration")....)
I am using scala 2.10.5 and spark 1.6
Thank you
Assuming the column "duration" contains the duration in a string, you can just use "unix_timestamp" function of the functions package to get the number of seconds passing the pattern:
import org.apache.spark.sql.functions._
val df = Seq("01:12:17.8370000", "00:00:39.0390000").toDF("duration")
val newColumn = unix_timestamp(col("duration"), "HH:mm:ss")
val result = df.withColumn("duration", newColumn)
result.show
+--------+
|duration|
+--------+
| 4337|
| 39|
+--------+
If you have a string column, you can write a udf to calculate this manually:
val df = Seq("01:12:17.8370000", "00:00:39.0390000").toDF("duration")
def str_sec = udf((s: String) => {
val Array(hour, minute, second) = s.split(":")
hour.toInt * 3600 + minute.toInt * 60 + second.toDouble.toInt
})
df.withColumn("duration", str_sec($"duration")).show
+--------+
|duration|
+--------+
| 4337|
| 39|
+--------+
there are inbuilt functions you can take advantage of which are faster and efficient than using udf functions
given input dataframe as
+----------------+
|duration |
+----------------+
|01:12:17.8370000|
|00:00:39.0390000|
+----------------+
so you can do something like below
df.withColumn("seconds", hour($"duration")*3600+minute($"duration")*60+second($"duration"))
you should be getting output as
+----------------+-------+
|duration |seconds|
+----------------+-------+
|01:12:17.8370000|4337 |
|00:00:39.0390000|39 |
+----------------+-------+