How to run udf on every column in a dataframe? - scala

I have a UDF:
val TrimText = (s: AnyRef) => {
//does logic returns string
}
And a dataframe:
var df = spark.read.option("sep", ",").option("header", "true").csv(root_path + "/" + file)
I would like to perform TrimText on every value in every column in the dataframe.
However, the problem is, I have a dynamic number of columns. I know I can get the list of columns by df.columns. But I am unsure on how this will help me with my issue. How can I solve this problem?
TLDR Issue - Performing a UDF on every column in a dataframe, when the dataframe has an unknown number of columns
Attempting to use:
df.columns.foldLeft( df )( (accDF, c) =>
accDF.withColumn(c, TrimText(col(c)))
)
Throws this error:
error: type mismatch;
found : String
required: org.apache.spark.sql.Column
accDF.withColumn(c, TrimText(col(c)))
TrimText is suppose to return a string and expects the input to be a value in a column. So it is going to be standardizing every value in every row of the entire dataframe.

You can use foldLeft to traverse the column list to iteratively apply withColumn to the DataFrame using your UDF:
df.columns.foldLeft( df )( (accDF, c) =>
accDF.withColumn(c, TrimText(col(c)))
)

>> I would like to perform TrimText on every value in every column in the dataframe.
>> I have a dynamic number of columns.
when sql function is available for trimming why UDF, could see below code fit's for you ?
import org.apache.spark.sql.functions._
spark.udf.register("TrimText", (x:String) => ..... )
val df2 = sc.parallelize(List(
(26, true, 60000.00),
(32, false, 35000.00)
)).toDF("age", "education", "income")
val cols2 = df2.columns.toSet
df2.createOrReplaceTempView("table1")
val query = "select " + buildcolumnlst(cols2) + " from table1 "
println(query)
val dfresult = spark.sql(query)
dfresult.show()
def buildcolumnlst(myCols: Set[String]) = {
myCols.map(x => "TrimText(" + x + ")" + " as " + x).mkString(",")
}
results,
select trim(age) as age,trim(education) as education,trim(income) as income from table1
+---+---------+-------+
|age|education| income|
+---+---------+-------+
| 26| true|60000.0|
| 32| false|35000.0|
+---+---------+-------+

val a = sc.parallelize(Seq(("1 "," 2"),(" 3","4"))).toDF()
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def TrimText(s: Column): Column = {
//does logic returns string
trim(s)
}
a.select(a.columns.map(c => TrimText(col(c))):_*).show

Related

Combine multiple ArrayType Columns in Spark into one ArrayType Column

I want to merge multiple ArrayType[StringType] columns in spark to create one ArrayType[StringType]. For combining two columns I found the soluton here:
Merge two spark sql columns of type Array[string] into a new Array[string] column
But how do I go about combining, if I don't know the number of columns at compile time. At run time, I will know the names of all the columns to be combined.
One option is to use the UDF defined in the above stackoverflow question, to add two columns, multiple times in a loop. But this involves multiple reads on the entire dataframe. Is there a way to do this in just one go?
+------+------+---------+
| col1 | col2 | combined|
+------+------+---------+
| [a,b]| [i,j]|[a,b,i,j]|
| [c,d]| [k,l]|[c,d,k,l]|
| [e,f]| [m,n]|[e,f,m,n]|
| [g,h]| [o,p]|[g,h,o,p]|
+------+----+-----------+
val arrStr: Array[String] = Array("col1", "col2")
val arrCol: Array[Column] = arrString.map(c => df(c))
val assembleFunc = udf { r: Row => assemble(r.toSeq: _*)}
val outputDf = df.select(col("*"), assembleFunc(struct(arrCol:
_*)).as("combined"))
def assemble(rowEntity: Any*):
collection.mutable.WrappedArray[String] = {
var outputArray =
rowEntity(0).asInstanceOf[collection.mutable.WrappedArray[String]]
rowEntity.drop(1).foreach {
case v: collection.mutable.WrappedArray[String] =>
outputArray ++= v
case null =>
throw new SparkException("Values to assemble cannot be
null.")
case o =>
throw new SparkException(s"$o of type ${o.getClass.getName}
is not supported.")
}
outputArray
}
outputDf.show(false)
Process the dataframe schema and get all the columns of the type ArrayType[StringType].
create a new dataframe with functions.array_union of the first two columns
iterate through the rest of the columns and adding each of them to the combined column
>>>from pyspark import Row
>>>from pyspark.sql.functions import array_union
>>>df = spark.createDataFrame([Row(col1=['aa1', 'bb1'],
col2=['aa2', 'bb2'],
col3=['aa3', 'bb3'],
col4= ['a', 'ee'], foo="bar"
)])
>>>df.show()
+----------+----------+----------+-------+---+
| col1| col2| col3| col4|foo|
+----------+----------+----------+-------+---+
|[aa1, bb1]|[aa2, bb2]|[aa3, bb3]|[a, ee]|bar|
+----------+----------+----------+-------+---+
>>>cols = [col_.name for col_ in df.schema
... if col_.dataType == ArrayType(StringType())
... or col_.dataType == ArrayType(StringType(), False)
... ]
>>>print(cols)
['col1', 'col2', 'col3', 'col4']
>>>
>>>final_df = df.withColumn("combined", array_union(cols[:2][0], cols[:2][1]))
>>>
>>>for col_ in cols[2:]:
... final_df = final_df.withColumn("combined", array_union(col('combined'), col(col_)))
>>>
>>>final_df.select("combined").show(truncate=False)
+-------------------------------------+
|combined |
+-------------------------------------+
|[aa1, bb1, aa2, bb2, aa3, bb3, a, ee]|
+-------------------------------------+

Dataframe to RDD[Row] replacing space with nulls

I am converting a Spark dataframe to RDD[Row] so I can map it to final schema to write into Hive Orc table. I want to convert any space in the input to actual null so the hive table can store actual null instead of a empty string.
Input DataFrame (a single column with pipe delimited values):
col1
1|2|3||5|6|7|||...|
My code:
inputDF.rdd.
map { x: Row => x.get(0).asInstanceOf[String].split("\\|", -1)}.
map { x => Row (nullConverter(x(0)),nullConverter(x(1)),nullConverter(x(2)).... nullConverter(x(200)))}
def nullConverter(input: String): String = {
if (input.trim.length > 0) input.trim
else null
}
Is there any clean way of doing it rather than calling the nullConverter function 200 times.
Update based on single column:
Going with your approach, I will do something like:
inputDf.rdd.map((row: Row) => {
val values = row.get(0).asInstanceOf[String].split("\\|").map(nullConverter)
Row(values)
})
Make your nullConverter or any other logic a udf:
import org.apache.spark.sql.functions._
val nullConverter = udf((input: String) => {
if (input.trim.length > 0) input.trim
else null
})
Now, use the udf on your df and apply to all columns:
val convertedDf = inputDf.select(inputDf.columns.map(c => nullConverter(col(c)).alias(c)):_*)
Now, you can do your RDD logic.
This would be easier to do using the DataFrame API before converting to an RDD. First, split the data:
val df = Seq(("1|2|3||5|6|7|8||")).toDF("col0") // Example dataframe
val df2 = df.withColumn("col0", split($"col0", "\\|")) // Split on "|"
Then find out the length of the array:
val numCols = df2.first.getAs[Seq[String]](0).length
Now, for each element in the array, use the nullConverter UDF and then assign it to it's own column.
val nullConverter = udf((input: String) => {
if (input.trim.length > 0) input.trim
else null
})
val df3 = df2.select((0 until numCols).map(i => nullConverter($"col0".getItem(i)).as("col" + i)): _*)
The result using the example dataframe:
+----+----+----+----+----+----+----+----+----+----+
|col0|col1|col2|col3|col4|col5|col6|col7|col8|col9|
+----+----+----+----+----+----+----+----+----+----+
| 1| 2| 3|null| 5| 6| 7| 8|null|null|
+----+----+----+----+----+----+----+----+----+----+
Now convert it to an RDD or continue using the data as a DataFrame depending on your needs.
There is no point in converting dataframe to rdd
import org.apache.spark.sql.functions._
df = sc.parallelize([
(1, "foo bar"), (2, "foobar "), (3, " ")
]).toDF(["k", "v"])
df.select(regexp_replace(col("*"), " ", "NULL"))

Spark: Convert/concat an WrappedArray of Row to a String

I have a Spark dataframe with column that has array<struct<_1:long,_2:string>> datatype and following sample data:
WrappedArray([Value,Title1], [Value,Title2], [Value,Title3])
I want to convert this column from WrappedArray to a single String
Here is the desired output:
Value+Title1,Value+Title2,Value+Title3
I tried the following udf passing that column of the dataframe:
val f = (x:Seq[Row]) => x.mkString(",")
sqlContext.udf.register("f", f)
But the result is [Value,Title1],[Value,Title2],[Value,Title3]
try this:
val f = (x:Seq[Row]) => x.map(row => "%s+%s".format(row.getString(0),
row.getString(1))).mkString(",")

DataFrame: Append column name to rows data

I'm looking a way to append column names to data frame row's data .
Number of columns could be different from time to time
I've Spark 1.4.1
I've a dataframe :
Edit: : all data is String type only
+---+----------+
|key| value|
+---+----------+
|foo| bar|
|bar| one, two|
+---+----------+
I'd like to get :
+-------+---------------------+
|key | value|
+-------+---------------------+
|key_foo| value_bar|
|key_bar| value_one, value_two|
+---+-------------------------+
I tried
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val concatColNamesWithElems = udf { seq: Seq[Row] =>
seq.map { case Row(y: String) => (col +"_"+y)}}
Save DataFrame as Table (Ex: dfTable), So that you write SQL on it.
df.registerTempTable("dfTable")
Create UDF and Register: I'd assume your value column type is String
sqlContext.udf.register("prefix", (columnVal: String, prefix: String) =>
columnVal.split(",").map(x => prefix + "_" + x.trim).mkString(", ")
)
Use UDF in Query
//prepare columns which have UDF and all column names with AS
//Ex: prefix(key, "key") AS key // you can this representation
val columns = df.columns.map(col => s"""prefix($col, "$col") AS $col """).mkString(",")
println(columns) //for testing how columns framed
val resultDf = sqlContext.sql("SELECT " + columns + " FROM dfTable")

apache spark groupBy pivot function

I am new to spark and using spark 1.6.1. I am using the pivot function to create a new column based on a integer value. Say I have a csv file like this:
year,winds
1990,50
1990,55
1990,58
1991,45
1991,42
1991,58
I am loading the csv file like this:
var df =sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("data/sample.csv")
I want to aggregate the winds colmnn filtering those winds greater than 55 so that I get an output file like this:
year, majorwinds
1990,2
1991,1
I am using the code below:
val df2=df.groupBy("major").pivot("winds").agg(>55)->"count")
But I get this error
error: expected but integer literal found
What is the correct syntax here? Thanks in advance
In your case, if you just want output like:
+----+----------+
|year|majorwinds|
+----+----------+
|1990| 2|
|1991| 1|
+----+----------+
It's not necessary to use pivot.
You could reach this by using filter, groupBy and count:
df.filter($"winds" >= 55)
.groupBy($"year")
.count()
.withColumnRenamed("count", "majorwinds")
.show()
use this generic funtion to do pivot
def transpose(sqlCxt: SQLContext, df: DataFrame, compositeId: Vector[String], pair: (String, String), distinctCols: Array[Any]): DataFrame = {
val rdd = df.map { row => (compositeId.collect { case id => row.getAs(id).asInstanceOf[Any] }, scala.collection.mutable.Map(row.getAs(pair._1).asInstanceOf[Any] -> row.getAs(pair._2).asInstanceOf[Any])) }
val pairRdd = rdd.reduceByKey(_ ++ _)
val rowRdd = pairRdd.map(r => dynamicRow(r, distinctCols))
sqlCxt.createDataFrame(rowRdd, getSchema(compositeId ++ distinctCols))
}