DataFrame: Append column name to rows data - scala

I'm looking a way to append column names to data frame row's data .
Number of columns could be different from time to time
I've Spark 1.4.1
I've a dataframe :
Edit: : all data is String type only
+---+----------+
|key| value|
+---+----------+
|foo| bar|
|bar| one, two|
+---+----------+
I'd like to get :
+-------+---------------------+
|key | value|
+-------+---------------------+
|key_foo| value_bar|
|key_bar| value_one, value_two|
+---+-------------------------+
I tried
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val concatColNamesWithElems = udf { seq: Seq[Row] =>
seq.map { case Row(y: String) => (col +"_"+y)}}

Save DataFrame as Table (Ex: dfTable), So that you write SQL on it.
df.registerTempTable("dfTable")
Create UDF and Register: I'd assume your value column type is String
sqlContext.udf.register("prefix", (columnVal: String, prefix: String) =>
columnVal.split(",").map(x => prefix + "_" + x.trim).mkString(", ")
)
Use UDF in Query
//prepare columns which have UDF and all column names with AS
//Ex: prefix(key, "key") AS key // you can this representation
val columns = df.columns.map(col => s"""prefix($col, "$col") AS $col """).mkString(",")
println(columns) //for testing how columns framed
val resultDf = sqlContext.sql("SELECT " + columns + " FROM dfTable")

Related

Apply a transformation to all the columns with the same data type on Spark

I need to apply a transformation to all the Integer columns of my Data Frame before writting a CSV. The transformation consists on changing the type to String and then transform the format to the European one (E.g. 1234567 -> "1234567" -> "1.234.567").
Has Spark any way to apply this transformation to all the Integer Columns? I want it to be a generic functionality (because I need to write multiple CSVs) instead of hardcoding all the columns to transform for each dataframe.
DataFrame has dtypes method, which returns column names along with their data types: Array[("Column name", "Data Type")].
You can map this array, applying different expressions to each column, based on their data type. And you can then pass this mapped list to the select method:
import spark.implicits._
import org.apache.spark.sql.functions._
val dataSeq = Seq(
(1246984, 993922, "test_1"),
(246984, 993922, "test_2"),
(246984, 993922, "test_3"))
val df = dataSeq.toDF("int_1", "int_2", "str_3")
df.show
+-------+------+------+
| int_1| int_2| str_3|
+-------+------+------+
|1246984|993922|test_1|
| 246984|993922|test_2|
| 246984|993922|test_3|
+-------+------+------+
val columns =
df.dtypes.map{
case (c, "IntegerType") => regexp_replace(format_number(col(c), 0), ",", ".").as(c)
case (c, t) => col(c)
}
val df2 = df.select(columns:_*)
df2.show
+---------+-------+------+
| int_1| int_2| str_3|
+---------+-------+------+
|1,246,984|993,922|test_1|
| 246,984|993,922|test_2|
| 246,984|993,922|test_3|
+---------+-------+------+

How can i check for empty values on spark Dataframe using User defined functions

guys, I have this user-defined function to check if the text rows are empty:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
{{{
val df = Seq(
(0, "","Mongo"),
(1, "World","sql"),
(2, "","")
).toDF("id", "text", "Source")
// Define a "regular" Scala function
val checkEmpty: String => Boolean = x => {
var test = false
if(x.isEmpty){
test = true
}
test
}
val upper = udf(checkEmpty)
df.withColumn("isEmpty", upper('text)).show
}}}
I'm actually getting this dataframe:
+---+-----+------+-------+
| id| text|Source|isEmpty|
+---+-----+------+-------+
| 0| | Mongo| true|
| 1|World| sql| false|
| 2| | | true|
+---+-----+------+-------+
How could I check for all the rows for empty values and return a message like:
id 0 has the text column with empty values
id 2 has the text,source column with empty values
UDF which get nullable columns as Row can be used, for get empty column names. Then rows with non-empty columns can be filtered:
val emptyColumnList = (r: Row) => r
.toSeq
.zipWithIndex
.filter(_._1.toString().isEmpty)
.map(pair => r.schema.fields(pair._2).name)
val emptyColumnListUDF = udf(emptyColumnList)
val columnsToCheck = Seq($"text", $"Source")
val result = df
.withColumn("EmptyColumns", emptyColumnListUDF(struct(columnsToCheck: _*)))
.where(size($"EmptyColumns") > 0)
.select(format_string("id %s has the %s columns with empty values", $"id", $"EmptyColumns").alias("description"))
Result:
+----------------------------------------------------+
|description |
+----------------------------------------------------+
|id 0 has the [text] columns with empty values |
|id 2 has the [text,Source] columns with empty values|
+----------------------------------------------------+
You could do something like this:
case class IsEmptyRow(id: Int, description: String) //case class for column names
val isEmptyDf = df.map {
row => row.getInt(row.fieldIndex("id")) -> row //we take id of row as first column
.toSeq //then to get secod we change row values to seq
.zip(df.columns) //zip it with column names
.collect { //if value is string and empty we append column name
case (value: String, column) if value.isEmpty => column
}
}.map { //then we create description string and pack results to case class
case (id, Nil) => IsEmptyRow(id, s"id $id has no columns with empty values")
case (id, List(column)) => IsEmptyRow(id, s"id $id has the $column column with empty values")
case (id, columns) => IsEmptyRow(id, s"id $id has the ${columns.mkString(", ")} columns with empty values")
}
Then running isEmptyDf.show(truncate = false) will show:
+---+---------------------------------------------------+
|id |description |
+---+---------------------------------------------------+
|0 |id 0 has the text columns with empty values |
|1 |id 1 has no columns with empty values |
|2 |id 2 has the text, Source columns with empty values|
+---+---------------------------------------------------+
You can also join back with original dataset:
df.join(isEmptyDf, "id").show(truncate = false)

How to run udf on every column in a dataframe?

I have a UDF:
val TrimText = (s: AnyRef) => {
//does logic returns string
}
And a dataframe:
var df = spark.read.option("sep", ",").option("header", "true").csv(root_path + "/" + file)
I would like to perform TrimText on every value in every column in the dataframe.
However, the problem is, I have a dynamic number of columns. I know I can get the list of columns by df.columns. But I am unsure on how this will help me with my issue. How can I solve this problem?
TLDR Issue - Performing a UDF on every column in a dataframe, when the dataframe has an unknown number of columns
Attempting to use:
df.columns.foldLeft( df )( (accDF, c) =>
accDF.withColumn(c, TrimText(col(c)))
)
Throws this error:
error: type mismatch;
found : String
required: org.apache.spark.sql.Column
accDF.withColumn(c, TrimText(col(c)))
TrimText is suppose to return a string and expects the input to be a value in a column. So it is going to be standardizing every value in every row of the entire dataframe.
You can use foldLeft to traverse the column list to iteratively apply withColumn to the DataFrame using your UDF:
df.columns.foldLeft( df )( (accDF, c) =>
accDF.withColumn(c, TrimText(col(c)))
)
>> I would like to perform TrimText on every value in every column in the dataframe.
>> I have a dynamic number of columns.
when sql function is available for trimming why UDF, could see below code fit's for you ?
import org.apache.spark.sql.functions._
spark.udf.register("TrimText", (x:String) => ..... )
val df2 = sc.parallelize(List(
(26, true, 60000.00),
(32, false, 35000.00)
)).toDF("age", "education", "income")
val cols2 = df2.columns.toSet
df2.createOrReplaceTempView("table1")
val query = "select " + buildcolumnlst(cols2) + " from table1 "
println(query)
val dfresult = spark.sql(query)
dfresult.show()
def buildcolumnlst(myCols: Set[String]) = {
myCols.map(x => "TrimText(" + x + ")" + " as " + x).mkString(",")
}
results,
select trim(age) as age,trim(education) as education,trim(income) as income from table1
+---+---------+-------+
|age|education| income|
+---+---------+-------+
| 26| true|60000.0|
| 32| false|35000.0|
+---+---------+-------+
val a = sc.parallelize(Seq(("1 "," 2"),(" 3","4"))).toDF()
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def TrimText(s: Column): Column = {
//does logic returns string
trim(s)
}
a.select(a.columns.map(c => TrimText(col(c))):_*).show

Combine multiple ArrayType Columns in Spark into one ArrayType Column

I want to merge multiple ArrayType[StringType] columns in spark to create one ArrayType[StringType]. For combining two columns I found the soluton here:
Merge two spark sql columns of type Array[string] into a new Array[string] column
But how do I go about combining, if I don't know the number of columns at compile time. At run time, I will know the names of all the columns to be combined.
One option is to use the UDF defined in the above stackoverflow question, to add two columns, multiple times in a loop. But this involves multiple reads on the entire dataframe. Is there a way to do this in just one go?
+------+------+---------+
| col1 | col2 | combined|
+------+------+---------+
| [a,b]| [i,j]|[a,b,i,j]|
| [c,d]| [k,l]|[c,d,k,l]|
| [e,f]| [m,n]|[e,f,m,n]|
| [g,h]| [o,p]|[g,h,o,p]|
+------+----+-----------+
val arrStr: Array[String] = Array("col1", "col2")
val arrCol: Array[Column] = arrString.map(c => df(c))
val assembleFunc = udf { r: Row => assemble(r.toSeq: _*)}
val outputDf = df.select(col("*"), assembleFunc(struct(arrCol:
_*)).as("combined"))
def assemble(rowEntity: Any*):
collection.mutable.WrappedArray[String] = {
var outputArray =
rowEntity(0).asInstanceOf[collection.mutable.WrappedArray[String]]
rowEntity.drop(1).foreach {
case v: collection.mutable.WrappedArray[String] =>
outputArray ++= v
case null =>
throw new SparkException("Values to assemble cannot be
null.")
case o =>
throw new SparkException(s"$o of type ${o.getClass.getName}
is not supported.")
}
outputArray
}
outputDf.show(false)
Process the dataframe schema and get all the columns of the type ArrayType[StringType].
create a new dataframe with functions.array_union of the first two columns
iterate through the rest of the columns and adding each of them to the combined column
>>>from pyspark import Row
>>>from pyspark.sql.functions import array_union
>>>df = spark.createDataFrame([Row(col1=['aa1', 'bb1'],
col2=['aa2', 'bb2'],
col3=['aa3', 'bb3'],
col4= ['a', 'ee'], foo="bar"
)])
>>>df.show()
+----------+----------+----------+-------+---+
| col1| col2| col3| col4|foo|
+----------+----------+----------+-------+---+
|[aa1, bb1]|[aa2, bb2]|[aa3, bb3]|[a, ee]|bar|
+----------+----------+----------+-------+---+
>>>cols = [col_.name for col_ in df.schema
... if col_.dataType == ArrayType(StringType())
... or col_.dataType == ArrayType(StringType(), False)
... ]
>>>print(cols)
['col1', 'col2', 'col3', 'col4']
>>>
>>>final_df = df.withColumn("combined", array_union(cols[:2][0], cols[:2][1]))
>>>
>>>for col_ in cols[2:]:
... final_df = final_df.withColumn("combined", array_union(col('combined'), col(col_)))
>>>
>>>final_df.select("combined").show(truncate=False)
+-------------------------------------+
|combined |
+-------------------------------------+
|[aa1, bb1, aa2, bb2, aa3, bb3, a, ee]|
+-------------------------------------+

Dataframe to RDD[Row] replacing space with nulls

I am converting a Spark dataframe to RDD[Row] so I can map it to final schema to write into Hive Orc table. I want to convert any space in the input to actual null so the hive table can store actual null instead of a empty string.
Input DataFrame (a single column with pipe delimited values):
col1
1|2|3||5|6|7|||...|
My code:
inputDF.rdd.
map { x: Row => x.get(0).asInstanceOf[String].split("\\|", -1)}.
map { x => Row (nullConverter(x(0)),nullConverter(x(1)),nullConverter(x(2)).... nullConverter(x(200)))}
def nullConverter(input: String): String = {
if (input.trim.length > 0) input.trim
else null
}
Is there any clean way of doing it rather than calling the nullConverter function 200 times.
Update based on single column:
Going with your approach, I will do something like:
inputDf.rdd.map((row: Row) => {
val values = row.get(0).asInstanceOf[String].split("\\|").map(nullConverter)
Row(values)
})
Make your nullConverter or any other logic a udf:
import org.apache.spark.sql.functions._
val nullConverter = udf((input: String) => {
if (input.trim.length > 0) input.trim
else null
})
Now, use the udf on your df and apply to all columns:
val convertedDf = inputDf.select(inputDf.columns.map(c => nullConverter(col(c)).alias(c)):_*)
Now, you can do your RDD logic.
This would be easier to do using the DataFrame API before converting to an RDD. First, split the data:
val df = Seq(("1|2|3||5|6|7|8||")).toDF("col0") // Example dataframe
val df2 = df.withColumn("col0", split($"col0", "\\|")) // Split on "|"
Then find out the length of the array:
val numCols = df2.first.getAs[Seq[String]](0).length
Now, for each element in the array, use the nullConverter UDF and then assign it to it's own column.
val nullConverter = udf((input: String) => {
if (input.trim.length > 0) input.trim
else null
})
val df3 = df2.select((0 until numCols).map(i => nullConverter($"col0".getItem(i)).as("col" + i)): _*)
The result using the example dataframe:
+----+----+----+----+----+----+----+----+----+----+
|col0|col1|col2|col3|col4|col5|col6|col7|col8|col9|
+----+----+----+----+----+----+----+----+----+----+
| 1| 2| 3|null| 5| 6| 7| 8|null|null|
+----+----+----+----+----+----+----+----+----+----+
Now convert it to an RDD or continue using the data as a DataFrame depending on your needs.
There is no point in converting dataframe to rdd
import org.apache.spark.sql.functions._
df = sc.parallelize([
(1, "foo bar"), (2, "foobar "), (3, " ")
]).toDF(["k", "v"])
df.select(regexp_replace(col("*"), " ", "NULL"))