I have a Spark dataframe with column that has array<struct<_1:long,_2:string>> datatype and following sample data:
WrappedArray([Value,Title1], [Value,Title2], [Value,Title3])
I want to convert this column from WrappedArray to a single String
Here is the desired output:
Value+Title1,Value+Title2,Value+Title3
I tried the following udf passing that column of the dataframe:
val f = (x:Seq[Row]) => x.mkString(",")
sqlContext.udf.register("f", f)
But the result is [Value,Title1],[Value,Title2],[Value,Title3]
try this:
val f = (x:Seq[Row]) => x.map(row => "%s+%s".format(row.getString(0),
row.getString(1))).mkString(",")
Related
I have a UDF:
val TrimText = (s: AnyRef) => {
//does logic returns string
}
And a dataframe:
var df = spark.read.option("sep", ",").option("header", "true").csv(root_path + "/" + file)
I would like to perform TrimText on every value in every column in the dataframe.
However, the problem is, I have a dynamic number of columns. I know I can get the list of columns by df.columns. But I am unsure on how this will help me with my issue. How can I solve this problem?
TLDR Issue - Performing a UDF on every column in a dataframe, when the dataframe has an unknown number of columns
Attempting to use:
df.columns.foldLeft( df )( (accDF, c) =>
accDF.withColumn(c, TrimText(col(c)))
)
Throws this error:
error: type mismatch;
found : String
required: org.apache.spark.sql.Column
accDF.withColumn(c, TrimText(col(c)))
TrimText is suppose to return a string and expects the input to be a value in a column. So it is going to be standardizing every value in every row of the entire dataframe.
You can use foldLeft to traverse the column list to iteratively apply withColumn to the DataFrame using your UDF:
df.columns.foldLeft( df )( (accDF, c) =>
accDF.withColumn(c, TrimText(col(c)))
)
>> I would like to perform TrimText on every value in every column in the dataframe.
>> I have a dynamic number of columns.
when sql function is available for trimming why UDF, could see below code fit's for you ?
import org.apache.spark.sql.functions._
spark.udf.register("TrimText", (x:String) => ..... )
val df2 = sc.parallelize(List(
(26, true, 60000.00),
(32, false, 35000.00)
)).toDF("age", "education", "income")
val cols2 = df2.columns.toSet
df2.createOrReplaceTempView("table1")
val query = "select " + buildcolumnlst(cols2) + " from table1 "
println(query)
val dfresult = spark.sql(query)
dfresult.show()
def buildcolumnlst(myCols: Set[String]) = {
myCols.map(x => "TrimText(" + x + ")" + " as " + x).mkString(",")
}
results,
select trim(age) as age,trim(education) as education,trim(income) as income from table1
+---+---------+-------+
|age|education| income|
+---+---------+-------+
| 26| true|60000.0|
| 32| false|35000.0|
+---+---------+-------+
val a = sc.parallelize(Seq(("1 "," 2"),(" 3","4"))).toDF()
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def TrimText(s: Column): Column = {
//does logic returns string
trim(s)
}
a.select(a.columns.map(c => TrimText(col(c))):_*).show
I'm trying to change the type of a list of columns for a Dataframe in Spark 1.6.0.
All the examples found so far however only allow casting for a single column (df.withColumn) or for all the columns in the dataframe:
val castedDF = filteredDf.columns.foldLeft(filteredDf)((filteredDf, c) => filteredDf.withColumn(c, col(c).cast("String")))
Is there any efficient, batch way of doing this for a list of columns in the dataframe?
There is nothing wrong with withColumn* but you can use select if you prefer:
import org.apache.spark.sql.functions col
val columnsToCast: Set[String]
val outputType: String = "string"
df.select(df.columns map (
c => if(columnsToCast.contains(c)) col(c).cast(outputType) else col(c)
): _*)
* Execution plan will be the same for a single select as with chained withColumn.
I am new to spark and using spark 1.6.1. I am using the pivot function to create a new column based on a integer value. Say I have a csv file like this:
year,winds
1990,50
1990,55
1990,58
1991,45
1991,42
1991,58
I am loading the csv file like this:
var df =sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("data/sample.csv")
I want to aggregate the winds colmnn filtering those winds greater than 55 so that I get an output file like this:
year, majorwinds
1990,2
1991,1
I am using the code below:
val df2=df.groupBy("major").pivot("winds").agg(>55)->"count")
But I get this error
error: expected but integer literal found
What is the correct syntax here? Thanks in advance
In your case, if you just want output like:
+----+----------+
|year|majorwinds|
+----+----------+
|1990| 2|
|1991| 1|
+----+----------+
It's not necessary to use pivot.
You could reach this by using filter, groupBy and count:
df.filter($"winds" >= 55)
.groupBy($"year")
.count()
.withColumnRenamed("count", "majorwinds")
.show()
use this generic funtion to do pivot
def transpose(sqlCxt: SQLContext, df: DataFrame, compositeId: Vector[String], pair: (String, String), distinctCols: Array[Any]): DataFrame = {
val rdd = df.map { row => (compositeId.collect { case id => row.getAs(id).asInstanceOf[Any] }, scala.collection.mutable.Map(row.getAs(pair._1).asInstanceOf[Any] -> row.getAs(pair._2).asInstanceOf[Any])) }
val pairRdd = rdd.reduceByKey(_ ++ _)
val rowRdd = pairRdd.map(r => dynamicRow(r, distinctCols))
sqlCxt.createDataFrame(rowRdd, getSchema(compositeId ++ distinctCols))
}
I have a dataframe in Spark with many columns and a udf that I defined. I want the same dataframe back, except with one column transformed. Furthermore, my udf takes in a string and returns a timestamp. Is there an easy way to do this? I tried
val test = myDF.select("my_column").rdd.map(r => getTimestamp(r))
but this returns an RDD and just with the transformed column.
If you really need to use your function, I can suggest two options:
Using map / toDF:
import org.apache.spark.sql.Row
import sqlContext.implicits._
def getTimestamp: (String => java.sql.Timestamp) = // your function here
val test = myDF.select("my_column").rdd.map {
case Row(string_val: String) => (string_val, getTimestamp(string_val))
}.toDF("my_column", "new_column")
Using UDFs (UserDefinedFunction):
import org.apache.spark.sql.functions._
def getTimestamp: (String => java.sql.Timestamp) = // your function here
val newCol = udf(getTimestamp).apply(col("my_column")) // creates the new column
val test = myDF.withColumn("new_column", newCol) // adds the new column to original DF
Alternatively,
If you just want to transform a StringType column into a TimestampType column you can use the unix_timestamp column function available since Spark SQL 1.5:
val test = myDF
.withColumn("new_column", unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm")
.cast("timestamp"))
Note: For spark 1.5.x, it is necessary to multiply the result of unix_timestamp by 1000 before casting to timestamp (issue SPARK-11724). The resulting code would be:
val test = myDF
.withColumn("new_column", (unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm") *1000L)
.cast("timestamp"))
Edit: Added udf option
val columnName=Seq("col1","col2",....."coln");
Is there a way to do dataframe.select operation to get dataframe containing only the column names specified .
I know I can do dataframe.select("col1","col2"...)
but the columnNameis generated at runtime.
I could do dataframe.select() repeatedly for each column name in a loop.Will it have any performance overheads?. Is there any other simpler way to accomplish this?
val columnNames = Seq("col1","col2",....."coln")
// using the string column names:
val result = dataframe.select(columnNames.head, columnNames.tail: _*)
// or, equivalently, using Column objects:
val result = dataframe.select(columnNames.map(c => col(c)): _*)
Since dataFrame.select() expects a sequence of columns and we have a sequence of strings, we need to convert our sequence to a List of cols and convert that list to the sequence. columnName.map(name => col(name)): _* gives a sequence of columns from a sequence of strings, and this can be passed as a parameter to select():
val columnName = Seq("col1", "col2")
val DFFiltered = DF.select(columnName.map(name => col(name)): _*)
Alternatively, you can also write like this
val columnName = Seq("col1", "col2")
val DFFiltered = DF.select(columnName.map(DF(_): _*)