Scala explode followed by UDF on a dataframe fails - scala

I have a scala dataframe with the following schema:
root
|-- time: string (nullable = true)
|-- itemId: string (nullable = true)
|-- itemFeatures: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
I want to explode the itemFeatures column and then send my dataframe to a UDF. But as soon as I include the explode, calling the UDF results in this error:
org.apache.spark.SparkException: Task not serializable
I can't figure out why???
Environment: Scala 2.11.12, Spark 2.4.4
Full example:
val dataList = List(
("time1", "id1", "map1"),
("time2", "id2", "map2"))
val df = dataList.toDF("time", "itemId", "itemFeatures")
val dfExploded = df.select(col("time"), col("itemId"), explode("itemFeatures"))
val doNextThingUDF: UserDefinedFunction = udf(doNextThing _)
val dfNextThing = dfExploded.withColumn("nextThing", doNextThingUDF(col("time"))
where my UDF looks like this:
val doNextThing(time: String): String = {
time+"blah"
}
If I remove the explode, everything works fine, or if I don't call the UDF after the explode, everything works fine. I could imagine Spark is somehow unable to send each row to a UDF if it is dynamically executing the explode and doesn't know how many rows that are going to exist, but even when I add ex dfExploded.cache() and dfExploded.count() I still get the error. Is this a known issue? What am I missing?

I think the issue come from how you define your donextThing function. Also
there is couple of typos in your "full example".
Especially the itemFeatures column is a string in your example, I understand it should be a Map.
But here is a working example:
val dataList = List(
("time1", "id1", Map("map1" -> 1)),
("time2", "id2", Map("map2" -> 2)))
val df = dataList.toDF("time", "itemId", "itemFeatures")
val dfExploded = df.select(col("time"), col("itemId"), explode($"itemFeatures"))
val doNextThing = (time: String) => {time+"blah"}
val doNextThingUDF = udf(doNextThing)
val dfNextThing = dfExploded.withColumn("nextThing", doNextThingUDF(col("time")))

Related

Reading ambiguous column name in Spark sql Dataframe using scala

I have duplicate columns in text file and when I try to load that text file using spark scala code, it gets loaded successfully into data frame and I can see the first 20 rows by df.Show()
Full Code:-
val sc = new SparkContext(conf)
val hivesql = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sc.textFile("/...FilePath.../*")
val fieldCount = rdd.map(_.split("[|]")).map(x => x.size).first()
val field = rdd.zipWithIndex.filter(_._2==0).map(_._1).first()
val fields = field.split("[|]").map(fieldName =>StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val rowRDD = rdd.map(_.split("[|]")).map(attributes => getARow(attributes,fieldCount))
val df = hivesql.createDataFrame(rowRDD, schema)
df.registerTempTable("Sample_File")
df.Show()
Till this point my code works fine.
But as soon as I try below code then it gives me error.
val results = hivesql.sql("Select id,sequence,sequence from Sample_File")
so I have 2 columns with same name in text file i.e sequence
How can I access that two columns.. I tried with sequence#2 but still not working
Spark Version:-1.6.0
Scala Version:- 2.10.5
result of df.printschema()
|-- id: string (nullable = true)
|-- sequence: string (nullable = true)
|-- sequence: string (nullable = true)
I second #smart_coder's approach, I have a slightly different approach though. Please find it below.
You need to have unique column names to do query from hivesql.sql.
you can rename the column names dynamically by using below code:
Your code:
val df = hivesql.createDataFrame(rowRDD, schema)
After this point, we need to remove ambiguity, below is the solution:
var list = df.schema.map(_.name).toList
for(i <- 0 to list.size -1){
val cont = list.count(_ == list(i))
val col = list(i)
if(cont != 1){
list = list.take(i) ++ List(col+i) ++ list.drop(i+1)
}
}
val df1 = df.toDF(list: _*)
// you would get the output as below:
result of df1.printschema()
|-- id: string (nullable = true)
|-- sequence1: string (nullable = true)
|-- sequence: string (nullable = true)
So basically, we are getting all the column names as a list, then checking if any column is repeating more than once,
if a column is repeating, we are appending the column name with the index, then we create a new dataframe d1 with the new list with renamed column names.
I have tested this in Spark 2.4, but it should work in 1.6 as well.
The below code might help you to resolve your problem. I have tested this in Spark 1.6.3.
val sc = new SparkContext(conf)
val hivesql = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sc.textFile("/...FilePath.../*")
val fieldCount = rdd.map(_.split("[|]")).map(x => x.size).first()
val field = rdd.zipWithIndex.filter(_._2==0).map(_._1).first()
val fields = field.split("[|]").map(fieldName =>StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val rowRDD = rdd.map(_.split("[|]")).map(attributes => getARow(attributes,fieldCount))
val df = hivesql.createDataFrame(rowRDD, schema)
val colNames = Seq("id","sequence1","sequence2")
val df1 = df.toDF(colNames: _*)
df1.registerTempTable("Sample_File")
val results = hivesql.sql("select id,sequence1,sequence2 from Sample_File")

Update DataFrame col names from a list avoiding using var

I have a list of defined columns as:
case class ExcelColumn(colName: String, colType: String, colCode: String)
val cols = List(
ExcelColumn("Products Selled", "text", "products_selled"),
ExcelColumn("Total Value", "int", "total_value"),
)
And a file (csv with header columns Products Selled, Total Value) which is readed as dataframe.
val df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(filePath)
// csv file have header as colNames
var finalDf = df
.withColumn("row_id", monotonically_increasing_id)
.select(cols
.map(_.name.trim)
.map(col): _*)
// convert df col names as colCodes (for kudu table columns)
cols.foreach(col => finalDf = finalDf.withColumnRenamed(col.name.trim, col.colCode.trim))
In last line, I change the dataframe column name from Products Selled into products_selled. Due of this, finalDf is a var.
I want to know if is a solution to declare finalDf as val, and not var.
I tried something like below code, but withColumnRenamed return a new DataFrame, but I can not do this outside cols.foreach
cols.foreach(col => finalDf.withColumnRenamed(col.name.trim, col.colCode.trim))
Using select You can rename columns.
renaming columns inside select is faster than foldLeft, check post for comparison.
Try below code.
case class ExcelColumn(colName: String, colType: String, colCode: String)
val cols = List(
ExcelColumn("Products Selled", "string", "products_selled"),
ExcelColumn("Total Value", "int", "total_value"),
)
val colExpr = cols.map(c => trim(col(c.colName)).as(c.colCode.trim))
If you are storing valid column data type in ExcelColumn case class, you can use column data type like below.
val colExpr = cols.map(c => trim(col(c.colName).cast(c.colType)).as(c.colCode.trim))
finalDf.select(colExpr:_*)
The better way is to use foldLeft with withColumnRenamed
case class ExcelColumn(colName: String, colType: String, colCode: String)
val cols = List(
ExcelColumn("Products Selled", "text", "products_selled"),
ExcelColumn("Total Value", "int", "total_value"),
)
val resultDF = cols.foldLeft(df){(acc, name ) =>
acc.withColumnRenamed(name.colName.trim, name.colCode.trim)
}
Original Schema:
root
|-- Products Selled: integer (nullable = false)
|-- Total Value: string (nullable = true)
|-- value: integer (nullable = false)
New Schema:
root
|-- products_selled: integer (nullable = false)
|-- total_value: string (nullable = true)
|-- value: integer (nullable = false)

Scala and Spark UDF function

I made a simple UDF to convert or extract some values from a time field in a temptabl in spark. I register the function but when I call the function using sql it throws a NullPointerException. Below is my function and process of executing it. I am using Zeppelin. Strangly this was working yesterday but it stopped working this morning.
Function
def convert( time:String ) : String = {
val sdf = new java.text.SimpleDateFormat("HH:mm")
val time1 = sdf.parse(time)
return sdf.format(time1)
}
Register the Function
sqlContext.udf.register("convert",convert _)
Test the function without SQL -- This works
convert(12:12:12) -> returns 12:12
Test the function with SQL in Zeppelin this FAILS.
%sql
select convert(time) from temptable limit 10
Structure of temptable
root
|-- date: string (nullable = true)
|-- time: string (nullable = true)
|-- serverip: string (nullable = true)
|-- request: string (nullable = true)
|-- resource: string (nullable = true)
|-- protocol: integer (nullable = true)
|-- sourceip: string (nullable = true)
Part of the stacktrace that I am getting.
java.lang.NullPointerException
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:643)
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:652)
at org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54)
at org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:376)
at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:44)
Use udf instead of define a function directly
import org.apache.spark.sql.functions._
val convert = udf[String, String](time => {
val sdf = new java.text.SimpleDateFormat("HH:mm")
val time1 = sdf.parse(time)
sdf.format(time1)
}
)
A udf's input parameter is Column(or Columns). And the return type is Column.
case class UserDefinedFunction protected[sql] (
f: AnyRef,
dataType: DataType,
inputTypes: Option[Seq[DataType]]) {
def apply(exprs: Column*): Column = {
Column(ScalaUDF(f, dataType, exprs.map(_.expr), inputTypes.getOrElse(Nil)))
}
}
You have to define your function as a UDF.
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
val convertUDF: UserDefinedFunction = udf((time:String) => {
val sdf = new java.text.SimpleDateFormat("HH:mm")
val time1 = sdf.parse(time)
sdf.format(time1)
})
Next you would apply your UDF on your DataFrame.
// assuming your DataFrame is already defined
dataFrame.withColumn("time", convertUDF(col("time"))) // using the same name replaces existing
Now, as to your actual problem, one reason you are receiving this error could be because your DataFrame contains rows which are nulls. If you filter them out before you apply the UDF, you should be able to continue no problem.
dataFrame.filter(col("time").isNotNull)
I'm curious what else causes a NullPointerException when running a UDF other than it encountering a null, if you found a reason different than my suggestion, I'd be glad to know.

How to apply a function to a column of a Spark DataFrame?

Let's assume that we have a Spark DataFrame
df.getClass
Class[_ <: org.apache.spark.sql.DataFrame] = class org.apache.spark.sql.DataFrame
with the following schema
df.printSchema
root
|-- rawFV: string (nullable = true)
|-- tk: array (nullable = true)
| |-- element: string (containsNull = true)
Given that each row of the tk column is an array of strings, how to write a Scala function that will return the number of elements in each row?
You don't have to write a custom function because there is one:
import org.apache.spark.sql.functions.size
df.select(size($"tk"))
If you really want you can write an udf:
import org.apache.spark.sql.functions.udf
val size_ = udf((xs: Seq[String]) => xs.size)
or even create custom a expression but there is really no point in that.
One way is to access them using the sql like below.
df.registerTempTable("tab1")
val df2 = sqlContext.sql("select tk[0], tk[1] from tab1")
df2.show()
To get size of array column,
val df3 = sqlContext.sql("select size(tk) from tab1")
df3.show()
If your Spark version is older, you can use HiveContext instead of Spark's SQL Context.
I would also try for some thing that traverses.

How can I change column types in Spark SQL's DataFrame?

Suppose I'm doing something like:
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.printSchema()
root
|-- year: string (nullable = true)
|-- make: string (nullable = true)
|-- model: string (nullable = true)
|-- comment: string (nullable = true)
|-- blank: string (nullable = true)
df.show()
year make model comment blank
2012 Tesla S No comment
1997 Ford E350 Go get one now th...
But I really wanted the year as Int (and perhaps transform some other columns).
The best I could come up with was
df.withColumn("year2", 'year.cast("Int")).select('year2 as 'year, 'make, 'model, 'comment, 'blank)
org.apache.spark.sql.DataFrame = [year: int, make: string, model: string, comment: string, blank: string]
which is a bit convoluted.
I'm coming from R, and I'm used to being able to write, e.g.
df2 <- df %>%
mutate(year = year %>% as.integer,
make = make %>% toupper)
I'm likely missing something, since there should be a better way to do this in Spark/Scala...
Edit: Newest newest version
Since spark 2.x you should use dataset api instead when using Scala [1]. Check docs here:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame
If working with python, even though easier, I leave the link here as it's a very highly voted question:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.withColumn.html
>>> df.withColumn('age2', df.age + 2).collect()
[Row(age=2, name='Alice', age2=4), Row(age=5, name='Bob', age2=7)]
[1] https://spark.apache.org/docs/latest/sql-programming-guide.html:
In the Scala API, DataFrame is simply a type alias of Dataset[Row].
While, in Java API, users need to use Dataset to represent a
DataFrame.
Edit: Newest version
Since spark 2.x you can use .withColumn. Check the docs here:
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset#withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame
Oldest answer
Since Spark version 1.4 you can apply the cast method with DataType on the column:
import org.apache.spark.sql.types.IntegerType
val df2 = df.withColumn("yearTmp", df.year.cast(IntegerType))
.drop("year")
.withColumnRenamed("yearTmp", "year")
If you are using sql expressions you can also do:
val df2 = df.selectExpr("cast(year as int) year",
"make",
"model",
"comment",
"blank")
For more info check the docs:
http://spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame
[EDIT: March 2016: thanks for the votes! Though really, this is not the best answer, I think the solutions based on withColumn, withColumnRenamed and cast put forward by msemelman, Martin Senne and others are simpler and cleaner].
I think your approach is ok, recall that a Spark DataFrame is an (immutable) RDD of Rows, so we're never really replacing a column, just creating new DataFrame each time with a new schema.
Assuming you have an original df with the following schema:
scala> df.printSchema
root
|-- Year: string (nullable = true)
|-- Month: string (nullable = true)
|-- DayofMonth: string (nullable = true)
|-- DayOfWeek: string (nullable = true)
|-- DepDelay: string (nullable = true)
|-- Distance: string (nullable = true)
|-- CRSDepTime: string (nullable = true)
And some UDF's defined on one or several columns:
import org.apache.spark.sql.functions._
val toInt = udf[Int, String]( _.toInt)
val toDouble = udf[Double, String]( _.toDouble)
val toHour = udf((t: String) => "%04d".format(t.toInt).take(2).toInt )
val days_since_nearest_holidays = udf(
(year:String, month:String, dayOfMonth:String) => year.toInt + 27 + month.toInt-12
)
Changing column types or even building a new DataFrame from another can be written like this:
val featureDf = df
.withColumn("departureDelay", toDouble(df("DepDelay")))
.withColumn("departureHour", toHour(df("CRSDepTime")))
.withColumn("dayOfWeek", toInt(df("DayOfWeek")))
.withColumn("dayOfMonth", toInt(df("DayofMonth")))
.withColumn("month", toInt(df("Month")))
.withColumn("distance", toDouble(df("Distance")))
.withColumn("nearestHoliday", days_since_nearest_holidays(
df("Year"), df("Month"), df("DayofMonth"))
)
.select("departureDelay", "departureHour", "dayOfWeek", "dayOfMonth",
"month", "distance", "nearestHoliday")
which yields:
scala> df.printSchema
root
|-- departureDelay: double (nullable = true)
|-- departureHour: integer (nullable = true)
|-- dayOfWeek: integer (nullable = true)
|-- dayOfMonth: integer (nullable = true)
|-- month: integer (nullable = true)
|-- distance: double (nullable = true)
|-- nearestHoliday: integer (nullable = true)
This is pretty close to your own solution. Simply, keeping the type changes and other transformations as separate udf vals make the code more readable and re-usable.
As the cast operation is available for Spark Column's (and as I personally do not favour udf's as proposed by #Svend at this point), how about:
df.select( df("year").cast(IntegerType).as("year"), ... )
to cast to the requested type? As a neat side effect, values not castable / "convertable" in that sense, will become null.
In case you need this as a helper method, use:
object DFHelper{
def castColumnTo( df: DataFrame, cn: String, tpe: DataType ) : DataFrame = {
df.withColumn( cn, df(cn).cast(tpe) )
}
}
which is used like:
import DFHelper._
val df2 = castColumnTo( df, "year", IntegerType )
First, if you wanna cast type, then this:
import org.apache.spark.sql
df.withColumn("year", $"year".cast(sql.types.IntegerType))
With same column name, the column will be replaced with new one. You don't need to do add and delete steps.
Second, about Scala vs R.
This is the code that most similar to R I can come up with:
val df2 = df.select(
df.columns.map {
case year # "year" => df(year).cast(IntegerType).as(year)
case make # "make" => functions.upper(df(make)).as(make)
case other => df(other)
}: _*
)
Though the code length is a little longer than R's. That is nothing to do with the verbosity of the language. In R the mutate is a special function for R dataframe, while in Scala you can easily ad-hoc one thanks to its expressive power.
In word, it avoid specific solutions, because the language design is good enough for you to quickly and easy build your own domain language.
side note: df.columns is surprisingly a Array[String] instead of Array[Column], maybe they want it look like Python pandas's dataframe.
You can use selectExpr to make it a little cleaner:
df.selectExpr("cast(year as int) as year", "upper(make) as make",
"model", "comment", "blank")
Java code for modifying the datatype of the DataFrame from String to Integer
df.withColumn("col_name", df.col("col_name").cast(DataTypes.IntegerType))
It will simply cast the existing(String datatype) to Integer.
I think this is lot more readable for me.
import org.apache.spark.sql.types._
df.withColumn("year", df("year").cast(IntegerType))
This will convert your year column to IntegerType with creating any temporary columns and dropping those columns.
If you want to convert to any other datatype, you can check the types inside org.apache.spark.sql.types package.
To convert the year from string to int, you can add the following option to the csv reader: "inferSchema" -> "true", see DataBricks documentation
Generate a simple dataset containing five values and convert int to string type:
val df = spark.range(5).select( col("id").cast("string") )
So this only really works if your having issues saving to a jdbc driver like sqlserver, but it's really helpful for errors you will run into with syntax and types.
import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect}
import org.apache.spark.sql.jdbc.JdbcType
val SQLServerDialect = new JdbcDialect {
override def canHandle(url: String): Boolean = url.startsWith("jdbc:jtds:sqlserver") || url.contains("sqlserver")
override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
case StringType => Some(JdbcType("VARCHAR(5000)", java.sql.Types.VARCHAR))
case BooleanType => Some(JdbcType("BIT(1)", java.sql.Types.BIT))
case IntegerType => Some(JdbcType("INTEGER", java.sql.Types.INTEGER))
case LongType => Some(JdbcType("BIGINT", java.sql.Types.BIGINT))
case DoubleType => Some(JdbcType("DOUBLE PRECISION", java.sql.Types.DOUBLE))
case FloatType => Some(JdbcType("REAL", java.sql.Types.REAL))
case ShortType => Some(JdbcType("INTEGER", java.sql.Types.INTEGER))
case ByteType => Some(JdbcType("INTEGER", java.sql.Types.INTEGER))
case BinaryType => Some(JdbcType("BINARY", java.sql.Types.BINARY))
case TimestampType => Some(JdbcType("DATE", java.sql.Types.DATE))
case DateType => Some(JdbcType("DATE", java.sql.Types.DATE))
// case DecimalType.Fixed(precision, scale) => Some(JdbcType("NUMBER(" + precision + "," + scale + ")", java.sql.Types.NUMERIC))
case t: DecimalType => Some(JdbcType(s"DECIMAL(${t.precision},${t.scale})", java.sql.Types.DECIMAL))
case _ => throw new IllegalArgumentException(s"Don't know how to save ${dt.json} to JDBC")
}
}
JdbcDialects.registerDialect(SQLServerDialect)
the answers suggesting to use cast, FYI, the cast method in spark 1.4.1 is broken.
for example, a dataframe with a string column having value "8182175552014127960" when casted to bigint has value "8182175552014128100"
df.show
+-------------------+
| a|
+-------------------+
|8182175552014127960|
+-------------------+
df.selectExpr("cast(a as bigint) a").show
+-------------------+
| a|
+-------------------+
|8182175552014128100|
+-------------------+
We had to face a lot of issue before finding this bug because we had bigint columns in production.
df.select($"long_col".cast(IntegerType).as("int_col"))
You can use below code.
df.withColumn("year", df("year").cast(IntegerType))
Which will convert year column to IntegerType column.
Using Spark Sql 2.4.0 you can do that:
spark.sql("SELECT STRING(NULLIF(column,'')) as column_string")
This method will drop the old column and create new columns with same values and new datatype. My original datatypes when the DataFrame was created were:-
root
|-- id: integer (nullable = true)
|-- flag1: string (nullable = true)
|-- flag2: string (nullable = true)
|-- name: string (nullable = true)
|-- flag3: string (nullable = true)
After this I ran following code to change the datatype:-
df=df.withColumnRenamed(<old column name>,<dummy column>) // This was done for both flag1 and flag3
df=df.withColumn(<old column name>,df.col(<dummy column>).cast(<datatype>)).drop(<dummy column>)
After this my result came out to be:-
root
|-- id: integer (nullable = true)
|-- flag2: string (nullable = true)
|-- name: string (nullable = true)
|-- flag1: boolean (nullable = true)
|-- flag3: boolean (nullable = true)
So many answers and not much thorough explanations
The following syntax works Using Databricks Notebook with Spark 2.4
from pyspark.sql.functions import *
df = df.withColumn("COL_NAME", to_date(BLDFm["LOAD_DATE"], "MM-dd-yyyy"))
Note that you have to specify the entry format you have (in my case "MM-dd-yyyy") and the import is mandatory as the to_date is a spark sql function
Also Tried this syntax but got nulls instead of a proper cast :
df = df.withColumn("COL_NAME", df["COL_NAME"].cast("Date"))
(Note I had to use brackets and quotes for it to be syntaxically correct though)
PS : I have to admit this is like a syntax jungle, there are many possible ways entry points, and the official API references lack proper examples.
Another solution is as follows:
1) Keep "inferSchema" as False
2) While running 'Map' functions on the row, you can read 'asString' (row.getString...)
//Read CSV and create dataset
Dataset<Row> enginesDataSet = sparkSession
.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema","false")
.load(args[0]);
JavaRDD<Box> vertices = enginesDataSet
.select("BOX","BOX_CD")
.toJavaRDD()
.map(new Function<Row, Box>() {
#Override
public Box call(Row row) throws Exception {
return new Box((String)row.getString(0),(String)row.get(1));
}
});
Why not just do as described under http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast
df.select(df.year.cast("int"),"make","model","comment","blank")
One can change data type of a column by using cast in spark sql.
table name is table and it has two columns only column1 and column2 and column1 data type is to be changed.
ex-spark.sql("select cast(column1 as Double) column1NewName,column2 from table")
In the place of double write your data type.
Another way:
// Generate a simple dataset containing five values and convert int to string type
val df = spark.range(5).select( col("id").cast("string")).withColumnRenamed("id","value")
In case you have to rename dozens of columns given by their name, the following example takes the approach of #dnlbrky and applies it to several columns at once:
df.selectExpr(df.columns.map(cn => {
if (Set("speed", "weight", "height").contains(cn)) s"cast($cn as double) as $cn"
else if (Set("isActive", "hasDevice").contains(cn)) s"cast($cn as boolean) as $cn"
else cn
}):_*)
Uncasted columns are kept unchanged. All columns stay in their original order.
val fact_df = df.select($"data"(30) as "TopicTypeId", $"data"(31) as "TopicId",$"data"(21).cast(FloatType).as( "Data_Value_Std_Err")).rdd
//Schema to be applied to the table
val fact_schema = (new StructType).add("TopicTypeId", StringType).add("TopicId", StringType).add("Data_Value_Std_Err", FloatType)
val fact_table = sqlContext.createDataFrame(fact_df, fact_schema).dropDuplicates()
In case if you want to change multiple columns of a specific type to another without specifying individual column names
/* Get names of all columns that you want to change type.
In this example I want to change all columns of type Array to String*/
val arrColsNames = originalDataFrame.schema.fields.filter(f => f.dataType.isInstanceOf[ArrayType]).map(_.name)
//iterate columns you want to change type and cast to the required type
val updatedDataFrame = arrColsNames.foldLeft(originalDataFrame){(tempDF, colName) => tempDF.withColumn(colName, tempDF.col(colName).cast(DataTypes.StringType))}
//display
updatedDataFrame.show(truncate = false)