Spark read dataframe column value as string [duplicate] - scala

This question already has answers here:
Concatenate columns in Apache Spark DataFrame
(18 answers)
Closed 4 years ago.
I have dataframe in Spark 2.2 and I want to read a column value as string.
val df1 = df.withColumn("col1" ,
when( col("col1").isNull , col("col2") +"some_string" )
when col1 is null, I want to get string value in col2 and append my logic there.
Problem is I always get col("col2") as org.apache.spark.sql.Column. How can I convert this value into String to append my custom string?

lit and concat will do the trick. You can give and string value using lit function and using concat function you can concatenate it to the string value of the column.
import org.apache.spark.sql.functions._
df.withColumn("col1", when(col("col1").isNull,
concat(col("col2"), lit("some_string"))))

You can use the lit function to change the string value to Column and use the concat function.
val df1 = df.withColumn("col1" ,
when( col("col1").isNull , concat(col("col2"), lit("some_string")))
Hope this helps! )

Related

How could I unpivot a dataframe in Spark? [duplicate]

This question already has answers here:
How to melt Spark DataFrame?
(6 answers)
Closed 3 years ago.
I have a dataframe with the following schema:
subjectID, feature001, feature002, feature003, ..., feature299
Let's say my dataframe looks like:
123,0.23,0.54,0.35,...,0.26
234,0.17,0.49,0.47,...,0.69
Now, what I want is:
subjectID, featureID, featureValue
The above dataframe would look like:
123,001,0.23
123,002,0.54
123,003,0.35
......
123,299,0.26
234,001,0.17
234,002,0.49
234,003,0.47
......
234,299,0.69
I know how to achieve it if i have only several columns:
newDF = df.select($"subjectID", expr("stack(3, 'feature001', 001, 'feature002', 002, 'feature003', 003) as (featureID, featureValue)"))
However, I am looking for a way to deal with 300 columns.
You can build an array of struct with your columns and then use explode to transform them as rows:
import org.apache.spark.sql.functions.{explode, struct, lit, array, col}
// build an array of struct expressions from the feature columns
val columnExprs = df.columns
.filter(_.startsWith("feature"))
.map(name => struct(lit(name.replace("feature","")) as "id", col(name) as "value"))
// unpivot the DataFrame
val newDF = df.select($"subjectID", explode(array(columnExprs:_*)) as "feature")
.select($"subjectID",
$"feature.id" as "featureID",
$"feature.value" as "featureValue")

how to put some default value like "9999-12-31" in date field where we have null [duplicate]

This question already has answers here:
How to replace null values with a specific value in Dataframe using spark in Java?
(4 answers)
Closed 4 years ago.
test is a dataframe with t_dob a date field which has null value in it. I want to hardcode a value like "9999-12-31" whenever i have null in date filed using spark scala. Could not find any such option in na.fill() method for date field. Could anyone let me know how this can be done.
Expected output is as below :-
+-------------+-------+-----+
|s_customer_id| s_name|t_dob|
+-------------+-------+-----+
| 101|shameer| 9999-12-31|
| 102| rajesh| 9999-12-31|
+-------------+-------+-----+
Here is my approach
val spark = getSession()
val data = Seq(("101", "Shameer", null),
("102", "Rajesh", new Date(new java.util.Date().getTime)))
import spark.implicits._
val df = spark.sparkContext.parallelize(data).toDF("s_customer_id", "s_name", "t_dob")
import org.apache.spark.sql.functions.{lit, when, to_date}
df.withColumn("t_dob", when($"t_dob".isNull, to_date(lit("9999-12-31"), "yyyy-MM-dd")).otherwise($"t_dob")).show()
Output
+-------------+-------+----------+
|s_customer_id| s_name| t_dob|
+-------------+-------+----------+
| 101|Shameer|9999-12-31|
| 102| Rajesh|2019-02-21|
+-------------+-------+----------+
Try this one: var newTest = test.withColumn("t_dob_chnaged", when(col("t_dob").isin(null), "9999-12-31"))

Create boolean flag based on column value containing element of a List [duplicate]

This question already has answers here:
How to use Column.isin with list?
(5 answers)
Closed 4 years ago.
def myFunction(df: DataFrame): DataFrame = {
val myList= List("a","b","c")
df
.withColumn("myFlag",
if (myList.contains(df.select(col("columnName1")))) lit("true") else lit(false))
}
I want to write a function, that takes a Dataframe, and adds a column to it, named "myFlag".
I want "myFlag" to be true if the corresponding "columnName1" has a value that is an element of "myList", false otherwise.
For simplicity, "columnName1" values and "myList" only contain Strings.
My function above will not work. Any suggestions?
This can be done using isin which is defined on Column:
import spark.implicits._
df
.withColumn("myFlag",$"columnName1".isin(myList:_*))

How to convert Array[String] into spark Dataframe to save CSV file format? [duplicate]

This question already has answers here:
How to create DataFrame from Scala's List of Iterables?
(5 answers)
Closed 4 years ago.
Code that I'm using to parse the CSV
val ListOfNames = List("Ramesh","Suresh","Ganesh") //Dynamical will add list of names
val Seperator = ListOfNames.map(x => x.split(",") //mkString(",")
sc.parallelize(Array(seperator)).toDF().csv("path")
Getting output :
"Ramesh,Suresh,Ganesh" // Hence entire list into a single column in CSV
Expected output:
Ramesh, Suresh, Ganesh // each name into a single column in CSV
output should be in a row and each string should be in each column with comma separated.
If I try to change anything, it is saying CSV Data sources do not support array of string data type.
How to solve this?
If you are looking to convert your list of size n to a spark dataframe which holds n number of rows with only one column then the solution will look like below:
import sparkSession.sqlContext.implicits._
val listOfNames = List("Ramesh","Suresh","Ganesh")
val df = listOfNames.toDF("names")
df.show(false)
output:
+------+
|names |
+------+
|Ramesh|
|Suresh|
|Ganesh|
+------+

Converting Column of Dataframe to Seq[Columns] Scala

I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?
If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)