How could I unpivot a dataframe in Spark? [duplicate] - scala

This question already has answers here:
How to melt Spark DataFrame?
(6 answers)
Closed 3 years ago.
I have a dataframe with the following schema:
subjectID, feature001, feature002, feature003, ..., feature299
Let's say my dataframe looks like:
123,0.23,0.54,0.35,...,0.26
234,0.17,0.49,0.47,...,0.69
Now, what I want is:
subjectID, featureID, featureValue
The above dataframe would look like:
123,001,0.23
123,002,0.54
123,003,0.35
......
123,299,0.26
234,001,0.17
234,002,0.49
234,003,0.47
......
234,299,0.69
I know how to achieve it if i have only several columns:
newDF = df.select($"subjectID", expr("stack(3, 'feature001', 001, 'feature002', 002, 'feature003', 003) as (featureID, featureValue)"))
However, I am looking for a way to deal with 300 columns.

You can build an array of struct with your columns and then use explode to transform them as rows:
import org.apache.spark.sql.functions.{explode, struct, lit, array, col}
// build an array of struct expressions from the feature columns
val columnExprs = df.columns
.filter(_.startsWith("feature"))
.map(name => struct(lit(name.replace("feature","")) as "id", col(name) as "value"))
// unpivot the DataFrame
val newDF = df.select($"subjectID", explode(array(columnExprs:_*)) as "feature")
.select($"subjectID",
$"feature.id" as "featureID",
$"feature.value" as "featureValue")

Related

Creating pyspark dataframe from same entries in the list

I have two list, one is col_name=['col1','col2','col3'] and the other is col_value=['val1', 'val2', 'val3']. I am trying to create a dataframe from the two list with col_name being the column.I need the output with 3 columns and 1 row (with 1 header as a col_name).
Finding difficult to get the solution for this. Pls help.
Construct data directly, and use the createDataFrame method to create a dataframe.
col_name = ['col1', 'col2', 'col3']
col_value = ['val1', 'val2', 'val3']
data = [col_value]
df = spark.createDataFrame(data, col_name)
df.printSchema()
df.show(truncate=False)

How to convert Array[String] into spark Dataframe to save CSV file format? [duplicate]

This question already has answers here:
How to create DataFrame from Scala's List of Iterables?
(5 answers)
Closed 4 years ago.
Code that I'm using to parse the CSV
val ListOfNames = List("Ramesh","Suresh","Ganesh") //Dynamical will add list of names
val Seperator = ListOfNames.map(x => x.split(",") //mkString(",")
sc.parallelize(Array(seperator)).toDF().csv("path")
Getting output :
"Ramesh,Suresh,Ganesh" // Hence entire list into a single column in CSV
Expected output:
Ramesh, Suresh, Ganesh // each name into a single column in CSV
output should be in a row and each string should be in each column with comma separated.
If I try to change anything, it is saying CSV Data sources do not support array of string data type.
How to solve this?
If you are looking to convert your list of size n to a spark dataframe which holds n number of rows with only one column then the solution will look like below:
import sparkSession.sqlContext.implicits._
val listOfNames = List("Ramesh","Suresh","Ganesh")
val df = listOfNames.toDF("names")
df.show(false)
output:
+------+
|names |
+------+
|Ramesh|
|Suresh|
|Ganesh|
+------+

Spark read dataframe column value as string [duplicate]

This question already has answers here:
Concatenate columns in Apache Spark DataFrame
(18 answers)
Closed 4 years ago.
I have dataframe in Spark 2.2 and I want to read a column value as string.
val df1 = df.withColumn("col1" ,
when( col("col1").isNull , col("col2") +"some_string" )
when col1 is null, I want to get string value in col2 and append my logic there.
Problem is I always get col("col2") as org.apache.spark.sql.Column. How can I convert this value into String to append my custom string?
lit and concat will do the trick. You can give and string value using lit function and using concat function you can concatenate it to the string value of the column.
import org.apache.spark.sql.functions._
df.withColumn("col1", when(col("col1").isNull,
concat(col("col2"), lit("some_string"))))
You can use the lit function to change the string value to Column and use the concat function.
val df1 = df.withColumn("col1" ,
when( col("col1").isNull , concat(col("col2"), lit("some_string")))
Hope this helps! )

How to convert all column of dataframe to numeric spark scala?

I loaded a csv as dataframe. I would like to cast all columns to float, knowing that the file is to big to write all columns names:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")
Given this DataFrame as example:
val df = sqlContext.createDataFrame(Seq(("0", 0),("1", 1),("2", 0))).toDF("id", "c0")
with schema:
StructType(
StructField(id,StringType,true),
StructField(c0,IntegerType,false))
You can loop over DF columns by .columns functions:
val castedDF = df.columns.foldLeft(df)((current, c) => current.withColumn(c, col(c).cast("float")))
So the new DF schema looks like:
StructType(
StructField(id,FloatType,true),
StructField(c0,FloatType,false))
EDIT:
If you wanna exclude some columns from casting, you could do something like (supposing we want to exclude the column id):
val exclude = Array("id")
val someCastedDF = (df.columns.toBuffer --= exclude).foldLeft(df)((current, c) =>
current.withColumn(c, col(c).cast("float")))
where exclude is an Array of all columns we want to exclude from casting.
So the schema of this new DF is:
StructType(
StructField(id,StringType,true),
StructField(c0,FloatType,false))
Please notice that maybe this is not the best solution to do it but it can be a starting point.

Filter dataframe by value NOT present in column of other dataframe [duplicate]

This question already has answers here:
Filter Spark DataFrame based on another DataFrame that specifies denylist criteria
(2 answers)
Closed 6 years ago.
Banging my head a little with this one, and I suspect the answer is very simple. Given two dataframes, I want to filter the first where values in one column are not present in a column of another dataframe.
I would like to do this without resorting to full-blown Spark SQL, so just using DataFrame.filter, or Column.contains or the "isin" keyword, or one of the join methods.
val df1 = Seq(("Hampstead", "London"),
("Spui", "Amsterdam"),
("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")
val res = df1.filter(df2("cities").contains("city") === false)
// doesn't work, nor do the 20 other variants I have tried
Anyone got any ideas?
I've discovered that I can solve this using a simpler method - it seems that an antijoin is possible as a parameter to the join method, but the Spark Scaladoc does not describe it:
import org.apache.spark.sql.functions._
val df1 = Seq(("Hampstead", "London"),
("Spui", "Amsterdam"),
("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")
df1.join(df2, df1("city") === df2("cities"), "leftanti").show
Results in:
+----------+-------+
| location| city|
+----------+-------+
|Chittagong|Chennai|
+----------+-------+
P.S. thanks for the pointer to the duplicate - duly marked as such
If you are trying to filter a DataFrame using another, you should use join (or any of its variants). If what you need is to filter it using a List or any data structure that fits in your master and workers you could broadcast it, then reference it inside the filter or where method.
For instance I would do something like:
import org.apache.spark.sql.functions._
val df1 = Seq(("Hampstead", "London"),
("Spui", "Amsterdam"),
("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")
df2.join(df1, joinExprs=df1("city") === df2("cities"), joinType="full_outer")
.select("city", "cities")
.where(isnull($"cities"))
.drop("cities").show()