How to convert a Dataframe into a List (Scala)?

How to convert a Dataframe into a List (Scala)? - scala

I want to convert a Dataframe which contains Double values into a List so that I can use it to make calculations. What is your suggestion so that I can take a correct type List (i.e. Double) ?
My approach is this :
var newList = myDataFrame.collect().toList
but it returns a type List[org.apache.spark.sql.Row] which I don't know what it is exactly!
Is it possible to forget that step and simply pass my Dataframe inside a function and make calculation from it? (For example I want to compare the third element of its second column with a specific double. Is it possible to do so directly from my Dataframe?)
At any cost I have to understand how to create the right type List each time!
EDIT:
Input Dataframe:
+---+---+
|_c1|_c2|
+---+---+
|0 |0 |
|8 |2 |
|9 |1 |
|2 |9 |
|2 |4 |
|4 |6 |
|3 |5 |
|5 |3 |
|5 |9 |
|0 |1 |
|8 |9 |
|1 |0 |
|3 |4 |
|8 |7 |
|4 |9 |
|2 |5 |
|1 |9 |
|3 |6 |
+---+---+
Result after conversion:
List((0,0), (8,2), (9,1), (2,9), (2,4), (4,6), (3,5), (5,3), (5,9), (0,1), (8,9), (1,0), (3,4), (8,7), (4,9), (2,5), (1,9), (3,6))
But every element in the List has to be Double type.

You can cast the coulmn you need to Double and convert it to RDD and collect it
If you have data that cannot be parsed then you can use udf to clean before casting it to double
val stringToDouble = udf((data: String) => {
Try (data.toDouble) match {
case Success(value) => value
case Failure(exception) => Double.NaN
}
})
val df = Seq(
("0.000","0"),
("0.000008","24"),
("9.00000","1"),
("-2","xyz"),
("2adsfas","1.1.1")
).toDF("a", "b")
.withColumn("a", stringToDouble($"a").cast(DoubleType))
.withColumn("b", stringToDouble($"b").cast(DoubleType))
After this you will get output as
+------+----+
|a |b |
+------+----+
|0.0 |0.0 |
|8.0E-6|24.0|
|9.0 |1.0 |
|-2.0 |NaN |
|NaN |NaN |
+------+----+
To get Array[(Double, Double)]
val result = df.rdd.map(row => (row.getDouble(0), row.getDouble(1))).collect()
The result will be Array[(Double, Double)]

#Convert DataFrame to DataSet using case class & then convert it to list
#It'll return the list of type of your class object.All the variables inside the #class(mapping to fields in your table)will be pre-typeCasted) Then you won't need to #type cast every time.
#Please execute below code to check it-
#Sample to check & verify(scala)-
val wa = Array("one","two","two")
val wr = sc.parallelize(wa,3).map(x=>(x,"x",1))
val wdf = wr.toDF("a","b","c")
case class wc(a:String,b:String,c:Int)
val myList= wds.collect.toList
myList.foreach(x=>println(x))
myList.foreach(x=>println(x.a.getClass,x.b.getClass,x.c.getClass))

myDataFrame.select("_c1", "_c2").collect().map(each => (each.getAs[Double]("_c1"), each.getAs[Double]("_c2"))).toList

Related

How to create a column expression using a subquery in spark scala

Given any df, I want to calculate another column for the df called "has_duplicates", and then add a column with a boolean value for whether each row is unique. Example input df:
val df = Seq((1, 2), (2, 5), (1, 7), (1, 2), (2, 5)).toDF("A", "B")
Given an input columns: Seq[String], I know how to get the count of each row:
val countsDf = df.withColumn("count", count("*").over(Window.partitionBy(columns.map(col(_)): _*)))
But I'm not sure how to use this to create a column expression for the final column indicating whether each row is unique.
Something like
def getEvaluationExpression(df: DataFrame): Column = {
when("count > 1", lit("fail").otherwise(lit("pass"))
}
but the count needs to be evaluated on the spot using the query above.

Try below code.
scala> df.withColumn("has_duplicates", when(count("*").over(Window.partitionBy(df.columns.map(col(_)): _*)) > 1 , lit("fail")).otherwise("pass")).show(false)
+---+---+--------------+
|A |B |has_duplicates|
+---+---+--------------+
|1 |7 |pass |
|1 |2 |fail |
|1 |2 |fail |
|2 |5 |fail |
|2 |5 |fail |
+---+---+--------------+
Or
scala> df.withColumn("count",count("*").over(Window.partitionBy(df.columns.map(col(_)): _*))).withColumn("has_duplicates", when($"count" > 1 , lit("fail")).otherwise("pass")).show(false)
+---+---+-----+--------------+
|A |B |count|has_duplicates|
+---+---+-----+--------------+
|1 |7 |1 |pass |
|1 |2 |2 |fail |
|1 |2 |2 |fail |
|2 |5 |2 |fail |
|2 |5 |2 |fail |
+---+---+-----+--------------+

Map values of a column with ArrayType based on values from another dataframe in PySpark

What I have:
| ids. |items |item_id|value|timestamp|
+--------+--------+-------+-----+---------+
|[A,B,C] |1.0 |1 |5 |100 |
|[A,B,D] |1.0 |2 |6 |90 |
|[D] |0.0. |3 |7 |80 |
|[C] |0.0. |4 |8 |80 |
+--------+--------+-------+-----+----------
| ids |id_num |
+--------+--------+
|A |1 |
|B |2 |
|C |3 |
|D |4 |
+---+----+--------+
What I want:
| ids |
+--------+
|[1,2,3] |
|[1,2,4] |
|[3] |
|[4] |
+--------+
Is there a way to do this without an explode? Thank you for your help!

You can use a UDF:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType
# Suppose this is the dictionary you want to map
map_dict = {'A':1, 'B':2,'C':3,'D':4}
def array_map(array_col):
return list(map(map_dict.get, array_col))
"""
If you prefer list comprehension, you can return [map_dict[k] for k in array_col]
"""
array_map_udf = udf(array_map, ArrayType())
df = df.withColumn("mapped_array", array_map_udf(col("ids")))
I can't think of a different method, but to get a parallelized dictionary, you can just use the toJSON method. It will require further processing on the kind of reference df you have:
import json
df_json = df.toJSON().map(lambda x: json.loads(x))

String cannot be cast to Integer(Scala)

I am writing a Spark project using Scala in which I need to make some calculations from "demo" datasets. I am using databricks platform.
I need to pass the 2nd column of my Dataframe (trainingCoordDataFrame) into a list. The type of the list must be List[Int].
The dataframe is as shown bellow:
> +---+---+---+---+
> |_c0|_c1|_c2|_c3|
> +---+---+---+---+
> |1 |0 |0 |a |
> |11 |9 |1 |a |
> |12 |2 |7 |c |
> |13 |2 |9 |c |
> |14 |2 |4 |b |
> |15 |1 |3 |c |
> |16 |4 |6 |c |
> |17 |3 |5 |c |
> |18 |5 |3 |a |
> |2 |0 |1 |a |
> |20 |8 |9 |c |
> |3 |1 |0 |b |
> |4 |3 |4 |b |
> |5 |8 |7 |b |
> |6 |4 |9 |b |
> |7 |2 |5 |a |
> |8 |1 |9 |a |
> |9 |3 |6 |a |
> +---+---+---+---+
I am trying to create the list I want using the following command:
val trainingCoordList = trainingCoordDataFrame.select("_c1").collect().map(each => (each.getAs[Int]("_c1"))).toList
The message from the compiler is this:
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Integer
Note that the procedure is :
1) Upload the dataset from local PC to databricks (so no standard data can be used).
val mainDataFrame = spark.read.format("csv").option("header", "false").load("FileStore/tables/First_Spacial_Dataset_ByAris.csv")
2) Create dataframe. ( Step one: Split the main Dataframe randomly. Step two : Remove the unnecessary columns)
val Array(trainingDataFrame,testingDataFrame) = mainDataFrame.randomSplit(Array(0.8,0.2)) //step one
val trainingCoordDataFrame = trainingDataFrame.drop("_c0", "_c3") //step two
3) Create list. <- Here is the false command.
What is the correct way to reach the result I want?

I think there are several ways to deal with this problem.
A) Define a schema for your CSV:
For example:
val customSchema = StructType(Array(
StructField("_c0", IntegerType),
StructField("_c1", IntegerType),
StructField("_c2", IntegerType),
StructField("_c3", StringType)))
When you read the CSV add the schema option with the StructType we created earlier
val mainDataFrame = spark.read.format("csv").option("header", "false").schema(customSchema).load("FileStore/tables/First_Spacial_Dataset_ByAris.csv")
Now if we look at the output of the mainDataFrame.printSchema() command we'll see that the columns are typed according to your use case:
root
|-- _c0: integer (nullable = true)
|-- _c1: integer (nullable = true)
|-- _c2: integer (nullable = true)
|-- _c3: string (nullable = true)
This means we can actually run your original command without getting an error.
trainingCoordDataFrame.select("_c2").map(r => r.getInt(0)).collect.toList
B) Cast the entire column to Int
Refer to the column itself instead of the column name and then cast the column to IntegerType. Now that the column type is Int you can again use getInt where it failed earlier:
trainingCoordDataFrame.select($"_c2".cast(IntegerType)).map(r => r.getInt(0)).collect.toList
C) Cast each value individually
Use map to cast to or retrieve as String each individual value and then cast it to Int
trainingCoordDataFrame.select("_c2").map(r => r.getString(0).toInt).collect.toList

The column's value is of type string so read the column as string and use scala's string.toInt method.
A cast is definitely wrong at this place.
val trainingCoordList = trainingCoordDataFrame.select("_c1").collect().map(each => each.getAs[String]("_c1").toInt).toList
Or use the Dataset API with custom schema e.g. with tuples

How to convert a List of Lists to a DataFrame in Scala?

I am learning Spark and Scala, and was experimenting in the spark REPL.
When I try to convert a List to a DataFrame, it works as follows:
val convertedDf = Seq(1,2,3,4).toDF("Field1")
However, when I try to convert a list of lists to a DataFrame with two columns (field1, field2), it fails with
java.lang.IllegalArgumentException: requirement failed: The number of
columns doesn't match
error message:
val twoColumnDf =Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3)).toDF("Field1", (Field2))
How to convert such a List of Lists to a DataFrame in Scala?

If you are seeking ways to have each elements of each sequence in each row of respective columns then following are the options for you
zip
zip both sequences and then apply toDF as
val twoColumnDf =Seq(1,2,3,4,5).zip(Seq(5,4,3,2,3)).toDF("Field1", "Field2")
which should give you twoColumnDf as
+------+------+
|Field1|Field2|
+------+------+
|1 |5 |
|2 |4 |
|3 |3 |
|4 |2 |
|5 |3 |
+------+------+
zipped
Another better way is to use zipped as
val threeColumnDf = (Seq(1,2,3,4,5), Seq(5,4,3,2,3), Seq(10,10,10,12,14)).zipped.toList.toDF("Field1", "Field2", "field3")
which should give you
+------+------+------+
|Field1|Field2|field3|
+------+------+------+
|1 |5 |10 |
|2 |4 |10 |
|3 |3 |10 |
|4 |2 |12 |
|5 |3 |14 |
+------+------+------+
But zipped works only for maximum three sequeces Thanks for pointing that out #Shaido
Note: the number of rows is determined by the shortest sequence present
transpose
Tanspose combines all sequences as zip and zipped does but returns list instead of tuples so a little hacking is needed as
Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3)).transpose.map{case List(a,b) => (a, b)}.toDF("Field1", "Field2")
+------+------+
|Field1|Field2|
+------+------+
|1 |5 |
|2 |4 |
|3 |3 |
|4 |2 |
|5 |3 |
+------+------+
and
Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3), Seq(10,10,10,12,14)).transpose.map{case List(a,b,c) => (a, b, c)}.toDF("Field1", "Field2", "Field3")
+------+------+------+
|Field1|Field2|Field3|
+------+------+------+
|1 |5 |10 |
|2 |4 |10 |
|3 |3 |10 |
|4 |2 |12 |
|5 |3 |14 |
+------+------+------+
and so on ...
Note: transpose requires all sequences to be of same length
I hope the answer is helpful

By default, each element is considered to be a Row of the dataFrame.
If you want each of the Seqs to be a different column you need to group them inside a Tuple:
val twoColumnDf =Seq((Seq(1,2,3,4,5), Seq(5,4,3,2,3))).toDF("Field1", "Field2")
twoColumnDf.show
+---------------+---------------+
| Field1| Field2|
+---------------+---------------+
|[1, 2, 3, 4, 5]|[5, 4, 3, 2, 3]|
+---------------+---------------+

Filtering out rows of a table bassed on a column

I am trying to filter out table rows based in column value.
I have a dataframe:
+---+-----+
|id |value|
+---+-----+
|3 |0 |
|3 |1 |
|3 |0 |
|4 |1 |
|4 |0 |
|4 |0 |
+---+-----+
I want to create a new dataframe deleting all rows with value!=0:
+---+-----+
|id |value|
+---+-----+
|3 |0 |
|3 |0 |
|4 |0 |
|4 |0 |
+---+-----+
I figured the syntax should be something like this but couldn't get it right:
val newDataFrame = OldDataFrame.filter($"value"==0)

Correct way is as following. You just forgot to add one = sign
val newDataFrame = OldDataFrame.filter($"value" === 0)

Their are various ways by which you can do the filtering.
val newDataFrame = OldDataFrame.filter($"value"===0)
val newDataFrame = OldDataFrame.filter(OldDataFrame("value") === 0)
val newDataFrame = OldDataFrame.filter("value === 0")
You can also use where function as well instead of filter.