I am writing a Spark project using Scala in which I need to make some calculations from "demo" datasets. I am using databricks platform.
I need to pass the 2nd column of my Dataframe (trainingCoordDataFrame) into a list. The type of the list must be List[Int].
The dataframe is as shown bellow:
> +---+---+---+---+
> |_c0|_c1|_c2|_c3|
> +---+---+---+---+
> |1 |0 |0 |a |
> |11 |9 |1 |a |
> |12 |2 |7 |c |
> |13 |2 |9 |c |
> |14 |2 |4 |b |
> |15 |1 |3 |c |
> |16 |4 |6 |c |
> |17 |3 |5 |c |
> |18 |5 |3 |a |
> |2 |0 |1 |a |
> |20 |8 |9 |c |
> |3 |1 |0 |b |
> |4 |3 |4 |b |
> |5 |8 |7 |b |
> |6 |4 |9 |b |
> |7 |2 |5 |a |
> |8 |1 |9 |a |
> |9 |3 |6 |a |
> +---+---+---+---+
I am trying to create the list I want using the following command:
val trainingCoordList = trainingCoordDataFrame.select("_c1").collect().map(each => (each.getAs[Int]("_c1"))).toList
The message from the compiler is this:
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Integer
Note that the procedure is :
1) Upload the dataset from local PC to databricks (so no standard data can be used).
val mainDataFrame = spark.read.format("csv").option("header", "false").load("FileStore/tables/First_Spacial_Dataset_ByAris.csv")
2) Create dataframe. ( Step one: Split the main Dataframe randomly. Step two : Remove the unnecessary columns)
val Array(trainingDataFrame,testingDataFrame) = mainDataFrame.randomSplit(Array(0.8,0.2)) //step one
val trainingCoordDataFrame = trainingDataFrame.drop("_c0", "_c3") //step two
3) Create list. <- Here is the false command.
What is the correct way to reach the result I want?
I think there are several ways to deal with this problem.
A) Define a schema for your CSV:
For example:
val customSchema = StructType(Array(
StructField("_c0", IntegerType),
StructField("_c1", IntegerType),
StructField("_c2", IntegerType),
StructField("_c3", StringType)))
When you read the CSV add the schema option with the StructType we created earlier
val mainDataFrame = spark.read.format("csv").option("header", "false").schema(customSchema).load("FileStore/tables/First_Spacial_Dataset_ByAris.csv")
Now if we look at the output of the mainDataFrame.printSchema() command we'll see that the columns are typed according to your use case:
root
|-- _c0: integer (nullable = true)
|-- _c1: integer (nullable = true)
|-- _c2: integer (nullable = true)
|-- _c3: string (nullable = true)
This means we can actually run your original command without getting an error.
trainingCoordDataFrame.select("_c2").map(r => r.getInt(0)).collect.toList
B) Cast the entire column to Int
Refer to the column itself instead of the column name and then cast the column to IntegerType. Now that the column type is Int you can again use getInt where it failed earlier:
trainingCoordDataFrame.select($"_c2".cast(IntegerType)).map(r => r.getInt(0)).collect.toList
C) Cast each value individually
Use map to cast to or retrieve as String each individual value and then cast it to Int
trainingCoordDataFrame.select("_c2").map(r => r.getString(0).toInt).collect.toList
The column's value is of type string so read the column as string and use scala's string.toInt method.
A cast is definitely wrong at this place.
val trainingCoordList = trainingCoordDataFrame.select("_c1").collect().map(each => each.getAs[String]("_c1").toInt).toList
Or use the Dataset API with custom schema e.g. with tuples
Related
I have below data which is stored in a csv file
1|Roy|NA|2|Marry|4.6|3|Richard|NA|4|Joy|NA|5|Joe|NA|6|Jos|9|
Now I want to read the file and store it in the spark dataframe, before storing it into dataframe I want to split at every 3rd | and store it as a row.
Output Expected :
1|Roy|NA|
2|Marry|4.6|
3|Richard|NA|
4|Joy|NA|
5|Joe|NA|
6|Jos|9|
Could you anyone help me out to get the output like above.
Start by reading your csv file
val df = spark.read.option("delimiter", "|").csv(file)
This will give you this dataframe
+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+----+
|_c1|_c2|_c3|_c4 |_c5|_c6|_c7 |_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|
+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+----+
|Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |null|
|Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |null|
|Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |null|
+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+----+
Last column is created because of the last delimiter in your csv file so we get rid of it
val dataframe = df.drop(df.schema.last.name)
dataframe.show(false)
+---+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+
|_c0|_c1|_c2|_c3|_c4 |_c5|_c6|_c7 |_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|
+---+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+
|1 |Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |
|1 |Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |
|1 |Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |
+---+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+
Then, you need to create an array that contains list of columns name you need to have in your final dataframe
val names : Array[String] = Array("colOne", "colTwo", "colThree")
Last, you need a function that reads by 3
def splitCSV(dataFrame: DataFrame, columnNames : Array[String], sparkSession: SparkSession) : DataFrame = {
import sparkSession.implicits._
val columns = dataFrame.columns
var finalDF : DataFrame = Seq.empty[(String,String,String)].toDF(columnNames:_*)
for(order <- 0 until(columns.length) -3 by(3) ){
finalDF = finalDF.union(dataFrame.select(col(columns(order)).as(columnNames(0)), col(columns(order+1)).as(columnNames(1)), col(columns(order+2)).as(columnNames(2))))
}
finalDF
}
After we apply this function on dataframe
val finalDF = splitCSV(dataframe, names, sparkSession)
finalDF.show(false)
+------+-------+--------+
|colOne|colTwo |colThree|
+------+-------+--------+
|1 |Roy |NA |
|1 |Roy |NA |
|1 |Roy |NA |
|2 |Marry |4.6 |
|2 |Marry |4.6 |
|2 |Marry |4.6 |
|3 |Richard|NA |
|3 |Richard|NA |
|3 |Richard|NA |
|4 |Joy |NA |
|4 |Joy |NA |
|4 |Joy |NA |
|5 |Joe |NA |
|5 |Joe |NA |
|5 |Joe |NA |
+------+-------+--------+
You can use regex for most of it. There's no straightforward regex for "split at nth matching occurence", so we work around it by using a match to pick out the pattern, then insert a custom splitter that we can then use.
ds
.withColumn("value",
regexp_replace('value, "([^\\|]*)\\|([^\\|]*)\\|([^\\|]*)\\|", "$1|$2|$3||")) // 1
.withColumn("value", explode(split('value, "\\|\\|"))) // 2
.where(length('value) > 0) // 3
Explanation
Replace every group of 3 |'s with the components, then terminate with ||
Split on each || and use explode to move each to a separate row
Unfortunately, the split picks up the empty group at the end, so we filter it out
Output for your given input:
+------------+
|value |
+------------+
|1|Roy|NA |
|2|Marry|4.6 |
|3|Richard|NA|
|4|Joy|NA |
|5|Joe|NA |
|6|Jos|9 |
+------------+
Given any df, I want to calculate another column for the df called "has_duplicates", and then add a column with a boolean value for whether each row is unique. Example input df:
val df = Seq((1, 2), (2, 5), (1, 7), (1, 2), (2, 5)).toDF("A", "B")
Given an input columns: Seq[String], I know how to get the count of each row:
val countsDf = df.withColumn("count", count("*").over(Window.partitionBy(columns.map(col(_)): _*)))
But I'm not sure how to use this to create a column expression for the final column indicating whether each row is unique.
Something like
def getEvaluationExpression(df: DataFrame): Column = {
when("count > 1", lit("fail").otherwise(lit("pass"))
}
but the count needs to be evaluated on the spot using the query above.
Try below code.
scala> df.withColumn("has_duplicates", when(count("*").over(Window.partitionBy(df.columns.map(col(_)): _*)) > 1 , lit("fail")).otherwise("pass")).show(false)
+---+---+--------------+
|A |B |has_duplicates|
+---+---+--------------+
|1 |7 |pass |
|1 |2 |fail |
|1 |2 |fail |
|2 |5 |fail |
|2 |5 |fail |
+---+---+--------------+
Or
scala> df.withColumn("count",count("*").over(Window.partitionBy(df.columns.map(col(_)): _*))).withColumn("has_duplicates", when($"count" > 1 , lit("fail")).otherwise("pass")).show(false)
+---+---+-----+--------------+
|A |B |count|has_duplicates|
+---+---+-----+--------------+
|1 |7 |1 |pass |
|1 |2 |2 |fail |
|1 |2 |2 |fail |
|2 |5 |2 |fail |
|2 |5 |2 |fail |
+---+---+-----+--------------+
I want to convert a Dataframe which contains Double values into a List so that I can use it to make calculations. What is your suggestion so that I can take a correct type List (i.e. Double) ?
My approach is this :
var newList = myDataFrame.collect().toList
but it returns a type List[org.apache.spark.sql.Row] which I don't know what it is exactly!
Is it possible to forget that step and simply pass my Dataframe inside a function and make calculation from it? (For example I want to compare the third element of its second column with a specific double. Is it possible to do so directly from my Dataframe?)
At any cost I have to understand how to create the right type List each time!
EDIT:
Input Dataframe:
+---+---+
|_c1|_c2|
+---+---+
|0 |0 |
|8 |2 |
|9 |1 |
|2 |9 |
|2 |4 |
|4 |6 |
|3 |5 |
|5 |3 |
|5 |9 |
|0 |1 |
|8 |9 |
|1 |0 |
|3 |4 |
|8 |7 |
|4 |9 |
|2 |5 |
|1 |9 |
|3 |6 |
+---+---+
Result after conversion:
List((0,0), (8,2), (9,1), (2,9), (2,4), (4,6), (3,5), (5,3), (5,9), (0,1), (8,9), (1,0), (3,4), (8,7), (4,9), (2,5), (1,9), (3,6))
But every element in the List has to be Double type.
You can cast the coulmn you need to Double and convert it to RDD and collect it
If you have data that cannot be parsed then you can use udf to clean before casting it to double
val stringToDouble = udf((data: String) => {
Try (data.toDouble) match {
case Success(value) => value
case Failure(exception) => Double.NaN
}
})
val df = Seq(
("0.000","0"),
("0.000008","24"),
("9.00000","1"),
("-2","xyz"),
("2adsfas","1.1.1")
).toDF("a", "b")
.withColumn("a", stringToDouble($"a").cast(DoubleType))
.withColumn("b", stringToDouble($"b").cast(DoubleType))
After this you will get output as
+------+----+
|a |b |
+------+----+
|0.0 |0.0 |
|8.0E-6|24.0|
|9.0 |1.0 |
|-2.0 |NaN |
|NaN |NaN |
+------+----+
To get Array[(Double, Double)]
val result = df.rdd.map(row => (row.getDouble(0), row.getDouble(1))).collect()
The result will be Array[(Double, Double)]
#Convert DataFrame to DataSet using case class & then convert it to list
#It'll return the list of type of your class object.All the variables inside the #class(mapping to fields in your table)will be pre-typeCasted) Then you won't need to #type cast every time.
#Please execute below code to check it-
#Sample to check & verify(scala)-
val wa = Array("one","two","two")
val wr = sc.parallelize(wa,3).map(x=>(x,"x",1))
val wdf = wr.toDF("a","b","c")
case class wc(a:String,b:String,c:Int)
val myList= wds.collect.toList
myList.foreach(x=>println(x))
myList.foreach(x=>println(x.a.getClass,x.b.getClass,x.c.getClass))
myDataFrame.select("_c1", "_c2").collect().map(each => (each.getAs[Double]("_c1"), each.getAs[Double]("_c2"))).toList
I am learning Spark and Scala, and was experimenting in the spark REPL.
When I try to convert a List to a DataFrame, it works as follows:
val convertedDf = Seq(1,2,3,4).toDF("Field1")
However, when I try to convert a list of lists to a DataFrame with two columns (field1, field2), it fails with
java.lang.IllegalArgumentException: requirement failed: The number of
columns doesn't match
error message:
val twoColumnDf =Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3)).toDF("Field1", (Field2))
How to convert such a List of Lists to a DataFrame in Scala?
If you are seeking ways to have each elements of each sequence in each row of respective columns then following are the options for you
zip
zip both sequences and then apply toDF as
val twoColumnDf =Seq(1,2,3,4,5).zip(Seq(5,4,3,2,3)).toDF("Field1", "Field2")
which should give you twoColumnDf as
+------+------+
|Field1|Field2|
+------+------+
|1 |5 |
|2 |4 |
|3 |3 |
|4 |2 |
|5 |3 |
+------+------+
zipped
Another better way is to use zipped as
val threeColumnDf = (Seq(1,2,3,4,5), Seq(5,4,3,2,3), Seq(10,10,10,12,14)).zipped.toList.toDF("Field1", "Field2", "field3")
which should give you
+------+------+------+
|Field1|Field2|field3|
+------+------+------+
|1 |5 |10 |
|2 |4 |10 |
|3 |3 |10 |
|4 |2 |12 |
|5 |3 |14 |
+------+------+------+
But zipped works only for maximum three sequeces Thanks for pointing that out #Shaido
Note: the number of rows is determined by the shortest sequence present
transpose
Tanspose combines all sequences as zip and zipped does but returns list instead of tuples so a little hacking is needed as
Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3)).transpose.map{case List(a,b) => (a, b)}.toDF("Field1", "Field2")
+------+------+
|Field1|Field2|
+------+------+
|1 |5 |
|2 |4 |
|3 |3 |
|4 |2 |
|5 |3 |
+------+------+
and
Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3), Seq(10,10,10,12,14)).transpose.map{case List(a,b,c) => (a, b, c)}.toDF("Field1", "Field2", "Field3")
+------+------+------+
|Field1|Field2|Field3|
+------+------+------+
|1 |5 |10 |
|2 |4 |10 |
|3 |3 |10 |
|4 |2 |12 |
|5 |3 |14 |
+------+------+------+
and so on ...
Note: transpose requires all sequences to be of same length
I hope the answer is helpful
By default, each element is considered to be a Row of the dataFrame.
If you want each of the Seqs to be a different column you need to group them inside a Tuple:
val twoColumnDf =Seq((Seq(1,2,3,4,5), Seq(5,4,3,2,3))).toDF("Field1", "Field2")
twoColumnDf.show
+---------------+---------------+
| Field1| Field2|
+---------------+---------------+
|[1, 2, 3, 4, 5]|[5, 4, 3, 2, 3]|
+---------------+---------------+
I am trying to filter out table rows based in column value.
I have a dataframe:
+---+-----+
|id |value|
+---+-----+
|3 |0 |
|3 |1 |
|3 |0 |
|4 |1 |
|4 |0 |
|4 |0 |
+---+-----+
I want to create a new dataframe deleting all rows with value!=0:
+---+-----+
|id |value|
+---+-----+
|3 |0 |
|3 |0 |
|4 |0 |
|4 |0 |
+---+-----+
I figured the syntax should be something like this but couldn't get it right:
val newDataFrame = OldDataFrame.filter($"value"==0)
Correct way is as following. You just forgot to add one = sign
val newDataFrame = OldDataFrame.filter($"value" === 0)
Their are various ways by which you can do the filtering.
val newDataFrame = OldDataFrame.filter($"value"===0)
val newDataFrame = OldDataFrame.filter(OldDataFrame("value") === 0)
val newDataFrame = OldDataFrame.filter("value === 0")
You can also use where function as well instead of filter.