Given any df, I want to calculate another column for the df called "has_duplicates", and then add a column with a boolean value for whether each row is unique. Example input df:
val df = Seq((1, 2), (2, 5), (1, 7), (1, 2), (2, 5)).toDF("A", "B")
Given an input columns: Seq[String], I know how to get the count of each row:
val countsDf = df.withColumn("count", count("*").over(Window.partitionBy(columns.map(col(_)): _*)))
But I'm not sure how to use this to create a column expression for the final column indicating whether each row is unique.
Something like
def getEvaluationExpression(df: DataFrame): Column = {
when("count > 1", lit("fail").otherwise(lit("pass"))
}
but the count needs to be evaluated on the spot using the query above.
Try below code.
scala> df.withColumn("has_duplicates", when(count("*").over(Window.partitionBy(df.columns.map(col(_)): _*)) > 1 , lit("fail")).otherwise("pass")).show(false)
+---+---+--------------+
|A |B |has_duplicates|
+---+---+--------------+
|1 |7 |pass |
|1 |2 |fail |
|1 |2 |fail |
|2 |5 |fail |
|2 |5 |fail |
+---+---+--------------+
Or
scala> df.withColumn("count",count("*").over(Window.partitionBy(df.columns.map(col(_)): _*))).withColumn("has_duplicates", when($"count" > 1 , lit("fail")).otherwise("pass")).show(false)
+---+---+-----+--------------+
|A |B |count|has_duplicates|
+---+---+-----+--------------+
|1 |7 |1 |pass |
|1 |2 |2 |fail |
|1 |2 |2 |fail |
|2 |5 |2 |fail |
|2 |5 |2 |fail |
+---+---+-----+--------------+
What I have:
| ids. |items |item_id|value|timestamp|
+--------+--------+-------+-----+---------+
|[A,B,C] |1.0 |1 |5 |100 |
|[A,B,D] |1.0 |2 |6 |90 |
|[D] |0.0. |3 |7 |80 |
|[C] |0.0. |4 |8 |80 |
+--------+--------+-------+-----+----------
| ids |id_num |
+--------+--------+
|A |1 |
|B |2 |
|C |3 |
|D |4 |
+---+----+--------+
What I want:
| ids |
+--------+
|[1,2,3] |
|[1,2,4] |
|[3] |
|[4] |
+--------+
Is there a way to do this without an explode? Thank you for your help!
You can use a UDF:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType
# Suppose this is the dictionary you want to map
map_dict = {'A':1, 'B':2,'C':3,'D':4}
def array_map(array_col):
return list(map(map_dict.get, array_col))
"""
If you prefer list comprehension, you can return [map_dict[k] for k in array_col]
"""
array_map_udf = udf(array_map, ArrayType())
df = df.withColumn("mapped_array", array_map_udf(col("ids")))
I can't think of a different method, but to get a parallelized dictionary, you can just use the toJSON method. It will require further processing on the kind of reference df you have:
import json
df_json = df.toJSON().map(lambda x: json.loads(x))
I am writing a Spark project using Scala in which I need to make some calculations from "demo" datasets. I am using databricks platform.
I need to pass the 2nd column of my Dataframe (trainingCoordDataFrame) into a list. The type of the list must be List[Int].
The dataframe is as shown bellow:
> +---+---+---+---+
> |_c0|_c1|_c2|_c3|
> +---+---+---+---+
> |1 |0 |0 |a |
> |11 |9 |1 |a |
> |12 |2 |7 |c |
> |13 |2 |9 |c |
> |14 |2 |4 |b |
> |15 |1 |3 |c |
> |16 |4 |6 |c |
> |17 |3 |5 |c |
> |18 |5 |3 |a |
> |2 |0 |1 |a |
> |20 |8 |9 |c |
> |3 |1 |0 |b |
> |4 |3 |4 |b |
> |5 |8 |7 |b |
> |6 |4 |9 |b |
> |7 |2 |5 |a |
> |8 |1 |9 |a |
> |9 |3 |6 |a |
> +---+---+---+---+
I am trying to create the list I want using the following command:
val trainingCoordList = trainingCoordDataFrame.select("_c1").collect().map(each => (each.getAs[Int]("_c1"))).toList
The message from the compiler is this:
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Integer
Note that the procedure is :
1) Upload the dataset from local PC to databricks (so no standard data can be used).
val mainDataFrame = spark.read.format("csv").option("header", "false").load("FileStore/tables/First_Spacial_Dataset_ByAris.csv")
2) Create dataframe. ( Step one: Split the main Dataframe randomly. Step two : Remove the unnecessary columns)
val Array(trainingDataFrame,testingDataFrame) = mainDataFrame.randomSplit(Array(0.8,0.2)) //step one
val trainingCoordDataFrame = trainingDataFrame.drop("_c0", "_c3") //step two
3) Create list. <- Here is the false command.
What is the correct way to reach the result I want?
I think there are several ways to deal with this problem.
A) Define a schema for your CSV:
For example:
val customSchema = StructType(Array(
StructField("_c0", IntegerType),
StructField("_c1", IntegerType),
StructField("_c2", IntegerType),
StructField("_c3", StringType)))
When you read the CSV add the schema option with the StructType we created earlier
val mainDataFrame = spark.read.format("csv").option("header", "false").schema(customSchema).load("FileStore/tables/First_Spacial_Dataset_ByAris.csv")
Now if we look at the output of the mainDataFrame.printSchema() command we'll see that the columns are typed according to your use case:
root
|-- _c0: integer (nullable = true)
|-- _c1: integer (nullable = true)
|-- _c2: integer (nullable = true)
|-- _c3: string (nullable = true)
This means we can actually run your original command without getting an error.
trainingCoordDataFrame.select("_c2").map(r => r.getInt(0)).collect.toList
B) Cast the entire column to Int
Refer to the column itself instead of the column name and then cast the column to IntegerType. Now that the column type is Int you can again use getInt where it failed earlier:
trainingCoordDataFrame.select($"_c2".cast(IntegerType)).map(r => r.getInt(0)).collect.toList
C) Cast each value individually
Use map to cast to or retrieve as String each individual value and then cast it to Int
trainingCoordDataFrame.select("_c2").map(r => r.getString(0).toInt).collect.toList
The column's value is of type string so read the column as string and use scala's string.toInt method.
A cast is definitely wrong at this place.
val trainingCoordList = trainingCoordDataFrame.select("_c1").collect().map(each => each.getAs[String]("_c1").toInt).toList
Or use the Dataset API with custom schema e.g. with tuples
I am learning Spark and Scala, and was experimenting in the spark REPL.
When I try to convert a List to a DataFrame, it works as follows:
val convertedDf = Seq(1,2,3,4).toDF("Field1")
However, when I try to convert a list of lists to a DataFrame with two columns (field1, field2), it fails with
java.lang.IllegalArgumentException: requirement failed: The number of
columns doesn't match
error message:
val twoColumnDf =Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3)).toDF("Field1", (Field2))
How to convert such a List of Lists to a DataFrame in Scala?
If you are seeking ways to have each elements of each sequence in each row of respective columns then following are the options for you
zip
zip both sequences and then apply toDF as
val twoColumnDf =Seq(1,2,3,4,5).zip(Seq(5,4,3,2,3)).toDF("Field1", "Field2")
which should give you twoColumnDf as
+------+------+
|Field1|Field2|
+------+------+
|1 |5 |
|2 |4 |
|3 |3 |
|4 |2 |
|5 |3 |
+------+------+
zipped
Another better way is to use zipped as
val threeColumnDf = (Seq(1,2,3,4,5), Seq(5,4,3,2,3), Seq(10,10,10,12,14)).zipped.toList.toDF("Field1", "Field2", "field3")
which should give you
+------+------+------+
|Field1|Field2|field3|
+------+------+------+
|1 |5 |10 |
|2 |4 |10 |
|3 |3 |10 |
|4 |2 |12 |
|5 |3 |14 |
+------+------+------+
But zipped works only for maximum three sequeces Thanks for pointing that out #Shaido
Note: the number of rows is determined by the shortest sequence present
transpose
Tanspose combines all sequences as zip and zipped does but returns list instead of tuples so a little hacking is needed as
Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3)).transpose.map{case List(a,b) => (a, b)}.toDF("Field1", "Field2")
+------+------+
|Field1|Field2|
+------+------+
|1 |5 |
|2 |4 |
|3 |3 |
|4 |2 |
|5 |3 |
+------+------+
and
Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3), Seq(10,10,10,12,14)).transpose.map{case List(a,b,c) => (a, b, c)}.toDF("Field1", "Field2", "Field3")
+------+------+------+
|Field1|Field2|Field3|
+------+------+------+
|1 |5 |10 |
|2 |4 |10 |
|3 |3 |10 |
|4 |2 |12 |
|5 |3 |14 |
+------+------+------+
and so on ...
Note: transpose requires all sequences to be of same length
I hope the answer is helpful
By default, each element is considered to be a Row of the dataFrame.
If you want each of the Seqs to be a different column you need to group them inside a Tuple:
val twoColumnDf =Seq((Seq(1,2,3,4,5), Seq(5,4,3,2,3))).toDF("Field1", "Field2")
twoColumnDf.show
+---------------+---------------+
| Field1| Field2|
+---------------+---------------+
|[1, 2, 3, 4, 5]|[5, 4, 3, 2, 3]|
+---------------+---------------+
I am trying to filter out table rows based in column value.
I have a dataframe:
+---+-----+
|id |value|
+---+-----+
|3 |0 |
|3 |1 |
|3 |0 |
|4 |1 |
|4 |0 |
|4 |0 |
+---+-----+
I want to create a new dataframe deleting all rows with value!=0:
+---+-----+
|id |value|
+---+-----+
|3 |0 |
|3 |0 |
|4 |0 |
|4 |0 |
+---+-----+
I figured the syntax should be something like this but couldn't get it right:
val newDataFrame = OldDataFrame.filter($"value"==0)
Correct way is as following. You just forgot to add one = sign
val newDataFrame = OldDataFrame.filter($"value" === 0)
Their are various ways by which you can do the filtering.
val newDataFrame = OldDataFrame.filter($"value"===0)
val newDataFrame = OldDataFrame.filter(OldDataFrame("value") === 0)
val newDataFrame = OldDataFrame.filter("value === 0")
You can also use where function as well instead of filter.