I have dataframe contain (around 20000000 rows) and I'd like to drop duplicates from a dataframe for two columns if those columns have the same values, or even if those values are in the reverse order.
For example the original dataframe:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| A|
| 1| 1| B|
| 2| 1| C|
| 1| 2| D|
| 3| 5| E|
| 3| 4| F|
| 4| 3| G|
+----+----+----+
where the schema of the column as follows:
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: string (nullable = true)
The desired dataframe should look like:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| A|
| 1| 2| D|
| 3| 5| E|
| 3| 4| F|
+----+----+----+
The dropDuplicates() method remove duplicates if the values in the same order
I followed the accepted answer to this question Pandas: remove reverse duplicates from dataframe but it took more time.
You can use this:
Hope this helps.
Note : In 'col3' 'D' will be removed istead of 'C', because 'C' is positioned before 'D'.
from pyspark.sql import functions as F
df = spark.read.csv('/FileStore/tables/stack2.csv', header = 'True')
df2 = df.select(F.least(df.col1,df.col2).alias('col1'),F.greatest(df.col1,df.col2).alias('col2'),df.col3)
df2.dropDuplicates(['col1','col2']).show()
Related
Trying to resolve some transformation within dataframes, any help is much appreciated.
Within scala (version 2.3.1) : I have a dataframe which has array of string and long.
+------+---------+----------+---------+---------+
|userId| varA| varB| varC| varD|
+------+---------+----------+---------+---------+
| 1|[A, B, C]| [0, 2, 5]|[1, 2, 9]|[0, 0, 0]|
| 2|[X, Y, Z]|[1, 20, 5]|[9, 0, 6]|[1, 1, 1]|
+------+---------+----------+---------+---------+
I would want my output to be like below dataframe.
+------+---+---+---+---+
|userId| A| B| C| D|
+------+---+---+---+---+
| 1| A| 0| 1| 0|
| 1| B| 2| 2| 0|
| 1| C| 5| 9| 0|
| 2| X| 1| 9| 1|
| 2| Y| 20| 0| 1|
| 2| Z| 5| 6| 1|
+------+---+---+---+---+
I tried doing this using explode, getting Cartesian product. Is there a way to keep the record count to 6 rows, instead of 18 rows.
scala> val data = sc.parallelize(Seq("""{"userId": 1,"varA": ["A", "B", "C"], "varB": [0, 2, 5], "varC": [1, 2, 9], "varD": [0, 0, 0]}""","""{"userId": 2,"varA": ["X", "Y", "Z"], "varB": [1, 20, 5], "varC": [9, 0, 6], "varD": [1, 1, 1]}"""))
scala> val df = spark.read.json(data)
scala> df.show()
+------+---------+----------+---------+---------+
|userId| varA| varB| varC| varD|
+------+---------+----------+---------+---------+
| 1|[A, B, C]| [0, 2, 5]|[1, 2, 9]|[0, 0, 0]|
| 2|[X, Y, Z]|[1, 20, 5]|[9, 0, 6]|[1, 1, 1]|
+------+---------+----------+---------+---------+
scala>
scala> df.printSchema
root
|-- userId: long (nullable = true)
|-- varA: array (nullable = true)
| |-- element: string (containsNull = true)
|-- varB: array (nullable = true)
| |-- element: long (containsNull = true)
|-- varC: array (nullable = true)
| |-- element: long (containsNull = true)
|-- varD: array (nullable = true)
| |-- element: long (containsNull = true)
scala>
scala> val zip_str = udf((x: Seq[String], y: Seq[Long]) => x.zip(y))
zip_str: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StructType(StructField(_1,StringType,true), StructField(_2,LongType,false)),true),Some(List(ArrayType(StringType,true), ArrayType(LongType,false))))
scala> val zip_long = udf((x: Seq[Long], y: Seq[Long]) => x.zip(y))
zip_long: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StructType(StructField(_1,LongType,false), StructField(_2,LongType,false)),true),Some(List(ArrayType(LongType,false), ArrayType(LongType,false))))
scala> df.withColumn("zip_1", explode(zip_str($"varA", $"varB"))).withColumn("zip_2", explode(zip_long($"varC", $"varD"))).select($"userId", $"zip_1._1".alias("A"),$"zip_1._2".alias("B"),$"zip_2._1".alias("C"),$"zip_2._2".alias("D")).show()
+------+---+---+---+---+
|userId| A| B| C| D|
+------+---+---+---+---+
| 1| A| 0| 1| 0|
| 1| A| 0| 2| 0|
| 1| A| 0| 9| 0|
| 1| B| 2| 1| 0|
| 1| B| 2| 2| 0|
| 1| B| 2| 9| 0|
| 1| C| 5| 1| 0|
| 1| C| 5| 2| 0|
| 1| C| 5| 9| 0|
| 2| X| 1| 9| 1|
| 2| X| 1| 0| 1|
| 2| X| 1| 6| 1|
| 2| Y| 20| 9| 1|
| 2| Y| 20| 0| 1|
| 2| Y| 20| 6| 1|
| 2| Z| 5| 9| 1|
| 2| Z| 5| 0| 1|
| 2| Z| 5| 6| 1|
+------+---+---+---+---+
scala>
Some reference used here
https://intellipaat.com/community/17050/explode-transpose-multiple-columns-in-spark-sql-table
Something down the line of combining posexplode and expr could work.
if we do the following:
df.select(
col("userId"),
posexplode("varA"),
col("varB"),
col("varC")
).withColumn(
"varB",
expr("varB[pos]")
).withColumn(
"varC",
expr("varC[pos]")
)
I am writing this from memory so I am not 100% sure. I will run a test later and update with Edit if I verify.
EDIT
Above expression works except one minor correct is needed. Updated expression -
df.select(col("userId"),posexplode(col("varA")),col("varB"),col("varC"), col("varD")).withColumn("varB",expr("varB[pos]")).withColumn("varC",expr("varC[pos]")).withColumn("varD",expr("varD[pos]")).show()
Ouput -
+------+---+---+----+----+----+
|userId|pos|col|varB|varC|varD|
+------+---+---+----+----+----+
| 1| 0| A| 0| 1| 0|
| 1| 1| B| 2| 2| 0|
| 1| 2| C| 5| 9| 0|
| 2| 0| X| 1| 9| 1|
| 2| 1| Y| 20| 0| 1|
| 2| 2| Z| 5| 6| 1|
+------+---+---+----+----+----+
You don't need udfs, it could be achieved using spark sql arrays_zip and then explode:
df.select('userId,explode(arrays_zip('varA,'varB,'varC,'varD)))
.select("userId","col.varA","col.varB","col.varC","col.varD")
.show
output:
+------+----+----+----+----+
|userId|varA|varB|varC|varD|
+------+----+----+----+----+
| 1| A| 0| 1| 0|
| 1| B| 2| 2| 0|
| 1| C| 5| 9| 0|
| 1| X| 1| 9| 1|
| 1| Y| 20| 0| 1|
| 1| Z| 5| 6| 1|
+------+----+----+----+----+
currently, schema for my table is:
root
|-- product_id: integer (nullable = true)
|-- product_name: string (nullable = true)
|-- aisle_id: string (nullable = true)
|-- department_id: string (nullable = true)
I want to apply the below schema on the above table and delete all the rows which do not follow the below schema:
val productsSchema = StructType(Seq(
StructField("product_id",IntegerType,nullable = true),
StructField("product_name",StringType,nullable = true),
StructField("aisle_id",IntegerType,nullable = true),
StructField("department_id",IntegerType,nullable = true)
))
Use option "DROPMALFORMED" while loading the data which ignores corrupted records.
spark.read.format("json")
.option("mode", "DROPMALFORMED")
.option("header", "true")
.schema(productsSchema)
.load("sample.json")
If data is not matching with schema, spark will put null as value in that column. We just have to filter the null values for all columns.
Used filter to filter ```null`` values for all columns.
scala> "cat /tmp/sample.json".! // JSON File Data, one row is not matching with schema.
{"product_id":1,"product_name":"sampleA","aisle_id":"AA","department_id":"AAD"}
{"product_id":2,"product_name":"sampleBB","aisle_id":"AAB","department_id":"AADB"}
{"product_id":3,"product_name":"sampleCC","aisle_id":"CC","department_id":"CCC"}
{"product_id":3,"product_name":"sampledd","aisle_id":"dd","departmentId":"ddd"}
{"name","srinivas","age":29}
res100: Int = 0
scala> schema.printTreeString
root
|-- aisle_id: string (nullable = true)
|-- department_id: string (nullable = true)
|-- product_id: long (nullable = true)
|-- product_name: string (nullable = true)
scala> val df = spark.read.schema(schema).option("badRecordsPath", "/tmp/badRecordsPath").format("json").load("/tmp/sample.json") // Loading Json data & if schema is not matching we will be getting null rows for all columns.
df: org.apache.spark.sql.DataFrame = [aisle_id: string, department_id: string ... 2 more fields]
scala> df.show(false)
+--------+-------------+----------+------------+
|aisle_id|department_id|product_id|product_name|
+--------+-------------+----------+------------+
|AA |AAD |1 |sampleA |
|AAB |AADB |2 |sampleBB |
|CC |CCC |3 |sampleCC |
|dd |null |3 |sampledd |
|null |null |null |null |
+--------+-------------+----------+------------+
scala> df.filter(df.columns.map(c => s"${c} is not null").mkString(" or ")).show(false) // Filter null rows.
+--------+-------------+----------+------------+
|aisle_id|department_id|product_id|product_name|
+--------+-------------+----------+------------+
|AA |AAD |1 |sampleA |
|AAB |AADB |2 |sampleBB |
|CC |CCC |3 |sampleCC |
|dd |null |3 |sampledd |
+--------+-------------+----------+------------+
scala>
do check out na.drop functions on data-frame, you can drop rows based on null values, min nulls in a row, and also based on a specific column which has nulls.
scala> sc.parallelize(Seq((1,"a","a"),(1,"a","a"),(2,"b","b"),(3,"c","c"),(4,"d","d"),(4,"d",null))).toDF
res7: org.apache.spark.sql.DataFrame = [_1: int, _2: string ... 1 more field]
scala> res7.show()
+---+---+----+
| _1| _2| _3|
+---+---+----+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
| 4| d|null|
+---+---+----+
//dropping row if a null is found
scala> res7.na.drop.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+
//drops only if `minNonNulls = 3` if accepted to each row
scala> res7.na.drop(minNonNulls = 3).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+
//not dropping any
scala> res7.na.drop(minNonNulls = 2).show()
+---+---+----+
| _1| _2| _3|
+---+---+----+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
| 4| d|null|
+---+---+----+
//drops row based on nulls in `_3` column
scala> res7.na.drop(Seq("_3")).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+
I am generating a query string dynamically as follows and passing it to selectExpr().
queryString=''''category_id as cat_id','category_department_id as cat_dpt_id','category_name as cat_name''''
df.selectExpr(queryString)
As per document
selectExpr(*expr) :
Projects a set of SQL expressions and returns a new DataFrame.
This is a variant of select() that accepts SQL expressions.
The issue is that the variable "queryString" is being treated as a single string instead of three separate columns ( and rightly so ). Following is the error:
: org.apache.spark.sql.catalyst.parser.ParseException:
.........
== SQL ==
'category_id as cat_id', 'category_department_id as cat_dpt_id', 'category_name as cat_name'
------------------------^^^
Is there any way I can pass the dynamically generated "queryString" as an argument of selectExpr().
If possible, while generating your query string, try to put the individual column expressions in a list right away instead of concatenating them into one string.
If not possible, you can split your query string to have seperated column expressions which can be passed to selectExpr.
# generate some dummy data
data= pd.DataFrame(np.random.randint(0, 5, size=(5, 3)), columns=list("abc"))
df = spark.createDataFrame(data)
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 1| 4|
| 1| 2| 1|
| 3| 3| 2|
| 3| 2| 2|
| 2| 0| 2|
+---+---+---+
# create example query string
query_string="'a as aa','b as bb','c as cc'"
# split and pass
column_expr = query_string.replace("'", "").split(",")
df.selectExpr(column_expr).show()
+---+---+---+
| aa| bb| cc|
+---+---+---+
| 1| 1| 4|
| 1| 2| 1|
| 3| 3| 2|
| 3| 2| 2|
| 2| 0| 2|
+---+---+---+
I have:
+---+-------+-------+
| id| var1| var2|
+---+-------+-------+
| a|[1,2,3]|[1,2,3]|
| b|[2,3,4]|[2,3,4]|
+---+-------+-------+
I want:
+---+-------+-------+-------+-------+-------+-------+
| id|var1[0]|var1[1]|var1[2]|var2[0]|var2[1]|var2[2]|
+---+-------+-------+-------+-------+-------+-------+
| a| 1| 2| 3| 1| 2| 3|
| b| 2| 3| 4| 2| 3| 4|
+---+-------+-------+-------+-------+-------+-------+
The solution provided by How to split a list to multiple columns in Pyspark?
df1.select('id', df1.var1[0], df1.var1[1], ...).show()
works, but some of my arrays are very long (max 332).
How can I write this so that it takes account of all length arrays?
This solution will work for your problem, no matter the number of initial columns and the size of your arrays. Moreover, if a column has different array sizes (eg [1,2], [3,4,5]), it will result in the maximum number of columns with null values filling the gap.
from pyspark.sql import functions as F
df = spark.createDataFrame(sc.parallelize([['a', [1,2,3], [1,2,3]], ['b', [2,3,4], [2,3,4]]]), ["id", "var1", "var2"])
columns = df.drop('id').columns
df_sizes = df.select(*[F.size(col).alias(col) for col in columns])
df_max = df_sizes.agg(*[F.max(col).alias(col) for col in columns])
max_dict = df_max.collect()[0].asDict()
df_result = df.select('id', *[df[col][i] for col in columns for i in range(max_dict[col])])
df_result.show()
>>>
+---+-------+-------+-------+-------+-------+-------+
| id|var1[0]|var1[1]|var1[2]|var2[0]|var2[1]|var2[2]|
+---+-------+-------+-------+-------+-------+-------+
| a| 1| 2| 3| 1| 2| 3|
| b| 2| 3| 4| 2| 3| 4|
+---+-------+-------+-------+-------+-------+-------+
val df = (Seq((1, "a", "10"),(1,"b", "12"),(1,"c", "13"),(2, "a", "14"),
(2,"c", "11"),(1,"b","12" ),(2, "c", "12"),(3,"r", "11")).
toDF("col1", "col2", "col3"))
So I have a spark dataframe with 3 columns.
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a| 10|
| 1| b| 12|
| 1| c| 13|
| 2| a| 14|
| 2| c| 11|
| 1| b| 12|
| 2| c| 12|
| 3| r| 11|
+----+----+----+
My requirement is actually I need to perform two levels of groupby as explained below.
Level1:
If I do groupby on col1 and do a sum of Col3. I will get below two columns.
1. col1
2. sum(col3)
I will loose col2 here.
Level2:
If i want to again group by on col1 and col2 and do a sum of Col3 I will get below 3 columns.
1. col1
2. col2
3. sum(col3)
My requirement is actually I need to perform two levels of groupBy and have these two columns(sum(col3) of level1, sum(col3) of level2) in a final one dataframe.
How can I do this, can anyone explain?
spark : 1.6.2
Scala : 2.10
One option is to do the two sum separately and then join them back:
(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
join(df.groupBy("col1").agg(sum($"col3").as("sum_level1")), Seq("col1")).show)
+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
| 2| c| 23.0| 37.0|
| 2| a| 14.0| 37.0|
| 1| c| 13.0| 47.0|
| 1| b| 24.0| 47.0|
| 3| r| 11.0| 11.0|
| 1| a| 10.0| 47.0|
+----+----+----------+----------+
Another option is to use the window functions, considering the fact that the level1_sum is the sum of level2_sum grouped by col1:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"col1")
(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
withColumn("sum_level1", sum($"sum_level2").over(w)).show)
+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
| 1| c| 13.0| 47.0|
| 1| b| 24.0| 47.0|
| 1| a| 10.0| 47.0|
| 3| r| 11.0| 11.0|
| 2| c| 23.0| 37.0|
| 2| a| 14.0| 37.0|
+----+----+----------+----------+