Compare 2 pyspark dataframe columns and change values of another column based on it - pyspark

I have a problem where I have generated a dataframe from a graph algorithm that I have written. The thing is that I want to keep the value of the underlying component to be the same essentially after every run of the graph code.
This is a sample dataframe generated:
df = spark.createDataFrame(
[
(1, 'A1'),
(1, 'A2'),
(1, 'A3'),
(2, 'B1'),
(2, 'B2'),
(3, 'B3'),
(4, 'C1'),
(4, 'C2'),
(4, 'C3'),
(4, 'C4'),
(5, 'D1'),
],
['old_comp_id', 'db_id']
)
After another run the values change completely, so the new run has values like these,
df2 = spark.createDataFrame(
[
(2, 'A1'),
(2, 'A2'),
(2, 'A3'),
(3, 'B1'),
(3, 'B2'),
(3, 'B3'),
(1, 'C1'),
(1, 'C2'),
(1, 'C3'),
(1, 'C4'),
(4, 'D1'),
],
['new_comp_id', 'db_id']
)
So the thing I need to do is to compare the values between the above two dataframes and change the values of the component id based on the database id associated.
if the database_id are the same then update the component id to be from the 1st dataframe
if they are different then assign a completely new comp_id (new_comp_id = max(old_comp_id)+1)
This is what I have come up with so far:
old_ids = df.groupBy("old_comp_id").agg(F.collect_set(F.col("db_id")).alias("old_db_id"))
new_ids = df2.groupBy("new_comp_id").agg(F.collect_set(F.col("db_id")).alias("new_db_id"))
joined = new_ids.join(old_ids,old_ids.old_comp_id == new_ids.new_comp_id,"outer")
joined.withColumn("update_comp", F.when( F.col("new_db_id") == F.col("old_db_id"), F.col('old_comp_id')).otherwise(F.max(F.col("old_comp_id")+1))).show()

In order to use aggregated functions in non-aggregated columns, you should use Windowing Functions.
First, you outer-join the DFs with the db_id:
from pyspark.sql.functions import when, col, max
joinedDF = df.join(df2, df["db_id"] == df2["new_db_id"], "outer")
Then, start to building the Window (which where you group by db_id, and order by old_comp_id, in order to have in first rows the old_comp_id with highest value.
from pyspark.sql.window import Window
from pyspark.sql.functions import desc
windowSpec = Window\
.partitionBy("db_id")\
.orderBy(desc("old_comp_id"))\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
Then, you build the max column using the windowSpec
from pyspark.sql.functions import max
maxCompId = max(col("old_comp_id")).over(windowSpec)
Then, you apply it on the select
joinedDF.select(col("db_id"), when(col("new_db_id").isNotNull(), col("old_comp_id")).otherwise(maxCompId+1).alias("updated_comp")).show()
For more information, please refer to the documentation (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Window)
Hope this helps

Related

Run SQL query on dataframe via Pyspark

I would like to run sql query on dataframe but do I have to create a view on this dataframe?
Is there any easier way?
df = spark.createDataFrame([
('a', 1, 1), ('a',1, None), ('b', 1, 1),
('c',1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')
and the query I want to run some complex queries against this dataframe.
For example
I can do
df.createOrReplaceTempView("temp_view")
df_new = pyspark.sql("select id,max(foo) from temp_view group by id")
but do I have to convert it to view first before querying it?
I know there is a dataframe equivalent operation.
The above query is only an example.
You can just do
df.select('id', 'foo')
This will return a new Spark DataFrame with columns id and foo.

Pyspark isin with column in argument doesn't exclude rows

I need to exclude rows which doesn't have True value in column Status.
In my opinion this filter( isin( )== False) structure should solve my problem but it doesn't.
df = sqlContext.createDataFrame([( "A", "True"), ( "A", "False"), ( "B", "False"), ("C", "True")], ( "name", "status"))
df.registerTempTable("df")
df_t = df[df.status == "True"]
from pyspark.sql import functions as sf
df_f = df.filter(df.status.isin(df_t.name)== False)
I expect row:
B | False
any help is greatly appreciated!
First, I think in your last statement, you meant to use df.name instead of df.status.
df_f = df.filter(df.status.isin(df_t.name)== False)
Second, even if you use df.name, it still won't work.
Because it's mixing the columns (Column type) from two DataFrames, i.e. df_t and df in your final statement. I don't think this works in pyspark.
However, you can achieve the same effect using other methods.
If I understand correctly, you want to select 'A' and 'C' first through 'status' column, then select the rows excluding ['A', 'C']. The thing here is to extend the selection to the second row of 'A', which can be achieved by Window. See below:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = sqlContext.createDataFrame([( "A", "True"), ( "A", "False"), ( "B", "False"), ("C", "True")], ( "name", "status"))
df.registerTempTable("df")
# create an auxiliary column satisfying the condition
df = df.withColumn("flag", F.when(df['status']=="True", 1).otherwise(0))
df.show()
# extend the selection to other rows with the same 'name'
df = df.withColumn('flag', F.max(df['flag']).over(Window.partitionBy('name')))
df.show()
#filter is now easy
df_f = df.filter(df.flag==0)
df_f.show()

How to join two spark RDD

I have 2 spark RDD, the 1st one contains a mapping between some indices and ids which are strings and the 2nd one contains tuples of related indices
val ids = spark.sparkContext.parallelize(Array[(Int, String)](
(1, "a"), (2, "b"), (3, "c"), (4, "d"), (5, "e"))).toDF("index", "idx")
val relationships = spark.sparkContext.parallelize(Array[(Int, Int)](
(1, 3), (2, 3), (4, 5))).toDF("index1", "index2")
I want to join somehow these RDD (or merge or sql or any best spark practice) to have at the end related ids instead:
The result of my combined RDD should return:
("a", "c"), ("b", "c"), ("d", "e")
Any idea how I can achieve this operation in an optimal way without loading any of the RDD into a memory map (because in my scenarios, these RDD can potentially load millions of records)
You can approach this by creating a two views from DataFrame as following
relationships.createOrReplaceTempView("relationships");
ids.createOrReplaceTempView("ids");
Next run the following SQL query to generate the required result which performs inner join between relationships and ids view to generate the required result
import sqlContext.sql;
val result = spark.sql("""select t.index1, id.idx from
(select id.idx as index1, rel.index2
from relationships rel
inner join
ids id on rel.index1=id.index) t
inner join
ids id
on id.index=t.index2
""");
result.show()
Another approach using DataFrame without creating views
relationships.as("rel").
join(ids.as("ids"), $"ids.index" === $"rel.index1").as("temp").
join(ids.as("ids"), $"temp.index2"===$"ids.index").
select($"temp.idx".as("index1"), $"ids.idx".as("index2")).show

The questions about Spark dataframe operations [duplicate]

This question already has an answer here:
get TopN of all groups after group by using Spark DataFrame
(1 answer)
Closed 5 years ago.
if I create a dataframe like this:
val df1 = sc.parallelize(List((1, 1), (1, 1), (1, 1), (1, 2), (1, 2), (1, 3), (2, 1), (2, 2), (2, 2), (2, 3)).toDF("key1","key2")
Then I group by "key1" and "key2", and count "key2".
val df2 = df1.groupBy("key1","key2").agg(count("key2") as "k").sort(col("k").desc)
My question is how to filter this dataframe and leave the top 2 num of the "k" from each "key1"?
if I don't use window functions ,what should I solve this problem?
This can be done using window-function, using row_number() (or also rank()/dense_rank(), depending on your requirements):
import org.apache.spark.sql.functions.row_number
import org.apache.spark.sql.expressions.Window
df2
.withColumn("rnb", row_number().over(Window.partitionBy($"key1").orderBy($"k".desc)))
.where($"rnb" <= 2).drop($"rnb")
.show()
EDIT:
Here a solution using RDD (which do not require a HiveContext):
df2
.rdd
.groupBy(_.getAs[Int]("key1"))
.flatMap{case (_,rows) => {
rows.toSeq
.sortBy(_.getAs[Long]("k")).reverse
.take(2)
.map{case Row(key1:Int,key2:Int,k:Long) => (key1,key2,k)}
}
}
.toDF("key1","key2","k")
.show()

efficiently using union in spark

I am new to scala and spark and now I have two RDD like A is [(1,2),(2,3)] and B is [(4,5),(5,6)] and I want to get RDD like [(1,2),(2,3),(4,5),(5,6)]. But thing is my data is large, suppose both A and B is 10GB. I use sc.union(A,B) but it is slow. I saw in spark UI there are 28308 tasks in this stage.
Is there more efficient way to do this?
Why don't you convert the two RDDs to dataframes and use union function.
Converting to dataframe is easy you just need to import sqlContext.implicits._ and apply .toDF() function with header names.
for example:
val sparkSession = SparkSession.builder().appName("testings").master("local").config("", "").getOrCreate()
val sqlContext = sparkSession.sqlContext
var firstTableColumns = Seq("col1", "col2")
var secondTableColumns = Seq("col3", "col4")
import sqlContext.implicits._
var firstDF = Seq((1, 2), (2, 3), (3, 4), (2, 3), (3, 4)).toDF(firstTableColumns:_*)
var secondDF = Seq((4, 5), (5, 6), (6, 7), (4, 5)) .toDF(secondTableColumns: _*)
firstDF = firstDF.union(secondDF)
It should be very easy for you to work with dataframes than with RDDs. Changing dataframe to RDD is quite easy too, just call .rdd function
val rddData = firstDF.rdd