Pyspark union column values from string separate with comma into array - pyspark

I have a pyspark request to union multiple dataframes on id. Each dataframe has a certain column with comma separate strings.i.e.
df1=[("1", "a,b,c,"),
("2", "i,j,k"),
("3", "x,y,z")]
df2=[("1", "b,d,e"),
("2", "l,m,n"),
("3", "x")]
Now I want to union this column's value of each entry together.i.e.
df3=[("1", "a,b,c,d,e"),
("2", "i,j,k,l,m,n"),
("3", "x,y,z")]
is there a function to do that?

What you are looking for is the array_union function.
data1 = [
('1', 'a,b,c'),
('2', 'i,j,k'),
('3', 'x,y,z')
]
data2 = [
('1', 'b,d,e'),
('2', 'l,m,n'),
('3', 'x')
]
df1 = spark.createDataFrame(data1, ['id', 'values1'])
df2 = spark.createDataFrame(data2, ['id', 'values2'])
df = df1.join(df2, 'id') \
.select('id',
F.array_join(F.array_union(F.split('values1', ','), F.split('values2', ',')), ',').alias('values'))
df.show(truncate=False)

Related

Run SQL query on dataframe via Pyspark

I would like to run sql query on dataframe but do I have to create a view on this dataframe?
Is there any easier way?
df = spark.createDataFrame([
('a', 1, 1), ('a',1, None), ('b', 1, 1),
('c',1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')
and the query I want to run some complex queries against this dataframe.
For example
I can do
df.createOrReplaceTempView("temp_view")
df_new = pyspark.sql("select id,max(foo) from temp_view group by id")
but do I have to convert it to view first before querying it?
I know there is a dataframe equivalent operation.
The above query is only an example.
You can just do
df.select('id', 'foo')
This will return a new Spark DataFrame with columns id and foo.

Compare 2 pyspark dataframe columns and change values of another column based on it

I have a problem where I have generated a dataframe from a graph algorithm that I have written. The thing is that I want to keep the value of the underlying component to be the same essentially after every run of the graph code.
This is a sample dataframe generated:
df = spark.createDataFrame(
[
(1, 'A1'),
(1, 'A2'),
(1, 'A3'),
(2, 'B1'),
(2, 'B2'),
(3, 'B3'),
(4, 'C1'),
(4, 'C2'),
(4, 'C3'),
(4, 'C4'),
(5, 'D1'),
],
['old_comp_id', 'db_id']
)
After another run the values change completely, so the new run has values like these,
df2 = spark.createDataFrame(
[
(2, 'A1'),
(2, 'A2'),
(2, 'A3'),
(3, 'B1'),
(3, 'B2'),
(3, 'B3'),
(1, 'C1'),
(1, 'C2'),
(1, 'C3'),
(1, 'C4'),
(4, 'D1'),
],
['new_comp_id', 'db_id']
)
So the thing I need to do is to compare the values between the above two dataframes and change the values of the component id based on the database id associated.
if the database_id are the same then update the component id to be from the 1st dataframe
if they are different then assign a completely new comp_id (new_comp_id = max(old_comp_id)+1)
This is what I have come up with so far:
old_ids = df.groupBy("old_comp_id").agg(F.collect_set(F.col("db_id")).alias("old_db_id"))
new_ids = df2.groupBy("new_comp_id").agg(F.collect_set(F.col("db_id")).alias("new_db_id"))
joined = new_ids.join(old_ids,old_ids.old_comp_id == new_ids.new_comp_id,"outer")
joined.withColumn("update_comp", F.when( F.col("new_db_id") == F.col("old_db_id"), F.col('old_comp_id')).otherwise(F.max(F.col("old_comp_id")+1))).show()
In order to use aggregated functions in non-aggregated columns, you should use Windowing Functions.
First, you outer-join the DFs with the db_id:
from pyspark.sql.functions import when, col, max
joinedDF = df.join(df2, df["db_id"] == df2["new_db_id"], "outer")
Then, start to building the Window (which where you group by db_id, and order by old_comp_id, in order to have in first rows the old_comp_id with highest value.
from pyspark.sql.window import Window
from pyspark.sql.functions import desc
windowSpec = Window\
.partitionBy("db_id")\
.orderBy(desc("old_comp_id"))\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
Then, you build the max column using the windowSpec
from pyspark.sql.functions import max
maxCompId = max(col("old_comp_id")).over(windowSpec)
Then, you apply it on the select
joinedDF.select(col("db_id"), when(col("new_db_id").isNotNull(), col("old_comp_id")).otherwise(maxCompId+1).alias("updated_comp")).show()
For more information, please refer to the documentation (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Window)
Hope this helps

Pyspark isin with column in argument doesn't exclude rows

I need to exclude rows which doesn't have True value in column Status.
In my opinion this filter( isin( )== False) structure should solve my problem but it doesn't.
df = sqlContext.createDataFrame([( "A", "True"), ( "A", "False"), ( "B", "False"), ("C", "True")], ( "name", "status"))
df.registerTempTable("df")
df_t = df[df.status == "True"]
from pyspark.sql import functions as sf
df_f = df.filter(df.status.isin(df_t.name)== False)
I expect row:
B | False
any help is greatly appreciated!
First, I think in your last statement, you meant to use df.name instead of df.status.
df_f = df.filter(df.status.isin(df_t.name)== False)
Second, even if you use df.name, it still won't work.
Because it's mixing the columns (Column type) from two DataFrames, i.e. df_t and df in your final statement. I don't think this works in pyspark.
However, you can achieve the same effect using other methods.
If I understand correctly, you want to select 'A' and 'C' first through 'status' column, then select the rows excluding ['A', 'C']. The thing here is to extend the selection to the second row of 'A', which can be achieved by Window. See below:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = sqlContext.createDataFrame([( "A", "True"), ( "A", "False"), ( "B", "False"), ("C", "True")], ( "name", "status"))
df.registerTempTable("df")
# create an auxiliary column satisfying the condition
df = df.withColumn("flag", F.when(df['status']=="True", 1).otherwise(0))
df.show()
# extend the selection to other rows with the same 'name'
df = df.withColumn('flag', F.max(df['flag']).over(Window.partitionBy('name')))
df.show()
#filter is now easy
df_f = df.filter(df.flag==0)
df_f.show()

pyspark generate rdd row wise using another field as a source

Input RDD
--------------------
A,123|124|125|126
B,123|124|125|126
From this rdd I need to generate another in the below format
Output RDD
--------------------
A,123
A,124
A,125
A,126
B,123
B,124
B,125
B,126
x = sc.parallelize([("a", ["x", "y", "z"]), ("b", ["p", "r"])])
def f(x): return x
x.flatMapValues(f).collect()
[('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]

How to sort by count and retain unique items in value

I have a dataframe with 2 columns, of the form
col1 col2
k1 'a'
k2 'b'
k1 'a'
k1 'c'
k2 'c'
k1 'b'
k1 'b'
k2 'c'
k1 'b'
I want the output to be
k1 ['b', 'a', 'c']
k2 ['c', 'b']
So the unique set of entries, sorted by the number of times each entry occurs (in descending order). In the above example, 'b' is associated with k1 thrice, 'a' twice, and 'c' once.
How do I go about doing this?
groupBy($"col1").count()
only looks at the number of times the entries in col1 occur, but that's not what I'm looking for.
You can do the following:
for each key and column value, calculate the count
for each key, calculate a list with all related column values and their counts
use udf to sort the list and drop the counts
Like that (in Scala):
import scala.collection.mutable
import org.apache.spark.sql.{Row}
val sort_by_count_udf = udf {
arr: mutable.WrappedArray[Row] =>
arr.map {
case Row(count: Long, col2: String) => (count, col2)
}.sortBy(-_._1).map { case (count, col2) => col2 }
}
val df = List(("k1", "a"),
("k1", "a"), ("k1", "c"), ("k1", "b"),
("k2", "b"), ("k2", "c"), ("k2", "c"),
("k1", "b"), ("k1", "b"))
.toDF("col1", "col2")
val grouped = df
.groupBy("col1", "col2")
.count()
.groupBy("col1")
.agg(collect_list(struct("count", "col2")).as("list"))
grouped.withColumn("list_ordered", sort_by_count_udf(col("list"))).show
Here's one (not so pretty solution) using only in-built functions :
df.groupBy($"col1" , $"col2")
.agg(count($"col2").alias("cnt") )
.groupBy($"col1")
.agg(sort_array(collect_list(struct(-$"cnt", $"col2"))).as("list"))
.withColumn("x" , $"list".getItem("col2") )
.show(false)
Since sort_array sorts the elements in ascending order according to their natural ordering -$"cnt" helps us in getting the elements sorted in descending order based on their count. getItem is used to get the value of col2 from the struct.
Output:
+----+------------------------+---------+
|col1|list |x |
+----+------------------------+---------+
|k2 |[[-2,c], [-1,b]] |[c, b] |
|k1 |[[-3,b], [-2,a], [-1,c]]|[b, a, c]|
+----+------------------------+---------+