How to move a specific column of a pyspark dataframe in the start of the dataframe - pyspark

I have a pyspark dataframe as follows (this is just a simplified example, my actual dataframe has hundreds of columns):
col1,col2,......,col_with_fix_header
1,2,.......,3
4,5,.......,6
2,3,........,4
and I want to move col_with_fix_header in the start, so that the output comes as follows:
col_with_fix_header,col1,col2,............
3,1,2,..........
6,4,5,....
4,2,3,.......
I don't want to list all the columns in the solution.

In case you don't want to list all columns of your dataframe, you can use the dataframe property columns. This property gives you a python list of column names and you can simply slice it:
df = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)], ["id", "name", "age"])
df.select([df.columns[-1]] + df.columns[:-1]).show()
Output:
+---+---+-------+
|age| id| name|
+---+---+-------+
| 34| a| Alice|
| 36| b| Bob|
| 30| c|Charlie|
| 29| d| David|
| 32| e| Esther|
| 36| f| Fanny|
| 60| g| Gabby|
+---+---+-------+

Related

Select Data from multiple rows to one row

I'm pretty new to functional programming and pyspark and I currently struggle to condense the data I want from my source data
Let's say I have two tables as DataFrames:
# if not already created automatically, instantiate Sparkcontext
spark = SparkSession.builder.getOrCreate()
columns = ['Id', 'JoinId', 'Name']
vals = [(1, 11, 'FirstName'), (2, 12, 'SecondName'), (3, 13, 'ThirdName')]
persons = spark.createDataFrame(vals,columns)
columns = ['Id', 'JoinId', 'Specification', 'Date', 'Destination']
vals = [(1, 10, 'I', '20051205', 'New York City'), (2, 11, 'I', '19991112', 'Berlin'), (3, 11, 'O', '20030101', 'Madrid'), (4, 13, 'I', '20200113', 'Paris'), (5, 11, 'U', '20070806', 'Lissabon')]
movements = spark.createDataFrame(vals,columns)
persons.show()
+---+------+----------+
| Id|JoinId| Name|
+---+------+----------+
| 1| 11| FirstName|
| 2| 12|SecondName|
| 3| 13| ThirdName|
+---+------+----------+
movements.show()
+---+------+-------------+--------+-------------+
| Id|JoinId|Specification| Date| Destination|
+---+------+-------------+--------+-------------+
| 1| 10| I|20051205|New York City|
| 2| 11| I|19991112| Berlin|
| 3| 11| O|20030101| Madrid|
| 4| 13| I|20200113| Paris|
| 5| 11| U|20070806| Lissabon|
+---+------+-------------+--------+-------------+
What I want to create is
+--------+----------+---------+---------+-----------+
|PersonId|PersonName| IDate| ODate|Destination|
| 1| FirstName| 19991112| 20030101| Berlin|
| 3| ThirdName| 20200113| | Paris|
+--------+----------+---------+---------+-----------+
The rules would be:
PersonId is the Id of the Person
IDate is the Date saved in the Movements DataFrame where Specification is I
ODate the Date saved in the Movements DataFrame where Specification is O
The Destination is the Destination of the joined entry where the Specification was I
I already joined the dataframes on JoinId
joined = persons.withColumnRenamed('JoinId', 'P_JoinId').join(movements, col('P_JoinId') == movements.JoinId, how='inner')
joined.show()
+---+--------+---------+---+------+-------------+--------+-----------+
| Id|P_JoinId| Name| Id|JoinId|Specification| Date|Destination|
+---+--------+---------+---+------+-------------+--------+-----------+
| 1| 11|FirstName| 2| 11| I|19991112| Berlin|
| 1| 11|FirstName| 3| 11| O|20030101| Madrid|
| 1| 11|FirstName| 5| 11| U|20070806| Lissabon|
| 3| 13|ThirdName| 4| 13| I|20200113| Paris|
+---+--------+---------+---+------+-------------+--------+-----------+
But I'm struggling to select data from multiple rows and put them with the given rules into a single row...
Thank you for your help
Note : I have renamed the id in movements to Id_Movements,to avoid confusion in grouping later.
You can pivot your joined data based on the specification and do some aggregation on date and destination. Then you will get the date and destination specification wise.
import pyspark.sql.functions as F
persons =sqlContext.createDataFrame( [(1, 11, 'FirstName'), (2, 12, 'SecondName'), (3, 13, 'ThirdName')],schema=['Id', 'JoinId', 'Name'])
movements=sqlContext.createDataFrame([(1, 10, 'I', '20051205', 'New York City'), (2, 11, 'I', '19991112', 'Berlin'), (3, 11, 'O', '20030101', 'Madrid'), (4, 13, 'I', '20200113', 'Paris'), (5, 11, 'U', '20070806', 'Lissabon')],schema=['Id_movements', 'JoinId', 'Specification', 'Date', 'Destination'])
df_joined = persons.withColumnRenamed('JoinId', 'P_JoinId').join(movements, F.col('P_JoinId') == movements.JoinId, how='inner')
#%%
df_pivot = df_joined.groupby(['Id','Name']).pivot('Specification').agg(F.min('Date').alias("date"),F.min('Destination').alias('destination'))
Here I have chosen the min aggregation, but you can choose the one as per your need and drop the irrelevant columns
results :
+---+---------+--------+-------------+--------+-------------+--------+-------------+
| Id| Name| I_date|I_destination| O_date|O_destination| U_date|U_destination|
+---+---------+--------+-------------+--------+-------------+--------+-------------+
| 1|FirstName|19991112| Berlin|20030101| Madrid|20070806| Lissabon|
| 3|ThirdName|20200113| Paris| null| null| null| null|
+---+---------+--------+-------------+--------+-------------+--------+-------------+

Spark: get rows value from a Window

With spark I defined a Window:
val window = Window
.partitionBy("myaggcol")
.orderBy("datefield")
.rowsBetween(-2, 0)
Then I can compute a new column from the window' rows, eg:
dataset
.withColumn("newcol", last("diffcol").over(window) - first("diffcol").over(window))
This will compute, for each point, the difference in "diffcol" with the n-2 row.
Now my question: how can I get the "diffcol" of n-1 row, not the first nor the last but the intermediary one?
If I understand your question correctly, Window function lag would work better than rowsBetween, as shown in the following example:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val df = Seq(
("a", 1, 100), ("a", 2, 200), ("a", 3, 300), ("a", 4, 400),
("b", 1, 500), ("b", 2, 600), ("b", 3, 700)
).toDF("c1", "c2", "c3")
val win = Window.partitionBy("c1").orderBy("c2")
df.
withColumn("c3Diff1", $"c3" - coalesce(lag("c3", 1).over(win), lit(0))).
withColumn("c3Diff2", $"c3" - coalesce(lag("c3", 2).over(win), lit(0))).
show
// +---+---+---+-------+-------+
// | c1| c2| c3|c3Diff1|c3Diff2|
// +---+---+---+-------+-------+
// | b| 1|500| 500| 500|
// | b| 2|600| 100| 600|
// | b| 3|700| 100| 200|
// | a| 1|100| 100| 100|
// | a| 2|200| 100| 200|
// | a| 3|300| 100| 200|
// | a| 4|400| 100| 200|
// +---+---+---+-------+-------+

Flatten Group By in Pyspark

I have a pyspark dataframe. For example,
d= hiveContext.createDataFrame([("A", 1), ("B", 2), ("D", 3), ("D", 3), ("A", 4), ("D", 3)],["Col1", "Col2"])
+----+----+
|Col1|Col2|
+----+----+
| A| 1|
| B| 2|
| D| 3|
| D| 3|
| A| 4|
| D| 3|
+----+----+
I want to group by Col1 and then create a list of Col2. I need to flatten the groups. I do have a lot of columns.
+----+----------+
|Col1| Col2|
+----+----------+
| A| [1,4] |
| B| [2] |
| D| [3,3,3]|
+----+----------+
You can do a groupBy() and use collect_list() as your aggregate function:
import pyspark.sql.functions as f
d.groupBy('Col1').agg(f.collect_list('Col2').alias('Col2')).show()
#+----+---------+
#|Col1| Col2|
#+----+---------+
#| B| [2]|
#| D|[3, 3, 3]|
#| A| [1, 4]|
#+----+---------+
Update
If you had multiple columns to combine, you could use collect_list() on each, and the combine the resulting lists using struct() and udf(). Consider the following example:
Create Dummy Data
from operator import add
import pyspark.sql.functions as f
# create example dataframe
d = sqlcx.createDataFrame(
[
("A", 1, 10),
("B", 2, 20),
("D", 3, 30),
("D", 3, 10),
("A", 4, 20),
("D", 3, 30)
],
["Col1", "Col2", "Col3"]
)
Collect Desired Columns into lists
Suppose you had a list of columns you wanted to collect into a list. You could do the following:
cols_to_combine = ['Col2', 'Col3']
d.groupBy('Col1').agg(*[f.collect_list(c).alias(c) for c in cols_to_combine]).show()
#+----+---------+------------+
#|Col1| Col2| Col3|
#+----+---------+------------+
#| B| [2]| [20]|
#| D|[3, 3, 3]|[30, 10, 30]|
#| A| [4, 1]| [20, 10]|
#+----+---------+------------+
Combine Resultant Lists into one Column
Now we want to combine the list columns into one list. If we use struct(), we will get the following:
d.groupBy('Col1').agg(*[f.collect_list(c).alias(c) for c in cols_to_combine])\
.select('Col1', f.struct(*cols_to_combine).alias('Combined'))\
.show(truncate=False)
#+----+------------------------------------------------+
#|Col1|Combined |
#+----+------------------------------------------------+
#|B |[WrappedArray(2),WrappedArray(20)] |
#|D |[WrappedArray(3, 3, 3),WrappedArray(10, 30, 30)]|
#|A |[WrappedArray(1, 4),WrappedArray(10, 20)] |
#+----+------------------------------------------------+
Flatten Wrapped Arrays
Almost there. We just need to combine the WrappedArrays. We can achieve this with a udf():
combine_wrapped_arrays = f.udf(lambda val: reduce(add, val), ArrayType(IntegerType()))
d.groupBy('Col1').agg(*[f.collect_list(c).alias(c) for c in cols_to_combine])\
.select('Col1', combine_wrapped_arrays(f.struct(*cols_to_combine)).alias('Combined'))\
.show(truncate=False)
#+----+---------------------+
#|Col1|Combined |
#+----+---------------------+
#|B |[2, 20] |
#|D |[3, 3, 3, 30, 10, 30]|
#|A |[1, 4, 10, 20] |
#+----+---------------------+
References
Pyspark Merge WrappedArrays Within a Dataframe
Update 2
A simpler way, without having to deal with WrappedArrays:
from operator import add
combine_udf = lambda cols: f.udf(
lambda *args: reduce(add, args),
ArrayType(IntegerType())
)
d.groupBy('Col1').agg(*[f.collect_list(c).alias(c) for c in cols_to_combine])\
.select('Col1', combine_udf(cols_to_combine)(*cols_to_combine).alias('Combined'))\
.show(truncate=False)
#+----+---------------------+
#|Col1|Combined |
#+----+---------------------+
#|B |[2, 20] |
#|D |[3, 3, 3, 30, 10, 30]|
#|A |[1, 4, 10, 20] |
#+----+---------------------+
Note: This last step only works if the datatypes for all of the columns are the same. You can not use this function to combine wrapped arrays with mixed types.
from spark 2.4 you can use pyspark.sql.functions.flatten
import pyspark.sql.functions as f
df.groupBy('Col1').agg(f.flatten(f.collect_list('Col2')).alias('Col2')).show()

Get the last element of a window in Spark 2.1.1

I have a dataframe in which I have subcategories, and want the last element of each of these subcategories.
val windowSpec = Window.partitionBy("name").orderBy("count")
sqlContext
.createDataFrame(
Seq[(String, Int)](
("A", 1),
("A", 2),
("A", 3),
("B", 10),
("B", 20),
("B", 30)
))
.toDF("name", "count")
.withColumn("firstCountOfName", first("count").over(windowSpec))
.withColumn("lastCountOfName", last("count").over(windowSpec))
.show()
returns me something strange:
+----+-----+----------------+---------------+
|name|count|firstCountOfName|lastCountOfName|
+----+-----+----------------+---------------+
| B| 10| 10| 10|
| B| 20| 10| 20|
| B| 30| 10| 30|
| A| 1| 1| 1|
| A| 2| 1| 2|
| A| 3| 1| 3|
+----+-----+----------------+---------------+
As we can see, the first value returned is correctly computed, but the last isn't, it's always the current value of the column.
Has someone a solution to do what I want?
According to the issue SPARK-20969, you should be able to get the expected results by defining adequate bounds to your window, as shown below.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val windowSpec = Window
.partitionBy("name")
.orderBy("count")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
sqlContext
.createDataFrame(
Seq[(String, Int)](
("A", 1),
("A", 2),
("A", 3),
("B", 10),
("B", 20),
("B", 30)
))
.toDF("name", "count")
.withColumn("firstCountOfName", first("count").over(windowSpec))
.withColumn("lastCountOfName", last("count").over(windowSpec))
.show()
Alternatively, if your are ordering on the same column you are computing first and last, you can change for min and max with a non-ordered window, then it should also work properly.

How to combine two spark data frames in sorted order

I want to combine two dataframes a and b into a dataframe c that is sorted on a column.
val a = Seq(("a", 1), ("c", 2), ("e", 3)).toDF("char", "num")
val b = Seq(("b", 4), ("d", 5)).toDF("char", "num")
val c = // how do I sort on char column?
Here is the result I want:
a.show() b.show() c.show()
+----+---+ +----+---+ +----+---+
|char|num| |char|num| |char|num|
+----+---+ +----+---+ +----+---+
| a| 1| | b| 4| | a| 1|
| c| 2| | d| 5| | b| 4|
| e| 3| +----+---+ | c| 2|
+----+---+ | d| 5|
| e| 3|
+----+---+
In simple, you can use sort() on each dataframe and union().
val a = Seq(("a", 1), ("c", 2), ("e", 3)).toDF("char", "num").sort($"char")
val b = Seq(("b", 4), ("d", 5)).toDF("char", "num").sort($"char")
val c = a.union(b).sort($"char")
if you want to do union for multiple dataframes we can try this way.
val df1 = sc.parallelize(List(
(50, 2, "arjun"),
(34, 4, "bob")
)).toDF("age", "children","name")
val df2 = sc.parallelize(List(
(51, 3, "jane"),
(35, 5, "bob")
)).toDF("age", "children","name")
val df3 = sc.parallelize(List(
(50, 2,"arjun"),
(34, 4,"bob")
)).toDF("age", "children","name")
val result= Seq(df1, df2, df3)
val res_union=result.reduce(_ union _).sort($"age",$"name",$"children")
res_union.show()