Pyspark : select specific column with its position - pyspark

I would like to know how to select a specific column with its number but not with its name in a dataframe ?
Like this in Pandas:
df = df.iloc[:,2]
It's possible ?

You can always get the name of the column with df.columns[n] and then select it:
df = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])
To select column at position n:
n = 1
df.select(df.columns[n]).show()
+---+
| b|
+---+
| 2|
| 4|
+---+
To select all but column n:
n = 1
You can either use drop:
df.drop(df.columns[n]).show()
+---+
| a|
+---+
| 1|
| 3|
+---+
Or select with manually constructed column names:
df.select(df.columns[:n] + df.columns[n+1:]).show()
+---+
| a|
+---+
| 1|
| 3|
+---+

Same solution as mirkhosro:
For a dataframe df, you can select the column n using df[n], where n is the index of the column.
Example:
df = df.filter(df[3]!=0)
will remove the rows of df, where the value in the fourth column is 0.
Note that you can check the columns using df.printSchema()

Related

Apache Spark calculating column value on the basis of distinct value of columns

I am processing the following tables and I would like to compute a new column (outcome) based on the distinct value of 2 other columns.
| id1 | id2 | outcome
| 1 | 1 | 1
| 1 | 1 | 1
| 1 | 3 | 2
| 2 | 5 | 1
| 3 | 1 | 1
| 3 | 2 | 2
| 3 | 3 | 3
The outcome should begin in incremental order starting from 1 based on the combined value of id1 and id2. Any hints how this can be accomplished in Scala. row_number doesn't seem to be useful here in this case.
The logic here is that for each unique value of id1 we will start numbering the outcome with min(id2) for corresponding id1 being assigned a value of 1.
You could try dense_rank()
with your example
val df = sqlContext
.read
.option("sep","|")
.option("header", true)
.option("inferSchema",true)
.csv("/home/cloudera/files/tests/ids.csv") // Here we read the .csv files
.cache()
df.show()
df.printSchema()
df.createOrReplaceTempView("table")
sqlContext.sql(
"""
|SELECT id1, id2, DENSE_RANK() OVER(PARTITION BY id1 ORDER BY id2) AS outcome
|FROM table
|""".stripMargin).show()
output
+---+---+-------+
|id1|id2|outcome|
+---+---+-------+
| 2| 5| 1|
| 1| 1| 1|
| 1| 1| 1|
| 1| 3| 2|
| 3| 1| 1|
| 3| 2| 2|
| 3| 3| 3|
+---+---+-------+
Use Window function to club(partition) them by first id and then order each partition based on second id.
Now you just need to assign a rank (dense_rank) over each Window partition.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
df
.withColumn("outcome", dense_rank().over(Window.partitionBy("id1").orderBy("id2")))

Show all pyspark columns after group and agg

I wish to groupby a column and then find the max of another column. Lastly, show all the columns based on this condition. However, when I used my codes, it only show 2 columns and not all of it.
# Normal way of creating dataframe in pyspark
sdataframe_temp = spark.createDataFrame([
(2,2,'0-2'),
(2,23,'22-24')],
['a', 'b', 'c']
)
sdataframe_temp2 = spark.createDataFrame([
(4,6,'4-6'),
(5,7,'6-8')],
['a', 'b', 'c']
)
# Concat two different pyspark dataframe
sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)
sdataframe_union_1_2_g = sdataframe_union_1_2.groupby('a').agg({'b':'max'})
sdataframe_union_1_2_g.show()
output:
+---+------+
| a|max(b)|
+---+------+
| 5| 7|
| 2| 23|
| 4| 6|
+---+------+
Expected output:
+---+------+-----+
| a|max(b)| c |
+---+------+-----+
| 5| 7|6-8 |
| 2| 23|22-24|
| 4| 6|4-6 |
+---+------+---+
You can use a Window function to make it work:
Method 1: Using Window function
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("a").orderBy(F.desc("b"))
(sdataframe_union_1_2
.withColumn('max_val', F.row_number().over(w) == 1)
.where("max_val == True")
.drop("max_val")
.show())
+---+---+-----+
| a| b| c|
+---+---+-----+
| 5| 7| 6-8|
| 2| 23|22-24|
| 4| 6| 4-6|
+---+---+-----+
Explanation
Window functions are useful when we want to attach a new column to the existing set of columns.
In this case, I tell Window function to groupby partitionBy('a') column and sort the column b in descending order F.desc(b). This make the first value in b in each group its max value.
Then we use F.row_number() to filter the max values where row number equals 1.
Finally, we drop the new column since it is not being used after filtering the data frame.
Method 2: Using groupby + inner join
f = sdataframe_union_1_2.groupby('a').agg(F.max('b').alias('b'))
sdataframe_union_1_2.join(f, on=['a','b'], how='inner').show()
+---+---+-----+
| a| b| c|
+---+---+-----+
| 2| 23|22-24|
| 5| 7| 6-8|
| 4| 6| 4-6|
+---+---+-----+

PySpark DataFrame multiply columns based on values in other columns

Pyspark newbie here. I have a dataframe, say,
+------------+-------+----+
| id| mode|count|
+------------+------+-----+
| 146360 | DOS| 30|
| 423541 | UNO| 3|
+------------+------+-----+
I want a dataframe with a new column aggregate with count * 2 , when mode is 'DOS' and count * 1 when mode is 'UNO'
+------------+-------+----+---------+
| id| mode|count|aggregate|
+------------+------+-----+---------+
| 146360 | DOS| 30| 60|
| 423541 | UNO| 3| 3|
+------------+------+-----+---------+
Appreciate your inputs and also some pointers to best practices :)
Method 1: using pyspark.sql.functions with when :
from pyspark.sql.functions import when,col
df = df.withColumn('aggregate', when(col('mode')=='DOS', col('count')*2).when(col('mode')=='UNO', col('count')*1).otherwise('count'))
Method 2: using SQL CASE expression with selectExpr:
df = df.selectExpr("*","CASE WHEN mode == 'DOS' THEN count*2 WHEN mode == 'UNO' THEN count*1 ELSE count END AS aggregate")
The result:
+------+----+-----+---------+
| id|mode|count|aggregate|
+------+----+-----+---------+
|146360| DOS| 30| 60|
|423541| UNO| 3| 3|
+------+----+-----+---------+

Adding historical path feature to a PySpark dataframe

I have column 'Event' in my original dataframe, I want to add the other 2 columns.
Event
Event_lag
Hist_event
0
N
N
0
0
N0
1
0
N00
0
1
N001
from pyspark.sql.functions import lag, col, monotonically_increasing_id, collect_list, concat_ws
from pyspark.sql import Window
#sample data
df= sc.parallelize([[0], [0], [1], [0]]).toDF(["Event"])
#add row index to the dataframe
df = df.withColumn("row_idx", monotonically_increasing_id())
w = Window.orderBy("row_idx")
#add 'Event_Lag' column to the dataframe
df = df.withColumn("Event_Lag", lag(col('Event').cast('string')).over(w))
df = df.fillna({'Event_Lag':'N'})
#finally add 'Hist_Event' column to the dataframe and remove row index column (i.e. 'row_idx') to have the final result
df = df.withColumn("Hist_Event", collect_list(col('Event_Lag')).over(w)).\
withColumn("Hist_Event", concat_ws("","Hist_Event")).\
drop("row_idx")
df.show()
Sample input:
+-----+
|Event|
+-----+
| 0|
| 0|
| 1|
| 0|
+-----+
Output is:
+-----+---------+----------+
|Event|Event_Lag|Hist_Event|
+-----+---------+----------+
| 0| N| N|
| 0| 0| N0|
| 1| 0| N00|
| 0| 1| N001|
+-----+---------+----------+

Get distinct values of specific column with max of different columns

I have the following DataFrame
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| 6|null|null|
| B|null| 5|null|
| C|null|null| 7|
| B|null|null| 4|
| B|null| 2|null|
| B|null| 1|null|
| A| 4|null|null|
+----+----+----+----+
What I would like to do in Spark is to return all entries in col1 in the case it has a maximum value for one of the columns col2, col3 or col4.
This snippet won't do what I want:
df.groupBy("col1").max("col2","col3","col4").show()
And this one just gives the max only for one column (1):
df.groupBy("col1").max("col2").show()
I even tried to merge the single outputs by this:
//merge rows
val rows = test1.rdd.zip(test2.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
//merge schemas
val schema = StructType(test1.schema.fields ++ test2.schema.fields)
// create new df
val test3: DataFrame = sqlContext.createDataFrame(rows, schema)
where test1 and test2 are DataFramesdone with queries as (1).
So how do I achive this nicely??
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| 6|null|null|
| B|null| 5|null|
| C|null|null| 7|
+----+----+----+----+
Or even only the distinct values:
+----+
|col1|
+----+
| A|
| B|
| C|
+----+
Thanks in advance! Best
You can use some thing like below :-
sqlcontext.sql("select x.* from table_name x ,
(select max(col2) as a,max(col3) as b, max(col4) as c from table_name ) temp
where a=x.col2 or b= x.col3 or c=x.col4")
Will give the desired result.
It can be solved like this:
df.registerTempTable("temp")
spark.sql("SELECT max(col2) AS max2, max(col3) AS max3, max(col4) AS max4 FROM temp").registerTempTable("max_temp")
spark.sql("SELECT col1 FROM temp, max_temp WHERE col2 = max2 OR col3 = max3 OR col4 = max4").show