Columns to rows and count for summarizing in pyspark

Columns to rows and count for summarizing in pyspark - pyspark

I have the following df in pyspark:
id
col1
col2
on
true
true
off
false
true
on
true
false
off
false
true
on
true
false
off
false
true
and I would like to summarize it as shown below:
col
id
true
false
col1
on
3
0
off
0
3
col2
on
1
2
off
3
0
where the integers represent the count() of true and false instances. I've tried using unpivot unsuccessfully. Any help will be appreciated.

You also can pivot your table (several times, so you can also use for comprehensive as well)
from pyspark.sql import functions as F
c1 = df.groupBy('id').pivot('c1').count().withColumn('col', F.lit('c1'))
c2 = df.groupBy('id').pivot('c2').count().withColumn('col', F.lit('c2'))
c1.union(c2).fillna(0).show()
+---+---+----+-----+
|col| id|true|false|
+---+---+----+-----+
| c1| on| 3| 0|
| c1|off| 0| 3|
| c2| on| 1| 2|
| c2|off| 3| 0|
+---+---+----+-----+

Few mechanics. List comprehension could be of help
#Pivot, in this case, iterate the df and pivot. That will give a list of dfs
cols=df.drop('id').columns
lst = [df.crosstab('id',c).withColumn('col', lit(c)) for c in cols]
#Use unionall or unionByName to reduce the list of dfs into one df. To do this, iterate each df in the lit renaming first column
df1 = reduce(DataFrame.unionByName, [y.withColumnRenamed(y.columns[0],'id') for y in lst])
df1.show()
+---+-----+----+----+
| id|false|true| col|
+---+-----+----+----+
| on| 0| 3|col1|
|off| 3| 0|col1|
| on| 2| 1|col2|
|off| 0| 3|col2|
+---+-----+----+----+

Related

Spark select column based on row values

I have a all string spark dataframe and I need to return columns in which all rows meet a certain criteria.
scala> val df = spark.read.format("csv").option("delimiter",",").option("header", "true").option("inferSchema", "true").load("file:///home/animals.csv")
df.show()
+--------+---------+--------+
|Column 1| Column 2|Column 3|
+--------+---------+--------+
|(ani)mal| donkey| wolf|
| mammal|(mam)-mal| animal|
| chi-mps| chimps| goat|
+--------+---------+--------+
Over here the criteria is return columns where all row values have length==6, irrespective of special characters. The response should be below dataframe since all rows in column 1 and column 2 have length==6
+--------+---------+
|Column 1| Column 2|
+--------+---------+
|(ani)mal| donkey|
| mammal|(mam)-mal|
| chi-mps| chimps|
+--------+---------+

You can use regexp_replace to delete the special characters if you know what there are and then get the length, filter to field what you want.
val cols = df.columns
val df2 = cols.foldLeft(df) {
(df, c) => df.withColumn(c + "_len", length(regexp_replace(col(c), "[()-]", "")))
}
df2.show()
+--------+---------+-------+-----------+-----------+-----------+
| Column1| Column2|Column3|Column1_len|Column2_len|Column3_len|
+--------+---------+-------+-----------+-----------+-----------+
|(ani)mal| donkey| wolf| 6| 6| 4|
| mammal|(mam)-mal| animal| 6| 6| 6|
| chi-mps| chimps| goat| 6| 6| 4|
+--------+---------+-------+-----------+-----------+-----------+

How to compute cumulative sum on multiple float columns?

I have 100 float columns in a Dataframe which are ordered by date.
ID Date C1 C2 ....... C100
1 02/06/2019 32.09 45.06 99
1 02/04/2019 32.09 45.06 99
2 02/03/2019 32.09 45.06 99
2 05/07/2019 32.09 45.06 99
I need to get C1 to C100 in the cumulative sum based on id and date.
Target dataframe should look like this:
ID Date C1 C2 ....... C100
1 02/04/2019 32.09 45.06 99
1 02/06/2019 64.18 90.12 198
2 02/03/2019 32.09 45.06 99
2 05/07/2019 64.18 90.12 198
I want to achieve this without looping from C1- C100.
Initial code for one column:
var DF1 = DF.withColumn("CumSum_c1", sum("C1").over(
Window.partitionBy("ID")
.orderBy(col("date").asc)))
I found a similar question here but he manually did it for two columns : Cumulative sum in Spark

Its a classical use for foldLeft. Let's generate some data first :
import org.apache.spark.sql.expressions._
val df = spark.range(1000)
.withColumn("c1", 'id + 3)
.withColumn("c2", 'id % 2 + 1)
.withColumn("date", monotonically_increasing_id)
.withColumn("id", 'id % 10 + 1)
// We will select the columns we want to compute the cumulative sum of.
val columns = df.drop("id", "date").columns
val w = Window.partitionBy(col("id")).orderBy(col("date").asc)
val results = columns.foldLeft(df)((tmp_, column) => tmp_.withColumn(s"cum_sum_$column", sum(column).over(w)))
results.orderBy("id", "date").show
// +---+---+---+-----------+----------+----------+
// | id| c1| c2| date|cum_sum_c1|cum_sum_c2|
// +---+---+---+-----------+----------+----------+
// | 1| 3| 1| 0| 3| 1|
// | 1| 13| 1| 10| 16| 2|
// | 1| 23| 1| 20| 39| 3|
// | 1| 33| 1| 30| 72| 4|
// | 1| 43| 1| 40| 115| 5|
// | 1| 53| 1| 8589934592| 168| 6|
// | 1| 63| 1| 8589934602| 231| 7|

Here is another way using simple select expression :
val w = Window.partitionBy($"id").orderBy($"date".asc).rowsBetween(Window.unboundedPreceding, Window.currentRow)
// get columns you want to sum
val columnsToSum = df.drop("ID", "Date").columns
// map over those columns and create new sum columns
val selectExpr = Seq(col("ID"), col("Date")) ++ columnsToSum.map(c => sum(col(c)).over(w).alias(c)).toSeq
df.select(selectExpr:_*).show()
Gives:
+---+----------+-----+-----+----+
| ID| Date| C1| C2|C100|
+---+----------+-----+-----+----+
| 1|02/04/2019|32.09|45.06| 99|
| 1|02/06/2019|64.18|90.12| 198|
| 2|02/03/2019|32.09|45.06| 99|
| 2|05/07/2019|64.18|90.12| 198|
+---+----------+-----+-----+----+

Add a New column in pyspark Dataframe (alternative of .apply in pandas DF)

I have a pyspark.sql.DataFrame.dataframe df
id col1
1 abc
2 bcd
3 lal
4 bac
i want to add one more column flag in the df such that if id is odd no, flag should be 'odd' , if even 'even'
final output should be
id col1 flag
1 abc odd
2 bcd even
3 lal odd
4 bac even
I tried:
def myfunc(num):
if num % 2 == 0:
flag = 'EVEN'
else:
flag = 'ODD'
return flag
df['new_col'] = df['id'].map(lambda x: myfunc(x))
df['new_col'] = df['id'].apply(lambda x: myfunc(x))
It Gave me error : TypeError: 'Column' object is not callable
How do is use .apply ( as i use in pandas dataframe) in pyspark

pyspark doesn't provide apply, the alternative is to use withColumn function. Use withColumn to perform this operation.
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([
[1,"abc"],
[2,"bcd"],
[3,"lal"],
[4,"bac"]
],
["id","col1"]
)
df.show()
+---+----+
| id|col1|
+---+----+
| 1| abc|
| 2| bcd|
| 3| lal|
| 4| bac|
+---+----+
df.withColumn(
"flag",
F.when(F.col("id")%2 == 0, F.lit("Even")).otherwise(
F.lit("odd"))
).show()
+---+----+----+
| id|col1|flag|
+---+----+----+
| 1| abc| odd|
| 2| bcd|Even|
| 3| lal| odd|
| 4| bac|Even|
+---+----+----+

Pyspark Dataframe Apply function to two columns

Say I have two PySpark DataFrames df1 and df2.
df1= 'a'
1
2
5
df2= 'b'
3
6
And I want to find the closest df2['b'] value for each df1['a'], and add the closest values as a new column in df1.
In other words, for each value x in df1['a'], I want to find a y that achieves min(abx(x-y)) for all y in df2['b'](note: can assume that there is only one y that can achieve the minimum distance), and the result would be
'a' 'b'
1 3
2 3
5 6
I tried the following code to create a distance matrix first (before finding the values achieving the minimum distance):
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
def dict(x,y):
return abs(x-y)
udf_dict = udf(dict, IntegerType())
sql_sc = SQLContext(sc)
udf_dict(df1.a, df2.b)
which gives
Column<PythonUDF#dist(a,b)>
Then I tried
sql_sc.CreateDataFrame(udf_dict(df1.a, df2.b))
which runs forever without giving error/output.
My questions are:
As I'm new to Spark, is my way to construct the output DataFrame efficient? (My way would be creating a distance matrix for all the a and b values first, and then find the min one)
What's wrong with the last line of my code and how to fix it?

Starting with your second question - you can apply udf only to existing dataframe, I think you were thinking for something like this:
>>> df1.join(df2).withColumn('distance', udf_dict(df1.a, df2.b)).show()
+---+---+--------+
| a| b|distance|
+---+---+--------+
| 1| 3| 2|
| 1| 6| 5|
| 2| 3| 1|
| 2| 6| 4|
| 5| 3| 2|
| 5| 6| 1|
+---+---+--------+
But there is a more efficient way to apply this distance, by using internal abs:
>>> from pyspark.sql.functions import abs
>>> df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
Then you can find matching numbers by calculating:
>>> distances = df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
>>> min_distances = distances.groupBy('a').agg(min('distance').alias('distance'))
>>> distances.join(min_distances, ['a', 'distance']).select('a', 'b').show()
+---+---+
| a| b|
+---+---+
| 5| 6|
| 1| 3|
| 2| 3|
+---+---+

Spark Dataframe sliding window over pair of rows

I have an eventlog in csv consisting of three columns timestamp, eventId and userId.
What I would like to do is append a new column nextEventId to the dataframe.
An example eventlog:
eventlog = sqlContext.createDataFrame(Array((20160101, 1, 0),(20160102,3,1),(20160201,4,1),(20160202, 2,0))).toDF("timestamp", "eventId", "userId")
eventlog.show(4)
|timestamp|eventId|userId|
+---------+-------+------+
| 20160101| 1| 0|
| 20160102| 3| 1|
| 20160201| 4| 1|
| 20160202| 2| 0|
+---------+-------+------+
The desired endresult would be:
|timestamp|eventId|userId|nextEventId|
+---------+-------+------+-----------+
| 20160101| 1| 0| 2|
| 20160102| 3| 1| 4|
| 20160201| 4| 1| Nil|
| 20160202| 2| 0| Nil|
+---------+-------+------+-----------+
So far I've been messing around with sliding windows but can't figure out how to compare 2 rows...
val w = Window.partitionBy("userId").orderBy(asc("timestamp")) //should be a sliding window over 2 rows...
val nextNodes = second($"eventId").over(w) //should work if there are only 2 rows

What you're looking for is lead (or lag). Using window you already defined:
import org.apache.spark.sql.functions.lead
eventlog.withColumn("nextEventId", lead("eventId", 1).over(w))
For true sliding window (like sliding average) you can use rowsBetween or rangeBetween clauses of the window definition but it is not really required here. Nevertheless example usage could be something like this:
val w2 = Window.partitionBy("userId")
.orderBy(asc("timestamp"))
.rowsBetween(-1, 0)
avg($"foo").over(w2)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Columns to rows and count for summarizing in pyspark - pyspark

Related

Spark select column based on row values

How to compute cumulative sum on multiple float columns?

Add a New column in pyspark Dataframe (alternative of .apply in pandas DF)

Pyspark Dataframe Apply function to two columns

Spark Dataframe sliding window over pair of rows

Categories

Resources