Spark monotonically_increasing_id() gives consecutive ids for all the partitions - scala

I have a dataframe df in Spark which looks something like this:
val df = (1 to 10).toList.toDF()
When I check the number of partitions, I see that I have 10 partitions:
df.rdd.getNumPartitions
res0: Int = 10
Now I generate an ID column:
val dfWithID = df.withColumn("id", monotonically_increasing_id())
dfWithID.show()
+-----+---+
|value| id|
+-----+---+
| 1| 0|
| 2| 1|
| 3| 2|
| 4| 3|
| 5| 4|
| 6| 5|
| 7| 6|
| 8| 7|
| 9| 8|
| 10| 9|
+-----+---+
So all the generated ids are consecutive though I have 10 partitions. Then I repartition the dataframe:
val dfp = df.repartition(10)
val dfpWithID = dfp.withColumn("id", monotonically_increasing_id())
dfpWithID.show()
+-----+-----------+
|value| id|
+-----+-----------+
| 10| 0|
| 1| 8589934592|
| 7|17179869184|
| 5|25769803776|
| 4|42949672960|
| 9|42949672961|
| 2|51539607552|
| 8|60129542144|
| 6|68719476736|
| 3|77309411328|
+-----+-----------+
Now I get the ids which are not consecutive anymore. Based on Spark documentation, it should put the partition ID in the upper 31 bits, and in both cases I have 10 partitions. Why it only adds the partition ID after calling repartition() ?

I assume this is because all your data in your initial dataframe resides in a single partition, the other 9 being empty.
To very this, use the answers given here: Apache Spark: Get number of records per partition

Related

Is there any method by which we can limit the rows in repartition function?

In spark I am trying to limit the numbers of rows to 100 in each partition. But i don't want to write it in the file.. i need to perform more operations on the file before overwriting the records
you can do it using repartition.
to keep n record in each partition you need to repartition your data as total_data_count/repartition=100
For example : i have 100 record now if i want to have each partition 10 records then i have to repartition my data in 10 parts df.repartition(10)
>>> df=spark.read.csv("/path to csv/sample2.csv",header=True)
>>> df.count()
100
>>> df1=df.repartition(10)
>>> df1\
... .withColumn("partitionId", spark_partition_id())\
... .groupBy("partitionId")\
... .count()\
... .orderBy(asc("count"))\
... .show()
+-----------+-----+
|partitionId|count|
+-----------+-----+
| 6| 10|
| 3| 10|
| 5| 10|
| 9| 10|
| 8| 10|
| 4| 10|
| 7| 10|
| 1| 10|
| 0| 10|
| 2| 10|
+-----------+-----+
here you can see each partition have 10 records

Calculate running total in Pyspark dataframes and break the loop when a condition occurs

I have a spark dataframe, where I need to calculate a running total based on the current and previous row sum of amount valued based on the col_x. when ever there is occurance of negative amount in col_y, I should break the running total of previous records, and start doing the running total from current row.
Sample dataset:
The expected output should be like:
How to acheive this with dataframes using pyspark?
Another way
Create Index
df = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
Regenerate Columns
df = df.select('index', 'value.*')#.show()
Create groups bounded by negative values
w=Window.partitionBy().orderBy('index').rowsBetween(-sys.maxsize,0)
df=df.withColumn('cat', f.min('Col_y').over(w))
Cumsum within groups
y=Window.partitionBy('cat').orderBy(f.asc('index')).rowsBetween(Window.unboundedPreceding,0)
df.withColumn('cumsum', f.round(f.sum('Col_y').over(y),2)).sort('index').drop('cat','index').show()
Outcome
+-----+-------------------+------+
|Col_x| Col_y|cumsum|
+-----+-------------------+------+
| ID1|-17.899999618530273| -17.9|
| ID1| 21.899999618530273| 4.0|
| ID1| 236.89999389648438| 240.9|
| ID1| 4.989999771118164|245.89|
| ID1| 610.2000122070312|856.09|
| ID1| -35.79999923706055| -35.8|
| ID1| 21.899999618530273| -13.9|
| ID1| 17.899999618530273| 4.0|
+-----+-------------------+------+
I am hoping in real scenario you will be having a timestamp column to do ordering of the data, I am ordering the data using line number with zipindex for the explanation basis here.
from pyspark.sql.window import Window
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [
("ID1", -17.9),
("ID1", 21.9),
("ID1", 236.9),
("ID1", 4.99),
("ID1", 610.2),
("ID1", -35.8),
("ID1",21.9),
("ID1",17.9)
]
schema = StructType([
StructField('Col_x', StringType(),True), \
StructField('Col_y', FloatType(),True)
])
df = spark.createDataFrame(data=data, schema=schema)
df_1 = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
df_1.createOrReplaceTempView("valuewithorder")
w = Window.partitionBy('Col_x').orderBy('index')
w1 = Window.partitionBy('Col_x','group').orderBy('index')
df_final=spark.sql("select value.Col_x,round(value.Col_y,1) as Col_y, index from valuewithorder")
"""Group The data into different groups based on the negative value existance"""
df_final = df_final.withColumn("valueChange",(f.col('Col_y')<0).cast("int")) \
.fillna(0,subset=["valueChange"])\
.withColumn("indicator",(~((f.col("valueChange") == 0))).cast("int"))\
.withColumn("group",f.sum(f.col("indicator")).over(w.rangeBetween(Window.unboundedPreceding, 0)))
"""Cumlative sum with idfferent parititon of group and col_x"""
df_cum_sum = df_final.withColumn("Col_z", sum('Col_y').over(w1))
df_cum_sum.createOrReplaceTempView("FinalCumSum")
df_cum_sum = spark.sql("select Col_x , Col_y ,round(Col_z,1) as Col_z from FinalCumSum")
df_cum_sum.show()
Results of intermedite data set and results
>>> df_cum_sum.show()
+-----+-----+-----+
|Col_x|Col_y|Col_z|
+-----+-----+-----+
| ID1|-17.9|-17.9|
| ID1| 21.9| 4.0|
| ID1|236.9|240.9|
| ID1| 5.0|245.9|
| ID1|610.2|856.1|
| ID1|-35.8|-35.8|
| ID1| 21.9|-13.9|
| ID1| 17.9| 4.0|
+-----+-----+-----+
>>> df_final.show()
+-----+-----+-----+-----------+---------+-----+
|Col_x|Col_y|index|valueChange|indicator|group|
+-----+-----+-----+-----------+---------+-----+
| ID1|-17.9| 0| 1| 1| 1|
| ID1| 21.9| 1| 0| 0| 1|
| ID1|236.9| 2| 0| 0| 1|
| ID1| 5.0| 3| 0| 0| 1|
| ID1|610.2| 4| 0| 0| 1|
| ID1|-35.8| 5| 1| 1| 2|
| ID1| 21.9| 6| 0| 0| 2|
| ID1| 17.9| 7| 0| 0| 2|
+-----+-----+-----+-----------+---------+-----+

How to create a sequence of events (column values) per some other column?

I have a Spark data frame as shown below -
val myDF = Seq(
(1,"A",100,0,0),
(1,"E",200,0,0),
(1,"",300,1,49),
(2,"A",200,0,0),
(2,"C",300,0,0),
(2,"D",100,0,0)
).toDF("visitor","channel","timestamp","purchase_flag","amount")
scala> myDF.show
+-------+-------+---------+-------------+------+
|visitor|channel|timestamp|purchase_flag|amount|
+-------+-------+---------+-------------+------+
| 1| A| 100| 0| 0|
| 1| E| 200| 0| 0|
| 1| | 300| 1| 49|
| 2| A| 200| 0| 0|
| 2| C| 300| 0| 0|
| 2| D| 100| 0| 0|
+-------+-------+---------+-------------+------+
I would like to create Sequence dataframe for every visitor from myDF that traces a visitor's path to purchase ordered by timestamp dimension.
The output dataframe should look like below(-> can be any delimiter) -
+-------+---------------------+
|visitor|channel sequence |
+-------+---------------------+
| 1| A->E->purchase |
| 2| D->A->C->no_purchase|
+-------+---------------------+
To make things clear, visitor 2 has been exposed to channel D, then A and then C; and he does not make a purchase.
Hence the sequence is to be formed as D->A-C->no_purchase.
NOTE: Whenever a purchase happens, channel value goes blank and purchase_flag is set to 1.
I want to do this using a Scala UDF in Spark so that I re-apply the method on other datasets.
Here's how it is done using udf function
val myDF = Seq(
(1,"A",100,0,0),
(1,"E",200,0,0),
(1,"",300,1,49),
(2,"A",200,0,0),
(2,"C",300,0,0),
(2,"D",100,0,0)
).toDF("visitor","channel","timestamp","purchase_flag","amount")
import org.apache.spark.sql.functions._
def sequenceUdf = udf((struct: Seq[Row], purchased: Seq[Int])=> struct.map(row => (row.getAs[String]("channel"), row.getAs[Int]("timestamp"))).sortBy(_._2).map(_._1).filterNot(_ == "").mkString("->")+{if(purchased.contains(1)) "->purchase" else "->no_purchase"})
myDF.groupBy("visitor").agg(collect_list(struct("channel", "timestamp")).as("struct"), collect_list("purchase_flag").as("purchased"))
.select(col("visitor"), sequenceUdf(col("struct"), col("purchased")).as("channel sequence"))
.show(false)
which should give you
+-------+--------------------+
|visitor|channel sequence |
+-------+--------------------+
|1 |A->E->purchase |
|2 |D->A->C->no_purchase|
+-------+--------------------+
You can make it as much generic as you can . this is just a demo on how you should proceed

Compare two dataframes and update the values

I have two dataframes like following.
val file1 = spark.read.format("csv").option("sep", ",").option("inferSchema", "true").option("header", "true").load("file1.csv")
file1.show()
+---+-------+-----+-----+-------+
| id| name|mark1|mark2|version|
+---+-------+-----+-----+-------+
| 1| Priya | 80| 99| 0|
| 2| Teju | 10| 5| 0|
+---+-------+-----+-----+-------+
val file2 = spark.read.format("csv").option("sep", ",").option("inferSchema", "true").option("header", "true").load("file2.csv")
file2.show()
+---+-------+-----+-----+-------+
| id| name|mark1|mark2|version|
+---+-------+-----+-----+-------+
| 1| Priya | 80| 99| 0|
| 2| Teju | 70| 5| 0|
+---+-------+-----+-----+-------+
Now I am comparing two dataframes and filtering out the mismatch values like this.
val columns = file1.schema.fields.map(_.name)
val selectiveDifferences = columns.map(col => file1.select(col).except(file2.select(col)))
selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})
+-----+
|mark1|
+-----+
| 10|
+-----+
I need to add the extra row into the dataframe, 1 for the mismatch value from the dataframe 2 and update the version number like this.
file1.show()
+---+-------+-----+-----+-------+
| id| name|mark1|mark2|version|
+---+-------+-----+-----+-------+
| 1| Priya | 80| 99| 0|
| 2| Teju | 10| 5| 0|
| 3| Teju | 70| 5| 1|
+---+-------+-----+-----+-------+
I am struggling to achieve the above step and it is my expected output. Any help would be appreciated.
You can get your final dataframe by using except and union as following
val count = file1.count()
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
file1.union(file2.except(file1)
.withColumn("version", lit(1)) //changing the version
.withColumn("id", (row_number.over(Window.orderBy("id")))+lit(count)) //changing the id number
)
lit, row_number and window functions are used to generate the id and versions
Note : use of window function to generate the new id makes the process inefficient as all the data would be collected in one executor for generating new id

Spark Scala DF. add a new Column to DF based in processing of some rows of the same column

Dears,
I'm New on SparK Scala, and,
I have a DF of two columns: "UG" and "Counts" and I like to obtain the Third
How was exposed in thsi list.
DF: UG, Counts, CUG ( the columns)
of 12 4
of 23 4
the 134 3
love 68 2
pain 3 1
the 18 3
love 100 2
of 23 4
the 12 3
of 11 4
I need to add a new column called "CUG", the third one exposed, where CUG(i) is the number of times that the string(i) in UG appears in the whole Column.
I tried with the following scheme:
Having the DF like the previous table in df. I did a sql UDF function to count the number of times that the string appear in the column "UG", that is:
val NW1 = (w1:String) => {
df.filter($"UG".like(w1.substring(1,(w1.length-1))).count()
}:Long
val sqlfunc = udf(NW1)
val df2= df.withColumn("CUG",sqlfunc(col("UG")))
But when I tried, ...It did'nt work. I obtained an error of Null Point exception. The UDF scheme worked isolated but not with in DF.
What can I do in order to obtain the asked results using DF.
Thanks In advance.
jm3
So what you can do is firstly count the number of rows grouped by the UG column which gives the third column you need, and then join with the original data frame. You can rename the column name if you want with the withColumnRenamed function.
scala> import org.apache.spark.sql.functions._
scala> myDf.show()
+----+------+
| UG|Counts|
+----+------+
| of| 12|
| of| 23|
| the| 134|
|love| 68|
|pain| 3|
| the| 18|
|love| 100|
| of| 23|
| the| 12|
| of| 11|
+----+------+
scala> myDf.join(myDf.groupBy("UG").count().withColumnRenamed("count", "CUG"), "UG").show()
+----+------+---+
| UG|Counts|CUG|
+----+------+---+
| of| 12| 4|
| of| 23| 4|
| the| 134| 3|
|love| 68| 2|
|pain| 3| 1|
| the| 18| 3|
|love| 100| 2|
| of| 23| 4|
| the| 12| 3|
| of| 11| 4|
+----+------+---+