Spark Dataframe :How to add a index Column : Aka Distributed Data Index - scala

I read data from a csv file ,but don't have index.
I want to add a column from 1 to row's number.
What should I do,Thanks (scala)

With Scala you can use:
import org.apache.spark.sql.functions._
df.withColumn("id",monotonicallyIncreasingId)
You can refer to this exemple and scala docs.
With Pyspark you can use:
from pyspark.sql.functions import monotonically_increasing_id
df_index = df.select("*").withColumn("id", monotonically_increasing_id())

monotonically_increasing_id - The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
"I want to add a column from 1 to row's number."
Let say we have the following DF
+--------+-------------+-------+
| userId | productCode | count |
+--------+-------------+-------+
| 25 | 6001 | 2 |
| 11 | 5001 | 8 |
| 23 | 123 | 5 |
+--------+-------------+-------+
To generate the IDs starting from 1
val w = Window.orderBy("count")
val result = df.withColumn("index", row_number().over(w))
This would add an index column ordered by increasing value of count.
+--------+-------------+-------+-------+
| userId | productCode | count | index |
+--------+-------------+-------+-------+
| 25 | 6001 | 2 | 1 |
| 23 | 123 | 5 | 2 |
| 11 | 5001 | 8 | 3 |
+--------+-------------+-------+-------+

How to get a sequential id column id[1, 2, 3, 4...n]:
from pyspark.sql.functions import desc, row_number, monotonically_increasing_id
from pyspark.sql.window import Window
df_with_seq_id = df.withColumn('index_column_name', row_number().over(Window.orderBy(monotonically_increasing_id())) - 1)
Note that row_number() starts at 1, therefore subtract by 1 if you want 0-indexed column

NOTE : Above approaches doesn't give a sequence number, but it does give increasing id.
Simple way to do that and ensure the order of indexes is like below.. zipWithIndex.
Sample data.
+-------------------+
| Name|
+-------------------+
| Ram Ghadiyaram|
| Ravichandra|
| ilker|
| nick|
| Naveed|
| Gobinathan SP|
|Sreenivas Venigalla|
| Jackela Kowski|
| Arindam Sengupta|
| Liangpi|
| Omar14|
| anshu kumar|
+-------------------+
package com.example
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{LongType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row}
/**
* DistributedDataIndex : Program to index an RDD with
*/
object DistributedDataIndex extends App with Logging {
val spark = builder
.master("local[*]")
.appName(this.getClass.getName)
.getOrCreate()
import spark.implicits._
val df = spark.sparkContext.parallelize(
Seq("Ram Ghadiyaram", "Ravichandra", "ilker", "nick"
, "Naveed", "Gobinathan SP", "Sreenivas Venigalla", "Jackela Kowski", "Arindam Sengupta", "Liangpi", "Omar14", "anshu kumar"
)).toDF("Name")
df.show
logInfo("addColumnIndex here")
// Add index now...
val df1WithIndex = addColumnIndex(df)
.withColumn("monotonically_increasing_id", monotonically_increasing_id)
df1WithIndex.show(false)
/**
* Add Column Index to dataframe to each row
*/
def addColumnIndex(df: DataFrame) = {
spark.sqlContext.createDataFrame(
df.rdd.zipWithIndex.map {
case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
// Create schema for index column
StructType(df.schema.fields :+ StructField("index", LongType, false)))
}
}
Result :
+-------------------+-----+---------------------------+
|Name |index|monotonically_increasing_id|
+-------------------+-----+---------------------------+
|Ram Ghadiyaram |0 |0 |
|Ravichandra |1 |8589934592 |
|ilker |2 |8589934593 |
|nick |3 |17179869184 |
|Naveed |4 |25769803776 |
|Gobinathan SP |5 |25769803777 |
|Sreenivas Venigalla|6 |34359738368 |
|Jackela Kowski |7 |42949672960 |
|Arindam Sengupta |8 |42949672961 |
|Liangpi |9 |51539607552 |
|Omar14 |10 |60129542144 |
|anshu kumar |11 |60129542145 |
+-------------------+-----+---------------------------+

As Ram said, zippedwithindex is better than monotonically increasing id, id you need consecutive row numbers. Try this (PySpark environment):
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, LongType
new_schema = StructType(**original_dataframe**.schema.fields[:] + [StructField("index", LongType(), False)])
zipped_rdd = **original_dataframe**.rdd.zipWithIndex()
indexed = (zipped_rdd.map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])).toDF(new_schema))
where original_dataframe is the dataframe you have to add index on and row_with_index is the new schema with the column index which you can write as
row_with_index = Row(
"calendar_date"
,"year_week_number"
,"year_period_number"
,"realization"
,"index"
)
Here, calendar_date, year_week_number, year_period_number and realization were the columns of my original dataframe. You can replace the names with the names of your columns. index is the new column name you had to add for the row numbers.

If you require a unique sequence number for each row, I have a slightly different approach, where a static column is added and is used to compute the row number using that column.
val srcData = spark.read.option("header","true").csv("/FileStore/sample.csv")
srcData.show(5)
+--------+--------------------+
| Job| Name|
+--------+--------------------+
|Morpheus| HR Specialist|
| Kayla| Lawyer|
| Trisha| Bus Driver|
| Robert|Elementary School...|
| Ober| Judge|
+--------+--------------------+
val srcDataModf = srcData.withColumn("sl_no",lit("1"))
val windowSpecRowNum = Window.partitionBy("sl_no").orderBy("sl_no")
srcDataModf.withColumn("row_num",row_number.over(windowSpecRowNum)).drop("sl_no").select("row_num","Name","Job")show(5)
+-------+--------------------+--------+
|row_num| Name| Job|
+-------+--------------------+--------+
| 1| HR Specialist|Morpheus|
| 2| Lawyer| Kayla|
| 3| Bus Driver| Trisha|
| 4|Elementary School...| Robert|
| 5| Judge| Ober|
+-------+--------------------+--------+

For SparkR:
(Assuming sdf is some sort of spark data frame)
sdf<- withColumn(sdf, "row_id", SparkR:::monotonically_increasing_id())

Related

Merge multiple spark rows inside dataframe by ID into one row based on update_time

we need to merge multiple rows based on ID into a single record using Pyspark. If there are multiple updates to the column, then we have to select the one with the last update made to it.
Please note, NULL would mean there was no update made to the column in that instance.
So, basically we have to create a single row with the consolidated updates made to the records.
So,for example, if this is the dataframe ...
Looking for similar answer, but in Pyspark .. Merge rows in a spark scala Dataframe
------------------------------------------------------------
| id | column1 | column2 | updated_at |
------------------------------------------------------------
| 123 | update1 | <*no-update*> | 1634228709 |
| 123 | <*no-update*> | 80 | 1634228724 |
| 123 | update2 | <*no-update*> | 1634229000 |
expected output is -
------------------------------------------------------------
| id | column1 | column2 | updated_at |
------------------------------------------------------------
| 123 | update2 | 80 | 1634229000 |
Let's say that our input dataframe is:
+---+-------+----+----------+
|id |col1 |col2|updated_at|
+---+-------+----+----------+
|123|null |null|1634228709|
|123|null |80 |1634228724|
|123|update2|90 |1634229000|
|12 |update1|null|1634221233|
|12 |null |80 |1634228333|
|12 |update2|null|1634221220|
+---+-------+----+----------+
What we want is to covert updated_at to TimestampType then order by id and updated_at in desc order:
df = df.withColumn("updated_at", F.col("updated_at").cast(TimestampType())).orderBy(
F.col("id"), F.col("updated_at").desc()
)
that gives us:
+---+-------+----+-------------------+
|id |col1 |col2|updated_at |
+---+-------+----+-------------------+
|12 |null |80 |2021-10-14 18:18:53|
|12 |update1|null|2021-10-14 16:20:33|
|12 |update2|null|2021-10-14 16:20:20|
|123|update2|90 |2021-10-14 18:30:00|
|123|null |80 |2021-10-14 18:25:24|
|123|null |null|2021-10-14 18:25:09|
+---+-------+----+-------------------+
Now get first non None value in each column or return None and group by id:
exp = [F.first(x, ignorenulls=True).alias(x) for x in df.columns[1:]]
df = df.groupBy(F.col("id")).agg(*exp)
And the result is:
+---+-------+----+-------------------+
|id |col1 |col2|updated_at |
+---+-------+----+-------------------+
|123|update2|90 |2021-10-14 18:30:00|
|12 |update1|80 |2021-10-14 18:18:53|
+---+-------+----+-------------------+
Here's the full example code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import TimestampType
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
data = [
(123, None, None, 1634228709),
(123, None, 80, 1634228724),
(123, "update2", 90, 1634229000),
(12, "update1", None, 1634221233),
(12, None, 80, 1634228333),
(12, "update2", None, 1634221220),
]
columns = ["id", "col1", "col2", "updated_at"]
df = spark.createDataFrame(data, columns)
df = df.withColumn("updated_at", F.col("updated_at").cast(TimestampType())).orderBy(
F.col("id"), F.col("updated_at").desc()
)
exp = [F.first(x, ignorenulls=True).alias(x) for x in df.columns[1:]]
df = df.groupBy(F.col("id")).agg(*exp)

Fetch the partial value from a column having key value pairs and assign it to new column in Spark Dataframe

I have a data frame as below
+----+-----------------------------+
|id | att |
+----+-----------------------------+
| 25 | {"State":"abc","City":"xyz"}|
| 26 | null |
| 27 | {"State":"pqr"} |
+----+-----------------------------+
I want a dataframe with columns id and city if the att column has city attribute else null
+----+------+
|id | City |
+----+------+
| 25 | xyz |
| 26 | null |
| 27 | null |
+----+------+
Language : Scala
You can use from_json to parse and convert your json data to Map. Then access the map item using one of:
getItem method of the Column class
default accessor, i.e map("map_key")
element_at function
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types.{MapType, StringType}
import sparkSession.implicits._
val df = Seq(
(25, """{"State":"abc","City":"xyz"}"""),
(26, null),
(27, """{"State":"pqr"}""")
).toDF("id", "att")
val schema = MapType(StringType, StringType)
df.select($"id", from_json($"att", schema).getItem("City").as("City"))
//or df.select($"id", from_json($"att", schema)("City").as("City"))
//or df.select($"id", element_at(from_json($"att", schema), "City").as("City"))
// +---+----+
// | id|City|
// +---+----+
// | 25| xyz|
// | 26|null|
// | 27|null|
// +---+----+

How to extract subnet from full address from a Dataframe?

I have created a temporary dataframe as below:
var someDF = Seq(("1","1.2.3.4"), ("2","5.26.6.3")).toDF("s/n", "ip")
Is there a way to extract the subnet out from the full ip address and put into a new column "subnet"?
Example of output:
---------------------------
|s/N | ip | subnet |
---------------------------
|1 | 1.2.3.4 | 1.2.3.x |
|2 | 5.26.6.3 | 5.26.6.x|
---------------------------
You could use an UDF to do this:
val getSubnet = udf((ip: String) => ip.split("\\.").init.mkString(".") + ".x")
val df = someDF.withColumn("subnet", getSubnet($"ip"))
Which would give you this dataframe:
+---+--------+--------+
|s/n| ip| subnet|
+---+--------+--------+
| 1| 1.2.3.4| 1.2.3.x|
| 2|5.26.6.3|5.26.6.x|
+---+--------+--------+
You can achieve your requirement with concat_ws and substring_index inbuilt functions.
import org.apache.spark.sql.functions._
someDF.withColumn("subnet", concat_ws(".", substring_index($"ip", ".", 3), lit("x")))
You could try the following: Very simple code but will improve your performance:
import org.apache.spark.sql.functions.{ concat, lit, col }
someDF.withColumn("subnet", concat(regexp_replace(col("ip"), "(.*\\.)\\d+$", "$1"), lit("x"))).show()
Output
+---+--------+--------+
|s/n| ip| subnet|
+---+--------+--------+
| 1| 1.2.3.4| 1.2.3.x|
| 2|5.26.6.3|5.26.6.x|
+---+--------+--------+

List to DataFrame in pyspark

Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
Now, i want to create a Dataframe as follows
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
I even want to add ID column which is not associated in the data
You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
Try this -
data_array = []
for i in range (0,len(my_data)) :
data_array.extend([(i, my_data[i])])
df = spark.createDataframe(data = data_array, schema = ["ID", "words"])
df.show()
Try this -- the simplest approach
from pyspark.sql import *
x = Row(utc_timestamp=utc, routine='routine name', message='your message')
data = [x]
df = sqlContext.createDataFrame(data)
Simple Approach:
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)
+---------------------+-----+
|id |words|
+---------------------+-----+
|[apple, ball, ballon]|0 |
|[cat, camel, james] |1 |
|[none, focus, cake] |2 |
+---------------------+-----+

Spark Dataframe - Write a new record for a change in VALUE for a particular KEY group

Need to write a row when there is change in "AMT" column for a particular "KEY" group.
Eg :
Scenarios-1: For KEY=2, first change is 90 to 20, So need to write a record with value (20-90).
Similarly the next change for the same key group is 20 to 30.5, So again need to write another record with value (30.5 - 20)
Scenarios-2: For KEY=1, only one record for this KEY group so write as is
Scenarios-3: For KEY=3, Since the same AMT value exists twice, so write once
How can this be implemented ? Using window functions or by groupBy agg functions?
Sample Input Data :
val DF1 = List((1,34.6),(2,90.0),(2,90.0),(2,20.0),(2,30.5),(3,89.0),(3,89.0)).toDF("KEY", "AMT")
DF1.show(false)
+-----+-------------------+
|KEY |AMT |
+-----+-------------------+
|1 |34.6 |
|2 |90.0 |
|2 |90.0 |
|2 |20.0 |----->[ 20.0 - 90.0 = -70.0 ]
|2 |30.5 |----->[ 30.5 - 20.0 = 10.5 ]
|3 |89.0 |
|3 |89.0 |
+-----+-------------------+
Expected Values :
scala> df2.show()
+----+--------------------+
|KEY | AMT |
+----+--------------------+
| 1 | 34.6 |-----> As Is
| 2 | -70.0 |----->[ 20.0 - 90.0 = -70.0 ]
| 2 | 10.5 |----->[ 30.5 - 20.0 = 10.5 ]
| 3 | 89.0 |-----> As Is, with one record only
+----+--------------------+
i have tried to solve it in pyspark not in scala.
from pyspark.sql.functions import lead
from pyspark.sql.window import Window
w1=Window().partitionBy("key").orderBy("key")
DF4 =spark.createDataFrame([(1,34.6),(2,90.0),(2,90.0),(2,20.0),(2,30.5),(3,89.0),(3,89.0)],["KEY", "AMT"])
DF4.createOrReplaceTempView('keyamt')
DF7=spark.sql('select distinct key,amt from keyamt where key in ( select key from (select key,count(distinct(amt))dist from keyamt group by key) where dist=1)')
DF8=DF4.join(DF7,DF4['KEY']==DF7['KEY'],'leftanti').withColumn('new_col',((lag('AMT',1).over(w1)).cast('double') ))
DF9=DF8.withColumn('new_col1', ((DF8['AMT']-DF8['new_col'].cast('double'))))
DF9.withColumn('new_col1', ((DF9['AMT']-DF9['new_col'].cast('double')))).na.fill(0)
DF9.filter(DF9['new_col1'] !=0).select(DF9['KEY'],DF9['new_col1']).union(DF7).orderBy(DF9['KEY'])
Output:
+---+--------+
|KEY|new_col1|
+---+--------+
| 1| 34.6|
| 2| -70.0|
| 2| 10.5|
| 3| 89.0|
+---+--------+
You can implement your logic using window function with combination of when, lead, monotically_increasing_id() for ordering and withColumn api as below
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("KEY").orderBy("rowNo")
val tempdf = DF1.withColumn("rowNo", monotonically_increasing_id())
tempdf.select($"KEY", when(lead("AMT", 1).over(windowSpec).isNull || (lead("AMT", 1).over(windowSpec)-$"AMT").as("AMT")===lit(0.0), $"AMT").otherwise(lead("AMT", 1).over(windowSpec)-$"AMT").as("AMT")).show(false)