Pyspark - Find new, left and existing sales - pyspark

I have a dataframe like this
and I want output like this
I need to aggregate the sales for each year band like this as below. For example for 2018-2019,
New_sales = sum of all sales of 2019 (which is the later year in 2018-2019) where the id didn't exists in 2018 but exists in 2019
Existing_sales = sum of sales of 2018 where the id is there in 2018 and 2019 subtract sum of sales of 2019
Existing_sales = 50+75 (sales of 2018) - (20+50) (sales of 2019) = 125-70 = 55
Left_sales = sum of all sales of 2018 (the earlier year in 2018-2019) where the id exists in 2018 but not in 2019
How do I achieve that?

I've added some more dummy data to include 2020 year for this example.
# +---+---+----------+
# | id|amt| dt|
# +---+---+----------+
# | 1| 50|2018-12-31|
# | 2|100|2018-12-31|
# | 3| 75|2018-12-31|
# | 1| 20|2019-12-31|
# | 3| 50|2019-12-31|
# | 5| 25|2019-12-31|
# | 1| 70|2020-12-31|
# | 2|150|2020-12-31|
# | 5|125|2020-12-31|
# +---+---+----------+
You can extract the year from the dates and pivot them with the sales values under it. This would make the calculations manageable by reducing the number of rows - Number of years would be less, so calculations would be optimized.
yr_sales_sdf = data_sdf. \
fillna(0, subset=['amt']). \
withColumn('yr', func.year('dt')). \
groupBy('id'). \
pivot('yr'). \
agg(func.first('amt'))
# +---+----+----+----+
# | id|2018|2019|2020|
# +---+----+----+----+
# | 5|null| 25| 125|
# | 1| 50| 20| 70|
# | 3| 75| 50|null|
# | 2| 100|null| 150|
# +---+----+----+----+
We'll need the years as a list - This can be extracted from the pivoted dataframe's columns as done in yrs list - and a case when function that will calculate if the ID is existing, left or new - I've created sale_cat_cond function (and it is purely pyspark so automatically optimized) that takes in two consecutive years as columns and checks the condition to generate the category and its sales as a struct. Structs are very helpful to track in these cases where the same condition is checked for more than one required values.
yrs = [k for k in yr_sales_sdf.columns if k[0:2] == '20']
sale_cat_cond = lambda frstCol, scndCol: (
func.when(func.col(frstCol).isNull() & func.col(scndCol).isNotNull(),
func.struct(func.lit(frstCol+'-'+scndCol).alias('year_band'),
func.lit('new_sales').alias('salecat'),
func.col(scndCol).alias('sale')
)
).
when(func.col(frstCol).isNotNull() & func.col(scndCol).isNull(),
func.struct(func.lit(frstCol+'-'+scndCol).alias('year_band'),
func.lit('left_sales').alias('salecat'),
func.col(frstCol).alias('sale')
)
).
otherwise(func.struct(func.lit(frstCol+'-'+scndCol).alias('year_band'),
func.lit('existing_sales').alias('salecat'),
(func.col(frstCol)-func.col(scndCol)).alias('sale')
)
)
)
Run the sale_cat_cond on all year columns, using a list comprehension, to calculate sales categories and their sales. This creates additional columns for all year bands.
yr_salecat_sdf = yr_sales_sdf. \
select('*',
*[sale_cat_cond(yrs[i], yrs[i+1]).alias(yrs[i]+'_'+yrs[i+1]+'_salecat') for i in range(len(yrs) - 1)]
)
# +---+----+----+----+-------------------------------+---------------------------------+
# |id |2018|2019|2020|2018_2019_salecat |2019_2020_salecat |
# +---+----+----+----+-------------------------------+---------------------------------+
# |5 |null|25 |125 |{2018-2019, new_sales, 25} |{2019-2020, existing_sales, -100}|
# |1 |50 |20 |70 |{2018-2019, existing_sales, 30}|{2019-2020, existing_sales, -50} |
# |3 |75 |50 |null|{2018-2019, existing_sales, 25}|{2019-2020, left_sales, 50} |
# |2 |100 |null|150 |{2018-2019, left_sales, 100} |{2019-2020, new_sales, 150} |
# +---+----+----+----+-------------------------------+---------------------------------+
The only thing left is to pivot-sum the sales categories and their sales. To do this, first collate all year band category structs into an array - This will make it easy to explode and pivot per ID (using SQL's inline function).
yr_salecat_sdf. \
withColumn('salecat_struct_arr',
func.array(*[k for k in yr_salecat_sdf.columns if '_salecat' in k])
). \
selectExpr('id', 'inline(salecat_struct_arr)'). \
groupBy('year_band'). \
pivot('salecat'). \
agg(func.sum('sale')). \
show()
# +---------+--------------+----------+---------+
# |year_band|existing_sales|left_sales|new_sales|
# +---------+--------------+----------+---------+
# |2019-2020| -150| 50| 150|
# |2018-2019| 55| 100| 25|
# +---------+--------------+----------+---------+
Additional details - schemas for all dataframes
data_sdf
# root
# |-- id: long (nullable = true)
# |-- amt: long (nullable = true)
# |-- dt: date (nullable = true)
yr_sales_sdf
# root
# |-- id: long (nullable = true)
# |-- 2018: long (nullable = true)
# |-- 2019: long (nullable = true)
# |-- 2020: long (nullable = true)
yr_salecat_sdf
# root
# |-- id: long (nullable = true)
# |-- 2018: long (nullable = true)
# |-- 2019: long (nullable = true)
# |-- 2020: long (nullable = true)
# |-- 2018-2019_salecat: struct (nullable = false)
# | |-- year_band: string (nullable = false)
# | |-- salecat: string (nullable = false)
# | |-- sale: long (nullable = true)
# |-- 2019-2020_salecat: struct (nullable = false)
# | |-- year_band: string (nullable = false)
# | |-- salecat: string (nullable = false)
# | |-- sale: long (nullable = true)
final result
# root
# |-- year_band: string (nullable = false)
# |-- existing_sales: long (nullable = true)
# |-- left_sales: long (nullable = true)
# |-- new_sales: long (nullable = true)

#Sparc
here is the solution . do let me know if you have questions around this.
--Approach--
You can use create two df based on year(date) and then
do inner join ---> to find the existing sales
df_2018 left_anti with df_2019--> give left_sales
df_2019 left_anti with df_2018 ---> give new sales.
combines these three by union , boom you get the result.
kindly upvote if you like my approach.
Solution:-
from pyspark.sql import Window
import pyspark.sql.functions as F
schema=["id","date_val","sales"]
data =[("1","2018-12-31","50"),
("2","2018-12-31","100"),
("3","2018-12-31","75"),
("1","2019-12-31","20"),
("3","2019-12-31","50"),
("5","2019-12-31","25")]
date_range=["2018","2019"]
df=spark.createDataFrame(data,schema)
df= df1.withColumn("date_val",F.col("date_val").cast("date"))\
.withColumn("year",F.year(F.col("date_val")).cast("string"))\
.withColumn("year_bands", F.lit(date_range[0]+"-"+date_range[1]))
filter_cond_2018 = (F.col("year") == "2018")
df_2018=df.filter(filter_cond_2018)
df_2019 = df.filter(~filter_cond_2018)
df_left_sales = df_2018.join(df_2019,["id"],"left_anti")\
.groupBy(["year","year_bands"]).agg(F.sum(F.col("sales")).alias("Left_Sales"))
df_new_sales=df_2019.join(df_2018,["id"],"left_anti")\
.groupBy(["year","year_bands"]).agg(F.sum(F.col("sales")).alias("New_Sales"))
df_ext_sales_2018 = df_2018.join(df_2019,["id"],"inner").select(df_2018["*"])\
.groupBy(["year","year_bands"]).agg(F.sum(F.col("sales")).alias("Existing_Sale_{}".format(date_range[0])))
df_ext_sales_2019 = df_2019.join(df_2018,["id"],"inner").select(df_2019["*"])\
.groupBy(["year","year_bands"]).agg(F.sum(F.col("sales")).alias("Existing_Sale_{}".format(date_range[1])))
df_agg = df_left_sales.join(df_new_sales,["year_bands"])\
.join(df_ext_sales_2018,["year_bands"])\
.join(df_ext_sales_2019,["year_bands"])
df_agg_fnl =df_agg\
.withColumn("Existing_Sales", F.col("Existing_Sale_{}".format(date_range[0]))-F.col("Existing_Sale_{}".format(date_range[1])))\
.select(["year_bands","Left_Sales","New_Sales","Existing_Sales"])
df_agg_fnl.show(10,0)
Generic Solution :
from pyspark.sql import functions as F ,DataFrame
schema=["id","date_val","sales"]
data =[("1","2018-12-31","50"),
("2","2018-12-31","100"),
("3","2018-12-31","75"),
("1","2019-12-31","20"),
("3","2019-12-31","50"),
("5","2019-12-31","25"),
("6","2020-12-31","25"),
("5","2020-12-31","10"),
("7","2020-12-31","25")]
date_range=["2018","2019"]
df=spark.createDataFrame(data,schema)
df = df.withColumn("year",F.split(F.col('date_val'), '-').getItem(0))
year_bands=df.select("year").distinct().toPandas()["year"].tolist()
def calculate_agg_data(df,start_year,end_year):
df_start_year=df.filter(F.col("year").isin([start_year]))
df_end_year=df.filter(F.col("year").isin([end_year]))
df_left_sales = df_start_year.join(df_end_year,["id"],"left_anti")\ .groupBy(["year","year_bands"]).agg(F.sum(F.col("sales")).alias("Left_Sales"))
df_new_sales=df_end_year.join(df_start_year,["id"],"left_anti")\
.groupBy(["year","year_bands"]).agg(F.sum(F.col("sales")).alias("New_Sales"))
df_start_year_ext_sales = df_start_year.join(df_end_year,["id"],"inner").select(df_start_year["*"])\
.groupBy(["year","year_bands"]).agg(F.sum(F.col("sales")).alias("Existing_Sale_{}".format(start_year)))
df_end_year_ext_sales = df_end_year.join(df_start_year,["id"],"inner").select(df_end_year["*"])\
.groupBy(["year","year_bands"]).agg(F.sum(F.col("sales")).alias("Existing_Sale_{}".format(end_year)))
# final agg
df_agg = df_left_sales.join(df_new_sales,["year_bands"])\
.join(df_start_year_ext_sales,["year_bands"])\
.join(df_end_year_ext_sales,["year_bands"])
df_agg_fnl =df_agg\
.withColumn("Existing_Sales", F.col("Existing_Sale_{}".format(start_year))-F.col("Existing_Sale_{}".format(end_year)))\
.select(["year_bands","Left_Sales","New_Sales","Existing_Sales"])
return df_agg_fnl
df_lst=[]
for index in range(len(year_bands)-1):
start_year=year_bands[index]
end_year=year_bands[index+1]
df=df.withColumn("year_bands", F.lit(start_year+"-"+end_year))
df_flt =df.filter(F.col("year").isin([start_year,end_year]))
df_agg =calculate_agg_data(df_flt,start_year,end_year)
df_lst.append(df_agg)
df_fnl = reduce(DataFrame.unionByName,df_lst)
df_fnl.show(10,0)
Alternative for reduce
df_fnl=df_lst[0]
for index in range(1,len(df_lst)):
if len(df_lst)>=1:
df_fnl=df_fnl.unionByName(df_lst[index])
df_fnl.show(40,0)
Kindly upvote of you like my solution and Approach .
()

So based on your previous post we have a dataframe df which is
+---+----------+----+------------+---------+---------+---------+---------+
|id |date |year|year_lst |2017-2018|2018-2019|2019-2020|2020-2021|
+---+----------+----+------------+---------+---------+---------+---------+
|1 |31/12/2017|2017|[2017, 2018]|Existing |Left |null |null |
|2 |31/12/2017|2017|[2017] |Left |null |null |null |
|3 |31/12/2017|2017|[2017, 2018]|Existing |Left |null |null |
|1 |31/12/2018|2018|[2017, 2018]|Existing |Left |null |null |
|3 |31/12/2018|2018|[2017, 2018]|Existing |Left |null |null |
|5 |31/12/2018|2018|[2018] |New |Left |null |null |
+---+----------+----+------------+---------+---------+---------+---------+
As you mentioned that you have a sales column now which you need to do the aggregation, assume when you use df.columns, your output should look like this:
['id', 'date', 'year', 'sales', 'year_lst', '2017-2018', '2018-2019', '2019-2020', '2020-2021']
To achieve your goal, what you can do is:
from pyspark.sql.functions import func
# loop from 2017-2018 column
for idx, year_year in enumerate(df.columns[4:]):
new_df = df.filter(func.col('year')==year_year.split('-')[0])\
.select('id',
'sales',
func.lit(year_year).alias('year_year'),
func.col(year_year).alias('status'))
if idx == 0:
output = new_df
else:
output = output.unionAll(new_df)
The remain is just the aggregation:
output.groupby('year_year').pivot('status').agg(func.sum('sales')).fillna(0).orderBy(func.col('year_year')).show()

Related

Spark select item in array by max score

Given the following DataFrame containing an id and a Seq of Stuff (with an id and score), how do I select the "best" Stuff in the array by score?
I'd like NOT to use UDFs and possibly work with Spark DataFrame functions only.
case class Stuff(id: Int, score: Double)
val df = spark.createDataFrame(Seq(
(1, Seq(Stuff(11, 0.4), Stuff(12, 0.5))),
(2, Seq(Stuff(22, 0.9), Stuff(23, 0.8)))
)).toDF("id", "data")
df.show(false)
+---+----------------------+
|id |data |
+---+----------------------+
|1 |[[11, 0.4], [12, 0.5]]|
|2 |[[22, 0.9], [23, 0.8]]|
+---+----------------------+
df.printSchema
root
|-- id: integer (nullable = false)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: integer (nullable = false)
| | |-- score: double (nullable = false)
I tried going down the route of window functions but the code gets a bit too convoluted. Expected output:
+---+---------+
|id |topStuff |
+---+---------
|1 |[12, 0.5]|
|2 |[22, 0.9]|
+---+---------+
You can use Spark 2.4 higher-order functions:
df
.selectExpr("id","(filter(data, x -> x.score == array_max(data.score)))[0] as topstuff")
.show()
gives
+---+---------+
| id| topstuff|
+---+---------+
| 1|[12, 0.5]|
| 2|[22, 0.9]|
+---+---------+
As an alternative, use window-functions (requires shuffling!):
df
.select($"id",explode($"data").as("topstuff"))
.withColumn("selector",max($"topstuff.score") .over(Window.partitionBy($"id")))
.where($"topstuff.score"===$"selector")
.drop($"selector")
.show()
also gives:
+---+---------+
| id| topstuff|
+---+---------+
| 1|[12, 0.5]|
| 2|[22, 0.9]|
+---+---------+

Pivot spark dataframe array of kv pairs into individual columns

I have following schema:
root
|-- id: string (nullable = true)
|-- date: timestamp (nullable = true)
|-- config: struct (nullable = true)
| |-- entry: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- key: string (nullable = true)
| | | |-- value: string (nullable = true)
There will not be more than 3 key-value pairs (k1,k2,k3) in the array and I would like to make value from each key into its own column while the corresponding data would come from the value from the same kv pair.
+--------+----------+----------+----------+---------+
|id |date |k1 |k2 |k3 |
+--------+----------+----------+----------+---------+
| id1 |2019-08-12|id1-v1 |id1-v2 |id1-v3 |
| id2 |2019-08-12|id2-v1 |id2-v2 |id2-v3 |
+--------+----------+----------+----------+---------+
So far I tried something like this:
sourceDF.filter($"someColumn".contains("SOME_STRING"))
.select($"id", $"date", $"config.entry" as "kvpairs")
.withColumn($"kvpairs".getItem(0).getField("key").toString(), $"kvpairs".getItem(0).getField("value"))
.withColumn($"kvpairs".getItem(1).getField("key").toString(), $"kvpairs".getItem(1).getField("value"))
.withColumn($"kvpairs".getItem(2).getField("key").toString(), $"kvpairs".getItem(2).getField("value"))
But in this case, the column names are shown as kvpairs[0][key], kvpairs[1][key] and kvpairs[2][key] as shown below:
+--------+----------+---------------+---------------+---------------+
|id |date |kvpairs[0][key]|kvpairs[1][key]|kvpairs[2][key]|
+--------+----------+---------------+---------------+---------------+
| id1 |2019-08-12| id1-v1 | id1-v2 | id1-v3 |
| id2 |2019-08-12| id2-v1 | id2-v2 | id2-v3 |
+--------+----------+---------------+---------------+---------------+
Two questions:
Is my approach right? Is there a better and easier way to pivot this
such that I get one row per array with the 3 kv pairs as 3 columns? I want to handle cases where order of the kv pairs may differ.
If the above approach is fine, how do I alias the column name to the data of the "key" element in the array?
Using multiple withColumn together with getItem will not work since the order of the kv pairs may differ. What you can do instead is explode the array and then use pivot as follows:
sourceDF.filter($"someColumn".contains("SOME_STRING"))
.select($"id", $"date", explode($"config.entry") as "exploded")
.select($"id", $"date", $"exploded.*")
.groupBy("id", "date")
.pivot("key")
.agg(first("value"))
The usage of first inside the aggregation here assumes there will be a single value for each key. Otherwise collect_list or collect_set can be used.
Result:
+---+----------+------+------+------+
|id |date |k1 |k2 |k2 |
+---+----------+------+------+------+
|id1|2019-08-12|id1-v1|id1-v2|id1-v3|
|id2|2019-08-12|id2-v1|id2-v2|id2-v3|
+---+----------+------+------+------+

How to return ListBuffer as a column from UDF using Spark Scala?

I am trying to use UDF's and return ListBuffer as a column from UDF, i am getting error.
I have created Df by executing below code:
val df = Seq((1,"dept3##rama##kumar","dept3##rama##kumar"), (2,"dept31##rama1##kumar1","dept33##rama3##kumar3")).toDF("id","str1","str2")
df.show()
it show like below:
+---+--------------------+--------------------+
| id| str1| str2|
+---+--------------------+--------------------+
| 1| dept3##rama##kumar| dept3##rama##kumar|
| 2|dept31##rama1##ku...|dept33##rama3##ku...|
+---+--------------------+--------------------+
as per my requirement i have to use i have to split the above columns based some inputs so i have tried UDF like below :
def appendDelimiterError=udf((id: Int, str1: String, str2: String)=> {
var lit = new ListBuffer[Any]()
if(str1.contains("##"){val a=str1.split("##")}
else if(str1.contains("##"){val a=str1.split("##")}
else if(str1.contains("#&"){val a=str1.split("#&")}
if(str2.contains("##"){ val b=str2.split("##")}
else if(str2.contains("##"){ val b=str2.split("##") }
else if(str1.contains("##"){val b=str2.split("##")}
var tmp_row = List(a,"test1",b)
lit +=tmp_row
return lit
})
val
i try to cal by executing below code:
val df1=df.appendDelimiterError("newcol",appendDelimiterError(df("id"),df("str1"),df("str2"))
i getting error "this was a bad call" .i want use ListBuffer/list to store and return to calling place.
my expected output will be:
+---+--------------------+------------------------+----------------------------------------------------------------------+
| id| str1| str2 | newcol |
+---+--------------------+------------------------+----------------------------------------------------------------------+
| 1| dept3##rama##kumar| dept3##rama##kumar |ListBuffer(List("dept","rama","kumar"),List("dept3","rama","kumar")) |
| 2|dept31##rama1##kumar1|dept33##rama3##kumar3 | ListBuffer(List("dept31","rama1","kumar1"),List("dept33","rama3","kumar3")) |
+---+--------------------+------------------------+----------------------------------------------------------------------+
How to achieve this?
An alternative with my own fictional data to which you can tailor and no UDF:
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = Seq(
(1, "111##cat##666", "222##fritz##777"),
(2, "AAA##cat##555", "BBB##felix##888"),
(3, "HHH##mouse##yyy", "123##mickey##ZZZ")
).toDF("c0", "c1", "c2")
val df2 = df.withColumn( "c_split", split(col("c1"), ("(##)|(##)|(##)|(##)") ))
.union(df.withColumn("c_split", split(col("c2"), ("(##)|(##)|(##)|(##)") )) )
df2.show(false)
df2.printSchema()
val df3 = df2.groupBy(col("c0")).agg(collect_list(col("c_split")).as("List_of_Data") )
df3.show(false)
df3.printSchema()
Gives answer but no ListBuffer - really necessary?, as follows:
+---+---------------+----------------+------------------+
|c0 |c1 |c2 |c_split |
+---+---------------+----------------+------------------+
|1 |111##cat##666 |222##fritz##777 |[111, cat, 666] |
|2 |AAA##cat##555 |BBB##felix##888 |[AAA, cat, 555] |
|3 |HHH##mouse##yyy|123##mickey##ZZZ|[HHH, mouse, yyy] |
|1 |111##cat##666 |222##fritz##777 |[222, fritz, 777] |
|2 |AAA##cat##555 |BBB##felix##888 |[BBB, felix, 888] |
|3 |HHH##mouse##yyy|123##mickey##ZZZ|[123, mickey, ZZZ]|
+---+---------------+----------------+------------------+
root
|-- c0: integer (nullable = false)
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
|-- c_split: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---------------------------------------+
|c0 |List_of_Data |
+---+---------------------------------------+
|1 |[[111, cat, 666], [222, fritz, 777]] |
|3 |[[HHH, mouse, yyy], [123, mickey, ZZZ]]|
|2 |[[AAA, cat, 555], [BBB, felix, 888]] |
+---+---------------------------------------+
root
|-- c0: integer (nullable = false)
|-- List_of_Data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)

Removing duplicate array structs by last item in array struct in Spark Dataframe

So my table looks something like this:
customer_1|place|customer_2|item |count
-------------------------------------------------
a | NY | b |(2010,304,310)| 34
a | NY | b |(2024,201,310)| 21
a | NY | b |(2010,304,312)| 76
c | NY | x |(2010,304,310)| 11
a | NY | b |(453,131,235) | 10
I've tried doing, but this does not eliminate the duplicates as the former array is still there (as it should be, I need it for end results).
val df= df_one.withColumn("vs", struct(col("item").getItem(size(col("item"))-1), col("item"), col("count")))
.groupBy(col("customer_1"), col("place"), col("customer_2"))
.agg(max("vs").alias("vs"))
.select(col("customer_1"), col("place"), col("customer_2"), col("vs.item"), col("vs.count"))
I would like to group by customer_1, place and customer_2 columns and return only array structs whose last item (-1) is unique with the highest count, any ideas?
Expected output:
customer_1|place|customer_2|item |count
-------------------------------------------------
a | NY | b |(2010,304,312)| 76
a | NY | b |(2010,304,310)| 34
a | NY | b |(453,131,235) | 10
c | NY | x |(2010,304,310)| 11
Given that the schema of the dataframe is as
root
|-- customer_1: string (nullable = true)
|-- place: string (nullable = true)
|-- customer_2: string (nullable = true)
|-- item: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- count: string (nullable = true)
You can apply concat funcations to create temp column for checking duplicate rows as done below
import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item"(size($"item")-1)))
.dropDuplicates("temp")
.drop("temp")
You should get following output
+----------+-----+----------+----------------+-----+
|customer_1|place|customer_2|item |count|
+----------+-----+----------+----------------+-----+
|a |NY |b |[2010, 304, 312]|76 |
|c |NY |x |[2010, 304, 310]|11 |
|a |NY |b |[453, 131, 235] |10 |
|a |NY |b |[2010, 304, 310]|34 |
+----------+-----+----------+----------------+-----+
Struct
Given the schema of dataframe is as
root
|-- customer_1: string (nullable = true)
|-- place: string (nullable = true)
|-- customer_2: string (nullable = true)
|-- item: struct (nullable = true)
| |-- _1: integer (nullable = false)
| |-- _2: integer (nullable = false)
| |-- _3: integer (nullable = false)
|-- count: string (nullable = true)
We can still do same as above with slight change in getting the third item from the struct as
import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item._3"))
.dropDuplicates("temp")
.drop("temp")
Hope the answer is helpful

Processing (OSM) PBF files in Spark

OSM data is available in PBF format. There are specialised libraries (such as https://github.com/plasmap/geow for parsing this data).
I want to store this data on S3 and parse the data into an RDD as part of an EMR job.
What is a straightforward way to achieve this? Can I fetch the file to the master node and process it locally? If so, would I create an empty RDD and add to it as streaming events are parsed from the input file?
One solution would be to skip the PBFs. One Spark-friendly representation is Parquet. In this blog post it is shown how to convert the PBFs to Parquets and how to load the data in Spark.
I released a new version of Osm4Scala that includes support for Spark 2 and 3.
There are a lot of examples in the README.md
It is really simple to use:
scala> val osmDF = spark.sqlContext.read.format("osm.pbf").load("<osm files path here>")
osmDF: org.apache.spark.sql.DataFrame = [id: bigint, type: tinyint ... 5 more fields]
scala> osmDF.createOrReplaceTempView("osm")
scala> spark.sql("select type, count(*) as num_primitives from osm group by type").show()
+----+--------------+
|type|num_primitives|
+----+--------------+
| 1| 338795|
| 2| 10357|
| 0| 2328075|
+----+--------------+
scala> spark.sql("select distinct(explode(map_keys(tags))) as tag_key from osm order by tag_key asc").show()
+------------------+
| tag_key|
+------------------+
| Calle|
| Conference|
| Exper|
| FIXME|
| ISO3166-1|
| ISO3166-1:alpha2|
| ISO3166-1:alpha3|
| ISO3166-1:numeric|
| ISO3166-2|
| MAC_dec|
| Nombre|
| Numero|
| Open|
| Peluqueria|
| Residencia UEM|
| Telefono|
| abandoned|
| abandoned:amenity|
| abandoned:barrier|
|abandoned:building|
+------------------+
only showing top 20 rows
scala> spark.sql("select id, latitude, longitude, tags from osm where type = 0").show()
+--------+------------------+-------------------+--------------------+
| id| latitude| longitude| tags|
+--------+------------------+-------------------+--------------------+
| 171933| 40.42006|-3.7016600000000004| []|
| 171946| 40.42125|-3.6844500000000004|[highway -> traff...|
| 171948|40.420230000000004|-3.6877900000000006| []|
| 171951|40.417350000000006|-3.6889800000000004| []|
| 171952| 40.41499|-3.6889800000000004| []|
| 171953| 40.41277|-3.6889000000000003| []|
| 171954| 40.40946|-3.6887900000000005| []|
| 171959| 40.40326|-3.7012200000000006| []|
|20952874| 40.42099|-3.6019200000000007| []|
|20952875|40.422610000000006|-3.5994900000000007| []|
|20952878| 40.42136000000001| -3.601470000000001| []|
|20952879| 40.42262000000001| -3.599770000000001| []|
|20952881| 40.42905000000001|-3.5970500000000007| []|
|20952883| 40.43131000000001|-3.5961000000000007| []|
|20952888| 40.42930000000001| -3.596590000000001| []|
|20952890| 40.43012000000001|-3.5961500000000006| []|
|20952891| 40.43043000000001|-3.5963600000000007| []|
|20952892| 40.43057000000001|-3.5969100000000007| []|
|20952893| 40.43039000000001|-3.5973200000000007| []|
|20952895| 40.42967000000001|-3.5972300000000006| []|
+--------+------------------+-------------------+--------------------+
only showing top 20 rows
You should definitely take a look at the Atlas project (written in Java): https://github.com/osmlab/atlas and https://github.com/osmlab/atlas-generator. It is being built by Apple's developers and allows distributed processing of osm.pbf files using Spark.
I wrote a spark data source for .pbf files. It uses Osmosis libraries underneath and leverages pruning of entities: https://github.com/igorgatis/spark-osmpbf
You probably want to read .pbf and write into a parquet file to make future queries much faster. Sample usage:
import io.github.igorgatis.spark.osmpbf.OsmPbfOptions
val df = spark.read
.format(OsmPbfOptions.FORMAT)
.options(new OsmPbfOptions()
.withExcludeMetadata(true)
.withTagsAsMap(true)
.toMap)
.load("path/to/some.osm.pbf")
df.printSchema
Prints:
root
|-- entity_type: string (nullable = false)
|-- id: long (nullable = false)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- nodes: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- index: integer (nullable = false)
| | |-- nodeId: long (nullable = false)
|-- members: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- member_id: long (nullable = false)
| | |-- role: string (nullable = true)
| | |-- type: string (nullable = true)