OSM data is available in PBF format. There are specialised libraries (such as https://github.com/plasmap/geow for parsing this data).
I want to store this data on S3 and parse the data into an RDD as part of an EMR job.
What is a straightforward way to achieve this? Can I fetch the file to the master node and process it locally? If so, would I create an empty RDD and add to it as streaming events are parsed from the input file?
One solution would be to skip the PBFs. One Spark-friendly representation is Parquet. In this blog post it is shown how to convert the PBFs to Parquets and how to load the data in Spark.
I released a new version of Osm4Scala that includes support for Spark 2 and 3.
There are a lot of examples in the README.md
It is really simple to use:
scala> val osmDF = spark.sqlContext.read.format("osm.pbf").load("<osm files path here>")
osmDF: org.apache.spark.sql.DataFrame = [id: bigint, type: tinyint ... 5 more fields]
scala> osmDF.createOrReplaceTempView("osm")
scala> spark.sql("select type, count(*) as num_primitives from osm group by type").show()
+----+--------------+
|type|num_primitives|
+----+--------------+
| 1| 338795|
| 2| 10357|
| 0| 2328075|
+----+--------------+
scala> spark.sql("select distinct(explode(map_keys(tags))) as tag_key from osm order by tag_key asc").show()
+------------------+
| tag_key|
+------------------+
| Calle|
| Conference|
| Exper|
| FIXME|
| ISO3166-1|
| ISO3166-1:alpha2|
| ISO3166-1:alpha3|
| ISO3166-1:numeric|
| ISO3166-2|
| MAC_dec|
| Nombre|
| Numero|
| Open|
| Peluqueria|
| Residencia UEM|
| Telefono|
| abandoned|
| abandoned:amenity|
| abandoned:barrier|
|abandoned:building|
+------------------+
only showing top 20 rows
scala> spark.sql("select id, latitude, longitude, tags from osm where type = 0").show()
+--------+------------------+-------------------+--------------------+
| id| latitude| longitude| tags|
+--------+------------------+-------------------+--------------------+
| 171933| 40.42006|-3.7016600000000004| []|
| 171946| 40.42125|-3.6844500000000004|[highway -> traff...|
| 171948|40.420230000000004|-3.6877900000000006| []|
| 171951|40.417350000000006|-3.6889800000000004| []|
| 171952| 40.41499|-3.6889800000000004| []|
| 171953| 40.41277|-3.6889000000000003| []|
| 171954| 40.40946|-3.6887900000000005| []|
| 171959| 40.40326|-3.7012200000000006| []|
|20952874| 40.42099|-3.6019200000000007| []|
|20952875|40.422610000000006|-3.5994900000000007| []|
|20952878| 40.42136000000001| -3.601470000000001| []|
|20952879| 40.42262000000001| -3.599770000000001| []|
|20952881| 40.42905000000001|-3.5970500000000007| []|
|20952883| 40.43131000000001|-3.5961000000000007| []|
|20952888| 40.42930000000001| -3.596590000000001| []|
|20952890| 40.43012000000001|-3.5961500000000006| []|
|20952891| 40.43043000000001|-3.5963600000000007| []|
|20952892| 40.43057000000001|-3.5969100000000007| []|
|20952893| 40.43039000000001|-3.5973200000000007| []|
|20952895| 40.42967000000001|-3.5972300000000006| []|
+--------+------------------+-------------------+--------------------+
only showing top 20 rows
You should definitely take a look at the Atlas project (written in Java): https://github.com/osmlab/atlas and https://github.com/osmlab/atlas-generator. It is being built by Apple's developers and allows distributed processing of osm.pbf files using Spark.
I wrote a spark data source for .pbf files. It uses Osmosis libraries underneath and leverages pruning of entities: https://github.com/igorgatis/spark-osmpbf
You probably want to read .pbf and write into a parquet file to make future queries much faster. Sample usage:
import io.github.igorgatis.spark.osmpbf.OsmPbfOptions
val df = spark.read
.format(OsmPbfOptions.FORMAT)
.options(new OsmPbfOptions()
.withExcludeMetadata(true)
.withTagsAsMap(true)
.toMap)
.load("path/to/some.osm.pbf")
df.printSchema
Prints:
root
|-- entity_type: string (nullable = false)
|-- id: long (nullable = false)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- nodes: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- index: integer (nullable = false)
| | |-- nodeId: long (nullable = false)
|-- members: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- member_id: long (nullable = false)
| | |-- role: string (nullable = true)
| | |-- type: string (nullable = true)
Related
Below is my source schema.
root
|-- header: struct (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- id: string (nullable = true)
| |-- honame: string (nullable = true)
|-- device: struct (nullable = true)
| |-- srcId: string (nullable = true)
| |-- srctype.: string (nullable = true)
|-- ATTRIBUTES: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- event_date: date (nullable = true)
|-- event_datetime: string (nullable = true)
I want to explode the ATTRIBUTES map type column and select all the columns which ends with _id.
Im using the below code.
val exploded = batch_df.select($"event_date", explode($"ATTRIBUTES")).show()
I am getting the below sample output.
---+----------+--------------------+--------------------+
|date | key| value|
+----------+--------------------+--------------------+
|2021-05-18|SYST_id | 85|
|2021-05-18|RECVR_id | 1|
|2021-05-18|Account_Id| | 12345|
|2021-05-18|Vb_id | 845|
|2021-05-18|SYS_INFO_id | 640|
|2021-05-18|mem_id | 456|
------------------------------------------------------
However, my required output is as below.
+---+-------+--------------+-----------+------------+-------+-------------+-------+
|date | SYST_id | RECVR_id | Account_Id | Vb_id | SYS_INFO_id| mem_id|
+----+------+--------------+-----------+------------+-------+-------------+-------+
|2021-05-18| 85 | 1 | 12345 | 845 | 640 | 456 |
+-----------+--------------+-----------+------------+-------+-------------+-------+
Could someone pls assist.
Your approach works. You only have to add a pivot operation after the explode:
import org.apache.spark.sql.functions._
exploded.groupBy("date").pivot("key").agg(first("value")).show()
I assume that the combination of date and key is unique, so it is safe to take the first (and only) value in the aggregation. If the combination is not unique, you could use collect_list as aggregation function.
Edit:
To add scrId and srctype, simply add these columns to the select statement:
val exploded = batch_df.select($"event_date", $"device.srcId", $"device.srctype", explode($"ATTRIBUTES"))
To reduce the number of columns after the pivot operation, apply a filter on the key column before aggregating:
val relevant_cols = Array("Account_Id", "Vb_id", "RECVR_id", "mem_id") // the four additional columns
exploded.filter($"key".isin(relevant_cols:_*).or($"key".endsWith(lit("_split"))))
.groupBy("date").pivot("key").agg(first("value")).show()
I know this question has been asked many times on Stack Overflow and has been satisfactorily answered in most posts, but I'm not sure if this is the best way in my case.
I have a Dataset that has several struct types embedded in it:
root
|-- STRUCT1: struct (nullable = true)
| |-- FIELD_1: string (nullable = true)
| |-- FIELD_2: long (nullable = true)
| |-- FIELD_3: integer (nullable = true)
|-- STRUCT2: struct (nullable = true)
| |-- FIELD_4: string (nullable = true)
| |-- FIELD_5: long (nullable = true)
| |-- FIELD_6: integer (nullable = true)
|-- STRUCT3: struct (nullable = true)
| |-- FIELD_7: string (nullable = true)
| |-- FIELD_8: long (nullable = true)
| |-- FIELD_9: integer (nullable = true)
|-- ARRAYSTRUCT4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FIELD_10: integer (nullable = true)
| | |-- FIELD_11: integer (nullable = true)
+-------+------------+------------+------------------+
|STRUCT1| STRUCT2 | STRUCT3 | ARRAYSTRUCT4 |
+-------+------------+------------+------------------+
|[1,2,3]|[aa, xx, yy]|[p1, q2, r3]|[[1a, 2b],[3c,4d]]|
+-------+------------+------------+------------------+
I want to convert this into:
1. A dataset where the structs are expanded into columns.
2. A data set where the array (ARRAYSTRUCT4) is exploded into rows.
root
|-- FIELD_1: string (nullable = true)
|-- FIELD_2: long (nullable = true)
|-- FIELD_3: integer (nullable = true)
|-- FIELD_4: string (nullable = true)
|-- FIELD_5: long (nullable = true)
|-- FIELD_6: integer (nullable = true)
|-- FIELD_7: string (nullable = true)
|-- FIELD_8: long (nullable = true)
|-- FIELD_9: integer (nullable = true)
|-- FIELD_10: integer (nullable = true)
|-- FIELD_11: integer (nullable = true)
+-------+------------+------------+---------+ ---------+----------+
|FIELD_1| FIELD_2 | FIELD_3 | FIELD_4 | |FIELD_10| FIELD_11 |
+-------+------------+------------+---------+ ... ---------+----------+
|1 |2 |3 | aa | | 1a | 2b |
+-------+------------+------------+-----------------------------------+
To achieve this, I could use:
val expanded = df.select("STRUCT1.*", "STRUCT2.*", "STRUCT3.*", "STRUCT4")
followed by an explode:
val exploded = expanded.select(explode(expanded("STRUCT4")))
However, I was wondering if there's a more functional way to do this, especially the select. I could use withColumn as below:
data.withColumn("FIELD_1", $"STRUCT1".getItem(0))
.withColumn("FIELD_2", $"STRUCT1".getItem(1))
.....
But I have 80+ columns. Is there a better way to achieve this?
You can first make all columns struct-type by explode-ing any Array(struct) columns into struct columns via foldLeft, then use map to interpolate each of the struct column names into col.*, as shown below:
import org.apache.spark.sql.functions._
case class S1(FIELD_1: String, FIELD_2: Long, FIELD_3: Int)
case class S2(FIELD_4: String, FIELD_5: Long, FIELD_6: Int)
case class S3(FIELD_7: String, FIELD_8: Long, FIELD_9: Int)
case class S4(FIELD_10: Int, FIELD_11: Int)
val df = Seq(
(S1("a1", 101, 11), S2("a2", 102, 12), S3("a3", 103, 13), Array(S4(1, 1), S4(3, 3))),
(S1("b1", 201, 21), S2("b2", 202, 22), S3("b3", 203, 23), Array(S4(2, 2), S4(4, 4)))
).toDF("STRUCT1", "STRUCT2", "STRUCT3", "ARRAYSTRUCT4")
// +-----------+-----------+-----------+--------------+
// | STRUCT1| STRUCT2| STRUCT3| ARRAYSTRUCT4|
// +-----------+-----------+-----------+--------------+
// |[a1,101,11]|[a2,102,12]|[a3,103,13]|[[1,1], [3,3]]|
// |[b1,201,21]|[b2,202,22]|[b3,203,23]|[[2,2], [4,4]]|
// +-----------+-----------+-----------+--------------+
val arrayCols = df.dtypes.filter( t => t._2.startsWith("ArrayType(StructType") ).
map(_._1)
// arrayCols: Array[String] = Array(ARRAYSTRUCT4)
val expandedDF = arrayCols.foldLeft(df)((accDF, c) =>
accDF.withColumn(c.replace("ARRAY", ""), explode(col(c))).drop(c)
)
val structCols = expandedDF.columns
expandedDF.select(structCols.map(c => col(s"$c.*")): _*).
show
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
// |FIELD_1|FIELD_2|FIELD_3|FIELD_4|FIELD_5|FIELD_6|FIELD_7|FIELD_8|FIELD_9|FIELD_10|FIELD_11|
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
// | a1| 101| 11| a2| 102| 12| a3| 103| 13| 1| 1|
// | a1| 101| 11| a2| 102| 12| a3| 103| 13| 3| 3|
// | b1| 201| 21| b2| 202| 22| b3| 203| 23| 2| 2|
// | b1| 201| 21| b2| 202| 22| b3| 203| 23| 4| 4|
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
Note that for simplicity it's assumed that your DataFrame has only struct and Array(struct)-type columns. If there are other data types, just apply filtering conditions to arrayCols and structCols accordingly.
I want to filter Spark sql.DataFrame leaving only wanted array elements without any knowledge for the whole schema before hand (don't want to hardcode it).
Schema:
root
|-- callstartcelllabel: string (nullable = true)
|-- calltargetcelllabel: string (nullable = true)
|-- measurements: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- enodeb: string (nullable = true)
| | |-- label: string (nullable = true)
| | |-- ltecelloid: long (nullable = true)
|-- networkcode: long (nullable = true)
|-- ocode: long (nullable = true)
|-- startcelllabel: string (nullable = true)
|-- startcelloid: long (nullable = true)
|-- targetcelllabel: string (nullable = true)
|-- targetcelloid: long (nullable = true)
|-- timestamp: long (nullable = true)
I want whole root only with particular measurements (which are filtered on) and root must contain at least one after filtering.
I have a dataframe of this root, and I have a dataframe of filtering values (one column).
So, example: I would only know that my root contains measurements array, and this array contains labels. So I want whole root with whole measurements which contains labels ("label1","label2").
last trial with explode and collect_list leads to: grouping expressions sequence is empty, and 'callstartcelllabel' is not an aggregate function... Is it even possible to generalize such filtering case ? Don't know how such generic udaf should look like yet.
New in Spark.
EDIT:
Current solution I've came to is:
explode array -> filter out unwanted rows with unwanted array members -> groupby everything but array members -> agg.(collect_list(col("measurements"))
Would it be faster doing it with udf ? I can't figure out how to make a generic udf filtering generic array, knowing only about filtering values...
case class Test(a:Int,b:Int) // declared case class to show above scenario
var df=List((1,2,Test(1,2)),(2,3,Test(3,4)),(4,2,Test(5,6))).toDF("name","rank","array")
+----+----+------+
|name|rank| array|
+----+----+------+
| 1| 2|[1, 2]|
| 2| 3|[3, 4]|
| 4| 2|[5, 6]|
+----+----+------+
df.printSchema
//dataFrame structure look like this
root
|-- name: integer (nullable = false)
|-- rank: integer (nullable = false)
|-- array: struct (nullable = true)
| |-- a: integer (nullable = false)
| |-- b: integer (nullable = false)
df.filter(df("array")("a")>1).show
//after filter on dataFrame on specified condition
+----+----+------+
|name|rank| array|
+----+----+------+
| 2| 3|[3, 4]|
| 4| 2|[5, 6]|
+----+----+------+
//Above code help you to understand the Scenario
//use this piece of code:
df.filter(df("measurements")("label")==="label1" || df("measurements")("label")==="label2).show
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.udf
var df=Seq((1,2,Array(Test(1,2),Test(5,6))),(1,3,Array(Test(1,2),Test(5,3))),(10,11,Array(Test(1,6)))).toDF("name","rank","array")
df.show
+----+----+----------------+
|name|rank| array|
+----+----+----------------+
| 1| 2|[[1, 2], [5, 6]]|
| 1| 3|[[1, 2], [5, 3]]|
| 10| 11| [[1, 6]]|
+----+----+----------------+
def test={
udf((a:scala.collection.mutable.WrappedArray[Row])=> {
val b=a.toArray.map(x=>(x.getInt(0),x.getInt(1)))
b.filter(y=>y._1>1)
})}
df.withColumn("array",test(df("array"))).show
+----+----+--------+
|name|rank| array|
+----+----+--------+
| 1| 2|[[5, 6]]|
| 1| 3|[[5, 3]]|
| 10| 11| []|
+----+----+--------+
I was in the process of flattening a Spark Schema using the method suggested here, when I came across an edge case -
val writerSchema = StructType(Seq(
StructField("f1", ArrayType(ArrayType(
StructType(Seq(
StructField("f2", ArrayType(LongType))
))
)))
))
writerSchema.printTreeString()
root
|-- f1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- f2: array (nullable = true)
| | | | |-- element: long (containsNull = true)
This prints the following output - f1 and not
f1
f1.f2
as I expected it to be.
Questions -
Is writerSchema a valid Spark schema?
How do I handle ArrayType objects when flattening the schema?
If you want to handle data like this
val json = """{"f1": [{"f2": [1, 2, 3] }, {"f2": [4,5,6]}, {"f2": [7,8,9]}, {"f2": [10,11,12]}]}"""
The valid schema will be
val writerSchema = StructType(Seq(
StructField("f1", ArrayType(
StructType(Seq(
StructField("f2", ArrayType(LongType))
))
))))
root
|-- f1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- f2: array (nullable = true)
| | | |-- element: long (containsNull = true)
You shouldn't be putting an ArrayType inside another ArrayType.
So lets suppose you have a dataframe inputDF :
inputDF.printSchema
root
|-- f1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- f2: array (nullable = true)
| | | |-- element: long (containsNull = true)
inputDF.show(false)
+-------------------------------------------------------------------------------------------------------+
|f1 |
+-------------------------------------------------------------------------------------------------------+
|[[WrappedArray(1, 2, 3)], [WrappedArray(4, 5, 6)], [WrappedArray(7, 8, 9)], [WrappedArray(10, 11, 12)]]|
+-------------------------------------------------------------------------------------------------------+
To flatten this dataframe we can explode the array columns (f1 and f2):
First, flatten column 'f1'
val semiFlattenDF = inputDF.select(explode(col("f1"))).select(col("col.*"))
semiFlattenDF.printSchema
root
|-- f2: array (nullable = true)
| |-- element: long (containsNull = true)
semiFlattenDF.show
+------------+
| f2|
+------------+
| [1, 2, 3]|
| [4, 5, 6]|
| [7, 8, 9]|
|[10, 11, 12]|
+------------+
Now flatten column 'f2' and get the column name as 'value'
val fullyFlattenDF = semiFlattenDF.select(explode(col("f2")).as("value"))
So now the DataFrame is flattened:
fullyFlattenDF.printSchema
root
|-- value: long (nullable = true)
fullyFlattenDF.show
+-----+
|value|
+-----+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
| 12|
+-----+
I've a dataframe with following schema -
|-- ID: string (nullable = true)
|-- VALUES: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _v1: string (nullable = true)
| | |-- _v2: string (nullable = true)
VALUES are like -
[["1","a"],["2","b"],["3","c"],["4","d"]]
[["4","g"]]
[["3","e"],["4","f"]]
I want to take the VALUES with the lowest integer i.e.
The result df should look like - (which will be StructType now, not Array[Struct])
["1","a"]
["4","g"]
["3","e"]
Can someone please guide me how can I approach this problem by creating a udf ?
Thanks in advance.
You don't need a UDF for that. Just use sort_array and pick the first element.
df.show
+--------------------+
| data_arr|
+--------------------+
|[[4,a], [2,b], [1...|
| [[1,a]]|
| [[3,b], [1,v]]|
+--------------------+
df.printSchema
root
|-- data_arr: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- col1: string (nullable = false)
| | |-- col2: string (nullable = false)
import org.apache.spark.sql.functions.sort_array
df.withColumn("first_asc", sort_array($"data_arr")(0)).show
+--------------------+---------+
| data_arr|first_asc|
+--------------------+---------+
|[[4,a], [2,b], [1...| [1,c]|
| [[1,a]]| [1,a]|
| [[3,b], [1,v]]| [1,v]|
+--------------------+---------+
Using the same dataframe as in the example:
val findSmallest = udf((rows: Seq[Row]) => {
rows.map(row => (row.getAs[String](0), row.getAs[String](1))).sorted.head
})
df.withColumn("SMALLEST", findSmallest($"VALUES"))
Will give a result like this:
+---+--------------------+--------+
| ID| VALUES|SMALLEST|
+---+--------------------+--------+
| 1|[[1,a], [2,b], [3...| [1,2]|
| 2| [[4,e]]| [4,g]|
| 3| [[3,g], [4,f]]| [3,g]|
+---+--------------------+--------+
If you only want the final values use select("SMALLEST).