Spark reading Open Street Map data and selecting entries - scala

I have OpenStreetMap (OSM) data from a .orc stored in var nlorc of a country for which I am trying to read out data for specific cities. As far as I know, a city entity is defined as a 'relation' in OSM. The nlorc.printSchema() of my data returns the following:
root
|-- id: long (nullable = true)
|-- type: string (nullable = true)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- lat: decimal(9,7) (nullable = true)
|-- lon: decimal(10,7) (nullable = true)
|-- nds: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ref: long (nullable = true)
|-- members: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- ref: long (nullable = true)
| | |-- role: string (nullable = true)
|-- changeset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- uid: long (nullable = true)
|-- user: string (nullable = true)
|-- version: long (nullable = true)
|-- visible: boolean (nullable = true)
As an example, https://www.openstreetmap.org/relation/47798#map=13/51.4373/4.8888 shows that the name of the city is part of "Tags". How can I access the keys of Tags and select specific cities?

You can use getItem to access the elements of the map:
df = ...
df.filter(df("tags").getItem("name")==="Baarle-Nassau").show()

Related

How to drop nested column or filter nested column in scala

root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = false)
| | |-- index: string (nullable = false)
| | |-- failed_at: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- updated_at: string (nullable = true)
How to remove the column from (webhooks) by taking the input from list
eg filterList: List[String]= List("index","status"). Is there any way to do by iterating row like the intermediate schema will change not the final schema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = false)
| | |-- index: string (nullable = false)
| | |-- status: string (nullable = true)
Check below code.
scala> df.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = true)
| |-- index: string (nullable = true)
| |-- failed_at: string (nullable = true)
| |-- status: string (nullable = true)
| |-- updated_at: string (nullable = true)
scala> val actualColumns = df.select(s"webhooks.*").columns
scala> val removeColumns = Seq("index","status")
scala> val webhooks = struct(actualColumns.filter(c => !removeColumns.contains(c)).map(c => col(s"webhooks.${c}")):_*).as("webhooks")
Output
scala> df.withColumn("webhooks",webhooks).printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = false)
| |-- failed_at: string (nullable = true)
| |-- updated_at: string (nullable = true)
Can also look at https://stackoverflow.com/a/39943812/2204206
Can be more convenient when removing deeply nested columns

How to append more columns to a structural datafame in scala

I have two dataframes (A and B), A is a structural schema whereas B is a common schema as below and will append B columns into A for C
A:
root
|-- package: struct (nullable = true)
| |-- globalPackageId: long (nullable = true)
| |-- naPackageId: string (nullable = true)
| |-- packageName: string (nullable = true)
|-- supplies: struct (nullable = true)
| |-- supplyMask: integer (nullable = true)
| |-- supplyIds: array (nullable = true)
| | |-- element: integer (containsNull = true)
|-- timestampDetails: struct (nullable = true)
| |-- packageTimestamp: string (nullable = true)
| |-- onboardTimestamp: string (nullable = true)
B:
root
|-- globalPackageId: long (nullable = true)
|-- order_id: long (nullable = true)
|-- order_address: string (nullable = true)
|-- order_number: integer (nullable = true)
C:
root
|-- package: struct (nullable = true)
| |-- globalPackageId: long (nullable = true)
| |-- naPackageId: string (nullable = true)
| |-- packageName: string (nullable = true)
| |-- order_id: long (nullable = true)
| |-- order_address: string (nullable = true)
| |-- order_number: integer (nullable = true)
|-- supplies: struct (nullable = true)
| |-- supplyMask: integer (nullable = true)
| |-- supplyIds: array (nullable = true)
| | |-- element: integer (containsNull = true)
|-- timestampDetails: struct (nullable = true)
| |-- packageTimestamp: string (nullable = true)
| |-- onboardTimestamp: string (nullable = true)
I am struggling to use .withColumn(struct("xxx"), "xxx")
But looks still not expected
Do you have any experience on this
Thanks,

How to display the string variable in the root using Spark SQL?

Long story short - I am using a spark code in Scala IDE to convert json to csv. I don't have knowledge about spark as I have worked only on RDBMS like Oracle, TD and DB2. All I was given was, the code which will converts the json data to csv and how to pass the arguments to retrieve data from the schema.
Now, I am able to fetch the data which is inside a struct and array by using
val val1 = df.select(explode($"data.business").as("ID")).select($"ID.amountTO")
val1.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(args(2) + "\\Result" + "\\" + timeForpath + "\\val1")
I don't know to export the columns which are not in the struct and are directly in the root of the schema like QAYONOutCome, QA1PartiesComments etc..
root
|-- QAYONOutCome: string (nullable = true)
|-- QA1PartiesComments: string (nullable = true)
|-- QA1PartiesQID: string (nullable = true)
|-- QA1PartiesResponse: string (nullable = true)
|-- QAHolderTypeComments: string (nullable = true)
|-- QAHolderTypeQID: string (nullable = true)
|-- QAHolderTypeResponse: string (nullable = true)
|-- QAhighRiskComments: string (nullable = true)
|-- QAhighRiskQID: string (nullable = true)
|-- QAhighRiskResponse: string (nullable = true)
|-- QA2ClassComments: string (nullable = true)
|-- QA2ClassQID: string (nullable = true)
|-- QA2ClassResponse: string (nullable = true)
|-- QAoutcomeComments: string (nullable = true)
|-- QAoutcomeQID: string (nullable = true)
|-- QAoutcomeResponse: string (nullable = true)
|-- data: struct (nullable = true)
| |-- business: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- amountTO: string (nullable = true)
| | | |-- ID: string (nullable = true)
| | | |-- Registration: struct (nullable = true)
| | | | |-- country: string (nullable = true)
| | | | |-- id: long (nullable = true)
| | | | |-- line1: string (nullable = true)
| | | | |-- line2: string (nullable = true)
| | | | |-- postCode: string (nullable = true)
Any help is appreciated. Apologies if my question sounds very dumb :(. Please let me know if some more information is needed to provide a solution or clarity. Thanks much in advance.

Spark Scala: How to Replace a Field in Deeply Nested DataFrame

I have a DataFrame which contains multiple nested columns. The schema is not static and could change upstream of my Spark application. Schema evolution is guaranteed to always be backward compatible. An anonymized, shortened version of the schema is pasted below
root
|-- isXPresent: boolean (nullable = true)
|-- isYPresent: boolean (nullable = true)
|-- isZPresent: boolean (nullable = true)
|-- createTime: long (nullable = true)
<snip>
|-- structX: struct (nullable = true)
| |-- hostIPAddress: integer (nullable = true)
| |-- uriArguments: string (nullable = true)
<snip>
|-- structY: struct (nullable = true)
| |-- lang: string (nullable = true)
| |-- cookies: map (nullable = true)
| | |-- key: string
| | |-- value: array (valueContainsNull = true)
| | | |-- element: string (containsNull = true)
<snip>
The spark job is supposed to transform "structX.uriArguments" from string to map(string, string). There is a somewhat similar situation asked in this post. However, the answer assumes the schema is static and does not change. So case class does not work in my situation.
What would be the best way to transform structX.uriArguments without hard-coding the entire schema inside the code? The outcome should look like this:
root
|-- isXPresent: boolean (nullable = true)
|-- isYPresent: boolean (nullable = true)
|-- isZPresent: boolean (nullable = true)
|-- createTime: long (nullable = true)
<snip>
|-- structX: struct (nullable = true)
| |-- hostIPAddress: integer (nullable = true)
| |-- uriArguments: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
<snip>
|-- structY: struct (nullable = true)
| |-- lang: string (nullable = true)
| |-- cookies: map (nullable = true)
| | |-- key: string
| | |-- value: array (valueContainsNull = true)
| | | |-- element: string (containsNull = true)
<snip>
Thanks
You could try using the DataFrame.withColumn(). It allows you to reference nested fields. You could add a new map column and drop the flat one. This question shows how to handle structs with withColumn.

How can I create a nested column by joining in Spark?

I would like to perform a "join" on two Spark DataFrames (Scala), but instead of a SQL-like join, I'd like to insert the "joined" row from the second DataFrame as a single nested column in the first. The reason to do so is, ultimately, to write back out to JSON with a nested structure. I know the answer is likely already on Stackoverflow, but some searching has not turned up my answer.
Table 1
root
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- center_name: string (nullable = true)
|-- design_description: string (nullable = true)
|-- geo_accession: string (nullable = true)
|-- instrument_model: string (nullable = true)
|-- library_construction_protocol: string (nullable = true)
|-- library_name: string (nullable = true)
|-- library_selection: string (nullable = true)
|-- library_source: string (nullable = true)
|-- library_strategy: string (nullable = true)
|-- paired: boolean (nullable = true)
|-- platform: string (nullable = true)
|-- read_spec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- base_coord: long (nullable = true)
| | |-- read_class: string (nullable = true)
| | |-- read_index: long (nullable = true)
| | |-- read_type: string (nullable = true)
|-- sample_accession: string (nullable = true)
|-- spot_length: long (nullable = true)
|-- study_accession: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
Table 2
root
|-- BioProject: string (nullable = true)
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- abstract: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- dbGaP: string (nullable = true)
|-- description: string (nullable = true)
|-- external_id: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- namespace: string (nullable = true)
|-- submitter_id: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- namespace: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
Join is on table1.study_accession with table2.accession. Result is below. Note the new column called study that contains record equivalents of Rows from table 2.
root
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- center_name: string (nullable = true)
|-- design_description: string (nullable = true)
|-- geo_accession: string (nullable = true)
|-- instrument_model: string (nullable = true)
|-- library_construction_protocol: string (nullable = true)
|-- library_name: string (nullable = true)
|-- library_selection: string (nullable = true)
|-- library_source: string (nullable = true)
|-- library_strategy: string (nullable = true)
|-- paired: boolean (nullable = true)
|-- platform: string (nullable = true)
|-- read_spec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- base_coord: long (nullable = true)
| | |-- read_class: string (nullable = true)
| | |-- read_index: long (nullable = true)
| | |-- read_type: string (nullable = true)
|-- sample_accession: string (nullable = true)
|-- spot_length: long (nullable = true)
|-- study_accession: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
|-- accession: string (nullable = true)
|-- study: struct (nullable = true)
| |-- BioProject: string (nullable = true)
| |-- Insdc: string (nullable = true)
| |-- LastMetaUpdate: string (nullable = true)
| |-- LastUpdate: string (nullable = true)
| |-- Published: string (nullable = true)
| |-- Received: string (nullable = true)
| |-- ReplacedBy: string (nullable = true)
| |-- Status: string (nullable = true)
| |-- Type: string (nullable = true)
| |-- abstract: string (nullable = true)
| |-- accession: string (nullable = true)
| |-- alias: string (nullable = true)
| |-- attributes: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- tag: string (nullable = true)
| | | |-- value: string (nullable = true)
| |-- dbGaP: string (nullable = true)
| |-- description: string (nullable = true)
| |-- external_id: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- namespace: string (nullable = true)
| |-- submitter_id: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- namespace: string (nullable = true)
| |-- tags: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: string (nullable = true)
From my understanding to your question, lets say you have two dataframes
df1
root
|-- col1: string (nullable = true)
|-- col2: integer (nullable = false)
|-- col3: double (nullable = false)
and
df2
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: double (nullable = false)
You will have to combine all the columns of df2 into a struct column and select the columns to be joined and the struct column. Here I am taking col1 as the joining column
import org.apache.spark.sql.functions._
val nestedDF2 = df2.select($"col1", struct(df2.columns.map(col):_*).as("nested_df2"))
Then final step is to join (here default is the inner join)
df1.join(nestedDF2, Seq("col1"))
which should give you
root
|-- col1: string (nullable = true)
|-- col2: integer (nullable = false)
|-- col3: double (nullable = false)
|-- nested_df2: struct (nullable = false)
| |-- col1: string (nullable = true)
| |-- col2: string (nullable = true)
| |-- col3: double (nullable = false)
I hope the answer is helpful