Accessing a Nested Map column in Spark Dataframes without using explode - scala

I have a column in a Spark dataframe where the schema looks something like this:
|-- seg: map (nullable = false)
| |-- key: string
| |-- value: array (valueContainsNull = false)
| | |-- element: struct (containsNull = false)
| | | |-- id: integer (nullable = false)
| | | |-- expiry: long (nullable = false)
The value in the column looks something like this:
Map(10000124 -> WrappedArray([20185255,1561507200], [20185256,1561507200]))]
What I want to do it create a column from this Map column which only contain an array of [20185255,20185256] (The elements of the array are 1st element of each array in the WrappedArray). How do I do this ?
I am trying not to use "explode".
** Also is their a way I can use a UDF which take in the Map and get those values ?**

Related

Scala Spark : How to extract nested column names from parquet file and adding prefix to it

The idea is to read a parquet file into dataFrame. Then, extract all column name's and type's from it's schema. If we have a nested columns, i would like to add a "prefix" before the column name.
Considering that we can have a nested column with sub column named properly, and we can have also a nested column with just an array of array without column name but "element".
val dfSource: DataFrame = spark.read.parquet("path.parquet")
val dfSourceSchema: StructType = dfSource.schema
Example of dfSourceSchema (Input):
|-- exCar: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: binary (nullable = true)
|-- exProduct: string (nullable = true)
|-- exName: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- exNameOne: string (nullable = true)
| | |-- exNameTwo: string (nullable = true)
Desired output :
((exCar.prefix.prefix,binary)),(exProduct, String), (exName.prefix.exNameOne, String), (exName.prefix.exNameTwo, String) )

How to perform general processing on spark StructType in Scala UDF?

I have dataframe with following schema
root
|-- name: integer (nullable = true)
|-- address: integer (nullable = true)
|-- cases: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- caseOpenTime: integer (nullable = true)
| | |-- caseDescription: string (nullable = true)
| | |-- caseSymptons: map (nullable = true)
| | | |-- key: string
| | | |-- value: struct (valueContainsNull = true)
| | | | |-- duration: integer (nullable = true)
| | | | |-- description: string (nullable = true)
I want to write UDF that can take "cases" column in data frame and produce another column "derivedColumn1" from this.
I want to write this derivation logic with general processing without using SQL constructs supported by Spark Dataframe. So steps will be:
val deriveDerivedColumen1_with_MapType = udf(_ => MapType = (casesColumn: ArrayType) => {
val derivedMapSchema = DataTypes.createMapType(StringType, LongType)
1. Convert casesColumn to scala object-X
2. ScalaMap<String, Long> scalaMap = myScalaObjectProcessing(object-X)
3. (return) scalaMap.convertToSparkMapType(schema = derivedMapSchema)
})
For specific use-cases, Dataframe SQL constructs can be used. But I am looking for general processing that is not constrained by SQL constructs so specifically looking for ways for:
How to convert complex spark StructType in Scala datatype Object-X ?
Then perform "SOME" general purpose processing on Scala Object-X
How to convert back Scala Object-X into spark MapType which can be added as new column in dataframe ?

Retrive subkey values of all the keys in json spark dataframe

i have a data frame with schema like below: (I have large number of keys )
|-- loginRequest: struct (nullable = true)
| |-- responseHeader: struct (nullable = true)
| | |-- status: long (nullable = true)
| | |-- code: long (nullable = true)
|-- loginResponse: struct (nullable = true)
| |-- responseHeader: struct (nullable = true)
| | |-- status: long (nullable = true)
| | |-- code: long (nullable = true)
I want to create a column with status of all the keys of responseHeader.status
Expected
+--------------------+--------------------+------------+
| loginRequest| loginResponse| status |
+--------------------+--------------------+------------+
|[0,1] | null| 0 |
| null|[0,1] | 0 |
| null| [0,1]| 0 |
| null| [1,0]| 1 |
+--------------------+--------------------+-------------
Thanks in Advance
A simple select will solve your problem.
You have a nest field :
loginResponse: struct (nullable = true)
| |-- responseHeader: struct (nullable = true)
| | |-- status
A quick way would be to flatten your dataframe.
Doing something like this:
df.select(df.col("loginRequest.*"),df.col("loginResponse.*"))
And get it working from there:
Or,
You could use something like this:
var explodeDF = df.withColumn("statusRequest", df("loginRequest. responseHeader"))
which you helped me into and these questions:
Flattening Rows in Spark
DataFrame explode list of JSON objects
In order to get it to populate either from response or request, you can use and when condition in spark.
- How to use AND or OR condition in when in Spark
You are able to get the subfields with the . delimiter in the select statement and with the help of the coalesce method, you should get exactly what you aim for, i.e. let's call the input dataframe df with your specified input schema, then this piece of code should do the work:
import org.apache.spark.sql.functions.{coalesce, col}
val df_status = df.withColumn("status",
coalesce(
col("loginRequest.responseHeader.status"),
col("loginResponse.responseHeader.status")
)
)
What coalesce does, is that it takes first non-null value in the order of the input columns to the method and in case there is no non-null value, it will return null (see https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/functions.html#coalesce-org.apache.spark.sql.Column...-).

In PySpark how to parse an embedded JSON

I am new to PySpark.
I have a JSON file which has below schema
df = spark.read.json(input_file)
df.printSchema()
|-- UrlsInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayUrl: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)
|-- type: long (nullable = true)
I want a new result dataframe which should have only two columns type and UrlsInfo.element.DisplayUrl
This is my try code, which doesn't give the expected output
df.createOrReplaceTempView("the_table")
resultDF = spark.sql("SELECT type, UrlsInfo.element.DisplayUrl FROM the_table")
resultDF.show()
I want resultDF to be something like this:
Type | DisplayUrl
----- ------------
2 | http://example.com
This is related JSON file parsing in Pyspark, but doesn't answer my question.
As you can see in your schema, UrlsInfo is an array type, not a struct. The "element" schema item thus refers not to a named property (you're trying to access it by .element) but to an array element (which responds to an index like [0]).
I've reproduced your schema by hand:
from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit")], Type=2)])
df.printSchema()
root
|-- Type: long (nullable = true)
|-- UrlsInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayUri: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)
and I'm able to produce a table like what you seem to be looking for by using an index:
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, UrlsInfo[0].DisplayUri FROM temp")
resultDF.show()
+----+----------------------+
|type|UrlsInfo[0].DisplayUri|
+----+----------------------+
| 2| http://example.com|
+----+----------------------+
However, this only gives the first element (if any) of UrlsInfo in the second column.
EDIT: I'd forgotten about the EXPLODE function, which you can use here to treat the UrlsInfo elements like a set of rows:
from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit"), Row(displayUri="http://another-example.com", type="narf", url="poit")], Type=2)])
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, EXPLODE(UrlsInfo.displayUri) AS displayUri FROM temp")
resultDF.show()
+----+--------------------+
|type| displayUri|
+----+--------------------+
| 2| http://example.com|
| 2|http://another-ex...|
+----+--------------------+

Take only a part of a MongoDB Document into a Spark Dataframe

I'm holding relatively large Documents in my MongoDB, I need only a small part of the information of the Document to be loaded into a Spark Dataframe to work on. This is an example of a Document (without a lot lot more of unnecessary fields I've removed for readability of this question)
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- customerInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- events: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- relevantField: integer (nullable = true)
| | | | |-- relevantField_2: string (nullable = true)
| | |-- situation: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- currentRank: integer (nullable = true)
| | |-- info: struct (nullable = true)
| | | |-- customerId: integer (nullable = true)
What I do now is explode "customerInfo":
val df = MongoSpark.load(sparksess)
val new_df = df.withColumn("customerInfo", explode(col("customerInfo")))
.select(col("_id"),
col("customerInfo.situation").getItem(13).getField("currentRank").alias("currentRank"),
col("customerInfo.info.customerId"),
col("customerInfo.events.relevantField"),
col("customerInfo.events.relevantField_2"))
Now, to my understanding this loads the whole "customerInfo" into memory to do actions over it which is a waste of time and resources, how can I explode only the specific information I need? Thank you!
how can I explode only the specific information I need?
Use Filters to filter the data in MongoDB first before sending it to Spark.
MongoDB Spark Connector will construct an Aggregation Pipeline to only send the filtered data into Spark, reducing the amount of data.
You could use $project aggregation stage to project certain fields only. See also MongoDB Spark Connector: Filters and Aggregation