I'm holding relatively large Documents in my MongoDB, I need only a small part of the information of the Document to be loaded into a Spark Dataframe to work on. This is an example of a Document (without a lot lot more of unnecessary fields I've removed for readability of this question)
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- customerInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- events: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- relevantField: integer (nullable = true)
| | | | |-- relevantField_2: string (nullable = true)
| | |-- situation: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- currentRank: integer (nullable = true)
| | |-- info: struct (nullable = true)
| | | |-- customerId: integer (nullable = true)
What I do now is explode "customerInfo":
val df = MongoSpark.load(sparksess)
val new_df = df.withColumn("customerInfo", explode(col("customerInfo")))
.select(col("_id"),
col("customerInfo.situation").getItem(13).getField("currentRank").alias("currentRank"),
col("customerInfo.info.customerId"),
col("customerInfo.events.relevantField"),
col("customerInfo.events.relevantField_2"))
Now, to my understanding this loads the whole "customerInfo" into memory to do actions over it which is a waste of time and resources, how can I explode only the specific information I need? Thank you!
how can I explode only the specific information I need?
Use Filters to filter the data in MongoDB first before sending it to Spark.
MongoDB Spark Connector will construct an Aggregation Pipeline to only send the filtered data into Spark, reducing the amount of data.
You could use $project aggregation stage to project certain fields only. See also MongoDB Spark Connector: Filters and Aggregation
Related
Below is my source schema.
root
|-- header: struct (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- id: string (nullable = true)
| |-- honame: string (nullable = true)
|-- device: struct (nullable = true)
| |-- srcId: string (nullable = true)
| |-- srctype.: string (nullable = true)
|-- ATTRIBUTES: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- event_date: date (nullable = true)
|-- event_datetime: string (nullable = true)
I want to explode the ATTRIBUTES map type column and select all the columns which ends with _id.
Im using the below code.
val exploded = batch_df.select($"event_date", explode($"ATTRIBUTES")).show()
I am getting the below sample output.
---+----------+--------------------+--------------------+
|date | key| value|
+----------+--------------------+--------------------+
|2021-05-18|SYST_id | 85|
|2021-05-18|RECVR_id | 1|
|2021-05-18|Account_Id| | 12345|
|2021-05-18|Vb_id | 845|
|2021-05-18|SYS_INFO_id | 640|
|2021-05-18|mem_id | 456|
------------------------------------------------------
However, my required output is as below.
+---+-------+--------------+-----------+------------+-------+-------------+-------+
|date | SYST_id | RECVR_id | Account_Id | Vb_id | SYS_INFO_id| mem_id|
+----+------+--------------+-----------+------------+-------+-------------+-------+
|2021-05-18| 85 | 1 | 12345 | 845 | 640 | 456 |
+-----------+--------------+-----------+------------+-------+-------------+-------+
Could someone pls assist.
Your approach works. You only have to add a pivot operation after the explode:
import org.apache.spark.sql.functions._
exploded.groupBy("date").pivot("key").agg(first("value")).show()
I assume that the combination of date and key is unique, so it is safe to take the first (and only) value in the aggregation. If the combination is not unique, you could use collect_list as aggregation function.
Edit:
To add scrId and srctype, simply add these columns to the select statement:
val exploded = batch_df.select($"event_date", $"device.srcId", $"device.srctype", explode($"ATTRIBUTES"))
To reduce the number of columns after the pivot operation, apply a filter on the key column before aggregating:
val relevant_cols = Array("Account_Id", "Vb_id", "RECVR_id", "mem_id") // the four additional columns
exploded.filter($"key".isin(relevant_cols:_*).or($"key".endsWith(lit("_split"))))
.groupBy("date").pivot("key").agg(first("value")).show()
I am reading data from Kafka and loading into data warehouse, from one Kafka topic I am
creating a data frame and after applying the required transformation I am creating multiple DFs out of it and loading those DFs to different tables, but this operation is happening in sequence. Is there a way I can parallelize this table load process?
root
|-- attribute1Formatted: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- accexecattributes: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- primary: boolean (nullable = true)
| | |-- accountExecUUID: string (nullable = true)
|-- attribute2Formatted: struct (nullable = true)
| |-- Jake-DOT-Sandler#xyz.com: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- primary: boolean (nullable = true)
have created two different datframes respectively for attribute1Formatted and attribute2Formatted and further these DFs are getting saved into database in different tables.
I don't have much knowledge of spark streaming but I believe streaming are iterative micro-batch, and in spark batch execution each action has one sink/output. So you can't store it in different tables with one execution.
Now,
if you write it in one table, reader can simply read only the column that they require. I mean: do you really need to store it in different places?
You can write it twice, filtering the fields that are not required
both write action will execute the computation of the full dataset, then remove not required columns
if the full dataset computation is long, you can cache it before the filtering+write
I have dataframe with following schema
root
|-- name: integer (nullable = true)
|-- address: integer (nullable = true)
|-- cases: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- caseOpenTime: integer (nullable = true)
| | |-- caseDescription: string (nullable = true)
| | |-- caseSymptons: map (nullable = true)
| | | |-- key: string
| | | |-- value: struct (valueContainsNull = true)
| | | | |-- duration: integer (nullable = true)
| | | | |-- description: string (nullable = true)
I want to write UDF that can take "cases" column in data frame and produce another column "derivedColumn1" from this.
I want to write this derivation logic with general processing without using SQL constructs supported by Spark Dataframe. So steps will be:
val deriveDerivedColumen1_with_MapType = udf(_ => MapType = (casesColumn: ArrayType) => {
val derivedMapSchema = DataTypes.createMapType(StringType, LongType)
1. Convert casesColumn to scala object-X
2. ScalaMap<String, Long> scalaMap = myScalaObjectProcessing(object-X)
3. (return) scalaMap.convertToSparkMapType(schema = derivedMapSchema)
})
For specific use-cases, Dataframe SQL constructs can be used. But I am looking for general processing that is not constrained by SQL constructs so specifically looking for ways for:
How to convert complex spark StructType in Scala datatype Object-X ?
Then perform "SOME" general purpose processing on Scala Object-X
How to convert back Scala Object-X into spark MapType which can be added as new column in dataframe ?
I am new to PySpark.
I have a JSON file which has below schema
df = spark.read.json(input_file)
df.printSchema()
|-- UrlsInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayUrl: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)
|-- type: long (nullable = true)
I want a new result dataframe which should have only two columns type and UrlsInfo.element.DisplayUrl
This is my try code, which doesn't give the expected output
df.createOrReplaceTempView("the_table")
resultDF = spark.sql("SELECT type, UrlsInfo.element.DisplayUrl FROM the_table")
resultDF.show()
I want resultDF to be something like this:
Type | DisplayUrl
----- ------------
2 | http://example.com
This is related JSON file parsing in Pyspark, but doesn't answer my question.
As you can see in your schema, UrlsInfo is an array type, not a struct. The "element" schema item thus refers not to a named property (you're trying to access it by .element) but to an array element (which responds to an index like [0]).
I've reproduced your schema by hand:
from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit")], Type=2)])
df.printSchema()
root
|-- Type: long (nullable = true)
|-- UrlsInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayUri: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)
and I'm able to produce a table like what you seem to be looking for by using an index:
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, UrlsInfo[0].DisplayUri FROM temp")
resultDF.show()
+----+----------------------+
|type|UrlsInfo[0].DisplayUri|
+----+----------------------+
| 2| http://example.com|
+----+----------------------+
However, this only gives the first element (if any) of UrlsInfo in the second column.
EDIT: I'd forgotten about the EXPLODE function, which you can use here to treat the UrlsInfo elements like a set of rows:
from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit"), Row(displayUri="http://another-example.com", type="narf", url="poit")], Type=2)])
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, EXPLODE(UrlsInfo.displayUri) AS displayUri FROM temp")
resultDF.show()
+----+--------------------+
|type| displayUri|
+----+--------------------+
| 2| http://example.com|
| 2|http://another-ex...|
+----+--------------------+
Spark DF with one column, where each row is of the
Type:
org.apache.spark.sql.Row
Form:
col1: array (nullable = true)
| |-- A1: struct (containsNull = true)
| | |-- B1: struct (nullable = true)
| | | |-- B11: string (nullable = true)
| | | |-- B12: string (nullable = true)
| | |-- B2: string (nullable = true)
I am trying to get the value of
A1->B1->B11.
Any methods to fetch this with the DataFrame APIs or indexing without converting each row into a seq and then iterating through it which affects my performance badly. Any suggestions would be great. Thanks