How to partition Spark RDD when importing Postgres using JDBC? - postgresql

I am importing a Postgres database into Spark. I know that I can partition on import, but that requires that I have a numeric column (I don't want to use the value column because it's all over the place and doesn't maintain order):
df = spark.read.format('jdbc').options(url=url, dbtable='tableName', properties=properties).load()
df.printSchema()
root
|-- id: string (nullable = false)
|-- timestamp: timestamp (nullable = false)
|-- key: string (nullable = false)
|-- value: double (nullable = false)
Instead, I am converting the dataframe into an rdd (of enumerated tuples) and trying to partition that instead:
rdd = df.rdd.flatMap(lambda x: enumerate(x)).partitionBy(20)
Note that I used 20 because I have 5 workers with one core each in my cluster, and 5*4=20.
Unfortunately, the following command still takes forever to execute:
result = rdd.first()
Therefore I am wondering if my logic above makes sense? Am I doing anything wrong? From the web GUI, it looks like the workers are not being used:

Since you already know you can partition by a numeric column this is probably what you should do. Here is the trick. First lets find a minimum and maximum epoch:
url = ...
properties = ...
min_max_query = """(
SELECT
CAST(min(extract(epoch FROM timestamp)) AS bigint),
CAST(max(extract(epoch FROM timestamp)) AS bigint)
FROM tablename
) tmp"""
min_epoch, max_epoch = spark.read.jdbc(
url=url, table=min_max_query, properties=properties
).first()
and use it to query the table:
numPartitions = ...
query = """(
SELECT *, CAST(extract(epoch FROM timestamp) AS bigint) AS epoch
FROM tablename) AS tmp"""
spark.read.jdbc(
url=url, table=query,
lowerBound=min_epoch, upperBound=max_epoch + 1,
column="epoch", numPartitions=numPartitions, properties=properties
).drop("epoch")
Since this splits data into ranges of the same size it is relatively sensitive to data skew so you should use it with caution.
You could also provide a list of disjoint predicates as a predicates argument.
predicates= [
"id BETWEEN 'a' AND 'c'",
"id BETWEEN 'd' AND 'g'",
... # Continue to get full coverage an desired number of predicates
]
spark.read.jdbc(
url=url, table="tablename", properties=properties,
predicates=predicates
)
The latter approach is much more flexible and can address certain issues with non-uniform data distribution but requires more knowledge about the data.
Using partitionBy fetches data first and then performs full shuffle to get desired number of partitions so it is relativistically expensive.

Related

Is there any way to change one spark DF's datatype to match with other DF's datatype

I have one spark DF1 with datatype,
string (nullable = true)
integer (nullable = true)
timestamp (nullable = true)
And one more spark DF2 (which I created from Pandas DF)
object
int64
object
Now I need to change the DF2 datatype to match with the Df1 datatype. Is there any common way to do that. Because every time I may get different columns and different datatypes.
Is there any way like assign the DF1 data type to some structType and use that for DF2?
suppose you have 2 dataframes - data1_sdf and data2_sdf. you can use a dataframe's schema to extract the column's data type by data_sdf.select('column_name').schema[0].dataType.
here's an example where data2_sdf columns are casted using data1_sdf within a select
data2_sdf. \
select(*[func.col(c).cast(data1_sdf.select(c).schema[0].dataType) if c in data1_sdf.columns else c for c in data_sdf.columns])
If you make sure that your first object is a string-like column and your third object is timestamp-like column, you can try to use this method:
df2 = spark.createDataFrame(
df2.rdd, schema=df1.schema
)
However, this method is not guaranteed, some cases are not valid (eg transform from string to integer). Also, this method might not be good in performance perspective. Therefore, you better use cast to change the data type of each column.

How to create a list of all values of a column in Structured Streaming?

I have a spark structured streaming job which gets records from Kafka (10,000 as maxOffsetsPerTrigger). I get all those records by spark's readStream method. This dataframe has a column named "key".
I need string(set(all values in that column 'key')) to use this string in a query to ElasticSearch.
I have already tried df.select("key").collect().distinct() but it throws exception:
collect() is not supported with structured streaming.
Thanks.
EDIT:
DATAFRAME:
+-------+-------------------+----------+
| key| ex|new column|
+-------+-------------------+----------+
| fruits| [mango, apple]| |
|animals| [cat, dog, horse]| |
| human|[ram, shyam, karun]| |
+-------+-------------------+----------+
SCHEMA:
root
|-- key: string (nullable = true)
|-- ex: array (nullable = true)
| |-- element: string (containsNull = true)
|-- new column: string (nullable = true)
STRING I NEED:
'["fruits", "animals", "human"]'
You can not apply collect on streaming dataframe. streamingDf here refers reading from Kafka.
val query = streamingDf
.select(col("Key").cast(StringType))
.writeStream
.format("console")
.start()
query.awaitTermination()
It will print your data in the console. To write data in an external source, you have to give an implementation of foreachWriter. For reference, refer
In given link, data is streamed using Kafka, read by spark and written to Cassandra eventually.
Hope, it will help.
For such use case, I'd recommend using foreachBatch operator:
foreachBatch(function: (Dataset[T], Long) ⇒ Unit): DataStreamWriter[T]
Sets the output of the streaming query to be processed using the provided function. This is supported only the in the micro-batch execution modes (that is, when the trigger is not continuous).
In every micro-batch, the provided function will be called in every micro-batch with (i) the output rows as a Dataset and (ii) the batch identifier.
The batchId can be used deduplicate and transactionally write the output (that is, the provided Dataset) to external systems. The output Dataset is guaranteed to exactly same for the same batchId (assuming all operations are deterministic in the query).
Quoting the official documentation (with a few modifications):
The foreachBatch operation allows you to apply arbitrary operations and writing logic on the output of a streaming query.
foreachBatch allows arbitrary operations and custom logic on the output of each micro-batch.
And in the same official documentation you can find a sample code that shows that you could do your use case fairly easily.
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.select("key").collect().distinct()
}

Multiple Spark DataFrame mutations in a single pipe

Consider a Spark DataFrame df with the following schema:
root
|-- date: timestamp (nullable = true)
|-- customerID: string (nullable = true)
|-- orderID: string (nullable = true)
|-- productID: string (nullable = true)
One column should be cast to a different type, other columns should just have their white-space trimmed.
df.select(
$"date",
df("customerID").cast(IntegerType),
$"orderID",
$"productId")
.withColumn("orderID", trim(col("orderID")))
.withColumn("productID", trim(col("productID")))
The operations seem to require different syntax; casting is done via select, while trim is done via withColumn.
I'm used to R and dplyr where all the above would be handled in a single mutate function, so mixing select and withColumn feels a bit cumbersome.
Is there a cleaner way to do this in a single pipe?
You can use either one. The difference is that withColumn will add (or replace if the same name is used) a new column to the dataframe while select will only keep the columns you specified. Depending on the situation, choose one to use.
The cast can be done using withColumn as follows:
df.withColumn("customerID", $"customerID".cast(IntegerType))
.withColumn("orderID", trim($"orderID"))
.withColumn("productID", trim($"productID"))
Note that you do not need to use withColumn on the date column above.
The trim functions can be done in a select as follows, here the column names are kept the same:
df.select(
$"date",
$"customerID".cast(IntegerType),
trim($"orderID").as("orderID"),
trim($"productId").as("productId"))
df.select(
$"date",
$"customerID".cast(IntegerType),
trim($"orderID").as("orderID"),
trim($"productID").as("productID"))

UDF to Concatenate Arrays of Undefined Case Class Buried in a Row Object

I have a dataframe, called sessions, with columns that may change over time. (Edit to Clarify: I do not have a case class for the columns - only a reflected schema.) I will consistently have a uuid and clientId in the outer scope with some other inner and outer scope columns that might constitute a tracking event so ... something like:
root
|-- runtimestamp: long (nullable = true)
|-- clientId: long (nullable = true)
|-- uuid: string (nullable = true)
|-- oldTrackingEvents: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timestamp: long (nullable = true)
| | |-- actionid: integer (nullable = true)
| | |-- actiontype: string (nullable = true)
| | |-- <tbd ... maps, arrays and other stuff matches sibling> section
...
|-- newTrackingEvents: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timestamp: long (nullable = true)
| | |-- actionid: integer (nullable = true)
| | |-- actiontype: string (nullable = true)
| | |-- <tbd ... maps, arrays and other stuff matches sibling>
...
I'd like to now merge oldTrackingEvents and newTrackingEvents with a UDF containing these parameters and yet-to-be resolved code logic:
val mergeTEs = udf((oldTEs : Seq[Row], newTEs : Seq[Row]) =>
// do some stuff - figure best way
// - to merge both groups of tracking events
// - remove duplicates tracker events structures
// - limit total tracking events < 500
return result // same type as UDF input params
)
The UDF return result would be an array of of the structure that is the resulting List of the concatenated two fields.
QUESTION:
My question is how to construct such a UDF - (1) use of correct passed-in parameter types, (2) a way to manipulate these collections within a UDF and (3) a clear way to return a value that doesn't have a compiler error. I unsuccessfully tested Seq[Row] for the input / output (with val testUDF = udf((trackingEvents : Seq[Row]) => trackingEvents) and received the error java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported for a direct return of trackingEvents. However, I get no error for returning Some(1) instead of trackingEvents ... What is the best way to manipulate the collections so that I can concatenate 2 lists of identical structures as suggested by the schema above with the UDF using the activity in the comments section. The goal is to use this operation:
sessions.select(mergeTEs('oldTrackingEvents, 'newTrackingEvents).as("cleanTrackingEvents"))
And in each row, ... get back a single array of 'trackingEvents' structure in a memory / speed efficient manner.
SUPPLEMENTAL:
Looking at a question shown to me ... There's a possible hint, if relevancy exists ... Defining a UDF that accepts an Array of objects in a Spark DataFrame? ... To create struct function passed to udf has to return Product type (Tuple* or case class), not Row.
Perhaps ... this other post is relevant / useful.
I think that the question you've linked explains it all, so just to reiterate. When working with udf:
Input representation for the StructType is weakly typed Row object.
Output type for StructType has to be Scala Product. You cannot return Row object.
If this is to much burden, you should use strongly typed Dataset
val f: T => U
sessions.as[T].map(f): Dataset[U]
where T is an algebraic data type representing Session schema, and U is algebraic data type representing the result.
Alternatively ... If your goal is to merge sequences of some random row structure / schema with some manipulation, this is an alternative generally-stated approach that avoids the partitioning talk:
From the master dataframe, create dataframes for each trackingEvents section, new and old. With each, select the exploded 'trackingEvents' section's columns. Save these val dataframe declarations as newTE and oldTE.
Create another dataframe, where columns that are picked are unique to each tracking event in the arrays of oldTrackingEvents and newTrackingEvents such as each's uuid, clientId and event timestamp. Your pseudo-schema would be:
(uuid: String, clientId : Long, newTE : Seq[Long], oldTE : Seq[Long])
Use a UDF to join the two simple sequences of your structure, both Seq[Long] which is 'something like the untested' example:
val limitEventsUDF = udf { (newTE: Seq[Long], oldTE: Seq[Long], limit: Int, tooOld: Long) => {
(newTE ++ oldTE).filter(_ > tooOld).sortWith(_ > _).distinct.take(limit)
}}
The UDF will return a dataframe of cleaned tracking events & you now have a very slim dataframe with removed events to self-join back to your exploded newTE and oldTE frames after being unioned back to each other.
GroupBy as needed thereafter using collect_list.
Still ... this seems like a lot of work - Should this be voted for this as "the answer" - I'm not sure?

PySpark DataFrames: filter where some value is in array column

I have a DataFrame in PySpark that has a nested array value for one of its fields. I would like to filter the DataFrame where the array contains a certain string. I'm not seeing how I can do that.
The schema looks like this:
root
|-- name: string (nullable = true)
|-- lastName: array (nullable = true)
| |-- element: string (containsNull = false)
I want to return all the rows where the upper(name) == 'JOHN' and where the lastName column (the array) contains 'SMITH' and the equality there should be case insensitive (like I did for the name). I found the isin() function on a column value, but that seems to work backwards of what I want. It seem like I need a contains() function on a column value. Anyone have any ideas for a straightforward way to do this?
You could consider working on the underlying RDD directly.
def my_filter(row):
if row.name.upper() == 'JOHN':
for it in row.lastName:
if it.upper() == 'SMITH':
yield row
dataframe = dataframe.rdd.flatMap(my_filter).toDF()
An update in 2019
spark 2.4.0 introduced new functions like array_contains and transform
official document
now it can be done in sql language
For your problem, it should be
dataframe.filter('array_contains(transform(lastName, x -> upper(x)), "JOHN")')
It is better than the previous solution using RDD as a bridge, because DataFrame operations are much faster than RDD ones.