Longest Run without being in the UK - scala

I have the following SparkDataframe
val inputDf = List(
("1", "1", "UK", "Spain", "2022-01-01"),
("1", "2", "Spain", "Germany", "2022-01-02"),
("1", "3", "Germany", "China", "2022-01-03"),
("1", "4", "China", "France", "2022-01-04"),
("1", "5", "France", "Spain", "2022-01-05"),
("1", "6", "Spain", "Italy", "2022-01-09"),
("1", "7", "Italy", "UK", "2022-01-14"),
("1", "8", "UK", "USA", "2022-01-15"),
("1", "9", "USA", "Canada", "2022-01-16"),
("1", "10", "Canada", "UK", "2022-01-17"),
("2", "16", "USA", "Finland", "2022-01-11"),
("2", "17", "Finland", "Russia", "2022-01-12"),
("2", "18", "Russia", "Turkey", "2022-01-13"),
("2", "19", "Turkey", "Japan", "2022-01-14"),
("2", "20", "Japan", "UK", "2022-01-15"),
).toDF("passengerId", "flightId", "from", "to", "date")
I would like to get the longest run for each passengers without being in the UK.
So for example in the case of passenger 1 his itinerary was UK>Spain>Germany>China>France>Spain>Italy>UK>USA> Canada>UK>Finland>Russia>Turkey>Japan>Spain>Germany>China>France>Spain>Italy>UK>USA>Canada>UK. Therefore the longest run would be 10.
I first merge the column from and to using the following code.
val passengerWithCountries = inputDf.groupBy("passengerId")
.agg(
// concat is for concatenate two lists of strings from columns "from" and "to"
concat(
// collect list gathers all values from the given column into array
collect_list(col("from")),
collect_list(col("to"))
).name("countries")
)
Output:
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|passengerId|countries |
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |[UK, Spain, Germany, China, France, Spain, Italy, UK, USA, Canada, UK, Finland, Russia, Turkey, Japan, Spain, Germany, China, France, Spain, Italy, UK, USA, Canada, UK, Finland, Russia, Turkey, Japan, UK]|
|2 |[USA, Finland, Russia, Turkey, Japan, Finland, Russia, Turkey, Japan, UK] |
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
The solution I have tried is the following. However, I since the value of my column are Array[String] and not String it does not work.
passengerWithCountries
.withColumn("countries_new", explode(split(Symbol("countries"), "UK,")))
.withColumn("journey_outside_UK", size(split(Symbol("countries"), ",")))
.groupBy("passengerId")
.agg(max(Symbol("journey_outside_UK")) as "longest_run").show()
I an looking to have the following output:
+-----------+-----------+
|passengerId|longest_run|
+-----------+-----------+
|1 |10 |
|2 |5 |
+-----------+-----------+
Please let me know if you have a solution.

// Added some edge cases:
// passengerId=3: just one itinary from UK to non-UK, longest run must be 1
// passengerId=4: just one itinary from non-UK to UK, longest run must be 1
// passengerId=5: just one itinary from UK to UK, longest run must be 0
// passengerId=6: one itinary from UK to UK, followed by UK to non-UK, longest run must be 1
val inputDf = List(
("1", "1", "UK", "Spain", "2022-01-01"),
("1", "2", "Spain", "Germany", "2022-01-02"),
("1", "3", "Germany", "China", "2022-01-03"),
("1", "4", "China", "France", "2022-01-04"),
("1", "5", "France", "Spain", "2022-01-05"),
("1", "6", "Spain", "Italy", "2022-01-09"),
("1", "7", "Italy", "UK", "2022-01-14"),
("1", "8", "UK", "USA", "2022-01-15"),
("1", "9", "USA", "Canada", "2022-01-16"),
("1", "10", "Canada", "UK", "2022-01-17"),
("2", "16", "USA", "Finland", "2022-01-11"),
("2", "17", "Finland", "Russia", "2022-01-12"),
("2", "18", "Russia", "Turkey", "2022-01-13"),
("2", "19", "Turkey", "Japan", "2022-01-14"),
("2", "20", "Japan", "UK", "2022-01-15"),
("3", "21", "UK", "Spain", "2022-01-01"),
("4", "22", "Spain", "UK", "2022-01-01"),
("5", "23", "UK", "UK", "2022-01-01"),
("6", "24", "UK", "UK", "2022-01-01"),
("6", "25", "UK", "Spain", "2022-01-02"),
("7", "25", "Spain", "Germany", "2022-01-02"),
).toDF("passengerId", "flightId", "from", "to", "date")
import org.apache.spark.sql.expressions.Window
// Declare window for analytic functions
val w = Window.partitionBy("passengerId").orderBy("date")
// Use analytic function to partition rows by UK-...-UK itinaries
val ukArrivals = inputDf.withColumn("newUK", sum(expr("case when from = 'UK' then 1 else 0 end")).over(w))
+-----------+--------+-------+-------+----------+-----+
|passengerId|flightId| from| to| date|newUK|
+-----------+--------+-------+-------+----------+-----+
| 1| 1| UK| Spain|2022-01-01| 1|
| 1| 2| Spain|Germany|2022-01-02| 1|
| 1| 3|Germany| China|2022-01-03| 1|
| 1| 4| China| France|2022-01-04| 1|
| 1| 5| France| Spain|2022-01-05| 1|
| 1| 6| Spain| Italy|2022-01-09| 1|
| 1| 7| Italy| UK|2022-01-14| 1|
| 1| 8| UK| USA|2022-01-15| 2|
| 1| 9| USA| Canada|2022-01-16| 2|
| 1| 10| Canada| UK|2022-01-17| 2|
| 2| 16| USA|Finland|2022-01-11| 0|
| 2| 17|Finland| Russia|2022-01-12| 0|
| 2| 18| Russia| Turkey|2022-01-13| 0|
| 2| 19| Turkey| Japan|2022-01-14| 0|
| 2| 20| Japan| UK|2022-01-15| 0|
| 3| 21| UK| Spain|2022-01-01| 1|
| 4| 22| Spain| UK|2022-01-01| 0|
| 5| 23| UK| UK|2022-01-01| 1|
| 6| 24| UK| UK|2022-01-01| 1|
| 6| 25| UK| Spain|2022-01-02| 2|
+-----------+--------+-------+-------+----------+-----+
// Calculate longest runs outside UK
val runs = (
ukArrivals
.groupBy("passengerId", "newUK") // for each UK-...-UK itinary
.agg((
sum(
expr("""
case
when 'UK' not in (from,to) then 1 -- count all nonUK countries, except for first one
when from = to then -1 -- special case for UK-UK itinaries
else 0 -- don't count itinaries from/to UK
end""")
) + 1 // count first non-UK country
).as("notUK"))
.groupBy("passengerId")
.agg(max("notUK").as("longest_run_outside_UK"))
)
runs.orderBy("passengerId").show
+-----------+----------------------+
|passengerId|longest_run_outside_UK|
+-----------+----------------------+
| 1| 6|
| 2| 5|
| 3| 1|
| 4| 1|
| 5| 0|
| 6| 1|
+-----------+----------------------+

Related

Sink Connector auto create tables with proper data type

I have the debezium source connector for Postgresql with the value convertor as Avro and it uses the schema registry.
Source DDL:
Table "public.tbl1"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+-----------------------------+-----------+----------+----------------------------------+----------+-------------+--------------+-------------
id | integer | | not null | nextval('tbl1_id_seq'::regclass) | plain | | |
name | character varying(100) | | | | extended | | |
col4 | numeric | | | | main | | |
col5 | bigint | | | | plain | | |
col6 | timestamp without time zone | | | | plain | | |
col7 | timestamp with time zone | | | | plain | | |
col8 | boolean | | | | plain | | |
Indexes:
"tbl1_pkey" PRIMARY KEY, btree (id)
Publications:
"dbz_publication"
Access method: heap
In the schema registry:
{
"type": "record",
"name": "Value",
"namespace": "test.public.tbl1",
"fields": [
{
"name": "id",
"type": {
"type": "int",
"connect.parameters": {
"__debezium.source.column.type": "SERIAL",
"__debezium.source.column.length": "10",
"__debezium.source.column.scale": "0"
},
"connect.default": 0
},
"default": 0
},
{
"name": "name",
"type": [
"null",
{
"type": "string",
"connect.parameters": {
"__debezium.source.column.type": "VARCHAR",
"__debezium.source.column.length": "100",
"__debezium.source.column.scale": "0"
}
}
],
"default": null
},
{
"name": "col4",
"type": [
"null",
{
"type": "double",
"connect.parameters": {
"__debezium.source.column.type": "NUMERIC",
"__debezium.source.column.length": "0"
}
}
],
"default": null
},
{
"name": "col5",
"type": [
"null",
{
"type": "long",
"connect.parameters": {
"__debezium.source.column.type": "INT8",
"__debezium.source.column.length": "19",
"__debezium.source.column.scale": "0"
}
}
],
"default": null
},
{
"name": "col6",
"type": [
"null",
{
"type": "long",
"connect.version": 1,
"connect.parameters": {
"__debezium.source.column.type": "TIMESTAMP",
"__debezium.source.column.length": "29",
"__debezium.source.column.scale": "6"
},
"connect.name": "io.debezium.time.MicroTimestamp"
}
],
"default": null
},
{
"name": "col7",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.parameters": {
"__debezium.source.column.type": "TIMESTAMPTZ",
"__debezium.source.column.length": "35",
"__debezium.source.column.scale": "6"
},
"connect.name": "io.debezium.time.ZonedTimestamp"
}
],
"default": null
},
{
"name": "col8",
"type": [
"null",
{
"type": "boolean",
"connect.parameters": {
"__debezium.source.column.type": "BOOL",
"__debezium.source.column.length": "1",
"__debezium.source.column.scale": "0"
}
}
],
"default": null
}
],
"connect.name": "test.public.tbl1.Value"
}
But in the target PostgreSQL the data types are completely mismatched for ID columns and timestamp columns. Sometimes Decimal columns as well(that's due to this)
Target:
Table "public.tbl1"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+------------------+-----------+----------+---------+----------+-------------+--------------+-------------
id | text | | not null | | extended | | |
name | text | | | | extended | | |
col4 | double precision | | | | plain | | |
col5 | bigint | | | | plain | | |
col6 | bigint | | | | plain | | |
col7 | text | | | | extended | | |
col8 | boolean | | | | plain | | |
Indexes:
"tbl1_pkey" PRIMARY KEY, btree (id)
Access method: heap
Im trying to understand even with schema registry , its not creating the target tables with proper datatypes.
Sink config:
{
"name": "t1-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics": "test.public.tbl1",
"connection.url": "jdbc:postgresql://172.31.85.***:5432/test",
"connection.user": "postgres",
"connection.password": "***",
"dialect.name": "PostgreSqlDatabaseDialect",
"auto.create": "true",
"insert.mode": "upsert",
"delete.enabled": "true",
"pk.fields": "id",
"pk.mode": "record_key",
"table.name.format": "tbl1",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"internal.key.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"internal.key.converter.schemas.enable": "true",
"internal.value.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"internal.value.converter.schemas.enable": "true",
"value.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"value.converter.schemas.enable": "true",
"value.converter.region": "us-east-1",
"key.converter.region": "us-east-1",
"key.converter.schemaAutoRegistrationEnabled": "true",
"value.converter.schemaAutoRegistrationEnabled": "true",
"key.converter.avroRecordType": "GENERIC_RECORD",
"value.converter.avroRecordType": "GENERIC_RECORD",
"key.converter.registry.name": "bhuvi-debezium",
"value.converter.registry.name": "bhuvi-debezium",
"value.converter.column.propagate.source.type": ".*",
"value.converter.datatype.propagate.source.type": ".*"
}
}

Filling missing value with mean by grouping multiple columns

Description:"
How can I fill the missing value in price column with mean, grouping data by condition and model columns in Pyspark? My python code would be like this :cars['price'] = np.ceil(cars['price'].fillna(cars.groupby(['condition', 'model' ])['price'].transform('mean')))
Error:
I try different codes in Pyspark but each time I get different errors. Like this, code:cars_new=cars.fillna((cars.groupBy("condition", "model").agg(mean("price"))['avg(price)']))
Error :
ValueError: value should be a float, int, long, string, bool or dict
DataFrame
enter image description here
Not sure how your input data looks like but let's say we have a dataframe that looks like this:
+---------+-----+-----+
|condition|model|price|
+---------+-----+-----+
|A |A |1 |
|A |B |2 |
|A |B |2 |
|A |A |1 |
|A |A |null |
|B |A |3 |
|B |A |null |
|B |B |4 |
+---------+-----+-----+
We want to fill null with average but over condition and model.
For this we can define a Window, calculate avg and then replace null.
Example:
from pyspark.sql import SparkSession, Window
import pyspark.sql.functions as F
spark = SparkSession.builder.appName("test").getOrCreate()
data = [
{"condition": "A", "model": "A", "price": 1},
{"condition": "A", "model": "B", "price": 2},
{"condition": "A", "model": "B", "price": 2},
{"condition": "A", "model": "A", "price": 1},
{"condition": "A", "model": "A", "price": None},
{"condition": "B", "model": "A", "price": 3},
{"condition": "B", "model": "A", "price": None},
{"condition": "B", "model": "B", "price": 4},
]
window = Window.partitionBy(["condition", "model"]).orderBy("condition")
df = spark.createDataFrame(data=data)
df = (
df.withColumn("avg", F.avg("price").over(window))
.withColumn(
"price", F.when(F.col("price").isNull(), F.col("avg")).otherwise(F.col("price"))
)
.drop("avg")
)
Which gives us:
+---------+-----+-----+
|condition|model|price|
+---------+-----+-----+
|A |A |1.0 |
|A |A |1.0 |
|A |A |1.0 |
|B |B |4.0 |
|B |A |3.0 |
|B |A |3.0 |
|A |B |2.0 |
|A |B |2.0 |
+---------+-----+-----+
It could be done using window functions like this:
cars_new = cars.fillna(0, subset=['price'])
w = Window().partitionBy('condition', 'model')
cars = cars.withColumn('price',when(col('price').isNull(), avg(col('price')).over(w)).otherwise(col('price')))

Pyspark: How to create a table by crossing information in two columns?

I have a dataframe like this in Pyspark:
A 1 info_A1
A 2 info_A2
B 2 info_B2
B 3 info_B3
I would like to obtain this result:
info_A1 null
info_A2 info_B2
null info_B3
Is there any function in Pyspark that does it automatically or I should iterate each row separately?
Try using groupBy and pivot:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
data = [
{"x": "A", "y": 1, "z": "info_A1"},
{"x": "A", "y": 2, "z": "info_A2"},
{"x": "B", "y": 2, "z": "info_B2"},
{"x": "B", "y": 3, "z": "info_B3"},
]
df = spark.createDataFrame(data)
df = df.groupBy("y").pivot("x").agg(F.max("z")).orderBy("y").drop("y")
Result:
+-------+-------+
|A |B |
+-------+-------+
|info_A1|null |
|info_A2|info_B2|
|null |info_B3|
+-------+-------+

Dataframe columns does not keep order and columns with null values are excluded while writing to CosmosDB Collection

I tried to copy data into cosmosDB collection from a dataframe in spark.
The data is writing into cosmosDB , but with two issues.
The order of column in dataframe is not maintaining in cosmosDB.
Columns with null values are not written in cosmosDB, they are totally excluded.
Below is the data available in dataframe:
+-------+------+--------+---------+---------+-------+
| NUM_ID| TIME| SIG1| SIG2| SIG3| SIG4|
+-------+------+--------+---------+---------+-------+
|X00030 | 13000|35.79893| 139.9061| 48.32786| null|
|X00095 | 75000| null| null| null|5860505|
|X00074 | 43000| null| 8.75037| 98.9562|8014505|
Below is the code written in spark to copy the dataframe into cosmosDB.
val finalSignals = spark.sql("""SELECT * FROM db.tableName""")
val toCosmosDF = finalSignals.withColumn("NUM_ID", trim(col("NUM_ID"))).withColumn("SIG1", round(col("SIG1"),5)).select("NUM_ID","TIME","SIG1","SIG2","SIG3","SIG4")
//write DF into COSMOSDB
import com.microsoft.azure.cosmosdb.spark.config.Config
import org.apache.spark.sql.SaveMode
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark._
val writeConfig = Config(Map(
"Endpoint" -> "xxxxxxxx",
"Masterkey" -> "xxxxxxxxxxx",
"Database" -> "xxxxxxxxx",
"Collection" -> "xxxxxxxxx",
"preferredRegions" -> "xxxxxxxxx",
"Upsert" -> "true"
))
toCosmosDF.write.mode(SaveMode.Append).cosmosDB(writeConfig)
Below is the data written into cosmosDB.
"SIG3": 48.32786,
"SIG2": 139.9061,
"TIME": 13000,
"NUM_ID": "X00030",
"id": "xxxxxxxxxxxx2a",
"SIG1": 35.79893,
"_rid": "xxxxxxxxxxxx",
"_self": "xxxxxxxxxxxxxxxxxx",
"_etag": "\"xxxxxxxxxxxxxxxx\"",
"_attachments": "attachments/",
"_ts": 1571390120
}
{
"TIME": 75000,
"NUM_ID": "X00095",
"id": "xxxxxxxxxxxx2a",
"_rid": "xxxxxxxxxxxx",
"SIG4": 5860505,
"_self": "xxxxxxxxxxxxxxxxxx",
"_etag": "\"xxxxxxxxxxxxxxxx\"",
"_attachments": "attachments/",
"_ts": 1571390120
}
{
"SIG3": 98.9562,
"SIG2": 8.75037,
"TIME": 43000,
"NUM_ID": "X00074",
"id": "xxxxxxxxxxxx2a",
"SIG4": 8014505,
"_rid": "xxxxxxxxxxxx",
"_self": "xxxxxxxxxxxxxxxxxx",
"_etag": "\"xxxxxxxxxxxxxxxx\"",
"_attachments": "attachments/",
"_ts": 1571390120
}
The entry for columns with null in dataframe is missing in cosmosDB document.
The data written into cosmosDB is not having the column order which is there in dataframe.
How to resolve these two issues?

Merge Spark dataframe rows based on key column in Scala

I have a streaming Dataframe with 2 columns. A key column represented as String and an objects column which is an array containing one object element. I want to be able to merge records or rows in the Dataframe with the same key such that the merged records form an array of objects.
Dataframe
----------------------------------------------------------------
|key | objects |
----------------------------------------------------------------
|abc | [{"name": "file", "type": "sample", "code": "123"}] |
|abc | [{"name": "image", "type": "sample", "code": "456"}] |
|xyz | [{"name": "doc", "type": "sample", "code": "707"}] |
----------------------------------------------------------------
Merged Dataframe
-------------------------------------------------------------------------
|key | objects |
-------------------------------------------------------------------------
|abc | [{"name": "file", "type": "sample", "code": "123"}, {"name":
"image", "type": "sample", "code": "456"}] |
|xyz | [{"name": "doc", "type": "sample", "code": "707"}] |
--------------------------------------------------------------------------
One option to do this to convert this into a PairedRDD and apply the reduceByKey function, but I'd prefer to do this with Dataframes if possible since it'd more optimal. Is there any way to do this with Dataframes without compromising on performance?
Assuming column objects is an array of a single JSON string, here's how you can merge objects by key:
import org.apache.spark.sql.functions._
case class Obj(name: String, `type`: String, code: String)
val df = Seq(
("abc", Obj("file", "sample", "123")),
("abc", Obj("image", "sample", "456")),
("xyz", Obj("doc", "sample", "707"))
).
toDF("key", "object").
select($"key", array(to_json($"object")).as("objects"))
df.show(false)
// +---+-----------------------------------------------+
// |key|objects |
// +---+-----------------------------------------------+
// |abc|[{"name":"file","type":"sample","code":"123"}] |
// |abc|[{"name":"image","type":"sample","code":"456"}]|
// |xyz|[{"name":"doc","type":"sample","code":"707"}] |
// +---+-----------------------------------------------+
df.groupBy($"key").agg(collect_list($"objects"(0)).as("objects")).
show(false)
// +---+---------------------------------------------------------------------------------------------+
// |key|objects |
// +---+---------------------------------------------------------------------------------------------+
// |xyz|[{"name":"doc","type":"sample","code":"707"}] |
// |abc|[{"name":"file","type":"sample","code":"123"}, {"name":"image","type":"sample","code":"456"}]|
// +---+---------------------------------------------------------------------------------------------+