Pyspark Hudie writing timestamps as binary - pyspark

I am trying to write a pyspark DF to s3 hudie parquet format. Evcerything is working fine, however, the timestamps are writing as binary format. I would like to write this as hive tiestamp format so that i can query data in Athena.
Pyspark config as follows.
LOCAL_SPARK_CONF = (
SparkConf()
.set(
"spark.jars.packages",
"org.apache.hadoop:hadoop-aws:3.2.2,org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.1,org.apache.spark:spark-avro_2.12:3.0.2",
)
.set("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.sql.hive.convertMetastoreParquet", "false")
)
Hudi options as follows:
hudi_options = {
"hoodie.table.name": hudi_table,
"hoodie.datasource.write.recordkey.field": "hash",
"hoodie.datasource.write.partitionpath.field": "version, date",
"hoodie.datasource.write.table.name": hudi_table,
"hoodie.datasource.hive_sync.support_timestamp": "true",
"hoodie.parquet.outputtimestamptype": "TIMESTAMP_MILLIS",
"hoodie.index.type": "GLOBAL_BLOOM", # This is required if we want to ensure we upsert a record, even if the partition changes
"hoodie.bloom.index.update.partition.path": "true",
"hoodie.datasource.write.operation": "upsert",
"hoodie.datasource.write.precombine.field": "data_timestamp",
"hoodie.upsert.shuffle.parallelism": 2,
"hoodie.insert.shuffle.parallelism": 2,
}
From reading the documentation "hoodie.datasource.hive_sync.support_timestamp": "true" should maintain hive timestamps and "hoodie.parquet.outputtimestamptype": "TIMESTAMP_MILLIS" should maintain the output format. However, when i subsequently check the data it's a binary timestamp. How can i avoid this?
I write the data as follows:
df.write.format("org.apache.hudi").options(**hudi_options).mode("overwrite").save(basePath)
I have tried converting the data to a string and writing as string type but it still converts to binary.
df = df.withColumn("data_timestamp", col("data_timestamp").cast(StringType()))

Related

How to read data from mongoDB to spark with a specific query

I am getting data from mongodb using the query,
db.objects.find({ _key: { $in: ["user:130"] } }, { _id: 0, uid: 1, username: 1 }).pretty();
now i need to get the same data in spark.
val readConf = ReadConfig(Map("uri" -> host, "database" -> "nodebb", "collection" -> "objects"))
val data = spark.read.mongo(readConf)
This is giving complete data from mongodb.
How can i apply that query too...?
Thanks
If you want for example just to filter some records you can use .filter on your df.
If you want to use sql queries on your data loaded from Mongo you can create temp view from your df and then query with spark.sql
df.createOrReplaceTempView("temp")
some_fruit = spark.sql("SELECT type, qty FROM temp WHERE type LIKE '%e%'")
some_fruit.show()
More details in documentation: MongoDB spark connector docu

Kafka Connect Sink Partition by recordField which is in Ticks

I have a kafka connect sink. Within my topic , I have field which is expressed in Ticks and not proper timestamp. I would ultimately want to use that as the partitioning field in the destination (in this case an Azure data lake gen 2).
I have tried using TimeBasedPartitioner along with timestamp.extractor and timestampconvertor but its just erroring out on the format. From what I see- all these timestampconvertors use a "timestamp" field whereas mine is in ticks, so I have to do additional transformations before I can use the timestamp convertor but I am not sure as to how, as the SMTs I have looked into, do not provide any such thing.
The error I get
java.lang.IllegalArgumentException: Invalid format: "20204642-05-16 21:34:40.000+0000" is malformed at " 21:34:40.000+0000"
This is how my sink configuration looks
{
"name": "azuredatalakegen2-sink-featuretracking-dl",
"config": {
"connector.class": "io.confluent.connect.azure.datalake.gen2.AzureDataLakeGen2SinkConnector",
"topics": "sometopic",
"topics.dir": "filesystem/folder",
"flush.size": "1",
"file.delim": "-",
"path.format":"'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",
"locale": "UTC",
"timezone": "UTC",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"partition.duration.ms": "300000",
"timestamp.extractor": "RecordField",
"timestamp.field": "event_date_time_ticks",
"format.class":"io.confluent.connect.azure.storage.format.parquet.ParquetFormat",
"transforms": "flatten,TimestampConverter",
"transforms.flatten.type": "org.apache.kafka.connect.transforms.Flatten$Value",
"transforms.flatten.delimiter": "_",
"transforms.TimestampConverter.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.TimestampConverter.format": "yyyy-MM-dd HH:mm:ss.SSSZ",
"transforms.TimestampConverter.field":"event_date_time_ticks",
"transforms.TimestampConverter.target.type": "string",
"max.retries": "288",
"retry.backoff.ms": "300000",
"errors.retry.timeout":"3600000",
"errors.retry.delay.max.ms":"60000",
"errors.log.enable":"true",
"errors.log.include.messages":"true",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"io.confluent.connect.protobuf.ProtobufConverter",
.......other conn related configs.......
}
}
Here is what SMTs I have seen : SMTs in Confluent Platform
How can I partition the data in the destination using the field : event_date_time_ticks which is in ticks e.g 637535500015510000 means : 2021-04-09​T07:26:41.551Z
Tick conversion to datetime: Tick to Datetime
Even if I try FieldPartitioner , how can I convert that tick into a datetime format in the sink configuration above? Or do I have to write something custom?
TimestampConverter expects Unix epoch time, not ticks.
You'll need to convert it, which would have to be a custom transform or a modification in your Producer (which shouldn't be a major problem because most languages have datetime epoch functions)
Convert ticks to unix timestamp

Batch writes to Cosmos DB from Databricks

Can someone let me know what asterics ** achieves when writing to Cosmos DB from Databrick.
# Write configuration
writeConfig = {
"Endpoint": "https://doctorwho.documents.azure.com:443/",
"Masterkey": "YOUR-KEY-HERE",
"Database": "DepartureDelays",
"Collection": "flights_fromsea",
"Upsert": "true"
}
# Write to Cosmos DB from the flights DataFrame
flights.write.format("com.microsoft.azure.cosmosdb.spark").options(
**writeConfig).save()
Thanks
This is simply to allow you to pass multiple arguments directly using a list, tuple or a dictionary in your case.
So rather than you say:
flights.write.format("com.microsoft.azure.cosmosdb.spark")\
.option("Endpoint", "https://doctorwho.documents.azure.com:443/")\
.option("Upsert", "true")\
.option("Masterkey", "YOUR-KEY-HERE")\
...etc
You simply have all your arguments in a dictionary and then pass it like the following
flights.write.format("com.microsoft.azure.cosmosdb.spark").options(
**yourdict).save()

Apache Spark export PostgreSQL data in Parquet format

I have PostgreSQL database with ~1000 different tables. I'd like to export all of these tables and data inside them into Parquet files.
In order to do it, I'm going to read each table into DataFrame and then store this df into Parquet file. Many of the PostgreSQL tables contains user-defined Types.
The biggest issue is - that I can't manually specify the schema for DataFrame. Will Apache Spark, in this case, be able to automatically infer the PostgreSQL table schemas and store them appropriately into Parquet format or this is impossible with Apache Spark and some other technology must be used for this purpose?
UPDATED
I have created the following PostgreSQL user-defined type, table and records:
create type dimensions as (
width integer,
height integer,
depth integer
);
create table moving_boxes (
id serial primary key,
dims dimensions not null
);
insert into moving_boxes (dims) values (row(3,4,5)::dimensions);
insert into moving_boxes (dims) values (row(1,4,2)::dimensions);
insert into moving_boxes (dims) values (row(10,12,777)::dimensions);
Implemented the following Spark application:
// that gives an one-partition Dataset
val opts = Map(
"url" -> "jdbc:postgresql:sparktest",
"dbtable" -> "moving_boxes",
"user" -> "user",
"password" -> "password")
val df = spark.
read.
format("jdbc").
options(opts).
load
println(df.printSchema())
df.write.mode(SaveMode.Overwrite).format("parquet").save("moving_boxes.parquet")
This is df.printSchema output:
root
|-- id: integer (nullable = true)
|-- dims: string (nullable = true)
As you may see, Spark DataFrame infer dims schema as string and not as a complex nested type.
This is the log information from ParquetWriteSupport:
18/11/06 10:08:52 INFO ParquetWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
{
"type" : "struct",
"fields" : [ {
"name" : "id",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "dims",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
}
and corresponding Parquet message type:
message spark_schema {
optional int32 id;
optional binary dims (UTF8);
}
Could you please explain, will the original complex dims type(defined in PostgreSQL) be lost in the saved Parquet file or no?

Using MongoDB Spark Connector to filter based on timestamp

I am using Spark MongoDB connector to fetch data from mongodb..However I am not able to get how I can query on Mongo using Spark using aggregation pipeline(rdd.withPipeline).Following is my code where I want to fetch records based on timestamp & store in dataframe :
val appData=MongoSpark.load(spark.sparkContext,readConfig)
val df=appData.withPipeline(Seq(Document.parse("{ $match: { createdAt : { $gt : 2017-01-01 00:00:00 } } }"))).toDF()
Is this a correct way to query on mongodb using spark for timestamp value?
As the comment mentioned, you could utilise the Extended JSON format for the date filter.
val appDataRDD = MongoSpark.load(sc)
val filteredRDD = appDataRDD.withPipeline(Seq(Document.parse("{$match:{timestamp:{$gt:{$date:'2017-01-01T00:00:00.000'}}}}")))
filteredRDD.foreach(println)
See also MongoDB Spark Connector: Filters and Aggregation to see an alternative filter.
Try this:
val pipeline = "{'$match': {'CreationDate':{$gt: {$date:'2020-08-26T00:00:00.000Z'}}}}"
val sourceDF = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", "mongodb://administrator:password#10.XXXXX:27017/?authSource=admin").option("database","_poc").option("collection", "activity").option("pipeline", pipeline).load()
try this(but it has limitation like mongo date and ISODate can only take TZ format timestamp only.
option("pipeline", s"""[{ $$match: { "updatedAt" : { $$gte : new ISODate("2022-11-29T15:26:21.556Z") } } }]""").mongo[DeltaComments]