Pyspark inserting Time in Cassandra using datastax connector - pyspark

I'm inserting data from Pyspark to Cassandra using:
com.datastax.spark:spark-cassandra-connector_2.11:2.4.0
Among the variables that I'm inserting there is also the time, and the connector doesn't like it.
If i tried to send: '16:51:35.634652' and i received the following error:
com.datastax.spark.connector.types.TypeConversionException: Cannot convert object 16:51:35.634652 of type class java.lang.String to java.lang.Long.
Basically the converter doesn't like the string and it want to convert it to java.lang.long when actually in cassandra is time and in python is string.
I'm wondering how can i just bring the value on Cassandra without converting anything to Long, i don't think that converting time to long make much sense.

Found it! I had a check on:
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkSupportedTypes.html
I found that: CQL timestamp -> SCALA: Long, java.util.Date, java.sql.Date, org.joda.time.DateTime
So i converted my variable
import datetime
date_time_1 = '11:12:27.243860'
date_time_obj = datetime.datetime.strptime(date_time_str, %H:%M:%S.%f')
Sent to the connector and everything is working fine!

Related

Unsupported type:TIMESTAMP(6) WITH LOCAL TIME ZONE NOT NULL

I am reading data from a PostgreSQL database into Flink, and using Table API and SQL. Apparently my timestamp column (called inserttime) is of type TIMESTAMP(6) WITH LOCAL TIME ZONE NOT NULL. But when I do an SQL query like SELECT inserttime FROM books, Flink is complaining that it is an unsupported type, with the error Unsupported type:TIMESTAMP(6) WITH LOCAL TIME ZONE NOT NULL.
Is this because of the NOT NULL at the end? If so, how should I cast it such that it can be read by Flink?
I've tried to use a UDF to convert it to something readable by Flink, like below:
public static class TimestampFunction extends ScalarFunction {
public java.sql.Timestamp eval (java.sql.Timestamp t) {
// do some conversion
}
}
Obviously, the eval function signature is wrong, but I'm wondering if this could be a method to get Flink to read my timestamp type? If so, what should the eval function signature be?
I've also tried doing CAST(inserttime as TIMESTAMP) but the same Unsupported type error occurs.
Any ideas?

Scala spark dataframe hive conversion

Hi in hive query I am using below syntax for displaying decimal values ex.
Cast( column as decimal(10,6)).
Same syntax how to convert in data frame.
$"column".cast("decimal(10,6)")
Will that work
It will. It totally legit to cast it like that:
df.withColumn("new_column_name", $"old_column_name".cast("decimal(10,6)"))

how can I save an isodate field using MongoSpark.save from mongodb spark connector v2.1?

everyone. A little bit of help would be nice, and I thank you for it. I'm trying to save a document which contains a datetime field. Using mongodbspark connector through MongoSpark.save() method, it could be a challenge:
if a set the field as a string, it's quite obvious that what will be saved is a string, not an isodate (even if the string fulfilled the ISO 8601 format
if I build an expression like this: my:date:{$date:}, where xxxx is some epoch time in milliseconds, then I get this BulkWriteError which set that '$' sign is not valid for storage
I get documents to be update from a library which returns BsonDocument docs. Datetime fields are treated like BsonDateTime fields, so I need to make some conversions before saving/updating 'cause getting the corresponding json string fro the BsonDocument, generates the $date non-valid-for-storage stuff.
For obtaind the BsonDocument, I just called a method from a library built by another developer:
val bdoc = handlePermanentProduct(p_id, operationsByProduct)
Then I convert the org.bson.BsonDocument in a org.bson.Document using a method I wrote:
val doc: Document = convert(bdoc)
Then, the usual code for getting a dataframe & saving/updating my documents
val docs = sc.parallelize(Seq(doc.toJson))
val df = sparkSession.read.json(rdd)
MongoSpark.save(df.write.option("collection", "products").option("replaceDocument", "false").mode(SaveMode.Append))
Thanks again, and in advance
I'm using Scala 2.11.8, Spark 2.11, and Mongodb Spark connect v2.1
Definitively, the way I was trying to use for saving/updating is not the right way. I found out, reading the documentation, of course, that there is a type matching process when I want to save/update using MongoSpark.save(...) method: datetime fields can be creates as java.sql.Timestamp, so driver makes the proper conversions. It was really easy once I found that. So, it's solved.

spark scala data cleansing

When developing a Spark job, I always get the same issues of data cleansing. Please can you give me some clues to implement this kind of data cleansing ?
The input can be a CSV/kafka/text containing string fields, integer fields and timestamp fields. I would like to remove all the lines that aren't compliant with the data model. Sometimes I get IP instead of integer, sometimes the timestamp can't be cast to timestamp because in the wrong format...
Additionally I would like to not kill performance of the job manipulating lots of java objects and scala complex structures, and be able to put business rules.
Imagine a dataset with this model
case class Flow(a: String, b: Int, c: Timestamp)
A very simple version would be
val file= sc.textFile(....).split(",").filter{l=>
l.length>x
l(1) forall Character.isDigit
l(2) // ???? how to match date format without costly regexp ????
}
Do you have any other suggestions ? using more complex solutions like Scalaz with monads or Play framework to do it ?
What would be the difference regarding performance ?
I have also looked at the spark csv parser included in 2.x but it's just a brutal try/catch/logging on type casting...

Sqoop2 import job from Oracle to HDFS(Text format) inserting string 'null' into HDFS

I am trying to import data from Oracle DB to HDFS using Sqoop2. I realized that sqoop2 is inserting a string 'null' value rather than non-string NULL value. Is there an alternative to skip this from happening?
I know there is an alternative in sqoop1 but i m looking for an option in sqoop2.
Thanks in advance
Sqoop 1.99.4 was released recently with a change that made this more usable. Sqoop2 now uses NULL in its output.