Upgrading to AWS Glue 3.0 rdd issue - pyspark - pyspark

when trying to transform a dataframe to rdd
delta_rdd = delta_df.rdd
I got the following message error
An error occurred while calling o401.javaToPython. You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is
Althought I'm not reading from parquet files and I don't have Timestamp type (I have string type instead)
Note: I added the necessary config https://docs.aws.amazon.com/glue/latest/dg/migrating-version-30.html#migrating-version-30-from-20,
I can read from parquet files and do transformations but when doing the .rdd It doesn't pass !

Related

Continuously Updating Partitioned Parquet

I have a Spark script that pulls data from a database and writes it to S3 in parquet format. The parquet data is partitioned by date. Because of the size of the table, I'd like to run the script daily and have it just rewrite the most recent few days of data (redundancy because data may change for a couple days).
I'm wondering how I can go about writing the data to s3 in a way that only overwrites the partitions of the days I'm working with. SaveMode.Overwrite unfortunately wipes everything before it, and the other save modes don't seem to be what I'm looking for.
Snippet of my current write:
table
.filter(row => row.ts.after(twoDaysAgo)) // update most recent 2 days
.withColumn("date", to_date(col("ts"))) // add a column with just date
.write
.mode(SaveMode.Overwrite)
.partitionBy("date") // use the new date column to partition the parquet output
.parquet("s3a://some-bucket/stuff") // pick a parent directory to hold the parquets
Any advice would be much appreciated, thanks!
The answer I was looking for was Dynamic Overwrite, detailed in this article. Short answer, adding this line fixed my problem:
sparkConf.set("spark.sql.sources.partitionOverwriteMode", "DYNAMIC")

Spark Elasticsearch connector fails to parse date as date. They are parsed as Long

I am pushing HDFS parquet Data to ElasticSearch using ES spark connector.
I have two columns having dates that I am unable to parse as date. They keep being ingested as long. They have the following formats:
basic_ordinal_date: YYYYddd
epoch_millis: in milliseconds; e.g. 1555498747861
I tried the following
Defining ingest pipelines
Defining mappings
Defining dynamic mappings
Defining index_templates with mapping
Converting my date columns to string so Elastic does some pattern matching
Depending on the method, either I have errors and my Spark job fails, or the documents are pushed with the concerned columns that remain parked as long.
How should I proceed to have my ordinal_dates, epoch_millis parsed as dates ?
Thank you in advance for your help

What changes do I have to do to migrate an application from Spark 1.5 to Spark 2.1?

I have to migrate to Spark 2.1 an application written in Scala 2.10.4 using Spark 1.6.
The application treats text files with around 7GB of dimension, and contains several rdd transformations.
I was told to try to recompile it with scala 2.11, which should be enough to make it work with Spark 2.1. This sounds strange to me as I know in Spark 2 there are some relevant changes, like:
Introduction of SparkSession object
Merge of DataSet and DataFrame
APIs
I managed to recompile the application in spark 2 with scala 2.11 with only minor changes due to Kryo Serializer registration.
I still have some runtime error that I am trying to solve and I am trying to figure out what will come next.
My question regards what changes are "neccessary" in order to make the application work as before, and what changes are "recommended" in terms of performance optimization (I need to keep at least the same level of performances), and whatever you think could be useful for a newbie in spark :).
Thanks in advance!
I did the same 1 year ago, there are not many changes you need to do, what comes in my mind:
if your code is cluttered with spark/sqlContext, then just extract this variable from SparkSession instace at the beginning of your code.
df.map switched to RDD API in Spark 1.6, in Spark 2.+ you stay in DataFrame API (which now has a map method). To get same functionality as before, replace df.map with df.rdd.map. The same is true for df.foreach and df.mapPartitions etc
unionAll in Spark 1.6 is just union in Spark 2.+
The databrick csv library is now included in Spark.
When you insert into a partitioned hive table, then the partition columns must now come as last column in the schema, in Spark 1.6 it had to be the first column
What you should consider (but would require more work):
migrate RDD-Code into Dataset-Code
enable CBO (cost based optimizer)
collect_list can be used with structs, in Spark 1.6 it could only be used with primitives. This can simplify some things
Datasource API was improved/unified
leftanti join was introduced

How to save a spark dataframe in tableau format?

Trying to save the spark dataframe(python) in .tde format. Will including these 4 jars in jars folder of spark will work?
jna.jar; tableauextract.jar; tableaucommon.jar; tableauserver.jar.If so how to get these jars?Could not find it on google search
You can get a Pandas dataframe using the .toPandas() method available on any Spark dataframe. From there, some options exist to get to a Tableau .tde file; check out this link:
https://github.com/chinchon/python-tableau-tde

Custom records reader PST file format in Spark Scala

I am working on PST files, I have worked on writing custom record reader for a Mapreduce program for different input formats but this time it is going to be spark.
I am not getting any clue or documentation on implementing record readers in spark.
Can some body help on this? Is it possible to implement this functionality in spark?