AWS Glue add new partitions and overwrite existing partitions - pyspark

I'm attempting to write pyspark code in Glue that lets me update the Glue Catalog by adding new partitions and overwrite existing partitions in the same call.
I read that there is no way to overwrite partitions in Glue so we must use pyspark code similar to this:
final_df.withColumn('year', date_format('date', 'yyyy'))\
.withColumn('month', date_format('date', 'MM'))\
.withColumn('day', date_format('date', 'dd'))\
.write.mode('overwrite')\
.format('parquet')\
.partitionBy('year', 'month', 'day')\
.save('s3://my_bucket/')
However with this method, the Glue Catalog does not get updated automatically so an msck repair table call is needed after each write. Recently AWS released a new feature enableUpdateCatalog, where newly created partitions are immediately updated in the Glue Catalog. The code looks like this:
additionalOptions = {"enableUpdateCatalog": True}
additionalOptions["partitionKeys"] = ["year", "month", "day"]
dyn_frame_catalog = glueContext.write_dynamic_frame_from_catalog(
frame=partition_dyf,
database = "my_db",
table_name = "my_table",
format="parquet",
additional_options=additionalOptions,
transformation_ctx = "my_ctx"
)
Is there a way to combine these 2 commands or will I need to use the pyspark method with write.mode('overwrite') and run an MSCK REPAIR TABLE my_table on every run of the Glue job?

If you have not already found your answer, I believe the following will work:
DataSink5 = glueContext.getSink(
path = "s3://...",
connection_type = "s3",
updateBehavior = "UPDATE_IN_DATABASE",
partitionKeys = ["year", "month", "day"],
enableUpdateCatalog = True,
transformation_ctx = "DataSink5")
DataSink5.setCatalogInfo(
catalogDatabase = "my_db",
catalogTableName = "my_table")
DataSink5.setFormat("glueparquet")
DataSink5.writeFrame(partition_dyf)

Related

MongoDB read operations take too long on spark, Is there any additional conf argument?

I want to query MongoDB collection with Spark using PySpark. But when I use "collect", "count", etc. with some filter conditions(for examle field > 5) the job doesn't appear for long time on Spark interface and takes too long to complete. But in small collections there is no problem, everything goes normal. I think Apache Spark tries to read all data.
My confs;
conf = SparkConf()\
.set('spark.executor.memory', '7g')\
.set('spark.executor.cores', '2')\
.set('spark.num.executors', '8')\
.set('spark.cores.max', '16')\
.set('spark.driver.memory', '10g')\
.set('spark.driver.port', '10001')\
.set('spark.driver.host', 'xx.xx.xx.xx')\
.set('spark.dynamicAllocation.enabled', "false")\
.set("spark.mongodb.read.connection.uri", "mongodb://XX.XX.XX.XX:27017")\
.set("spark.mongodb.read.database", "collection_1")\
.set("spark.mongodb.read.collection", "ADPacket")\
.set("spark.mongodb.write.connection.uri", "mongodb://XX.XX.XX.XX:27017")\
.set("spark.mongodb.write.database", "collection_1")\
.set("spark.mongodb.write.collection", "ADPacket")\
.set("spark.mongodb.write.operationType", "update")\
.set("spark.mongodb.readPreference.name", "secondaryPreferred")\
.setAppName('hello')\
.setMaster('spark://xx.xx.xx.xx:7077')
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
sqlContext = SQLContext(sc)
and I load the data;
df = sqlContext.read.format("mongodb").load()
df.createOrReplaceTempView("ADPacket")
I tried both below commands.
sqlContext.sql("SELECT * FROM ADPacket WHERE AssetConnectDeviceKey='xxxx' and TimeStamp<1655555989 LIMIT 500").collect()
df.filter((df["TimeStamp"]>1650050989) & (df["TimeStamp"]<1655555989)).count()
In addition AssetConnectDeviceKey and TimeStamp are indexes in MongoDB collection.

Synapse - Notebook not working from Pipeline

I have a notebook in Azure Synapse that reads parquet files into a data frame using the synapsesql function and then pushes the data frame contents into a table in the SQL Pool.
Executing the notebook manually is successful and the table is created and populated in the Synapse SQL pool.
When I try to call the same notebook from an Azure Synapse pipeline it returns successful however does not create the table. I am using the Synapse Notebook activity in the pipeline.
What could be the issue here?
I am getting deprecated warnings around the synapsesql function but don't know what is actually deprecated.
The code is below.
%%spark
val pEnvironment = "t"
val pFolderName = "TestFolder"
val pSourceDatabaseName = "TestDatabase"
val pSourceSchemaName = "TestSchema"
val pRootFolderName = "RootFolder"
val pServerName = pEnvironment + "synas01"
val pDatabaseName = pEnvironment + "syndsqlp01"
val pTableName = pSourceDatabaseName + "" + pSourceSchemaName + "" + pFolderName
// Import functions and Synapse connector
import org.apache.spark.sql.DataFrame
import com.microsoft.spark.sqlanalytics.utils.Constants
import org.apache.spark.sql.functions.
import org.apache.spark.sql.SqlAnalyticsConnector.
// Get list of "FileLocation" from control.FileLoadStatus
val fls:DataFrame = spark.read.
synapsesql(s"${pDatabaseName}.control.FileLoadStatus").
select("FileLocation","ProcessedDate")
// Read all parquet files in folder into data frame
// Add file name as column
val df:DataFrame = spark.read.
parquet(s"/source/${pRootFolderName}/${pFolderName}/").
withColumn("FileLocation", input_file_name())
// Join parquet file data frame to FileLoadStatus data frame
// Exclude rows in parquet file data frame where ProcessedDate is not null
val df2 = df.
join(fls,Seq("FileLocation"), "left").
where(fls("ProcessedDate").isNull)
// Write data frame to sql table
df2.write.
option(Constants.SERVER,s"${pServerName}.sql.azuresynapse.net").
synapsesql(s"${pDatabaseName}.xtr.${pTableName}",Constants.INTERNAL)
This case happens often and to get the output after pipeline execution. Follow the steps mentioned.
Pick up the Apache Spark application name from the output of pipeline
Navigate to Apache Spark Application under Monitor tab and search for the same application name .
These 4 tabs would be available there: Diagnostics,Logs,Input data,Output data
Go to Logs ad check 'stdout' for getting the required output.
https://www.youtube.com/watch?v=ydEXCVVGAiY
Check the above video link for detailed live procedure.

Flink Table API: GROUP BY in SQL Execution throws org.apache.flink.table.api.TableException

I have this very simplified use case: I want to use Apache Flink (1.11) to read data from a Kafka topic (let's call it source_topic), count an attribute in it (called b) and write the result into another Kafka topic (result_topic).
I have the following code so far:
from pyflink.datastream import StreamExecutionEnvironment, TimeCharacteristic
from pyflink.table import StreamTableEnvironment, EnvironmentSettings
def log_processing():
env = StreamExecutionEnvironment.get_execution_environment()
env_settings = EnvironmentSettings.new_instance().use_blink_planner().in_streaming_mode().build()
t_env = StreamTableEnvironment.create(stream_execution_environment=env, environment_settings=env_settings)`
t_env.get_config().get_configuration().set_boolean("python.fn-execution.memory.managed", True)
t_env.get_config().get_configuration().set_string("pipeline.jars", "file:///opt/flink-1.11.2/lib/flink-sql-connector-kafka_2.11-1.11.2.jar")
source_ddl = """
CREATE TABLE source_table(
a STRING,
b INT
) WITH (
'connector' = 'kafka',
'topic' = 'source_topic',
'properties.bootstrap.servers' = 'node-1:9092',
'scan.startup.mode' = 'earliest-offset',
'format' = 'csv',
'csv.ignore-parse-errors' = 'true'
)
"""
sink_ddl = """
CREATE TABLE result_table(
b INT,
result BIGINT
) WITH (
'connector' = 'kafka',
'topic' = 'result_topic',
'properties.bootstrap.servers' = 'node-1:9092',
'format' = 'csv'
)
"""
t_env.execute_sql(source_ddl)
t_env.execute_sql(sink_ddl)
t_env.execute_sql("INSERT INTO result_table SELECT b,COUNT(b) FROM source_table GROUP BY b")
t_env.execute("Kafka_Flink_job")
if __name__ == '__main__':
log_processing()
But when I execute it, I get the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o5.executeSql.
: org.apache.flink.table.api.TableException: Table sink 'default_catalog.default_database.result_table' doesn't support consuming update changes which is produced by node GroupAggregate(groupBy=[b], select=[b, COUNT(b) AS EXPR$1])
I am able to write data into a Kafka topic with a simple SELECT statement. But as soon as I add the GROUP BY clause, the exception above is thrown. I followed Flink's documentation on the use of the Table API with SQL for Python: https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/common.html#sql
Any help is highly appreciated, I am very new to Stream Processing and Flink. Thank you!
Using a GROUP BY clause will generate an updating stream, which is not supported by the Kafka connector as of Flink 1.11. On the other hand, when you use a simple SELECT statement without any aggregation, the result stream is append-only (this is why you're able to consume it without issues).
Flink 1.12 is very close to being released, and it includes a new upsert Kafka connector (FLIP-149, if you're curious) that will allow you to do this type of operation also in PyFlink (i.e. the Python Table API).

How override data in aws glue?

Cosider a code:
val inputTable = glueContext
.getCatalogSource(database = "my_db", tableName = "my_table)
.getDynamicFrame()
glueContext.getSinkWithFormat(
connectionType = "s3",
options = JsonOptions(Map("path" -> "s3://my_out_path")),
format = "orc", transformationContext = ""
).writeDynamicFrame(inputTable)
When I run this code twice new orc files added to old ones in "s3://my_out_path". Is there a way to overwrite always override path?
Note
The writting data have no partition.
Yes, you can use the spark to overwrite the content. You can still read your data with Glue methods, but then change it to a spark dataframe and overwrite the files:
datasink = DynamicFrame.toDF(inputTable)
datasink.write.\
format("orc").\
mode("overwrite").\
save("s3://my_out_path")

AWS Glue PySpark replace NULLs

I am running an AWS Glue job to load a pipe delimited file on S3 into an RDS Postgres instance, using the auto-generated PySpark script from Glue.
Initially, it complained about NULL values in some columns:
pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"
After some googling and reading on SO, I tried to replace the NULLs in my file by converting my AWS Glue Dynamic Dataframe to a Spark Dataframe, executing the function fillna() and reconverting back to a Dynamic Dataframe.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx =
"datasource0")
custom_df = datasource0.toDF()
custom_df2 = custom_df.fillna(-1)
custom_df3 = custom_df2.fromDF()
applymapping1 = ApplyMapping.apply(frame = custom_df3, mappings = [("id",
"string", "id", "int"),........more code
References:
https://github.com/awslabs/aws-glue-samples/blob/master/FAQ_and_How_to.md#3-there-are-some-transforms-that-i-cannot-figure-out
How to replace all Null values of a dataframe in Pyspark
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.fillna
Now, when I run my job, it throws the following error:
Log Contents:
Traceback (most recent call last):
File "script_2017-12-20-22-02-13.py", line 23, in <module>
custom_df3 = custom_df2.fromDF()
AttributeError: 'DataFrame' object has no attribute 'fromDF'
End of LogType:stdout
I am new to Python and Spark and have tried a lot, but can't make sense of this. Appreciate some expert help on this.
I tried changing my reconvert command to this:
custom_df3 = glueContext.create_dynamic_frame.fromDF(frame = custom_df2)
But still got the error:
AttributeError: 'DynamicFrameReader' object has no attribute 'fromDF'
UPDATE:
I suspect this is not about NULL values. The message "Can't get JDBC type for null" seems not to refer to a NULL value, but some data/type that JDBC is unable to decipher.
I created a file with only 1 record, no NULL values, changed all Boolean types to INT (and replaced values with 0 and 1), but still get the same error:
pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"
UPDATE:
Make sure DynamicFrame is imported (from awsglue.context import DynamicFrame), since fromDF / toDF are part of DynamicFrame.
Refer to https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html
You are calling .fromDF on the wrong class. It should look like this:
from awsglue.dynamicframe import DynamicFrame
DyamicFrame.fromDF(custom_df2, glueContext, 'label')
For this error, pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"
you should use the drop Null columns.
I was getting similar errors while loading to Redshift DB Tables. After using the below command, the issue got resolved
Loading= DropNullFields.apply(frame = resolvechoice3, transformation_ctx = "Loading")
In Pandas, and for Pandas DataFrame, pd.fillna() is used to fill null values with other specified values. However, DropNullFields drops all null fields in a DynamicFrame whose type is NullType. These are fields with missing or null values in every record in the DynamicFrame data set.
In your specific situation, you need to make sure you are using the write class for the appropriate dataset.
Here is the edited version of your code:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx =
"datasource0")
custom_df = datasource0.toDF()
custom_df2 = custom_df.fillna(-1)
custom_df3 = DyamicFrame.fromDF(custom_df2, glueContext, 'your_label')
applymapping1 = ApplyMapping.apply(frame = custom_df3, mappings = [("id",
"string", "id", "int"),........more code
This is what you are doing: 1. Read the file in DynamicFrame, 2. Convert it to DataFrame, 3. Drop null values, 4. Convert back to DynamicFrame, and 5. ApplyMapping. You were getting the following error because your step 4 was wrong and you were were feeding a DataFrame to ApplyMapping which does not work. ApplyMapping is designed for DynamicFrames.
I would suggest read your data in DynamicFrame and stick to the same data type. It would look like this (one way to do it):
from awsglue.dynamicframe import DynamicFrame
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx =
"datasource0")
custom_df = DropNullFields.apply(frame=datasource0)
applymapping1 = ApplyMapping.apply(frame = custom_df, mappings = [("id",
"string", "id", "int"),........more code