Glue job runs indefinitely without finishing code execution - pyspark

I have a glue job that reads in data from an RDS postgres instance (via data catalog) and writes to s3 in partitioned and parquet format. That works fine. I added to the end of that script code to run a crawler on the s3 path that was written to, so that the data catalog has the new partitions updated. But this code never runs. I added a logging statement immediately after the line that writes the dynamic frame, and this logging statement is never output. The job just runs indefinitely until time out or I stop the run.
import ...
...
sc = SparkContext()
gluecontext = GlueContext(sc)
log4jLogger=sc._jvm.org.apache.log4j
log=log4jLogger.LogManager.getLogger(__name__)
log.warn("test of logger")
DyF = gluecontext.create_dynamic_frame_from_catalog(database='db-we-want', table_name='table-of-value')
create_paritions(DyF)
log.warn("about to write dynamic frame with data of interest")
gluecontext.write_dyanmic_frame_from_options(frame=DyF, connection_type='s3', connection_options={'path': 's3://some-bucket/some-prefix'}, format='parquet')
log.warn('Attempting to start crawler')
glue_client = boto3.client('glue', region_name='us-east-1')
glue_client.start_crawler(Name='some-crawler')
I expect to have the crawler start and to see the log statements after the first glue context writes. I see objects in s3 for the glue context dynamic frame, but the log statement immediately after doesn't write. The crawler does not start. There are no errors in the logs and the job continues to run indefinitely.
EDIT:
I was able to solve this issue. The problem was that the RDS connection subnet was public. However glue jobs do not have a public IP. They need access to an NAT gateway. By switching the connection to use a private subnet with an NAT gateway the job was able to succeed. The error was a boto3 time out on trying to connect to glue resources.

Related

Dynamically change VPC inside Glue job

Hi I am having issues with VPC settings. When I use a connection (that has VPC attached) in Glue job I can read data from SQL server but then I can't write data into target Snowflake server due to "timeout". By timeout I mean that job doesn't fail because of any reason but timeout. No errors in logs etc.
If I remove the connection from the same job and replace the SQL server data frame with some dummy spark data frame it all writes to Snowflake without any issue.
For connection to SQL Server I use data catalog and for snowflake I use Spark jdbc connector (2 jar files added to job).
I am thinking about connecting to VPC dynamically from the job script itself, then pull the data into data frame, then disconnect from VPC and write data frame to the target. Does anyone think that it is possible? I didn't find any mentions in documentation TBH.

AWS Glue : Unable to process data from multiple sources S3 bucket and postgreSQL db with AWS Glue using Scala-Spark

For my requirement, I need to join data present in PostgreSQL db(hosted in RDS) and file present in S3 bucket. I have created a Glue job(spark-scala) which should connect to both PostgreSQL, S3 bucket and complete processing.
But Glue job encounters connection timeout while connecting to S3(below is error msg). It is successfully fetching data from PostgreSQL.
There is no permission related issue with S3 because I am able to read/write from same S3 bucket/path using different job. The exception/issue happens only if I try to connect both postgreSQL and S3 in one glue job/script.
In Glue job, glue context is created using SparkContext as object. I have tried creating two different sparkSession, each for S3 and postgreSQL db but this approach didnt work. Same timeout issue encountered.
Please help me in resolving the issue.
Error/Exception from log:
ERROR[main] glue.processLauncher (Logging.scala:logError(91)):Exception in User Class
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to emp_bucket.s3.amazonaws.com:443
[emp_bucket.s3.amazonaws.com/] failed : connect timed out
This is fixed.
Issue was with security group. Only TCP traffic was allowed earlier.
As part of the fix traffic was opened for all. Also, added HTTPS rule in inbound rules as well.

Is it possible to wait until an EMR cluster is terminated?

I'm trying to write a component that will start up an EMR cluster, run a Spark pipeline on that cluster, and then shut that cluster down once the pipeline completes.
I've gotten as far as creating the cluster and setting permissions to allow my main cluster's worker machines to start EMR clusters. However, I'm struggling with debugging the created cluster and waiting until the pipeline has concluded. Here is the code I have now. Note I'm using Spark Scala, but this is very close to standard Java code:
val runSparkJob = new StepConfig()
.withName("Run Pipeline")
.withActionOnFailure(ActionOnFailure.TERMINATE_CLUSTER)
.withHadoopJarStep(
new HadoopJarStepConfig()
.withJar("/path/to/jar")
.withArgs(
"spark-submit",
"etc..."
)
)
// Create a cluster and run the Spark job on it
val clusterName = "REDACTED Cluster"
val createClusterRequest =
new RunJobFlowRequest()
.withName(clusterName)
.withReleaseLabel(Configs.EMR_RELEASE_LABEL)
.withSteps(enableDebugging, runSparkJob)
.withApplications(new Application().withName("Spark"))
.withLogUri(Configs.LOG_URI_PREFIX)
.withServiceRole(Configs.SERVICE_ROLE)
.withJobFlowRole(Configs.JOB_FLOW_ROLE)
.withInstances(
new JobFlowInstancesConfig()
.withEc2SubnetId(Configs.SUBNET)
.withInstanceCount(Configs.INSTANCE_COUNT)
.withKeepJobFlowAliveWhenNoSteps(false)
.withMasterInstanceType(Configs.MASTER_INSTANCE_TYPE)
.withSlaveInstanceType(Configs.SLAVE_INSTANCE_TYPE)
)
val newCluster = emr.runJobFlow(createClusterRequest)
I have two concrete questions:
The call to emr.runJobFlow returns immediately upon submitting the result. Is there any way that I can make it block until the cluster is shut down or otherwise wait until the workflow has concluded?
My cluster is actually not coming up and when I go to the AWS Console -> EMR -> Events view I see a failure:
Amazon EMR Cluster j-XXX (REDACTED...) has terminated with errors at 2019-06-13 19:50 UTC with a reason of VALIDATION_ERROR.
Is there any way I can get my hands on this error programmatically in my Java/Scala application?
Yes, it is very possible to wait until an EMR cluster is terminated.
There is are waiters that will block execution until the cluster (i.e. job flow) gets to a certain state.
val newCluster = emr.runJobFlow(createClusterRequest);
val describeRequest = new DescribeClusterRequest()
.withClusterId(newCluster.getClusterId())
// Wait until terminated
emr.waiters().clusterTerminated().run(new WaiterParameters(describeRequest))
Also, if you want to get the status of the cluster (i.e. job flow), you can call the describeCluster function of the EMR client. Check out the linked documentation as you can get state and status information about the cluster to determine if it's successful or erred.
val result = emr.describeCluster(describeRequest)
Note: Not the best Java-er so the above is my best guess and how it would work based on the documentation but I have not tested the above.

Intermittant file not found using Google Cloud Storage from Dataproc - flushing writes?

I have a series of dataproc jobs that run to import some data received each morning. The process creates a cluster, runs four jobs in sequence, then shuts down the cluster. The input file is read from Google Cloud Storage, and the intermediate results are also saved in Avro form in GCS with the final output going to Cloud SQL.
Fairly often the jobs will fail trying to read the Avro written by the previous job. It appears that GCS hasn't "caught up" and the results from the previous job haven't been fully written. I was getting failures trying to read files that appeared to be from the previous day's run and partway through those files would disappear and be replaced by the new ones. I have changed my script that runs the files to clear the work area before starting the jobs, but still have problems where sometimes it starts reading and all the parts haven't been written fully.
I could change the code to simply store the intermediate files on the cluster, tho I like having them available outside for diagnosing other problems. I could also just write to both locations with the cluster for working and GCS for diagnostics.
But assuming this is some kind of sync issue, is there a way to force GCS to flush writes / be fully synced between jobs? Or is there some check I can do to make sure everything has been written before starting the next job in my chain?
EDIT: To answer the comment below, the sequence of jobs all run on the same cluster. The cluster is started, each job run in turn on that cluster, and then the cluster is shut down.
For now, I have worked around this by having the jobs write to HDFS on the cluster in addition to GCS, and the subsequent jobs reading from the cluster. The GCS output is now strictly for diagnostics in case of a problem. But even tho my immediate problem is (I believe) fixed I still would like to know what's happening and why GCS seems out of sync for a bit.

Spark: long delay between jobs

So we are running spark job that extract data and do some expansive data conversion and writes to several different files. Everything is running fine but I'm getting random expansive delays between resource intensive job finish and next job start.
In below picture, we can see that job that was scheduled at 17:22:02 took 15 min to finish, which means I'm expecting next job to be scheduled around 17:37:02. However, next job was scheduled at 22:05:59, which is +4 hours after job success.
When I dig into next job's spark UI it show <1 sec scheduler delay. So I'm confused to where does this 4 hours long delay is coming from.
(Spark 1.6.1 with Hadoop 2)
Updated:
I can confirm that David's answer below is spot on about how IO ops are handled in Spark is bit unexpected. (It makes sense to that file write essentially does "collect" behind the curtain before it writes considering ordering and/or other operations.) But I'm bit discomforted by the fact that I/O time is not included in job execution time. I guess you can see it in "SQL" tab of spark UI as queries are still running even with all jobs being successful but you cannot dive into it at all.
I'm sure there are more ways to improve but below two methods were sufficient for me:
reduce file count
set parquet.enable.summary-metadata to false
I/O operations often come with significant overhead that will occur on the master node. Since this work isn't parallelized, it can take quite a bit of time. And since it is not a job, it does not show up in the resource manager UI. Some examples of I/O tasks that are done by the master node
Spark will write to temporary s3 directories, then move the files using the master node
Reading of text files often occur on the master node
When writing parquet files, the master node will scan all the files post-write to check the schema
These issues can be solved by tweaking yarn settings or redesigning your code. If you provide some source code, I might be able to pinpoint your issue.
Discussion of writing I/O Overhead with Parquet and s3
Discussion of reading I/O Overhead "s3 is not a filesystem"
Problem:
I faced similar issue when writing parquet data on s3 with pyspark on EMR 5.5.1. All workers would finish writing data in _temporary bucket in output folder & Spark UI would show that all tasks have completed. But Hadoop Resource Manager UI would not release resources for the application neither mark it as complete. On checking s3 bucket, it seemed like spark driver was moving the files 1 by 1 from _temporary directory to output bucket which was extremely slow & all the cluster was idle except Driver node.
Solution:
The solution that worked for me was to use committer class by AWS ( EmrOptimizedSparkSqlParquetOutputCommitter ) by setting the configuration property spark.sql.parquet.fs.optimized.committer.optimization-enabled to true.
e.g.:
spark-submit ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
or
pyspark ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
Note that this property is available in EMR 5.19 or higher.
Result:
After running the spark job on EMR 5.20.0 using above solution, it did not create any _temporary directory & all the files were directly written to the output bucket, hence job finished very quickly.
Fore more details:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html