HadoopDataSource: Skipping Partition {} as no new files detected # s3: - scala

So, I have an S3 folder with several subfolders acting as partitions (based on the date of creation). I have a Glue Table for those partitions and can see the data using Athena.
Running a Glue Job and trying to access the Catalog I get the following error:
HadoopDataSource: Skipping Partition {} as no new files detected # s3:...
The line that gives me problems is the following:
glueContext.getCatalogSource(database = "DB_NAME", tableName = "TABLE_NAME", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame().toDF()
I'll want at every point to access all the data in those S3 subfolders as it is updated regularly.
I'm thinking the problem is the Glue Job Bookmark not detecting new files, but this is not running directly as part of a Job but as part of a library used by a Job.
Removing "transformationContext" or changing its value to empty hasn't worked.

So the Hadoop output you are getting is not an error but just a simple log that the partition is empty.
But the partition that is getting logged, {}, seems to be off. Can you check that?
In addition, could you run the job with bookmark disabled, to make sure that this is not the cause of the problem?
I also found this unresolved GitHub issue, maybe you can comment there too, so that the issue gets some attention.

Related

Debezium Connector tries to open old log file

I have a debezium connector that works fine, for a limited time. These errors occur in log file:
Caused by: java.sql.SQLException: ORA-00308: cannot open archived log '+RECO/XXXXXXXX/ARCHIVELOG/2022_01_04/thread_1_seq_53874.3204.1093111215'
ORA-17503: ksfdopn:2 Failed to open file +RECO/XXXXXXXX/ARCHIVELOG/2022_01_04/thread_1_seq_53874.3204.1093111215
ORA-15012: ASM file '+RECO/XXXXXX/ARCHIVELOG/2022_01_04/thread_1_seq_53874.3204.1093111215' does not exist
I've learnt in this database log files are deleted daily. Is my connector trying to read an old log file, which does not exist anymore? How can I tell my connector to check only last 12 hours, for example. Or should I do something in database side?
I've learnt in this database log files are deleted daily. Is my connector trying to read an old log file, which does not exist anymore?
It is fine to delete archive logs that are no longer needed, but it's critical that you make sure that you are not deleting logs that the Oracle Connector still requires in order to perform mining. In your particular case, the connector still required thread_1_seq_53874.3204.1093111215 but the log is no longer on the file system and therefore the connector will stop with an error. This error happens with any other connector such as MySQL if you remove the binlogs before the connector is done reading them.
How can I tell my connector to check only last 12 hours, for example.
You cannot.
The way the Debezium connectors are designed is that they're meant to read all changes from the logs in chronological order to guarantee that there is no change data event loss. If a log were to be deleted that was needed and we did not throw an error, then you would have gaps where changes from the source database would not be represented as change events and so your consumers wouldn't be kept in sync.
Or should I do something in database side
Archive logs need to be retained for as long as they're needed by the connector. The latency of the Oracle connector is dependent both on the volatility of your database but also on a number of factors such as the performance of the database server hardware (disk and cpu), the size of your redo logs, etc.
Some environments may not be able to keep archive logs available in the default destination location for extended periods of time due to space constraints. This is why we introduced a way that you can set up Oracle to write archive logs to a secondary destination location that is capable of retaining the logs for a longer period of time, often via a network mount, and then you can explicitly tell the connector use that archive destination name rather than the first valid/default location of the system.

Reading a file from local file system after reading it from hadoop file system

I am trying to read a file from my local EMR file system. It is there as a file under the folder /emr/myFile.csv. However, I keep getting a FileNotFoundException. Here is the line of code that I use to read it:
val myObj: File = new File("/emr/myFile.csv")
I added a file://// prefix to the file path as well because I have seen that work for others, but that still did not work. So I also try to read directly from the hadoop file system where it is stored in the folder: /emr/CNSMR_ACCNT_BAL/myFile.csv because I thought it was maybe checking by default in hdfs. However, that also results in a FileNotFoundException. Here is the code for that:
val myObj: File = new File("/emr/CNSMR_ACCNT_BAL/myFile.csv")
How can I read this file into a File?
For your 1st problem:
When you submit a hadoop job application master can get created on any of your worker node including master node (depending on your configuration).
If you are using EMR, your application master by default gets created on any of your worker node (CORE node) but not on master.
When you say file:///emr/myFile.csv this file exists on your local file system (I'm assuming that means on master node), your program will search for this file on that node where the application master is and its definitely not on your master node because for that you wouldn’t get any error.
2nd problem:
When you try to access a file in HDFS using java File.class, it won’t be able to access that file.
You need to use hadoop FileSystem api (org.apache.hadoop.fs.FileSystem) to interact with a HDFS file.
Also use HDFS file tag hdfs://<namenode>:<port>/emr/CNSMR_ACCNT_BAL/myFile.csv.
If your core-site.xml contains value of fs.defaultFS then you don’t need to put namenode and port info just simply hdfs:///emr/CNSMR_ACCNT_BAL/myFile.csv
So what's better option here while accessing file in hadoop cluster?
The answer depends upon your use case, but most cases putting it in HDFS it much better, because you don’t have to worry about where your application master is. Each and every node have access to the hdfs.
Hope that resolves your problem.

Hadoop FileUtils not able to write files on local(Unix) filesystem from Scala

I'm trying to write file to local FileSystem using FileSystem library of org.apache.hadoop.fs. Below is my one liner code inside the big scala code that should be doing this, but it's not.
fs.copyToLocalFile(false, hdfsSourcePath, new Path(newFile.getAbsolutePath), true)
The value of newFile is:
val newFile = new File(s"${localPath}/fileName.dat")
localPath is just a variable containing the full path on local disk.
hdfsSourcePath is the full path on HDFS location.
The job executes properly but I don't see the files created on local. I'm running it through Spark engine in cluster mode, that's why I used the copyToLocalFile method which overloads the 4th argument of useRawLocalFileSystem and set it to true. Using this, we can avoid getting the files being written on the executor node.
Any ideas?
I used the copyToLocalFile method which overloads the 4th argument of useRawLocalFileSystem and set it to true. Using this, we can avoid getting the files being written on the executor node.
I think you got this point wrong. Cluster mode makes driver run on executor node and local file system is that executor's file system. useRawLocalFileSystem only prevents writing checksum files (->info), it does not make the files appear on machine that is submitting the job, which is probably what you expected.
The best you can do is to save files to HDFS and retrieve them explicitly after the job finishes.

How can i identify processed files in Data flow Job

How can I identify processed files in Data flow Job? I am using a wildcard to read files from cloud storage. but every time when the job runs, it re-read all files.
This is a batch Job and following is sample reading TextIO that I am using.
PCollection<String> filePColection = pipeline.apply("Read files from Cloud Storage ", TextIO.read().from("gs://bucketName/TrafficData*.txt"));
To see a list of files that match your wildcard you can use gsutils, which is the Cloud Storage command line utility. You'd do the following:
gsutils ls gs://bucketName/TrafficData*.txt
Now, when it comes to running a batch job multiple times, your pipeline has no way to know which files it has analyzed already or not. To avoid analyzing new files you could do either of the following:
Define a Streaming job, and use TextIO's watchForNewFiles functionality. You would have to leave your job to run for as long as you want to keep processing files.
Find a way to provide your pipeline with files that have already been analyzed. For this, every time you run your pipeline you could generate a list of files to analyze, put it into a PCollection, read each with TextIO.readAll(), and store the list of analyzed files somewhere. Later, when you run your pipeline again you can use this list as a blacklist for files that you don't need to run again.
Let me know in the comments if you want to work out a solution around one of these two options.

spark-scala checkpointing cleanup

I am running a spark application in 'local' mode. It's checkpointing correctly to the directory defined in the checkpointFolder config. However, there are two issues that I am seeing that are causing some disk space issues.
1) As we have multiple users running the application, the checkpoint folder on server is created by the first user executing it, which causes other user's run to fail due to permissions issue on the OS. Is there a way to provide a relative path in the checkpointFolder, for example checkpointFolder=~/spark/checkpoint?
2) I have used the spark.worker.cleanup.enabled=true config to cleanup the checkpoint folder after the run, but don't see that happening. Is there an alternate way of cleaning it up through the app, instead of resorting to some cron job?
Hope the following is sensible:
1) You may create unique folder each time like /tmp/spark_checkpoint_1578032476801
2a) You may just delete folder at the end of the app.
2b) If you use HDFS for checkpointing then use such code
def cleanFS(sc: SparkContext, fsPath: String) = {
val fs = org.apache.hadoop.fs.FileSystem.get(new URI(fsPath), sc.hadoopConfiguration)
fs.delete(new Path(fsPath), true)
}
Check this answer out!
PySpark: fully cleaning checkpoints
I was facing the same issue and it is solved in the above link!