I am working with Azure Databricks and PySpark 2.4.3 trying to build a robust approach to file import from Blob storage to a cluster file. Things mostly work but parsing is not raising errors as I expect.
I have a 7GB csv file that I know has a number records with issues that are causing rows to be skipped (found by reconciling the count of read records in the output parquet file written from the Dataframe versus the source CSV.) I am attempting to use the badRecordsPath option and there no output is being generated (that I can find.). Can anyone share advice on how to troubleshoot file loading when there is bad data - and to create a robust process that will handle parsing errors not permissively in the future?
One issue tacked is around embedded newlines where I've found wholeFile and multiline options have helped - but I am now having challenges in getting insight to what records are not being accepted.
The python code that I am using to load the file looks like this.
myDf = spark.read.format("csv")\
.option("inferSchema", "true")\
.option("header", "true")\
.option("wholeFile", "true")\
.option("multiline","true")\
.option("ignoreLeadingWhiteSpace", "true")\
.option("ignoreTrailingWhiteSpace", "true")\
.option("parserLib","univocity")\
.option('quote', '"')\
.option('escape', '"')\
.option("badRecordsPath","/tmp/badRecordsPath")\
.load(BLOB_FILE_LOCATION)
What I see is that about a half million records out of more than 10 million records being dropped. I am currently unable to easily tell which ones or know that failures are occurring or what they are (without exporting and comparing data which would be OK for a one-time load - but not acceptable for the production system). I've also tried the other read modes without luck (it always seems to be behaving like DROPMALFORMED is set which is not the case (even trying mode set "FAILFAST" in an experiment.)
Many thanks for any insight / advice.
Related
I would like to request you for your help.
I have been working with DataBricks.
We developed some scrips and they are working in streaming.
Let's suppose that we have two jobs running and writing data to one general local dataset.
This means notebook1 and notebook2 writing data at the same LDS.
Each notebook read data from different origins and write the data to the same LDS in an standard format. To avoid problems we made use of partitions at the LDS.
This means that in this case the LDS have one partition for notebook1 and other partition for notebook2.
This implementation has been working well for almost 5 months.
However, today we just faced the following error:
com.databricks.sql.transaction.tahoe.DeltaFileNotFoundException: No file found in the directory: dbfs:/mnt/streaming/streaming1/_delta_log.
I have been looking for information for some way to solve it and the solutions that I found have been:
Solution 1 Which explain some reasons why this situations could happen and they say we should use a new checkpoint directory, or set the Spark property spark.sql.files.ignoreMissingFiles to true in the cluster’s Spark Config.
The first solution of using a new checkpoints directory is not possible for us to use due the requeriments that we need to satisfy because using a new checkpoints would mean for us to process the whole data again that has been processed.
You may ask why? In a summary we get updates from a database that is saved in a delta table that contais the raw data and is where we consume the data, so using a new checkpoint or deleting it would mean for us consume the whole data.
This only allow us to use the solution of applying the property of spark.sql.files.ignoreMissingFiles. However, my question here is: If we set this property, Would we be processing the data from the beginning? Or it would resume to process where the last checkpoints was?
Solution 2 I found a similar case here, however I didn't understand it at all, what they suggest is to change the parent directory, however we do have something similar to that which could not satisfy our problem and also add the directory in the start() option?
We have our mains streaming like this:
spark.readStream.format("delta") \
.option("readChangeFeed", "true") \
.option("maxFilesPerTrigger", 250) \
.option("maxBytesPerTrigger", 536870912)\
.option("failOnDataLoss", "true")\
.load(DATA_PATH)\
.filter(expr("_change_type not in ('delete', 'update_preimage')"))\
.writeStream\
.queryName(streamQueryName)\
.foreachBatch(MainFunctionstoprocess)\
.option("checkpointLocation", checkpointLocation)\
.option("mergeSchema", "true")\
.trigger(processingTime='1 seconds')\
.start()
Does anyone have some idea how we could solve this problem without deleting the checkpoints so we can resume the data in the last checkpoint it failed, or some way to get back to one checkpoint so we can only reprocess some part of the data?
I'm using zeppelin and spark, and I'd like to take a 2TB file from S3 and run transformations on it in Spark, and then send it up to S3 so that I can work with the file in Jupyter notebook. The transformations are pretty straightforward.
I'm reading the file as a parquet file. I think it's about 2TB, but I'm not sure how to verify.
It's about 10M row and 5 columns, so it's pretty big.
I tried to do my_table.write.parquet(s3path) and I tried my_table.write.option("maxRecordsPerFile", 200000).parquet(s3path). How do I come up with the right way to write a big parquet file?
These are the points you could consider...
1) maxRecordsPerFile setting:
With
my_table.write.parquet(s3path)
Spark writes a single file out per task.
The number of saved files is = the number of partitions of the RDD/Dataframe being saved. Thus, this could result in ridiculously large files (of couse you can repartition your data and save repartition means shuffles the data across the networks.).
To limit number of records per file
my_table.write.option("maxRecordsPerFile", numberOfRecordsPerFile..yourwish).parquet(s3path)
It can avoid generating huge files.
2) If you are using AWS Emr (Emrfs) this could be one of the point you can consider.
emr-spark-s3-optimized-committer
When the EMRFS S3-optimized Committer is Not Used :
When using the S3A file system.
When using an output format other than Parquet, such as ORC or text.
3) Using compression techniques , algo version and other spark configurations:
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 2)
.config("spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored", true)
.config("spark.hadoop.parquet.enable.summary-metadata", false)
.config("spark.sql.parquet.mergeSchema", false)
.config("spark.sql.parquet.filterPushdown", true) // for reading purpose
.config("mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.sql.parquet.compression.codec", "snappy")
.getOrCreate()
4) fast upload and other props in case you are using s3a:
.config("spark.hadoop.fs.s3a.fast.upload","true")
.config("spark.hadoop.fs.s3a.fast.upload","true")
.config("spark.hadoop.fs.s3a.connection.timeout","100000")
.config("spark.hadoop.fs.s3a.attempts.maximum","10")
.config("spark.hadoop.fs.s3a.fast.upload","true")
.config("spark.hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
.config("spark.hadoop.fs.s3a.fast.upload.active.blocks","4")
.config("fs.s3a.connection.ssl.enabled", "true")
The S3a connector will incrementally write blocks, but the (obsolete) version shipping with spark in hadoop-2.7.x doesn't handle it very well. IF you can, update all hadoop- Jars to 2.8.5 or 2.9.x.
the option "fs.s3a.multipart.size controls the size of the block. There's a limit of 10K blocks, so the max file you can upload is that size * 10,000. For very large files, use a bigger number than the default of "64M"
I am trying to measure how long does it take me to read and write parquet files in Amazon s3 (under a specific partition)
For that I wrote a script that simply reads the files and than write them back:
val df = sqlContext.read.parquet(path + "p1.parquet/partitionBy=partition1")
df.write.mode("overwrite").parquet(path + "p1.parquet/partitionBy=partition1")
However I get a null pointer exception. I tried to add df.count in between, but got the same error.
The reason for the error is that Spark only reads the data when it is going to be used. This results in Spark reading data from the file at the same time as trying to overwrite the file. This causes an issue since data can't be overwritten while reading.
I'd recommend saving to a temporary location as this is for timing purposes. An alternative would be to use .cache() on the data when reading, perform an action to force the read (as well as actually cache the data), and then overwrite the file.
I get stuck with the following problem. I have around 30,000 JSON files stored in S3 inside a particular bucket. These files are very small; each one takes only 400-500 Kb, but their quantity is not so small.
I want to create DataFrame based on all these files. I am reading JSON files using wildcard as follows:
var df = sqlContext.read.json("s3n://path_to_bucket/*.json")
I also tried this approach since json(...) is deprecated:
var df = sqlContext.read.format("json").load("s3n://path_to_bucket/*.json")
The problem is that it takes a very long time to create df. I was waiting 4 hours and the Spark job was still running.
Is there any more efficient approach to collect all these JSON files and create a DataFrame based on them?
UPDATE:
Or at least is it possible to read last 1000 files instead of reading all files? I found out that one can pass options as follows sqlContext.read.format("json").options, however I cannot figure out how to read only N newest files.
If you can get the last 1000 modified file names into a simple list you can simply call:
sqlContext.read.format("json").json(filePathsList: _*)
Please note that the .option call(s) are usually used to configure schema options.
Unfortunately, I haven't used S3 before, but I think you can use the same logic in the answer to this question to get the last modified file names:
How do I find the last modified file in a directory in Java?
You are loading like 13Gb of information. Are you sure that it takes a long time in just to create the DF? Maybe it's running the rest of the application but the UI shows that.
Try just to load and print the first row of the DF.
Anyway, what is the configuration of the cluster?
I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks).
The log files are CSV so I read them and apply a schema, then perform my transformations.
My problem is, how can I save each hour's data as a parquet format but append to the existing data set? When saving, I need to partition by 4 columns present in the dataframe.
Here is my save line:
data
.filter(validPartnerIds($"partnerID"))
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
The problem is that if the destination folder exists the save throws an error.
If the destination doesn't exist then I am not appending my files.
I've tried using .mode("append") but I find that Spark sometimes fails midway through so I end up loosing how much of my data is written and how much I still need to write.
I am using parquet because the partitioning substantially increases my querying in the future. As well, I must write the data as some file format on disk and cannot use a database such as Druid or Cassandra.
Any suggestions for how to partition my dataframe and save the files (either sticking to parquet or another format) is greatly appreciated.
If you need to append the files, you definitely have to use the append mode. I don't know how many partitions you expect it to generate, but I find that if you have many partitions, partitionBy will cause a number of problems (memory- and IO-issues alike).
If you think that your problem is caused by write operations taking too long, I recommend that you try these two things:
1) Use snappy by adding to the configuration:
conf.set("spark.sql.parquet.compression.codec", "snappy")
2) Disable generation of the metadata files in the hadoopConfiguration on the SparkContext like this:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
The metadata-files will be somewhat time consuming to generate (see this blog post), but according to this they are not actually important. Personally, I always disable them and have no issues.
If you generate many partitions (> 500), I'm afraid the best I can do is suggest to you that you look into a solution not using append-mode - I simply never managed to get partitionBy to work with that many partitions.
If you're using unsorted partitioning your data is going to be split across all of your partitions. That means every task will generate and write data to each of your output files.
Consider repartitioning your data according to your partition columns before writing to have all the data per output file on the same partitions:
data
.filter(validPartnerIds($"partnerID"))
.repartition([optional integer,] "partnerID","year","month","day")
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
See: DataFrame.repartition