I want to merge existing data in hdfs with new comings from RDD. (Not by filename instead by real data inside them)
I found out there is no way to control output files' names in rdd.saveAsTextFile API, so I can not save both just by naming them with different names.
I try to merge them by Hadoop's FileUtil.copyMerge function, but I'm using Hadoop 3, which means this API is not supported ever more.
Related
I am regularly uploading data on a parquet file which I use for my data analysis using and I want to ensure that the data in my parquet file are not duplicated. The command I use to do this is:
df.write.parquet('my_directory/', mode='overwrite')
Does this ensure that all my non-duplicated data will not be deleted accidentally at some point.
Cheers
The Overwrite as the name implies it rewrites the whole data into the path that you specify.
Rewrite in the sense, the data that is available in the df will be written to the path by removing the old files available if any in the path specified. So you can consider this as a DELETE and LOAD scenario, where you read all the records from the datasource lets say Oracle and then do your transformations and delete the parquet and write the new content in the dataframe.
The Dataframe.write supports a list of modes to write the content to the target.
mode –
specifies the behavior of the save operation when data already exists.
append: Append contents of this DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
error or errorifexists (default case): Throw an exception if data already
exists.
If your intention is to add new data to the parquet then you have to do with append but this brings in a new challenge of duplicates if you are dealing with changing data.
Does this ensure that all my non-duplicated data will not be deleted
accidentally at some point.
No. mode='overwrite' only ensures that if data already exists in the target directory, then the existing data would be deleted and new data would be written (analogous to truncate and load in RDBMS tables).
If you want to ensure there is no record level duplicates, the easiest thing to do is this:
df1 = df.dropDuplicates()
df1.write.parquet('my_directory/', mode='overwrite')
I have multiple parquet files in different directories and would like to read them in sequence by parameterization in Scala.
The problem is the schema information is not standard and column names vary drastically.
For example: what might be called load_date in 1 directory can be called load_dt in a parquet file from another directory.
So i'm being forced to use different read.parquet().select statements for each directory. (there are more than 30)
Is there a way by which i can use the same statement and switch schema information based on a parameter of some sort? Maybe like a client name or ID?
I am implementing the file source in Spark Structures Streaming and want to process the same file name again if the file has been modified. Basically an update to the file. Currently right now Spark will not process the same file name again once processed. Seems limited compared to Spark Streaming with Dstream. Is there a way to do this? Spark Structured Streaming doesn't document this anywhere it only process new file with different names.
I believe this is somewhat of an anti pattern, but you may be able to dig through the checkpoint data and remove the entry for that original file.
Try looking for the original file name in the /checkpoint/sources// files delete the file or entry. That might cause the stream to pick up the file name again. I haven't tried this myself.
If this is a one time manual update, I would just change the file name to something new and drop it in the source directory. This approach won't be maintainable or automated.
I know parquet files store meta data, but is it possible to add custom metadata to a parquet file, using Scala (preferably) using Spark?
The idea is that I store many similar structured parquet files in a Hadoop storage, but each has a uniquely named source (String field, also present as column in the parquet file), however, I'd like to access this information without creating the overhead of actually reading the parquet and possibly even removing this redundant column from the parquet.
I really don't want to put this info in a filename, so my best option is now just to read the first line of each parquet and use the source column as String field.
It works, but I was just wondering if there is a better way.
I have multiple files stored in HDFS, and I need to merge them into one file using spark. However, because this operation is done frequently (every hour). I need to append those multiple files to the source file.
I found that there is the FileUtil that gives the 'copymerge' function. but it doesn't allow to append two files.
Thank you for your help
You can do this with two methods:
sc.textFile("path/source", "path/file1", "path/file2").coalesce(1).saveAsTextFile("path/newSource")
Or as #Pushkr has proposed
new UnionRDD(sc, Seq(sc.textFile("path/source"), sc.textFile("path/file1"),..)).coalesce(1).saveAsTextFile("path/newSource")
If you don't want to create a new source and overwrite the same source every hour, you can use dataframe with save mode overwrite ( How to overwrite the output directory in spark)