I have multiple parquet files in different directories and would like to read them in sequence by parameterization in Scala.
The problem is the schema information is not standard and column names vary drastically.
For example: what might be called load_date in 1 directory can be called load_dt in a parquet file from another directory.
So i'm being forced to use different read.parquet().select statements for each directory. (there are more than 30)
Is there a way by which i can use the same statement and switch schema information based on a parameter of some sort? Maybe like a client name or ID?
Related
I want to merge existing data in hdfs with new comings from RDD. (Not by filename instead by real data inside them)
I found out there is no way to control output files' names in rdd.saveAsTextFile API, so I can not save both just by naming them with different names.
I try to merge them by Hadoop's FileUtil.copyMerge function, but I'm using Hadoop 3, which means this API is not supported ever more.
I'm a newcomer to GCP, and I'm learning every day and I'm loving this platform.
I'm using GCP's dataprep to join several csv files (with the same column structure), treat some data and write to a BigQuery.
I created a storage (butcket) to put all 60 csv files inside. In dataprep can I define a data set to be the union of all these files? Or do you have to create a dataset for each file?
Thank you very much for your time and attention.
If you have all your files inside a directory in GCS you can import that directory as a single dataset. The process is the same as importing single files. You have to make sure though, that the column structure is exactly the same for all the files inside the directory.
If you create a separate dataset for each file you are more flexible on the structure they have when you use the UNION page to concatenate them.
However, if your use case is just to load all the files (~60) to a single table in Bigquery without any transformation, I would suggest to just use a BigQuery load job. You can use a wildcard in the Cloud Storage URI to specify the files you want. Currently, BigQuery load jobs are free of charge, so it would be a very cost-effective solution compared to the use of Dataprep.
I am regularly uploading data on a parquet file which I use for my data analysis using and I want to ensure that the data in my parquet file are not duplicated. The command I use to do this is:
df.write.parquet('my_directory/', mode='overwrite')
Does this ensure that all my non-duplicated data will not be deleted accidentally at some point.
Cheers
The Overwrite as the name implies it rewrites the whole data into the path that you specify.
Rewrite in the sense, the data that is available in the df will be written to the path by removing the old files available if any in the path specified. So you can consider this as a DELETE and LOAD scenario, where you read all the records from the datasource lets say Oracle and then do your transformations and delete the parquet and write the new content in the dataframe.
The Dataframe.write supports a list of modes to write the content to the target.
mode –
specifies the behavior of the save operation when data already exists.
append: Append contents of this DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
error or errorifexists (default case): Throw an exception if data already
exists.
If your intention is to add new data to the parquet then you have to do with append but this brings in a new challenge of duplicates if you are dealing with changing data.
Does this ensure that all my non-duplicated data will not be deleted
accidentally at some point.
No. mode='overwrite' only ensures that if data already exists in the target directory, then the existing data would be deleted and new data would be written (analogous to truncate and load in RDBMS tables).
If you want to ensure there is no record level duplicates, the easiest thing to do is this:
df1 = df.dropDuplicates()
df1.write.parquet('my_directory/', mode='overwrite')
I know parquet files store meta data, but is it possible to add custom metadata to a parquet file, using Scala (preferably) using Spark?
The idea is that I store many similar structured parquet files in a Hadoop storage, but each has a uniquely named source (String field, also present as column in the parquet file), however, I'd like to access this information without creating the overhead of actually reading the parquet and possibly even removing this redundant column from the parquet.
I really don't want to put this info in a filename, so my best option is now just to read the first line of each parquet and use the source column as String field.
It works, but I was just wondering if there is a better way.
I have multiple files stored in HDFS, and I need to merge them into one file using spark. However, because this operation is done frequently (every hour). I need to append those multiple files to the source file.
I found that there is the FileUtil that gives the 'copymerge' function. but it doesn't allow to append two files.
Thank you for your help
You can do this with two methods:
sc.textFile("path/source", "path/file1", "path/file2").coalesce(1).saveAsTextFile("path/newSource")
Or as #Pushkr has proposed
new UnionRDD(sc, Seq(sc.textFile("path/source"), sc.textFile("path/file1"),..)).coalesce(1).saveAsTextFile("path/newSource")
If you don't want to create a new source and overwrite the same source every hour, you can use dataframe with save mode overwrite ( How to overwrite the output directory in spark)