Storing & reading custom metadata in parquet files using Spark / Scala - scala

I know parquet files store meta data, but is it possible to add custom metadata to a parquet file, using Scala (preferably) using Spark?
The idea is that I store many similar structured parquet files in a Hadoop storage, but each has a uniquely named source (String field, also present as column in the parquet file), however, I'd like to access this information without creating the overhead of actually reading the parquet and possibly even removing this redundant column from the parquet.
I really don't want to put this info in a filename, so my best option is now just to read the first line of each parquet and use the source column as String field.
It works, but I was just wondering if there is a better way.

Related

Spark Parquet Examine Metadata

One of Parquet's key features is metadata, including custom metadata.
However, I have been completely unable to read this metadata from Spark.
I have parquet files that contain file level metadata describing the data contained within. How can I gain access to that Metadata from Spark?
I'm currently using Scala for my Spark applications. I'm reading it into a dataframe using spark.read.parquet.

How to save and merge RDD data in scala?

I want to merge existing data in hdfs with new comings from RDD. (Not by filename instead by real data inside them)
I found out there is no way to control output files' names in rdd.saveAsTextFile API, so I can not save both just by naming them with different names.
I try to merge them by Hadoop's FileUtil.copyMerge function, but I'm using Hadoop 3, which means this API is not supported ever more.

How to convert hadoop avro, parquet, as well as text file to csv without spark

I have hdfs versions of avro, parquet, and text file. Unfortunately, I can't use spark to convert them to csv. I saw from an earlier so question that this doesn't seem to be possible. How to convert HDFS file to csv or tsv.
Is this possible, and if so, how do I do this?
This will help you to read Avro files (just avoid schema evolution/modifications...).
Example.
As to Parquet, you can use parquet-mr, take a look at ParquetReader.
Example: ignore the Spark usage, they just use it in order to create a Parquet file to be used later on with ParquetReader.
Hope it helps

Pyspark & HDFS: Add new dataframe column to existing parquet files in hdfs

let me first start with my scenario:
I have a huge dataframe stored in HDFS. I load the dataframe in a spark session
and create a new column without changing any of the existing content. After this, I want to store the dataframe to the original directory in HDFS.
Now, I know, I can practically do with with spark's write operation in the fashion df.parquet.write("my_df_path", mode="overwrite"). Since the data is immense, I'm investigating whether there is a so to speak column-wise append-mode or method, that does not write the complete dataframe back only the difference to the stored data. The final target is to save both memory and computational effort for the HDFS system.

Read multiple parquet files with different schema in scala

I have multiple parquet files in different directories and would like to read them in sequence by parameterization in Scala.
The problem is the schema information is not standard and column names vary drastically.
For example: what might be called load_date in 1 directory can be called load_dt in a parquet file from another directory.
So i'm being forced to use different read.parquet().select statements for each directory. (there are more than 30)
Is there a way by which i can use the same statement and switch schema information based on a parameter of some sort? Maybe like a client name or ID?