Read part files from hdfs into data frame using pySpark

Read part files from hdfs into data frame using pySpark - pyspark

I have multiple files stored in a hdfs location as follows
/user/project/202005/part-01798
/user/project/202005/part-01799
There are 2000 such part files. Each file is of the format
{'Name':'abc','Age':28,'Marks':[20,25,30]}
{'Name':...}
and so on . I have 2 questions
1) How to check whether these are multiple files or multiple partitions of the same file
2) How to read these in a data frame using pyspark

As these files are in one directory, and these are named as part-xxxxx files, so you can safely assume these are multiple part files of the same dataset. If these are partitions, they should be saved like this /user/project/date=202005/*
You can specify the dir "/user/project/202005" as input for spark like below assuming these are csv files
df = spark.read.csv('/user/project/202005/*',header=True, inferSchema=True)

Related

How do I create a metadata file in HDFS when writing a Parquet file as output from a Dataframe in PySpark?

I have a Spark transformation program which reads 2 Parquet files and creates one final Dataframe which is then written to a Parquet file in another directory in HDFS.
Is there a way to create a meta data/Schema file of the Parquet in the same directory as the parquet in HDFS?
We require this metadata/schema file for another processing.

Assuming the consumer of the meta file is not the consumer of the parquet file (as then a meta file is redundant as the schema is embedded in parquet format), you could use the schema property on the dataframe and write that to a file as a string.
Note, that you cannot write this meta file to the same path as the parquet file as you will get an error when you try and read the parquet file back, but could write it to the parent directory.

Read multiple parquet files with different schema in scala

I have multiple parquet files in different directories and would like to read them in sequence by parameterization in Scala.
The problem is the schema information is not standard and column names vary drastically.
For example: what might be called load_date in 1 directory can be called load_dt in a parquet file from another directory.
So i'm being forced to use different read.parquet().select statements for each directory. (there are more than 30)
Is there a way by which i can use the same statement and switch schema information based on a parameter of some sort? Maybe like a client name or ID?

Storing & reading custom metadata in parquet files using Spark / Scala

I know parquet files store meta data, but is it possible to add custom metadata to a parquet file, using Scala (preferably) using Spark?
The idea is that I store many similar structured parquet files in a Hadoop storage, but each has a uniquely named source (String field, also present as column in the parquet file), however, I'd like to access this information without creating the overhead of actually reading the parquet and possibly even removing this redundant column from the parquet.
I really don't want to put this info in a filename, so my best option is now just to read the first line of each parquet and use the source column as String field.
It works, but I was just wondering if there is a better way.

Append/concatenate two files using spark/scala

I have multiple files stored in HDFS, and I need to merge them into one file using spark. However, because this operation is done frequently (every hour). I need to append those multiple files to the source file.
I found that there is the FileUtil that gives the 'copymerge' function. but it doesn't allow to append two files.
Thank you for your help

You can do this with two methods:
sc.textFile("path/source", "path/file1", "path/file2").coalesce(1).saveAsTextFile("path/newSource")
Or as #Pushkr has proposed
new UnionRDD(sc, Seq(sc.textFile("path/source"), sc.textFile("path/file1"),..)).coalesce(1).saveAsTextFile("path/newSource")
If you don't want to create a new source and overwrite the same source every hour, you can use dataframe with save mode overwrite ( How to overwrite the output directory in spark)

Spark saveAsTextFile to Azure Blob creates a blob instead of a text file

I am trying to save an RDD to a text file. My instance of Spark is running on Linux and connected to Azure Blob
val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
//find the rows which have only one digit in the 7th column in the CSV
val rdd1 = rdd.filter(s => s.split(",")(6).length() == 1)
rdd1.saveAsTextFile("wasb:///HVACOut")
When I look at the output, it is not as a single text file but as a series of application/octet-stream files in a folder called HVACOut.
How can I output it as a single text file instead?

Well I am not sure you can get just one file without a directory. If you do
rdd1 .coalesce(1).saveAsTextFile("wasb:///HVACOut")
you will get one file inside a directory called "HVACOut" the file should like something like part-00001. This is because your rdd is a disturbed on in your cluster with what they call partitions. When you do a call to save (all save functions) it is going to make a file per partition. So by call coalesce(1) your telling you want 1 partition.
Hope this helps.

After finished provisioning a Apache Spark cluster on Azure HDInsight, you can go to the built-in Jupyter notebook for your cluster at: https://YOURCLUSTERNAME.azurehdinsight.net/jupyter.
There you will find sample notebook with example on how to do this.
Specifically, for scala, you can go to the notebook named "02 - Read and write data from Azure Storage Blobs (WASB) (Scala)".
Copying some of the code and comments here:
Note:
Because CSV is not natively supported by Spark, so there is no built-in way to write an RDD to a CSV file. However, you can work around this if you want to save your data as CSV.
Code:
csvFile.map((line) => line.mkString(",")).saveAsTextFile("wasb:///example/data/HVAC2sc.csv")
Hope this helps!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Read part files from hdfs into data frame using pySpark - pyspark

Related

How do I create a metadata file in HDFS when writing a Parquet file as output from a Dataframe in PySpark?

Read multiple parquet files with different schema in scala

Storing & reading custom metadata in parquet files using Spark / Scala

Append/concatenate two files using spark/scala

Spark saveAsTextFile to Azure Blob creates a blob instead of a text file

Categories

Resources