I have multiple files stored in a hdfs location as follows
/user/project/202005/part-01798
/user/project/202005/part-01799
There are 2000 such part files. Each file is of the format
{'Name':'abc','Age':28,'Marks':[20,25,30]}
{'Name':...}
and so on . I have 2 questions
1) How to check whether these are multiple files or multiple partitions of the same file
2) How to read these in a data frame using pyspark
As these files are in one directory, and these are named as part-xxxxx files, so you can safely assume these are multiple part files of the same dataset. If these are partitions, they should be saved like this /user/project/date=202005/*
You can specify the dir "/user/project/202005" as input for spark like below assuming these are csv files
df = spark.read.csv('/user/project/202005/*',header=True, inferSchema=True)
Related
I have a Spark transformation program which reads 2 Parquet files and creates one final Dataframe which is then written to a Parquet file in another directory in HDFS.
Is there a way to create a meta data/Schema file of the Parquet in the same directory as the parquet in HDFS?
We require this metadata/schema file for another processing.
Assuming the consumer of the meta file is not the consumer of the parquet file (as then a meta file is redundant as the schema is embedded in parquet format), you could use the schema property on the dataframe and write that to a file as a string.
Note, that you cannot write this meta file to the same path as the parquet file as you will get an error when you try and read the parquet file back, but could write it to the parent directory.
I have multiple parquet files in different directories and would like to read them in sequence by parameterization in Scala.
The problem is the schema information is not standard and column names vary drastically.
For example: what might be called load_date in 1 directory can be called load_dt in a parquet file from another directory.
So i'm being forced to use different read.parquet().select statements for each directory. (there are more than 30)
Is there a way by which i can use the same statement and switch schema information based on a parameter of some sort? Maybe like a client name or ID?
I know parquet files store meta data, but is it possible to add custom metadata to a parquet file, using Scala (preferably) using Spark?
The idea is that I store many similar structured parquet files in a Hadoop storage, but each has a uniquely named source (String field, also present as column in the parquet file), however, I'd like to access this information without creating the overhead of actually reading the parquet and possibly even removing this redundant column from the parquet.
I really don't want to put this info in a filename, so my best option is now just to read the first line of each parquet and use the source column as String field.
It works, but I was just wondering if there is a better way.
I have multiple files stored in HDFS, and I need to merge them into one file using spark. However, because this operation is done frequently (every hour). I need to append those multiple files to the source file.
I found that there is the FileUtil that gives the 'copymerge' function. but it doesn't allow to append two files.
Thank you for your help
You can do this with two methods:
sc.textFile("path/source", "path/file1", "path/file2").coalesce(1).saveAsTextFile("path/newSource")
Or as #Pushkr has proposed
new UnionRDD(sc, Seq(sc.textFile("path/source"), sc.textFile("path/file1"),..)).coalesce(1).saveAsTextFile("path/newSource")
If you don't want to create a new source and overwrite the same source every hour, you can use dataframe with save mode overwrite ( How to overwrite the output directory in spark)
I am trying to save an RDD to a text file. My instance of Spark is running on Linux and connected to Azure Blob
val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
//find the rows which have only one digit in the 7th column in the CSV
val rdd1 = rdd.filter(s => s.split(",")(6).length() == 1)
rdd1.saveAsTextFile("wasb:///HVACOut")
When I look at the output, it is not as a single text file but as a series of application/octet-stream files in a folder called HVACOut.
How can I output it as a single text file instead?
Well I am not sure you can get just one file without a directory. If you do
rdd1 .coalesce(1).saveAsTextFile("wasb:///HVACOut")
you will get one file inside a directory called "HVACOut" the file should like something like part-00001. This is because your rdd is a disturbed on in your cluster with what they call partitions. When you do a call to save (all save functions) it is going to make a file per partition. So by call coalesce(1) your telling you want 1 partition.
Hope this helps.
After finished provisioning a Apache Spark cluster on Azure HDInsight, you can go to the built-in Jupyter notebook for your cluster at: https://YOURCLUSTERNAME.azurehdinsight.net/jupyter.
There you will find sample notebook with example on how to do this.
Specifically, for scala, you can go to the notebook named "02 - Read and write data from Azure Storage Blobs (WASB) (Scala)".
Copying some of the code and comments here:
Note:
Because CSV is not natively supported by Spark, so there is no built-in way to write an RDD to a CSV file. However, you can work around this if you want to save your data as CSV.
Code:
csvFile.map((line) => line.mkString(",")).saveAsTextFile("wasb:///example/data/HVAC2sc.csv")
Hope this helps!