Append columns to existing CSV file in HDFS - scala

I am trying to append columns to a existing CSV file in HDFS.
Script1:
someDF1.repartition(1).write.format("com.databricks.spark.csv").mode("append").option("sep", "\t").option("header","true").save("folder/test_file.csv")
Error:
org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory.
Any suggestions on the mistake would be helpful

CSV files doesn't support Schema Evolution. So basically what you have to do is to read the entire data in the target path and then add the new column in this dataframe with some default value.
val oldDF = dfWithExistingData.withColumn("new_col", lit(null))
You can then union or merge this dataframe with the new dataset.
val targetData = oldDF.union(newDF)
You can then write the Data back to your target path in overwrite mode.
targetData
.repartition(1)
.write
.format("com.databricks.spark.csv")
.mode("overwrite")
.option("sep", "\t")
.option("header","true")
.save("folder")
Alternative: You can switch to other file formats which supports schema evolution e.g: Parquet to avoid doing the above process.

Related

Table or view not found when reading existing delta table

I am new to Delalake. I was trying a simple example.
Create dataframe from a csv
Save it is as delta table
Read it again.
It works fine. I can see the files are created in the default spark-warehouse folder.
But Next time I just want to read the saved table. So I comment code for the first two septs and re-run the program I get
Analysis Exception:Table or view not found
val transHistory = spark.
read
.option("header", "true")
.option("inferschema", true)
.csv(InputPath + "trainHistory.csv");
transHistory.write.format("delta").mode(SaveMode.Overwrite).saveAsTable("transactionshistory")
val transHistoryTable = spark.read.format("delta").table("transactionshistory")
transHistoryTable.show(10)
I am using delta lake 0.8.0, Spark 3.0, and scala 2.12.13

Group Cassandra Rows Then Write As Parquet File Using Spark

I need to write Cassandra Partitions as parquet file. Since I cannot share and use sparkSession in foreach function. Firstly, I call collect method to collect all data in driver program then I write parquet file to HDFS, as below.
Thanks to this link https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
I am able to get my partitioned rows. I want to write partitioned rows into seperated parquet file, whenever a partition is read from cassandra table. I also tried sparkSQLContext that method writes task results as temporary. I think, after all the tasks are done. I will see parquet files.
Is there any convenient method for this?
val keyedTable : CassandraTableScanRDD[(Tuple2[Int, Date], MyCassandraTable)] = getTableAsKeyed()
keyedTable.groupByKey
.collect
.foreach(f => {
import sparkSession.implicits._
val items = f._2.toList
val key = f._1
val baseHDFS = "hdfs://mycluster/parquet_test/"
val ds = sparkSession.sqlContext.createDataset(items)
ds.write
.option("compression", "gzip")
.parquet(baseHDFS + key._1 + "/" + key._2)
})
Why not use Spark SQL everywhere & use built-in functionality of the Parquet to write data by partitions, instead of creating a directory hierarchy yourself?
Something like this:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("table", "keyspace").load()
data.write
.option("compression", "gzip")
.partitionBy("col1", "col2")
.parquet(baseHDFS)
In this case, it will create a separate directory for every value of col & col2 as nested directories, with name like this: ${column}=${value}. Then when you read, you may force to read only specific value.

how can I split a dataframe into different df and need to save in different file?

var df = sparkSession.read
.option("delimiter", delimiter)
.option("header", true) // Use first line of all files as header
// .schema(customSchema)
.option("inferSchema", "true") // Automatically infer data types
.format("csv")
.load(filePath)
df.show()
df.write.partitionBy("outlook").csv("output/weather.csv")
but the output saved without that column values :
for example :
hot,high,false,yes
cool,normal,true,yes
Expected output for overcast file is:
overcast,hot,high,false,yes
overcast,cool,normal,true,yes
When you partition your data to write it, spark creates subfolders respecting the HDFS partitioning standards. Here you'll get a subfolder for each "outlook" value found in the dataset. All the files in the "outlook=overcast" subdirectory will only concern the records for which the outlook is overcast. So no need to store the outlook column in your data, its value would be the same across the whole files in a same subdirectory.
When reading back your data through Hive or Spark for instance, you'll have to specify that the outlook subdirectories are partitions indeed so a logical column can be used for projection, grouping, filtering or whatever you want to do.
In spark this can be expressed by specifying the basePath option :
val df = spark.read.option("basePath", "output/weather.csv").csv("output/weather.csv/*")
If you really need to store the outlook column in each file then maybe partitioning is not what you need.

Load file with schema information and dynamically apply to data file using Spark

I don't want to use infer schema and headers options. The only way is I should read a file containing only column headers and should use it dynamically to create a dataframe.
I am using Spark 2 and for loading a single csv file with my user defined schema but I want to handle this dynamically so that once I provide the path of only the schema file it will read that and use it as headers for the data and convert it to dataframe with the schema provided in the schema file.
Suppose in the folder I have provided contains 2 files. One file will have only the data, header is not compulsory. The 2nd file will have the schema (column names). So I have to read the schema file first followed by the file containing data and have to apply the schema to the data file and show it in dataframe.
Small example, schema.txt contains:
Custid,Name,Product
while the data file have:
1,Ravi,Mobile
From your comments I'm assuming the schema file only contains the column names and is formatted like a csv file (with the columns names as header and without any data). The column types will be inferred from the actual data file and are not specified by the schema file.
In this case, the easiest solution would be to read the schema file as a csv, setting header to true. This will give an empty dataframe but with the correct header. Then read the datafile and change the default column names to the ones in the schema dataframe.
val schemaFile = ...
val dataFile = ...
val colNames = spark.read.option("header", true).csv(schemaFile).columns
val df = spark.read
.option("header", "false")
.option("inferSchema", "true")
.csv(dataFile)
.toDF(colNames: _*)

How to read a CSV file and then save it as JSON in Spark Scala?

I am trying to read a CSV file that has around 7 million rows, and 22 columns.
How to save it as a JSON file after reading the CSV in a Spark Dataframe?
Read a CSV file as a dataframe
val spark = SparkSession.builder().master("local[2]").appname("test").getOrCreate
val df = spark.read.csv("path to csv")
Now you can perform some operation to df and save as JSON
df.write.json("output path")
Hope this helps!