In Spark, how to write header in a file, if there is no row in a dataframe? - pyspark

I want to write a header in a file if there is no row in dataframe, Currently when I write an empty dataframe to a file then file is created but it does not have header in it.
I am writing dataframe using these setting and command:
Dataframe.repartition(1) \
.write \
.format("com.databricks.spark.csv") \
.option("ignoreLeadingWhiteSpace", False) \
.option("ignoreTrailingWhiteSpace", False) \
.option("header", "true") \
.save('/mnt/Bilal/Dataframe');
I want the header row in the file, even if there is no data row in a dataframe.

if you want to have just header file. you can use fold left to create each column with white space and save that as your csv. I have not used pyspark but this is how it can be done in scala. majority of the code should be reusable you will have to just work on converting it to pyspark
val path ="/user/test"
val newdf=df.columns.foldleft(df){(tempdf,cols)=>
tempdf.withColumn(cols, lit(""))}
create a method for writing the header file
def createHeaderFile(headerFilePath: String, colNames: Array[String]) {
//format header file path
val fileName = "yourfileName.csv"
val headerFileFullName = "%s/%s".format(headerFilePath, fileName)
val hadoopConfig = new Configuration()
val fileSystem = FileSystem.get(hadoopConfig)
val output = fileSystem.create(new Path(headerFileFullName))
val writer = new PrintWriter(output)
for (h <- colNames) {
writer.write(h + ",")
}
writer.write("\n")
writer.close()
}
call it on your DF
createHeaderFile(path, newdf.columns)

I had the same problem with you, in Pyspark. When dataframe was empty (e.g after a .filter() transformation) then the output was one empty csv without header.
So, I created a custom method which checks if the ouput CSVs is one empty CSV. If yes, then it only adds the header.
import glob
import csv
def add_header_in_one_empty_csv(exported_path, columns):
list_of_csv_files = glob.glob(os.path.join(exported_path, '*.csv'))
if len(list_of_csv_files) == 1:
csv_file = list_of_csv_files[0]
with open(csv_file, 'a') as f:
if f.readline() == b'':
header = ','.join(columns)
f.write(header)
Example:
# Create a dummy Dataframe
df = spark.createDataFrame([(1,2), (1, 4), (3, 2), (1, 4)], ("a", "b"))
# Filter in order to create an empty Dataframe
filtered_df = df.filter(df['a']>10)
# Write the df without rows and no header
filtered_df.write.csv('output.csv', header='true')
# Add the header
add_header_in_one_empty_csv('output.csv', filtered_df.columns)

Same problem occurred to me. What I did is to use pandas for storing empty dataframes instead.
if df.count() == 0:
df.coalesce(1).toPandas().to_csv(join(output_folder, filename_output), index=False)
else:
df.coalesce(1).write.format("csv").option("header","true").mode('overwrite').save(join(output_folder, filename_output))

Related

How to remove footer from file while reading file in spark scala

I am trying to remove footer from file while reading file. is there any option like "footer" = "true".
The best approach will be to use Unix to remove footer from file
sed -i '$ d' foo.txt
If you want to do it spark way
You can first create dataframe then convert it into rdd and remove the last row from DF
lets say df is your dataframe after file read
val cnt= df.count();
val rdd = dataframe.rdd // convert df to rdd
//-- RDD without footer
val rddWithoutfoot = rdd.zipWithIndex().filter(x => x._2 < cnt )
.map (x => x._1)
// Dataframe without footer
val dfWithoutfoot = spark.createDataFrame(rddWithoutFoot , df.schema)

dataframe.select, select dataframe columns from file

I am trying to create a child dataframe from parent dataframe. but I have more than 100 cols to select.
so in Select statement can I give the columns from a file?
val Raw_input_schema=spark.read.format("text").option("header","true").option("delimiter","\t").load("/HEADER/part-00000").schema
val Raw_input_data=spark.read.format("text").schema(Raw_input_schema).option("delimiter","\t").load("/DATA/part-00000")
val filtered_data = Raw_input_data.select(all_cols)
how can I send the columns names from file in all_cols
I would assume you would read file somewhere from hdfs or from shared config file? Reason for this, that on the cluster this code, would be executed on individual node etc.
In this case I would approach this with next pice of code:
import org.apache.spark.sql.functions.col
val lines = Source.fromFile("somefile.name.csv").getLines
val cols = lines.flatMap(_.split(",")).map( col(_)).toArray
val df3 = df2.select(cols :_ *)
Essentially, you just have to provide array of strings and use :_ * notation for variable number of arguments.
finally this worked for me;
val Raw_input_schema=spark.read.format("csv").option("header","true").option("delimiter","\t").load("headerFile").schema
val Raw_input_data=spark.read.format("csv").schema(Raw_input_schema).option("delimiter","\t").load("dataFile")
val filtered_file = sc.textFile("filter_columns_file").map(cols=>cols.split("\t")).flatMap(x=>x).collect().toList
//or
val filtered_file = sc.textFile(filterFile).map(cols=>cols.split("\t")).flatMap(x=>x).collect().toList.map(x => new Column(x))
val final_df=Raw_input_data.select(filtered_file.head, filtered_file.tail: _*)
//or
val final_df = Raw_input_data.select(filtered_file:_*)'

Create DataFrame / Dataset using Header and Data in two different directories

I am getting the input file as CSV. Here I get two directories, first directory will have one file with header record and second directory will have data files. Here, I want to create a Dataframe/Dataset.
One way I can do is creating case class and split the data files by delimiter and attached the schema and create dataFrame.
What I am looking is read Header file and data file and create dataFrame. I saw a solution using databricks but my organization has restriction to use the databricks and below is the code which I come across. Can one you help me the solution without using databricks.
val headersDF = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("path to headers.csv")
val schema = headersDF.schema
val dataDF = sqlContext
.read
.format("com.databricks.spark.csv")
.schema(schema)
.load("path to data.csv")
You can do it like this
val schema=spark
.read
.format("csv")
.option("header","true")
.option("delimiter",",")
.load("C:\\spark\\programs\\empheaders.csv")
.schema
val data=spark
.read
.format("csv")
.schema(schema)
.option("delimiter",",")
.load("C:\\spark\\programs\\empdata.csv")
Because in your header CSV file you don't have any data there is no point in inferring the schema out of it.
So just get the field names by reading it.
val headerRDD = sc.parallelize(Seq(("Name,Age,Sal"))) //Assume this line is in your Header CSV
val header = headerRDD.flatMap(_.split(",")).collect
//headerRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at parallelize at command-2903591155643047:1
//header: Array[String] = Array(Name, Age, Sal)
Then read the data CSV file.
Either map each line to a case class or a tuple. Convert the data to a DataFrame by passing the header array.
val dataRdd = sc.parallelize(Seq(("Tom,22,500000"),("Rick,40,1000000"))) //Assume these lines are in your data CSV file
val data = dataRdd.map(_.split(",")).map(x => (x(0),x(1).toInt,x(2).toDouble)).toDF(header: _*)
//dataRdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[72] at parallelize at command-2903591155643048:1
//data: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 1 more field]
Result:
data.show()
+----+---+---------+
|Name|Age| Sal|
+----+---+---------+
| Tom| 22| 500000.0|
|Rick| 40|1000000.0|
+----+---+---------+

Scala csv file read and display the data in new column

I am new to Scala. I need to read data from csv file which has two header columns named Name and Marks, based on the Marks column I want to show the result in a 3rd column; pass or fail (<35 fail, >35pass).
The data looks like this:
Name,Marks
x,10
y,50
z,80
Result should be:
Name Marks Result
x 10 Fail
Y 50 Pass
z 80 Pass
You can read the csv file with header, then add a column by using when and otherwise to give different values depending on the marks.
import spark.implicits._
val df = spark.read.option("header", true).csv("/path/to/csv") // read csv
val df2 = df.withColumn("Result", when($"Marks" < 35, "Fail").otherwise("Pass"))
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local")
.appName("").config("spark.sql.warehouse.dir", "C:/temp").getOrCreate()
val df = spark.read.option("header",true).csv("file path")
val resul = df.withColumn("Result", when(col("Marks").cast("Int")>=35, "PASS").otherwise("FAIL"))

Spark: convert a CSV to RDD[Row]

I have a .csv file, which contains 258 columns in following structure.
["label", "index_1", "index_2", ... , "index_257"]
Now I wanna transform this .csv file to a RDD[Row]:
val data_csv = sc.textFile("~/test.csv")
val rowRDD = data_csv.map(_.split(",")).map(p => Row( p(0), p(1).trim, p(2).trim))
If I do the transform in this way, I have to write down 258 columns specifically. So I tried:
val rowRDD = data_csv.map(_.split(",")).map(p => Row( _ => p(_).trim))
and
val rowRDD = data_csv.map(_.split(",")).map(p => Row( x => p(x).trim))
But these two also not working and report error:
error: missing parameter type for expanded function ((x$2) => p(x$2).trim)
Can anyone tell me how to do this transform? Thanks a lot.
you should use sqlContext instead of sparkContext as
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", true)
.load(("~/test.csv")
this will create dataframe. calling .rdd on df should give you RDD[Row]
val rdd = df.rdd
Rather reading as a textFile read CSV files with the spark-csv
In your case
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.option("quote", "\"") //escape the quotes
.option("ignoreLeadingWhiteSpace", true) // escape space before your data
.load("cars.csv")
This loads data as a dataframe, now you can easily change it to RDD.
Hope this helps!
Apart from the other answers that are correct, the correct way to do what you're trying to do is to use Row.fromSeq inside the map function.
val rdd = sc.parallelize(Array((1 to 258).toArray, (1 to 258).toArray) )
.map(Row.fromSeq(_))
This will turn your rdd to type Row:
Array[org.apache.spark.sql.Row] = Array([1,2,3,4,5,6,7,8,9,10...