How to remove footer from file while reading file in spark scala - scala

I am trying to remove footer from file while reading file. is there any option like "footer" = "true".

The best approach will be to use Unix to remove footer from file
sed -i '$ d' foo.txt
If you want to do it spark way
You can first create dataframe then convert it into rdd and remove the last row from DF
lets say df is your dataframe after file read
val cnt= df.count();
val rdd = dataframe.rdd // convert df to rdd
//-- RDD without footer
val rddWithoutfoot = rdd.zipWithIndex().filter(x => x._2 < cnt )
.map (x => x._1)
// Dataframe without footer
val dfWithoutfoot = spark.createDataFrame(rddWithoutFoot , df.schema)

Related

In Spark, how to write header in a file, if there is no row in a dataframe?

I want to write a header in a file if there is no row in dataframe, Currently when I write an empty dataframe to a file then file is created but it does not have header in it.
I am writing dataframe using these setting and command:
Dataframe.repartition(1) \
.write \
.format("com.databricks.spark.csv") \
.option("ignoreLeadingWhiteSpace", False) \
.option("ignoreTrailingWhiteSpace", False) \
.option("header", "true") \
.save('/mnt/Bilal/Dataframe');
I want the header row in the file, even if there is no data row in a dataframe.
if you want to have just header file. you can use fold left to create each column with white space and save that as your csv. I have not used pyspark but this is how it can be done in scala. majority of the code should be reusable you will have to just work on converting it to pyspark
val path ="/user/test"
val newdf=df.columns.foldleft(df){(tempdf,cols)=>
tempdf.withColumn(cols, lit(""))}
create a method for writing the header file
def createHeaderFile(headerFilePath: String, colNames: Array[String]) {
//format header file path
val fileName = "yourfileName.csv"
val headerFileFullName = "%s/%s".format(headerFilePath, fileName)
val hadoopConfig = new Configuration()
val fileSystem = FileSystem.get(hadoopConfig)
val output = fileSystem.create(new Path(headerFileFullName))
val writer = new PrintWriter(output)
for (h <- colNames) {
writer.write(h + ",")
}
writer.write("\n")
writer.close()
}
call it on your DF
createHeaderFile(path, newdf.columns)
I had the same problem with you, in Pyspark. When dataframe was empty (e.g after a .filter() transformation) then the output was one empty csv without header.
So, I created a custom method which checks if the ouput CSVs is one empty CSV. If yes, then it only adds the header.
import glob
import csv
def add_header_in_one_empty_csv(exported_path, columns):
list_of_csv_files = glob.glob(os.path.join(exported_path, '*.csv'))
if len(list_of_csv_files) == 1:
csv_file = list_of_csv_files[0]
with open(csv_file, 'a') as f:
if f.readline() == b'':
header = ','.join(columns)
f.write(header)
Example:
# Create a dummy Dataframe
df = spark.createDataFrame([(1,2), (1, 4), (3, 2), (1, 4)], ("a", "b"))
# Filter in order to create an empty Dataframe
filtered_df = df.filter(df['a']>10)
# Write the df without rows and no header
filtered_df.write.csv('output.csv', header='true')
# Add the header
add_header_in_one_empty_csv('output.csv', filtered_df.columns)
Same problem occurred to me. What I did is to use pandas for storing empty dataframes instead.
if df.count() == 0:
df.coalesce(1).toPandas().to_csv(join(output_folder, filename_output), index=False)
else:
df.coalesce(1).write.format("csv").option("header","true").mode('overwrite').save(join(output_folder, filename_output))

Convert header (column names) to new dataframe

I have a dataframe with headers for example outputDF. I now want to take outputDF.columns and create a new dataframe with just one row which contains column names.
I then want to union both these dataframes with option("head=false") which spark can then write to a HDFS.
How do i do that?
below is an example
Val df = spark.read.csv("path")
val newDf = df.columns.toSeq.toDF
val unoindf= df.union(newDf);

dataframe.select, select dataframe columns from file

I am trying to create a child dataframe from parent dataframe. but I have more than 100 cols to select.
so in Select statement can I give the columns from a file?
val Raw_input_schema=spark.read.format("text").option("header","true").option("delimiter","\t").load("/HEADER/part-00000").schema
val Raw_input_data=spark.read.format("text").schema(Raw_input_schema).option("delimiter","\t").load("/DATA/part-00000")
val filtered_data = Raw_input_data.select(all_cols)
how can I send the columns names from file in all_cols
I would assume you would read file somewhere from hdfs or from shared config file? Reason for this, that on the cluster this code, would be executed on individual node etc.
In this case I would approach this with next pice of code:
import org.apache.spark.sql.functions.col
val lines = Source.fromFile("somefile.name.csv").getLines
val cols = lines.flatMap(_.split(",")).map( col(_)).toArray
val df3 = df2.select(cols :_ *)
Essentially, you just have to provide array of strings and use :_ * notation for variable number of arguments.
finally this worked for me;
val Raw_input_schema=spark.read.format("csv").option("header","true").option("delimiter","\t").load("headerFile").schema
val Raw_input_data=spark.read.format("csv").schema(Raw_input_schema).option("delimiter","\t").load("dataFile")
val filtered_file = sc.textFile("filter_columns_file").map(cols=>cols.split("\t")).flatMap(x=>x).collect().toList
//or
val filtered_file = sc.textFile(filterFile).map(cols=>cols.split("\t")).flatMap(x=>x).collect().toList.map(x => new Column(x))
val final_df=Raw_input_data.select(filtered_file.head, filtered_file.tail: _*)
//or
val final_df = Raw_input_data.select(filtered_file:_*)'

Spark scala- How to apply transformation logic on a generic set of columns defined in a file

I am using spark scala 1.6 version.
I have 2 files, one is a schema file which has hundreds of columns separated by commas and another file is .gz file which contains data.
I am trying to read the data using the schema file and apply different transformation logic on a set of few columns .
I tried running a sample code but I have hardcoded the columns numbers in the attached pic.
Also I want to write a udf which could read any set of columns and apply the transformation like replacing a special character and give the output.
Appreciate any suggestion
import org.apache.spark.SparkContext
val rdd1 = sc.textFile("../inp2.txt")
val rdd2 = rdd1.map(line => line.split("\t"))
val rdd2 = rdd1.map(line => line.split("\t")(1)).toDF
val replaceUDF = udf{s: String => s.replace(".", "")}
rdd2.withColumn("replace", replaceUDF('_1)).show
You can read the field name file with simple scala code and create a list of column names as
// this reads the file and creates a list of columnnames
val line = Source.fromFile("path to file").getLines().toList.head
val columnNames = line.split(",")
//read the text file as an rdd and convert to Dataframe
val rdd1 = sc.textFile("../inp2.txt")
val rdd2 = rdd1.map(line => line.split("\t")(1))
.toDF(columnNames : _*)
This creates a dataframe with columns names that you have in a separate file.
Hope this helps!

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()