How do I skip a header from CSV files in Spark? - scala

Suppose I give three files paths to a Spark context to read and each file has a schema in the first row. How can we skip schema lines from headers?
val rdd=sc.textFile("file1,file2,file3")
Now, how can we skip header lines from this rdd?

data = sc.textFile('path_to_data')
header = data.first() #extract header
data = data.filter(row => row != header) #filter out header

If there were just one header line in the first record, then the most efficient way to filter it out would be:
rdd.mapPartitionsWithIndex {
(idx, iter) => if (idx == 0) iter.drop(1) else iter
}
This doesn't help if of course there are many files with many header lines inside. You can union three RDDs you make this way, indeed.
You could also just write a filter that matches only a line that could be a header. This is quite simple, but less efficient.
Python equivalent:
from itertools import islice
rdd.mapPartitionsWithIndex(
lambda idx, it: islice(it, 1, None) if idx == 0 else it
)

In Spark 2.0 a CSV reader is build into Spark, so you can easily load a CSV file as follows:
spark.read.option("header","true").csv("filePath")

From Spark 2.0 onwards what you can do is use SparkSession to get this done as a one liner:
val spark = SparkSession.builder.config(conf).getOrCreate()
and then as #SandeepPurohit said:
val dataFrame = spark.read.format("CSV").option("header","true").load(csvfilePath)
I hope it solved your question !
P.S: SparkSession is the new entry point introduced in Spark 2.0 and can be found under spark_sql package

In PySpark you can use a dataframe and set header as True:
df = spark.read.csv(dataPath, header=True)

Working in 2018 (Spark 2.3)
Python
df = spark.read
.option("header", "true")
.format("csv")
.schema(myManualSchema)
.load("mycsv.csv")
Scala
val myDf = spark.read
.option("header", "true")
.format("csv")
.schema(myManualSchema)
.load("mycsv.csv")
PD1: myManualSchema is a predefined schema written by me, you could skip that part of code
UPDATE 2021
The same code works for Spark 3.x
df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.format("csv")
.csv("mycsv.csv")

You could load each file separately, filter them with file.zipWithIndex().filter(_._2 > 0) and then union all the file RDDs.
If the number of files is too large, the union could throw a StackOverflowExeption.

Use the filter() method in PySpark by filtering out the first column name to remove the header:
# Read file (change format for other file formats)
contentRDD = sc.textfile(<filepath>)
# Filter out first column of the header
filterDD = contentRDD.filter(lambda l: not l.startswith(<first column name>)
# Check your result
for i in filterDD.take(5) : print (i)

Alternatively, you can use the spark-csv package (or in Spark 2.0 this is more or less available natively as CSV). Note that this expects the header on each file (as you desire):
schema = StructType([
StructField('lat',DoubleType(),True),
StructField('lng',DoubleType(),True)])
df = sqlContext.read.format('com.databricks.spark.csv'). \
options(header='true',
delimiter="\t",
treatEmptyValuesAsNulls=True,
mode="DROPMALFORMED").load(input_file,schema=schema)

It's an option that you pass to the read() command:
context = new org.apache.spark.sql.SQLContext(sc)
var data = context.read.option("header","true").csv("<path>")

You can simply filter out the Header row by simply using filter() action in Pycharm(in case using python)
rdd = sc.textFile('StudentData.csv')
headers=rdd.first()
rdd=rdd.filter(lambda x: x!=headers)
rdd.collect()

Steps to filter header from datasets in RDD in Spark
def filter_header(line):
if line[0] != 'header_column_first_column_name':
return True
filtered_daily_show = daily_show.filter(lambda line: filter_header(line))
filtered_daily_show.take(5)
Load the data into rdd
Create another rdd with the reference of first rdd by filtering head(As RDD in Spark is immutable)
Execute transformation by invoking action

//Find header from the files lying in the directory
val fileNameHeader = sc.binaryFiles("E:\\sss\\*.txt",1).map{
case (fileName, stream)=>
val header = new BufferedReader(new InputStreamReader(stream.open())).readLine()
(fileName, header)
}.collect().toMap
val fileNameHeaderBr = sc.broadcast(fileNameHeader)
// Now let's skip the header. mapPartition will ensure the header
// can only be the first line of the partition
sc.textFile("E:\\sss\\*.txt",1).mapPartitions(iter =>
if(iter.hasNext){
val firstLine = iter.next()
println(s"Comparing with firstLine $firstLine")
if(firstLine == fileNameHeaderBr.value.head._2)
new WrappedIterator(null, iter)
else
new WrappedIterator(firstLine, iter)
}
else {
iter
}
).collect().foreach(println)
class WrappedIterator(firstLine:String,iter:Iterator[String]) extends Iterator[String]{
var isFirstIteration = true
override def hasNext: Boolean = {
if (isFirstIteration && firstLine != null){
true
}
else{
iter.hasNext
}
}
override def next(): String = {
if (isFirstIteration){
println(s"For the first time $firstLine")
isFirstIteration = false
if (firstLine != null){
firstLine
}
else{
println(s"Every time $firstLine")
iter.next()
}
}
else {
iter.next()
}
}
}

For python developers. I have tested with spark2.0. Let's say you want to remove first 14 rows.
sc = spark.sparkContext
lines = sc.textFile("s3://folder_location_of_csv/")
parts = lines.map(lambda l: l.split(","))
parts.zipWithIndex().filter(lambda tup: tup[1] > 14).map(lambda x:x[0])
withColumn is df function. So below will not work in RDD style as used above.
parts.withColumn("index",monotonically_increasing_id()).filter(index > 14)

Related

How to remove header from CSV file using scala [duplicate]

Suppose I give three files paths to a Spark context to read and each file has a schema in the first row. How can we skip schema lines from headers?
val rdd=sc.textFile("file1,file2,file3")
Now, how can we skip header lines from this rdd?
data = sc.textFile('path_to_data')
header = data.first() #extract header
data = data.filter(row => row != header) #filter out header
If there were just one header line in the first record, then the most efficient way to filter it out would be:
rdd.mapPartitionsWithIndex {
(idx, iter) => if (idx == 0) iter.drop(1) else iter
}
This doesn't help if of course there are many files with many header lines inside. You can union three RDDs you make this way, indeed.
You could also just write a filter that matches only a line that could be a header. This is quite simple, but less efficient.
Python equivalent:
from itertools import islice
rdd.mapPartitionsWithIndex(
lambda idx, it: islice(it, 1, None) if idx == 0 else it
)
In Spark 2.0 a CSV reader is build into Spark, so you can easily load a CSV file as follows:
spark.read.option("header","true").csv("filePath")
From Spark 2.0 onwards what you can do is use SparkSession to get this done as a one liner:
val spark = SparkSession.builder.config(conf).getOrCreate()
and then as #SandeepPurohit said:
val dataFrame = spark.read.format("CSV").option("header","true").load(csvfilePath)
I hope it solved your question !
P.S: SparkSession is the new entry point introduced in Spark 2.0 and can be found under spark_sql package
In PySpark you can use a dataframe and set header as True:
df = spark.read.csv(dataPath, header=True)
Working in 2018 (Spark 2.3)
Python
df = spark.read
.option("header", "true")
.format("csv")
.schema(myManualSchema)
.load("mycsv.csv")
Scala
val myDf = spark.read
.option("header", "true")
.format("csv")
.schema(myManualSchema)
.load("mycsv.csv")
PD1: myManualSchema is a predefined schema written by me, you could skip that part of code
UPDATE 2021
The same code works for Spark 3.x
df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.format("csv")
.csv("mycsv.csv")
You could load each file separately, filter them with file.zipWithIndex().filter(_._2 > 0) and then union all the file RDDs.
If the number of files is too large, the union could throw a StackOverflowExeption.
Use the filter() method in PySpark by filtering out the first column name to remove the header:
# Read file (change format for other file formats)
contentRDD = sc.textfile(<filepath>)
# Filter out first column of the header
filterDD = contentRDD.filter(lambda l: not l.startswith(<first column name>)
# Check your result
for i in filterDD.take(5) : print (i)
Alternatively, you can use the spark-csv package (or in Spark 2.0 this is more or less available natively as CSV). Note that this expects the header on each file (as you desire):
schema = StructType([
StructField('lat',DoubleType(),True),
StructField('lng',DoubleType(),True)])
df = sqlContext.read.format('com.databricks.spark.csv'). \
options(header='true',
delimiter="\t",
treatEmptyValuesAsNulls=True,
mode="DROPMALFORMED").load(input_file,schema=schema)
It's an option that you pass to the read() command:
context = new org.apache.spark.sql.SQLContext(sc)
var data = context.read.option("header","true").csv("<path>")
You can simply filter out the Header row by simply using filter() action in Pycharm(in case using python)
rdd = sc.textFile('StudentData.csv')
headers=rdd.first()
rdd=rdd.filter(lambda x: x!=headers)
rdd.collect()
Steps to filter header from datasets in RDD in Spark
def filter_header(line):
if line[0] != 'header_column_first_column_name':
return True
filtered_daily_show = daily_show.filter(lambda line: filter_header(line))
filtered_daily_show.take(5)
Load the data into rdd
Create another rdd with the reference of first rdd by filtering head(As RDD in Spark is immutable)
Execute transformation by invoking action
//Find header from the files lying in the directory
val fileNameHeader = sc.binaryFiles("E:\\sss\\*.txt",1).map{
case (fileName, stream)=>
val header = new BufferedReader(new InputStreamReader(stream.open())).readLine()
(fileName, header)
}.collect().toMap
val fileNameHeaderBr = sc.broadcast(fileNameHeader)
// Now let's skip the header. mapPartition will ensure the header
// can only be the first line of the partition
sc.textFile("E:\\sss\\*.txt",1).mapPartitions(iter =>
if(iter.hasNext){
val firstLine = iter.next()
println(s"Comparing with firstLine $firstLine")
if(firstLine == fileNameHeaderBr.value.head._2)
new WrappedIterator(null, iter)
else
new WrappedIterator(firstLine, iter)
}
else {
iter
}
).collect().foreach(println)
class WrappedIterator(firstLine:String,iter:Iterator[String]) extends Iterator[String]{
var isFirstIteration = true
override def hasNext: Boolean = {
if (isFirstIteration && firstLine != null){
true
}
else{
iter.hasNext
}
}
override def next(): String = {
if (isFirstIteration){
println(s"For the first time $firstLine")
isFirstIteration = false
if (firstLine != null){
firstLine
}
else{
println(s"Every time $firstLine")
iter.next()
}
}
else {
iter.next()
}
}
}
For python developers. I have tested with spark2.0. Let's say you want to remove first 14 rows.
sc = spark.sparkContext
lines = sc.textFile("s3://folder_location_of_csv/")
parts = lines.map(lambda l: l.split(","))
parts.zipWithIndex().filter(lambda tup: tup[1] > 14).map(lambda x:x[0])
withColumn is df function. So below will not work in RDD style as used above.
parts.withColumn("index",monotonically_increasing_id()).filter(index > 14)

Unable to filter CSV columns stored in dataframe in spark 2.2.0

I am reading a CSV file from my local machine using spark and scala and storing into a dataframe (called df). I have to select only few selected columns with new aliasing names from the df and save to new dataframe newDf. I have tried to do the same but I am getting the error below.
main" org.apache.spark.sql.AnalysisException: cannot resolve '`history_temp.time`' given input columns: [history_temp.time, history_temp.poc]
Below is the code written to read the csv file from my local machine.
import org.apache.spark.sql.SparkSession
object DataLoadConversion {
def main(args: Array[String]): Unit = {
System.setProperty("spark.sql.warehouse.dir", "file:///C:/spark-warehouse")
val spark = SparkSession.builder().master("local").appName("DataConversion").getOrCreate()
val df = spark.read.format("com.databricks.spark.csv")
.option("quote", "\"")
.option("escape", "\"")
.option("delimiter", ",")
.option("header", "true")
.option("mode", "FAILFAST")
.option("inferSchema","true")
.load("file:///C:/Users/an/Desktop/ct_temp.csv")
df.show(5) // Till this code is working fine
val newDf = df.select("history_temp.time","history_temp.poc")
Below are the code which I tried but not working.
// val newDf = df.select($"history_temp.time",$"history_temp.poc")
// val newDf = df.select("history_temp.time","history_temp.poc")
// val newDf = df.select( df("history_temp.time").as("TIME"))
// val newDf = df.select(df.col("history_temp.time"))
// df.select(df.col("*")) // This is working
newDf.show(10)
}
}
from the looks of it. your column name format is the issue here. i am guessing they are just regular stringType but when you have something like history_temp.time spark thinks it as an arrayed column. which is not the case. I would rename all of the columns and replace "." to "". then you can run the same select and it should work. you can use foldleft to rplace all "." with "" like below.
val replacedDF = df.columns.foldleft(df){ (newdf, colname)=>
newdf.withColumnRenamed (colname, colname.replace(".","_"))
}
With that done you can select from replacedDF with below
val newDf= replacedDf.select("history_temp_time","history_temp_poc")
Let me know how it works out for you.

Task not serializable when iterating through dataframe, scala

Below is my code and when I try to iterate through each row:
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", true) // Use first line of all files as header
.option("delimiter", TILDE)
.option("inferSchema", "true") // Automatically infer data types
.load(fileName._2)
val accGrpCountsIds: DataFrame = df.groupBy("accgrpid").count()
LOGGER.info(s"DataFrame Count - ${accGrpCountsIds.count()}")
accGrpCountsIds.show(3)
//switch based on file names and update the model.
accGrpCountsIds.foreach(accGrpRow => {
val accGrpId = accGrpRow.getLong(0)
val rowCount = accGrpRow.getInt(1)
}
When I try to interate through the dataframe above using foreach, I get an task not serializable error. How can I do this?
Do you have any other types in your foreach that you didn't share? or that's all you do and it doesn't work?
accGrpCountsIds.foreach(accGrpRow => {
val accGrpId = accGrpRow.getLong(0)
val rowCount = accGrpRow.getInt(1)
}
Also, you may find that useful?
Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects

Make the first line of csv as header of the table in scala [duplicate]

Suppose I give three files paths to a Spark context to read and each file has a schema in the first row. How can we skip schema lines from headers?
val rdd=sc.textFile("file1,file2,file3")
Now, how can we skip header lines from this rdd?
data = sc.textFile('path_to_data')
header = data.first() #extract header
data = data.filter(row => row != header) #filter out header
If there were just one header line in the first record, then the most efficient way to filter it out would be:
rdd.mapPartitionsWithIndex {
(idx, iter) => if (idx == 0) iter.drop(1) else iter
}
This doesn't help if of course there are many files with many header lines inside. You can union three RDDs you make this way, indeed.
You could also just write a filter that matches only a line that could be a header. This is quite simple, but less efficient.
Python equivalent:
from itertools import islice
rdd.mapPartitionsWithIndex(
lambda idx, it: islice(it, 1, None) if idx == 0 else it
)
In Spark 2.0 a CSV reader is build into Spark, so you can easily load a CSV file as follows:
spark.read.option("header","true").csv("filePath")
From Spark 2.0 onwards what you can do is use SparkSession to get this done as a one liner:
val spark = SparkSession.builder.config(conf).getOrCreate()
and then as #SandeepPurohit said:
val dataFrame = spark.read.format("CSV").option("header","true").load(csvfilePath)
I hope it solved your question !
P.S: SparkSession is the new entry point introduced in Spark 2.0 and can be found under spark_sql package
In PySpark you can use a dataframe and set header as True:
df = spark.read.csv(dataPath, header=True)
Working in 2018 (Spark 2.3)
Python
df = spark.read
.option("header", "true")
.format("csv")
.schema(myManualSchema)
.load("mycsv.csv")
Scala
val myDf = spark.read
.option("header", "true")
.format("csv")
.schema(myManualSchema)
.load("mycsv.csv")
PD1: myManualSchema is a predefined schema written by me, you could skip that part of code
UPDATE 2021
The same code works for Spark 3.x
df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.format("csv")
.csv("mycsv.csv")
You could load each file separately, filter them with file.zipWithIndex().filter(_._2 > 0) and then union all the file RDDs.
If the number of files is too large, the union could throw a StackOverflowExeption.
Use the filter() method in PySpark by filtering out the first column name to remove the header:
# Read file (change format for other file formats)
contentRDD = sc.textfile(<filepath>)
# Filter out first column of the header
filterDD = contentRDD.filter(lambda l: not l.startswith(<first column name>)
# Check your result
for i in filterDD.take(5) : print (i)
Alternatively, you can use the spark-csv package (or in Spark 2.0 this is more or less available natively as CSV). Note that this expects the header on each file (as you desire):
schema = StructType([
StructField('lat',DoubleType(),True),
StructField('lng',DoubleType(),True)])
df = sqlContext.read.format('com.databricks.spark.csv'). \
options(header='true',
delimiter="\t",
treatEmptyValuesAsNulls=True,
mode="DROPMALFORMED").load(input_file,schema=schema)
It's an option that you pass to the read() command:
context = new org.apache.spark.sql.SQLContext(sc)
var data = context.read.option("header","true").csv("<path>")
You can simply filter out the Header row by simply using filter() action in Pycharm(in case using python)
rdd = sc.textFile('StudentData.csv')
headers=rdd.first()
rdd=rdd.filter(lambda x: x!=headers)
rdd.collect()
Steps to filter header from datasets in RDD in Spark
def filter_header(line):
if line[0] != 'header_column_first_column_name':
return True
filtered_daily_show = daily_show.filter(lambda line: filter_header(line))
filtered_daily_show.take(5)
Load the data into rdd
Create another rdd with the reference of first rdd by filtering head(As RDD in Spark is immutable)
Execute transformation by invoking action
//Find header from the files lying in the directory
val fileNameHeader = sc.binaryFiles("E:\\sss\\*.txt",1).map{
case (fileName, stream)=>
val header = new BufferedReader(new InputStreamReader(stream.open())).readLine()
(fileName, header)
}.collect().toMap
val fileNameHeaderBr = sc.broadcast(fileNameHeader)
// Now let's skip the header. mapPartition will ensure the header
// can only be the first line of the partition
sc.textFile("E:\\sss\\*.txt",1).mapPartitions(iter =>
if(iter.hasNext){
val firstLine = iter.next()
println(s"Comparing with firstLine $firstLine")
if(firstLine == fileNameHeaderBr.value.head._2)
new WrappedIterator(null, iter)
else
new WrappedIterator(firstLine, iter)
}
else {
iter
}
).collect().foreach(println)
class WrappedIterator(firstLine:String,iter:Iterator[String]) extends Iterator[String]{
var isFirstIteration = true
override def hasNext: Boolean = {
if (isFirstIteration && firstLine != null){
true
}
else{
iter.hasNext
}
}
override def next(): String = {
if (isFirstIteration){
println(s"For the first time $firstLine")
isFirstIteration = false
if (firstLine != null){
firstLine
}
else{
println(s"Every time $firstLine")
iter.next()
}
}
else {
iter.next()
}
}
}
For python developers. I have tested with spark2.0. Let's say you want to remove first 14 rows.
sc = spark.sparkContext
lines = sc.textFile("s3://folder_location_of_csv/")
parts = lines.map(lambda l: l.split(","))
parts.zipWithIndex().filter(lambda tup: tup[1] > 14).map(lambda x:x[0])
withColumn is df function. So below will not work in RDD style as used above.
parts.withColumn("index",monotonically_increasing_id()).filter(index > 14)

skip header of csv while reading multiple files into rdd in scala

I am trying to read multiple csvs into an rdd from a path. This path has many csvs Is there a way I can avoid the headers while reading all the csvs into rdd? or use spotsRDD to omit out the header without having to use filter or deal with each csv individually and then union them?
val path ="file:///home/work/csvs/*"
val spotsRDD= sc.textFile(path)
println(spotsRDD.count())
Thanks
That is pity you are using spark 1.0.0.
You can use CSV Data Source for Apache Spark but this library requires Spark 1.3+ and btw. this library was inlined to Spark 2.x.
But we can analyse and implement something similar.
When we look into the com/databricks/spark/csv/DefaultSource.scala there is
val useHeader = parameters.getOrElse("header", "false")
and then in the com/databricks/spark/csv/CsvRelation.scala there is
// If header is set, make sure firstLine is materialized before sending to executors.
val filterLine = if (useHeader) firstLine else null
baseRDD().mapPartitions { iter =>
// When using header, any input line that equals firstLine is assumed to be header
val csvIter = if (useHeader) {
iter.filter(_ != filterLine)
} else {
iter
}
parseCSV(csvIter, csvFormat)
so if we assume the first line is only once in RDD (our csv rows) we can do something like in the example below:
CSV example file:
Latitude,Longitude,Name
48.1,0.25,"First point"
49.2,1.1,"Second point"
47.5,0.75,"Third point"
scala> val csvData = sc.textFile("test.csv")
csvData: org.apache.spark.rdd.RDD[String] = test.csv MapPartitionsRDD[24] at textFile at <console>:24
scala> val header = csvDataRdd.first
header: String = Latitude,Longitude,Name
scala> val csvDataWithoutHeaderRdd = csvDataRdd.mapPartitions{iter => iter.filter(_ != header)}
csvDataWithoutHeaderRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at mapPartitions at <console>:28
scala> csvDataWithoutHeaderRdd.foreach(println)
49.2,1.1,"Second point"
48.1,0.25,"First point"
47.5,0.75,"Third point"