why remove header from csv file doesn't work - scala

object test {
case class Caserne(x: String, y: String, Name: String, Description: String)
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("BankDataAnalysis").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext= new SQLContext(sc)
import sqlContext.implicits._
// load caserne data
val caserneTxt = sc.parallelize(
IOUtils.toString(
new URL("http://donnees.ville.montreal.qc.ca/dataset/c69e78c6-e454-4bd9-9778-e4b0eaf8105b/resource/f6542ad1-31f5-458e-b33d-1a028fab3e98/download/casernessim.csv"),
Charset.forName("utf8")).split("\n"))
val header = caserneTxt.first()
val caserne = caserneTxt.map(s => s.split(",")).filter(s => s != header).map(
s => Caserne(s(0),
s(1),
s(2).replaceAll("[^\\d]", "").trim(),
s(3).replaceAll("""<(?!\/?a(?=>|\s.*>))\/?.*?>""", " ").trim()
)).toDF()
caserne.registerTempTable("caserne")
sqlContext.sql("Select * from caserne").show()
}
}
I have to remove csv file header. I used filter(s => s != header) but it did'nt work. Thank you for your help

Try using :-
val rows = data.filter(s=> header(s,"X") != "X")
reference :- How do I convert csv file to rdd
I found this convenient method
val header = caserneTxt.first()
val no_header = caserneTxt.filter(_(0) != header(0))

one way would be using the one of the header key and filter that from dataframe something like below
dataFrame.filter(row => row.getAs[String]("description") != "description").show

Related

Migrate code Scala to databricks notebook

Working to get this code running using notebooks in databricks(already tested and working with an IDE), can not get this working if I change the structure of the code.
import java.io.{BufferedReader, InputStreamReader}
import java.text.SimpleDateFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
object TestUnit {
val dateFormat = new SimpleDateFormat("yyyyMMdd")
case class Averages (cust: String, Num: String, date: String, credit: Double)
def main(args: Array[String]): Unit = {
val inputFile = "s3a://tfsdl-ghd-wb/raidnd/Cleartablet.csv"
val outputFile = "s3a://tfsdl-ghd-wb/raidnd/Incte_19&20.csv"
val fileSystem = getFileSystem(inputFile)
val inputData = readCSVFileLines(fileSystem, inputFile, skipHeader = true)
.toSeq
val filtinp = inputData.filter(x => x.nonEmpty)
.map(x => x.split(","))
.map(x => Revenue(x(6), x(5), x(0), x(8).toDouble))
// Create output writer
val writer = new PrintWriter(new File(outputFile))
// Header for output CSV file
writer.write("Date,customer,number,Credit,Average Credit/SKU\n")
filtinp.foreach{x =>
val (com1, avg1) = com1Average(filtermp, x)
val (com2, avg2) = com2Average(filtermp, x)
}
// Write row to output csv file
writer.write(s"${x.day},${x.customer},${x.number},${x.credit},${avgcredit1},${avgcredit2}\n")
writer.close() // close the writer`
}
}

remove header from csv while reading from from txt or csv file in spark scala

I am trying to remove header from given input file. But I couldn't make it.
Th is what I have written. Can someone help me how to remove headers from the txt or csv file.
import org.apache.spark.{SparkConf, SparkContext}
object SalesAmount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName(getClass.getName).setMaster("local")
val sc = new SparkContext(conf)
val salesRDD = sc.textFile(args(0),2)
val salesPairRDD = salesRDD.map(rec => {
val fieldArr = rec.split(",")
(fieldArr(1), fieldArr(3).toDouble)
})
val totalAmountRDD = salesPairRDD.reduceByKey(_+_).sortBy(_._2,false)
val discountAmountRDD = totalAmountRDD.map(t => {
if (t._2 > 1000) (t._1,t._2 * 0.9)
else t
})
discountAmountRDD.foreach(println)
}
}
Skipping the first row when manually parsing text files using the RDD API is a bit tricky:
val salesPairRDD =
salesRDD
.mapPartitionsWithIndex((i, it) => if (i == 0) it.drop(1) else it)
.map(rec => {
val fieldArr = rec.split(",")
(fieldArr(1), fieldArr(3).toDouble)
})
The header line will be the first item in the first partition, so mapPartitionsWithIndex is used to iterate over the partitions and to skip the first item if the partition index is 0.

issue when using scala filter function in rdd

I started learning scala and Apache spark. I have an input file as below without the header.
0,name1,33,385 - first record
1,name2,26,221 - second record
unique-id, name, age, friends
1) when trying to filter age which is not 26, the below code is not working.
def parseLine(x : String) =
{
val line = x.split(",").filter(x => x._2 != "26")
}
I also tried like below. both cases it is printing all the values including 26
val friends = line(2).filter(x => x != "26")
2)when trying with index x._3, it is saying index outbound.
val line = x.split(",").filter(x => x._3 != "221")
Why index 3 is having an issue here?
Please find below the complete sample code.
package learning
import org.apache.spark._
import org.apache.log4j._
object Test1 {
def main(args : Array[String]): Unit =
{
val sc = new SparkContext("local[*]", "Test1")
val lines = sc.textFile("D:\\SparkScala\\abcd.csv")
Logger.getLogger("org").setLevel(Level.ERROR)
val testres = lines.map(parseLine)
testres.take(10).foreach(println)
}
def parseLine(x : String) =
{
val line = x.split(",").filter(x => x._2 != "33")
//val line = x.split(",").filter(x => x._3 != "307")
val age = line(1)
val friends = line(3).filter(x => x != "307")
(age,friends)
}
}
how to filter with age or friends in simple way here.
why index 3 is not working here
The issue is that you are trying to filter on the array representing a single line and not on the RDD that contains all the lines.
A possible version could be the following (I also created a case class to hold the data coming from the CSV):
package learning
import org.apache.spark._
import org.apache.log4j._
object Test2 {
// A structured representation of a CSV line
case class Person(id: String, name: String, age: Int, friends: Int)
def main(args : Array[String]): Unit = {
val sc = new SparkContext("local[*]", "Test1")
Logger.getLogger("org").setLevel(Level.ERROR)
sc.textFile("D:\\SparkScala\\abcd.csv") // RDD[String]
.map(line => parse(line)) // RDD[Person]
.filter(person => person.age != 26) // filter out people of 26 years old
.take(10) // collect 10 people from the RDD
.foreach(println)
}
def parse(x : String): Person = {
// Split the CSV string by comma into an array of strings
val line = x.split(",")
// After extracting the fields from the CSV string, create an instance of Person
Person(id = line(0), name = line(1), age = line(2).toInt, friends = line(3).toInt)
}
}
Another possibility would be to use flatMap() and Option[] values instead. In this case you can operate on a single line directly, for instance:
package learning
import org.apache.spark._
import org.apache.log4j._
object Test3 {
// A structured representation of a CSV line
case class Person(id: String, name: String, age: Int, friends: Int)
def main(args : Array[String]): Unit = {
val sc = new SparkContext("local[*]", "Test1")
Logger.getLogger("org").setLevel(Level.ERROR)
sc.textFile("D:\\SparkScala\\abcd.csv") // RDD[String]
.flatMap(line => parse(line)) // RDD[Person] -- you don't need to filter anymore, the flatMap does it for you now
.take(10) // collect 10 people from the RDD
.foreach(println)
}
def parse(x : String): Option[Person] = {
// Split the CSV string by comma into an array of strings
val line = x.split(",")
// After extracting the fields from the CSV string, create an instance of Person only if it's not 26
line(2) match {
case "26" => None
case _ => Some(Person(id = line(0), name = line(1), age = line(2).toInt, friends = line(3).toInt))
}
}
}

Union Dataframes based on condition in spark(scala)

I have a folder which consists of 4 subfolders which contains parquet files
Folder->A.parquet,B.parquet,C.parquet,D.parquet(subfolders). My requirement is I want to union data frames based on file Names I provide to the method.
I am doing it with code
val df = listDirectoriesGetWantedFile(folderPath,sqlContext,A,B)
def listDirectoriesGetWantedFile(folderPath: String, sqlContext: SQLContext, str1: String, str2: String): DataFrame = {
var df: DataFrame = null
val sb = new StringBuilder
sb.setLength(0)
var done = false
val path = new Path(folderPath)
if (fileSystem.isDirectory(path)) {
var files = fileSystem.listStatus(path)
for (file <- files) {
if (file.getPath.getName.contains(str) && !done) {
sb.append(file.getPath.toString())
sb.append(",")
done = true
} else if (file.getPath.getName.contains(str2)) {
sb.append(file.getPath.toString())
}
}
}
But I need to split the sb and then union the dataframes. Which I am unable to find the solution. How can I approach it and solve
If I understand your question, you could simply do something like this :
def listDirectoriesGetWantedFile(path: String,
sqlContext: SQLContext,
folder1: String,
folder2: String): DataFrame = {
val df1 = sqlContext.read.parquet(s"$path/$folder1")
val df2 = sqlContext.read.parquet(s"$path/$folder2")
df1.union(df2)
}
EDIT
By using Hadoop FileSystem, you can check path existence on your folders. So you may try something like that :
def listDirectoriesGetWantedFile(path: String, sqlContext: SQLContext, folders: Seq[String]): DataFrame = {
val conf = new Configuration()
val fs = FileSystem.get(conf)
val existingFolders = folders
.map(folder => new Path(s"$path/$folder"))
.filter(fs.exists(_))
.map(_.toString)
if (existingFolders.isEmpty) {
sqlContext.emptyDataFrame
} else {
sqlContext.read.parquet(existingFolders: _*)
}
}

Make RDD from LIST[Row] In scala(in spark)

I'm making some code with scala & spark and want to make CSV file from RDD or LIST[Row].
I wanted to process 'ListRDD' data parellel so I thouth output data would be more than one file.
val conf = new SparkConf().setAppName("Csv Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val logFile = "data.csv "
val rawdf = sqlContext.read.format("com.databricks.spark.csv")....
val rowRDD = rawdf.map { row =>
Row(
row.getAs( myMap.ID).toString,
row.getAs( myMap.Dept)
.....
)
}
val df = sqlContext.createDataFrame(rowRDD, mySchema)
val MapRDD = df.map { x => (x.getAs[String](myMap.ID), List(x)) }
val ListRDD = MapRDD.reduceByKey { (a: List[Row], b: List[Row]) => List(a, b).flatten }
myClass.myFunction( ListRDD)
in myClass..
def myFunction(ListRDD: RDD[(String, List[Row])]) = {
var rows: RDD[Row]
ListRDD.foreach( row => {
rows.add? gather? ( make(row._2)) // make( row._2) will return List[Row]
})
rows.saveAsFile(" path") // it's my final goal
}
def make( list: List[Row]) : List[Row] = {
data processing from List[Row]
}
I tried to make RDD data from List by sc.parallelize( list) BUT somehow nothing works. anyidea to make RDD type data from make function.
If you want to make an RDD from a List[Row], here is a way to do so
//Assuming list is your List[Row]
val newRDD: RDD[Object] = sc.makeRDD(list.toArray());