Spark read csv with commented headers - scala

I have the following file which I need to read using spark in scala -
#Version: 1.0
#Fields: date time location timezone
2018-02-02 07:27:42 US LA
2018-02-02 07:27:42 UK LN
I am currently trying to extract the fields using the following the -
spark.read.csv(filepath)
I am new to spark+scala and wanted to know know is there a better way to extract fields based on the # Fields row at the top of the file.

You should be using sparkContext's textFile api to read the text file and then filter the header line
val rdd = sc.textFile("filePath")
val header = rdd
.filter(line => line.toLowerCase.contains("#fields:"))
.map(line => line.split(" ").tail)
.first()
That should be it.
Now if you want to create a dataframe then you should parse it to form schema and then filter the data lines to form Rows. And finally use SQLContext to create a dataframe
import org.apache.spark.sql.types._
val schema = StructType(header.map(title => StructField(title, StringType, true)))
val dataRdd = rdd.filter(line => !line.contains("#")).map(line => Row.fromSeq(line.split(" ")))
val df = sqlContext.createDataFrame(dataRdd, schema)
df.show(false)
This should give you
+----------+--------+--------+--------+
|date |time |location|timezone|
+----------+--------+--------+--------+
|2018-02-02|07:27:42|US |LA |
|2018-02-02|07:27:42|UK |LN |
+----------+--------+--------+--------+
Note: if the file is tab delimited, instead of doing
line.split(" ")
you should be using \t
line.split("\t")

Sample input file "example.csv"
#Version: 1.0
#Fields: date time location timezone
2018-02-02 07:27:42 US LA
2018-02-02 07:27:42 UK LN
Test.scala
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession.Builder
import org.apache.spark.sql._
import scala.util.Try
object Test extends App {
// create spark session and sql context
val builder: Builder = SparkSession.builder.appName("testAvroSpark")
val sparkSession: SparkSession = builder.master("local[1]").getOrCreate()
val sc: SparkContext = sparkSession.sparkContext
val sqlContext: SQLContext = sparkSession.sqlContext
case class CsvRow(date: String, time: String, location: String, timezone: String)
// path of your csv file
val path: String =
"sample.csv"
// read csv file and skip firs two lines
val csvString: Seq[String] =
sc.textFile(path).toLocalIterator.drop(2).toSeq
// try to read only valid rows
val csvRdd: RDD[(String, String, String, String)] =
sc.parallelize(csvString).flatMap(r =>
Try {
val row: Array[String] = r.split(" ")
CsvRow(row(0), row(1), row(2), row(3))
}.toOption)
.map(csvRow => (csvRow.date, csvRow.time, csvRow.location, csvRow.timezone))
import sqlContext.implicits._
// make data frame
val df: DataFrame =
csvRdd.toDF("date", "time", "location", "timezone")
// display dataf frame
df.show()
}

Related

I don't know how to do the same using parquet file

Link to (data.csv) and (output.csv)
import org.apache.spark.sql._
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val tempDF=spark.read.csv("data.csv")
tempDF.coalesce(1).write.parquet("Parquet")
val rdd = sc.textFile("Parquet")
I Convert data.csv into optimised parquet file and then loaded it and now i want to do all the transformation on parquet file just like i did on csv file given below and then save it as a parquet file.Link of (data.csv) and (output.csv)
val header = rdd.first
val rdd1 = rdd.filter(_ != header)
val resultRDD = rdd1.map { r =>
val Array(country, values) = r.split(",")
country -> values
}.reduceByKey((a, b) => a.split(";").zip(b.split(";")).map { case (i1, i2) => i1.toInt + i2.toInt }.mkString(";"))
import spark.sqlContext.implicits._
val dataSet = resultRDD.map { case (country: String, values: String) => CountryAgg(country, values) }.toDS()
dataSet.coalesce(1).write.option("header","true").csv("output")
}
case class CountryAgg(country: String, values: String)
}
I reckon, you are trying to add up corresponding elements from the array based on Country. I have done this using DataFrame APIs, which makes the job easier.
Code for your reference:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("path", "/path/to/input/data.csv")
.load()
val df1 = df.select(
$"Country",
(split($"Values", ";"))(0).alias("c1"),
(split($"Values", ";"))(1).alias("c2"),
(split($"Values", ";"))(2).alias("c3"),
(split($"Values", ";"))(3).alias("c4"),
(split($"Values", ";"))(4).alias("c5")
)
.groupBy($"Country")
.agg(
sum($"c1" cast "int").alias("s1"),
sum($"c2" cast "int").alias("s2"),
sum($"c3" cast "int").alias("s3"),
sum($"c4" cast "int").alias("s4"),
sum($"c5" cast "int").alias("s5")
)
.select(
$"Country",
concat(
$"s1", lit(";"),
$"s2", lit(";"),
$"s3", lit(";"),
$"s4", lit(";"),
$"s5"
).alias("Values")
)
df1.repartition(1)
.write
.format("csv")
.option("delimiter",",")
.option("header", "true")
.option("path", "/path/to/output")
.save()
Here is the output for your reference.
scala> df1.show()
+-------+-------------------+
|Country| Values|
+-------+-------------------+
|Germany| 144;166;151;172;70|
| China| 218;239;234;209;75|
| India| 246;153;148;100;90|
| Canada| 183;258;150;263;71|
|England|178;114;175;173;153|
+-------+-------------------+
P.S.:
You can change the output format to parquet/orc or anything you wish.
I have repartitioned df1 into 1 partition just so that you could get a single output file. You can choose to repartition or not based
on your usecase
Hope this helps.
You could just read the file as parquet and perform the same operations on the resulting dataframe:
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("data.parquet")
If you need an rdd you can then just call:
val rdd = parquetFileDF.rdd
The you can proceed with the transformations as before and write as parquet like you have in your question.

spark Scala RDD to DataFrame Date format

Would you be able to help in this spark prob statement
Data -
empno|ename|designation|manager|hire_date|sal|deptno
7369|SMITH|CLERK|9902|2010-12-17|800.00|20
7499|ALLEN|SALESMAN|9698|2011-02-20|1600.00|30
Code:
val rawrdd = spark.sparkContext.textFile("C:\\Users\\cmohamma\\data\\delta scenarios\\emp_20191010.txt")
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|") (fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4).toDate,fields(5).toFloat,fields(6).toInt)
})
Problem Statement - This is not working -fields(4).toDate , whats is the alternative or what is the usage ?
What i have tried ?
tried replacing it to - to_date(col(fields(4)) , "yyy-MM-dd") - Not working
2.
Step 1.
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|")
(fields(0),fields(1),fields(2),fields(3),fields(4),fields(5),fields(6))
})
Now this tuples are all strings
Step 2.
mySchema = StructType(StructField(empno,IntegerType,true), StructField(ename,StringType,true), StructField(designation,StringType,true), StructField(manager,IntegerType,true), StructField(hire_date,DateType,true), StructField(sal,DoubleType,true), StructField(deptno,IntegerType,true))
Step 3. converting the string tuples to Rows
val rowRDD = refinedRDD.map(attributes => Row(attributes._1, attributes._2, attributes._3, attributes._4, attributes._5 , attributes._6, attributes._7))
Step 4.
val empDF = spark.createDataFrame(rowRDD, mySchema)
This is also not working and gives error related to types. to solve this i changed the step 1 as
(fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4),fields(5).toFloat,fields(6).toInt)
Now this is giving error for the date type column and i am again at the main problem.
Use Case - use textFile Api, convert this to a dataframe using custom schema (StructType) on top of it.
This can be done using the case class but in case class also i would be stuck where i would need to do a fields(4).toDate (i know i can cast string to date later in code but if the above problem solutionis possible)
You can use the following code snippet
import org.apache.spark.sql.functions.to_timestamp
scala> val df = spark.read.format("csv").option("header", "true").option("delimiter", "|").load("gs://otif-etl-input/test.csv")
df: org.apache.spark.sql.DataFrame = [empno: string, ename: string ... 5 more fields]
scala> val ts = to_timestamp($"hire_date", "yyyy-MM-dd")
ts: org.apache.spark.sql.Column = to_timestamp(`hire_date`, 'yyyy-MM-dd')
scala> val enriched_df = df.withColumn("ts", ts).show(2, false)
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
|empno|ename|designation|manager|hire_date |sal |deptno |ts |
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
|7369 |SMITH|CLERK |9902 |2010-12-17|800.00 |20 |2010-12-17 00:00:00|
|7499 |ALLEN|SALESMAN |9698 |2011-02-20|1600.00|30 |2011-02-20 00:00:00|
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
enriched_df: Unit = ()
There are multiple ways to cast your data to proper data types.
First : use InferSchema
val df = spark.read .option("delimiter", "\\|").option("header", true) .option("inferSchema", "true").csv(path)
df.printSchema
Some time it doesn't work as expected. see details here
Second : provide your own Datatype conversion template
val rawDF = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
//define schema in DF , hire_date as Date
val schemaDF = Seq(("empno", "INT"), ("ename", "STRING"), (**"hire_date", "date"**) , ("sal", "double")).toDF("columnName", "columnType")
rawDF.printSchema
//fetch schema details
val dataTypes = schemaDF.select("columnName", "columnType")
val listOfElements = dataTypes.collect.map(_.toSeq.toList)
//creating a map friendly template
val validationTemplate = (c: Any, t: Any) => {
val column = c.asInstanceOf[String]
val typ = t.asInstanceOf[String]
col(column).cast(typ)
}
//Apply datatype conversion template on rawDF
val convertedDF = rawDF.select(listOfElements.map(element => validationTemplate(element(0), element(1))): _*)
println("Conversion done!")
convertedDF.show()
convertedDF.printSchema
Third : Case Class
Create schema from caseclass with ScalaReflection and provide this customized schema while loading DF.
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types._
case class MySchema(empno: int, ename: String, hire_date: Date, sal: Double)
val schema = ScalaReflection.schemaFor[MySchema].dataType.asInstanceOf[StructType]
val rawDF = spark.read.schema(schema).option("header", "true").load(path)
rawDF.printSchema
Hope this will help.

Spark Structured Stream null in output dataset

i run a scala code which aggregates data and print output to the console. Unfortunately, i got a nulls after group operation. Current output:
|Id |Date | Count |
|null|null | 35471|
I realised, that the bottle neck is the point, when i group data - when i try to use column other than numeric, output returns nulls. Any advice will be welcome - i lost hours to find solution.
My code:
// create schema
val sensorsSchema = new StructType()
.add("SensorId", IntegerType)
.add("Timestamp", TimestampType)
.add("Value", DoubleType)
.add("State", StringType)
// read streaming data from csv...
// aggregate streaming data
val streamAgg = streamIn
.withColumn("Date", to_date(unix_timestamp($"Timestamp", "dd/MM/yyyy").cast(TimestampType)))
.groupBy("SensorId", "Date")
.count()
// write streaming data...
I change the code - now works perfect:
/****************************************
* STREAMING APP
* 1.0 beta
*****************************************
* read data from csv (local)
* and save as parquet (local)
****************************************/
package tk.streaming
import org.apache.spark.SparkConf
import org.apache.spark.sql._
// import org.apache.spark.sql.functions._
case class SensorsSchema(SensorId: Int, Timestamp: String, Value: Double, State: String, OperatorId: Int)
object Runner {
def main(args: Array[String]): Unit = {
// Configuration parameters (to create spark session and contexts)
val appName = "StreamingApp" // app name
val master = "local[*]" // master configuration
val dataDir = "/home/usr_spark/Projects/SparkStreaming/data"
val refreshInterval = 30 // seconds
// initialize context
val conf = new SparkConf().setMaster(master).setAppName(appName)
val spark = SparkSession.builder.config(conf).getOrCreate()
import spark.implicits._
// TODO change file source to Kafka (must)
// read streaming data
val sensorsSchema = Encoders.product[SensorsSchema].schema
val streamIn = spark.readStream
.format("csv")
.schema(sensorsSchema)
.load(dataDir + "/input")
.select("SensorId", "Timestamp", "State", "Value") // remove "OperatorId" column
// TODO save result in S3 (nice to have)
// write streaming data
import org.apache.spark.sql.streaming.Trigger
val streamOut = streamIn.writeStream
.queryName("streamingOutput")
.format("parquet")
.option("checkpointLocation", dataDir + "/output/checkpoint")
.option("path", dataDir + "/output")
.start()
streamOut.awaitTermination() // start streaming data
}
}

error: value show is not a member of String

If in this case I want to show the header . Why I cannot write in the third line header.show()?
What I have to do to view the content of the header variable?
val hospitalDataText = sc.textFile("/Users/bhaskar/Desktop/services.csv")
val header = hospitalDataText.first() //Remove the header
If you want a DataFrame use DataFrameReader and limit:
spark.read.text(path).limit(1).show
otherwise just println
println(header)
Unless of course you want to use cats Show. With cats add package to spark.jars.packages and
import cats.syntax.show._
import cats.instances.string._
sc.textFile(path).first.show
If you use sparkContext (sc.textFile), you get an RDD. You are getting the error because header is not a dataframe but a rdd. And show is applicable on dataframe or dataset only.
You will have to read the textfile with sqlContext and not sparkContext.
What you can do is use sqlContext and show(1) as
val hospitalDataText = sqlContext.read.csv("/Users/bhaskar/Desktop/services.csv")
hospitalDataText.show(1, false)
Updated for more clarification
sparkContext would create rdd which can be seen in
scala> val hospitalDataText = sc.textFile("file:/test/resources/t1.csv")
hospitalDataText: org.apache.spark.rdd.RDD[String] = file:/test/resources/t1.csv MapPartitionsRDD[5] at textFile at <console>:25
And if you use .first() then the first string of the RDD[String] is extracted as
scala> val header = hospitalDataText.first()
header: String = test1,26,BigData,test1
Now answering your comment below, yes you can create dataframe from header string just created
Following will put the string in one column
scala> val sqlContext = spark.sqlContext
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext#3fc736c4
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> Seq(header).toDF.show(false)
+----------------------+
|value |
+----------------------+
|test1,26,BigData,test1|
+----------------------+
If you want each string in separate columns you can do
scala> val array = header.split(",")
array: Array[String] = Array(test1, 26, BigData, test1)
scala> Seq((array(0), array(1), array(2), array(3))).toDF().show(false)
+-----+---+-------+-----+
|_1 |_2 |_3 |_4 |
+-----+---+-------+-----+
|test1|26 |BigData|test1|
+-----+---+-------+-----+
You can even define the header names as
scala> Seq((array(0), array(1), array(2), array(3))).toDF("col1", "number", "text2", "col4").show(false)
+-----+------+-------+-----+
|col1 |number|text2 |col4 |
+-----+------+-------+-----+
|test1|26 |BigData|test1|
+-----+------+-------+-----+
More advanced approach would be to use sqlContext.createDataFrame with Schema defined

How to create a DataFrame from a text file in Spark

I have a text file on HDFS and I want to convert it to a Data Frame in Spark.
I am using the Spark Context to load the file and then try to generate individual columns from that file.
val myFile = sc.textFile("file.txt")
val myFile1 = myFile.map(x=>x.split(";"))
After doing this, I am trying the following operation.
myFile1.toDF()
I am getting an issues since the elements in myFile1 RDD are now array type.
How can I solve this issue?
Update - as of Spark 1.6, you can simply use the built-in csv data source:
spark: SparkSession = // create the Spark Session
val df = spark.read.csv("file.txt")
You can also use various options to control the CSV parsing, e.g.:
val df = spark.read.option("header", "false").csv("file.txt")
For Spark version < 1.6:
The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (;), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).
Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:
case class Record(id: Int, name: String)
val myFile1 = myFile.map(x=>x.split(";")).map {
case Array(id, name) => Record(id.toInt, name)
}
myFile1.toDF() // DataFrame will have columns "id" and "name"
I have given different ways to create DataFrame from text file
val conf = new SparkConf().setAppName(appName).setMaster("local")
val sc = SparkContext(conf)
raw text file
val file = sc.textFile("C:\\vikas\\spark\\Interview\\text.txt")
val fileToDf = file.map(_.split(",")).map{case Array(a,b,c) =>
(a,b.toInt,c)}.toDF("name","age","city")
fileToDf.foreach(println(_))
spark session without schema
import org.apache.spark.sql.SparkSession
val sparkSess =
SparkSession.builder().appName("SparkSessionZipsExample")
.config(conf).getOrCreate()
val df = sparkSess.read.option("header",
"false").csv("C:\\vikas\\spark\\Interview\\text.txt")
df.show()
spark session with schema
import org.apache.spark.sql.types._
val schemaString = "name age city"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header",
"false").schema(schema).csv("C:\\vikas\\spark\\Interview\\text.txt")
dfWithSchema.show()
using sql context
import org.apache.spark.sql.SQLContext
val fileRdd =
sc.textFile("C:\\vikas\\spark\\Interview\\text.txt").map(_.split(",")).map{x
=> org.apache.spark.sql.Row(x:_*)}
val sqlDf = sqlCtx.createDataFrame(fileRdd,schema)
sqlDf.show()
If you want to use the toDF method, you have to convert your RDD of Array[String] into a RDD of a case class. For example, you have to do:
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
You will not able to convert it into data frame until you use implicit conversion.
val sqlContext = new SqlContext(new SparkContext())
import sqlContext.implicits._
After this only you can convert this to data frame
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
val df = spark.read.textFile("abc.txt")
case class Abc (amount:Int, types: String, id:Int) //columns and data types
val df2 = df.map(rec=>Amount(rec(0).toInt, rec(1), rec(2).toInt))
rdd2.printSchema
root
|-- amount: integer (nullable = true)
|-- types: string (nullable = true)
|-- id: integer (nullable = true)
A txt File with PIPE (|) delimited file can be read as :
df = spark.read.option("sep", "|").option("header", "true").csv("s3://bucket_name/folder_path/file_name.txt")
I know I am quite late to answer this but I have come up with a different answer:
val rdd = sc.textFile("/home/training/mydata/file.txt")
val text = rdd.map(lines=lines.split(",")).map(arrays=>(ararys(0),arrays(1))).toDF("id","name").show
You can read a file to have an RDD and then assign schema to it. Two common ways to creating schema are either using a case class or a Schema object [my preferred one]. Follows the quick snippets of code that you may use.
Case Class approach
case class Test(id:String,name:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
Schema Approach
import org.apache.spark.sql.types._
val schemaString = "id name"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header","false").schema(schema).csv("file.txt")
dfWithSchema.show()
The second one is my preferred approach since case class has a limitation of max 22 fields and this will be a problem if your file has more than 22 fields!