Load .csv file to dataframe in Scala for analysis - scala

I have a 100,000 line csv file that I need to load to a dataframe in scala. It has a header row, only one worksheet and five columns. I am new to using scala and IntelliJ and I would appreciate any help that I can get on this.

Below is sample code that you can use to load CSV. you can add more options based on your need.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
object LoadCsv extends App{
val spark = SparkSession.builder()
.appName("Load CSV")
.master("local[*]")
.getOrCreate()
val csvSchema = StructType(Array(
StructField("col1", StringType),
StructField("col2", DateType),
StructField("col3", DoubleType),
StructField("col4", DateType),
StructField("col5", DoubleType)
))
spark.read
.schema(csvSchema)
.option("header", "true")
.option("sep", ",")
.csv("path to csv on itellij")
}

Related

Except and ExceptAll functions for apache spark's dataset are giving an empty dataframe during streaming

Except and ExceptAll function for apache spark's dataset are giving empty dataframe during streaming.
I am using two datasets, both are batch,
the left-one has few rows that are not present in the right-one.
Upon running it gives correct output i.e. rows that are in the left-one but not in the right one.
Now I am repeating the same in streaming, the left-one is streaming while the right-one is batch source, in this scenario, I am getting an empty dataframe upon which dataframe.writeStream throws an exception None.get error.
package exceptPOC
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
object exceptPOC {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("pocExcept")
.setMaster("local")
.set("spark.driver.host", "localhost")
val sparkSession =
SparkSession.builder().config(sparkConf).getOrCreate()
val schema = new StructType()
.add("sepal_length", DoubleType, true)
.add("sepal_width", DoubleType, true)
.add("petal_length", DoubleType, true)
.add("petal_width", DoubleType, true)
.add("species", StringType, true)
var df1 = sparkSession.readStream
.option("header", "true")
.option("inferSchema", "true")
.option("treatEmptyValuesAsNulls", "false")
.option("delimiter", ",")
.schema(schema)
.csv(
"some/path/exceptPOC/streamIris" //contains iris1
)
var df2 = sparkSession.read
.option("header", "true")
.option("inferSchema", "true")
.option("treatEmptyValuesAsNulls", "false")
.option("delimiter", ",")
.schema(schema)
.csv(
"/some/path/exceptPOC/iris2.csv"
)
val exceptDF = df1.except(df2)
exceptDF.writeStream
.format("console")
.start()
.awaitTermination()
}
}

Unable to create multiple files using foreachBatch in spark (This Code Works Now)

I want to save files to multiple destination using foreachBatch , the code is running fine but foreachBatch isn't running the way wanted.
Kindly help me with this if you got any clue.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.streaming._
import org.apache.spark.storage.StorageLevel
object multiDestination {
val spark = SparkSession.builder()
.master("local")
.appName("Writing data to multiple destinations")
.getOrCreate()
def main(args: Array[String]): Unit = {
val mySchema = StructType(Array(
StructField("Id", IntegerType),
StructField("Name", StringType)
))
val askDF = spark
.readStream
.format("csv")
.option("header","true")
.schema(mySchema)
.load("/home/amulya/Desktop/csv/")
//println(askDF.show())
println(askDF.isStreaming)
askDF.writeStream.foreachBatch { (askDF : DataFrame , batchId:Long) =>
askDF.persist()

Possible to put records that aren't same length as header records to bad_record directory

I am reading a file into a dataframe like this
val df = spark.read
.option("sep", props.inputSeperator)
.option("header", "true")
.option("badRecordsPath", "/mnt/adls/udf_databricks/error")
.csv(inputLoc)
The file is setup like this
col_a|col_b|col_c|col_d
1|first|last|
2|this|is|data
3|ok
4|more||stuff
5|||
Now, spark will read all of this as acceptable data. However, I want 3|ok to be marked as a bad record because it's size does not match the header size. Is this possible?
val a = spark.sparkContext.textFile(pathOfYourFile)
val size = a.first.split("\\|").length
a.filter(i => i.split("\\|",-1).size != size).saveAsTextFile("/mnt/adls/udf_databricks/error")
The below code is supported by databricks implementation of spark.I dont see schema mapping in your code. could you map it and try ?
.option("badRecordsPath", "/mnt/adls/udf_databricks/error")
Change your code like below,
val customSchema = StructType(Array(
StructField("col_a", StringType, true),
StructField("col_b", StringType, true),
StructField("col_c", StringType, true),
StructField("col_d", StringType, true)))
val df = spark.read
.option("sep", props.inputSeperator)
.option("header", "true")
.option("badRecordsPath", "/mnt/adls/udf_databricks/error")
.schema(customSchema)
.csv(inputLoc)
More detail's you can refer Datbricks doc on badrecordspath
Thanks,
Karthick

How to read .csv files using spark streaming and write to parquet file using Scala?

I'm trying to read a file using spark 2.1.0 SparkStreaming program. The csv files are stored in a directory on my local machine and trying to use writestream parquet with a new file on my local machine . But whenever I try its always error in .parquet or getting blank folders.
This is my code :
case class TDCS_M05A(TimeInterval:String ,GantryFrom:String ,GantryTo:String ,VehicleType:Integer ,SpaceMeanSpeed:Integer ,CarTimes:Integer)
object Streamingcsv {
def main(args: Array[String]) {
val spark = SparkSession
.builder
.appName("Streamingcsv")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
import org.apache.spark.sql.types._
val schema = StructType(
StructField("TimeInterval",DateType, false) ::
StructField("GantryFrom", StringType, false) ::
StructField("GantryTo", StringType, false) ::
StructField("VehicleType", IntegerType, false) ::
StructField("SpaceMeanSpeed", IntegerType, false) ::
StructField("CarTimes", IntegerType, false) :: Nil)
import org.apache.spark.sql.Encoders
val usrschema = Encoders.product[TDCS_M05A].schema
val csvDF = spark.readStream
.schema(usrschema) // Specify schema of the csv files
.csv("/home/hduser/IdeaProjects/spark2.1/data/*.csv")
val query = csvDF.select("GantryFrom").where("CarTimes > 0")
query
.writeStream
.outputMode("append")
.format("parquet")
.option("checkpointLocation", "checkpoint")
.start("/home/hduser/IdeaProjects/spark2.1/output/")
//.parquet("/home/hduser/IdeaProjects/spark2.1/output/")
//.start()
query.awaitTermination()
}
I refer to the page How to read a file using sparkstreaming and write to a simple file using Scala?
and it isn't work, Please help me ,thanks.
You should make sure the checkpoint directory exists (or create it) before you start. Your implementation of query should also include a val for query that is separate from your DF.
If you don't need to, don't put within an object containing empty args so you can directly access other methods like
query.lastProgress
query.stop
(to name a few)
Change the second half of your code to look like this.
import org.apache.spark.sql.streaming.OutputMode
val csvQueriedDF = csvDF.select("GantryFrom").where("CarTimes > 0")
val query = csvQueriedDF
.writeStream
.outputMode(OutputMode.Append())
.format("parquet")
.option("checkpointLocation", "/home/hduser/IdeaProjects/spark2.1/partialTempOutput")
.option("path", "/home/hduser/IdeaProjects/spark2.1/output/")
.start()
Best of Luck!

Reading TSV into Spark Dataframe with Scala API

I have been trying to get the databricks library for reading CSVs to work. I am trying to read a TSV created by hive into a spark data frame using the scala api.
Here is an example that you can run in the spark shell (I made the sample data public so it can work for you)
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val sqlContext = new SQLContext(sc)
val segments = sqlContext.read.format("com.databricks.spark.csv").load("s3n://michaeldiscenza/data/test_segments")
The documentation says you can specify the delimiter but I am unclear about how to specify that option.
All of the option parameters are passed in the option() function as below:
val segments = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.load("s3n://michaeldiscenza/data/test_segments")
With Spark 2.0+ use the built-in CSV connector to avoid third party dependancy and better performance:
val spark = SparkSession.builder.getOrCreate()
val segments = spark.read.option("sep", "\t").csv("/path/to/file")
You May also try to inferSchema and check for schema.
val df = spark.read.format("csv")
.option("inferSchema", "true")
.option("sep","\t")
.option("header", "true")
.load(tmp_loc)
df.printSchema()