Using Scala and SparkSql and importing CSV file with header [duplicate] - scala

This question already has answers here:
Spark - load CSV file as DataFrame?
(14 answers)
Closed 5 years ago.
I'm very new to Spark and Scala(Like two hours new), I'm trying to play with a CSV data file but I cannot do it as I'm not sure how to deal with "Header row", I have searched internet for the way to load it or to skip it but I don't really know how to do that.
I'm pasting my code That I'm using, please help me.
object TaxiCaseOne{
case class NycTaxiData(Vendor_Id:String, PickUpdate:String, Droptime:String, PassengerCount:Int, Distance:Double, PickupLong:String, PickupLat:String, RateCode:Int, Flag:String, DropLong:String, DropLat:String, PaymentMode:String, Fare:Double, SurCharge:Double, Tax:Double, TripAmount:Double, Tolls:Double, TotalAmount:Double)
def mapper(line:String): NycTaxiData = {
val fields = line.split(',')
val data:NycTaxiData = NycTaxiData(fields(0), fields(1), fields(2), fields(3).toInt, fields(4).toDouble, fields(5), fields(6), fields(7).toInt, fields(8), fields(9),fields(10),fields(11),fields(12).toDouble,fields(13).toDouble,fields(14).toDouble,fields(15).toDouble,fields(16).toDouble,fields(17).toDouble)
return data
}def main(args: Array[String]) {
// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)
// Use new SparkSession interface in Spark 2.0
val spark = SparkSession
.builder
.appName("SparkSQL")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
.getOrCreate()
val lines = spark.sparkContext.textFile("../nyc.csv")
val data = lines.map(mapper)
// Infer the schema, and register the DataSet as a table.
import spark.implicits._
val schemaData = data.toDS
schemaData.printSchema()
schemaData.createOrReplaceTempView("data")
// SQL can be run over DataFrames that have been registered as a table
val vendor = spark.sql("SELECT * FROM data WHERE Vendor_Id == 'CMT'")
val results = teenagers.collect()
results.foreach(println)
spark.stop()
}
}

If you have a CSV file you should use spark-csv to read the csv files rather than using textFile
val spark = SparkSession.builder().appName("test val spark = SparkSession
.builder
.appName("SparkSQL")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
.getOrCreate()
val df = spark.read
.format("csv")
.option("header", "true") //This identifies first line as header
.csv("../nyc.csv")
You need a spark-core and spark-sql dependency to work with this
Hope this helps!

Related

.csv not a SequenceFile error on Select Hive Query

I am quite a newbie to Spark and Scala ;)
Code summary :
Reading data from CSV files --> Creating A simple inner join on 2 Files --> Writing data to Hive table --> Submitting the job on the cluster
Can you please help to identify what went wrong.
The code is not really complex.
The job is executed well on cluster.
Therefore when I try to visualize data written on hive table I am facing issue.
hive> select * from Customers limit 10;
Failed with exception java.io.IOException:java.io.IOException: hdfs://m01.itversity.com:9000/user/itv000666/warehouse/updatedcustomers.db/customers/part-00000-348a54cf-aa0c-45b4-ac49-3a881ae39702_00000.c000 .csv not a SequenceFile
object LapeyreSparkDemo extends App {
//Getting spark ready
val sparkConf = new SparkConf()
sparkConf.set("spark.app.name","Spark for Lapeyre")
//Creating Spark Session
val spark = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.config("spark.sql.warehouse.dir","/user/itv000666/warehouse")
.getOrCreate()
Logger.getLogger(getClass.getName).info("Spark Session Created Successfully")
//Reading
Logger.getLogger(getClass.getName).info("Data loading in DF started")
val ordersSchema = "orderid Int, customerName String, orderDate String, custId Int, orderStatus
String, age String, amount Int"
val orders2019Df = spark.read
.format("csv")
.option("header",true)
.schema(ordersSchema)
.option("path","/user/itv0006666/lapeyrePoc/orders2019.csv")
.load
val newOrder = orders2019Df.withColumnRenamed("custId", "oldCustId")
.withColumnRenamed("customername","oldCustomerName")
val orders2020Df = spark.read
.format("csv")
.option("header",true)
.schema(ordersSchema)
.option("path","/user/itv000666/lapeyrePoc/orders2020.csv")
.load
Logger.getLogger(getClass.getName).info("Data loading in DF complete")
//processing
Logger.getLogger(getClass.getName).info("Processing Started")
val joinCondition = newOrder.col("oldCustId") === orders2020Df.col("custId")
val joinType = "inner"
val joinData = newOrder.join(orders2020Df, joinCondition, joinType)
.select("custId","customername")
//Writing
spark.sql("create database if not exists updatedCustomers")
joinData.write
.format("csv")
.mode(SaveMode.Overwrite)
.bucketBy(4, "custId")
.sortBy("custId")
.saveAsTable("updatedCustomers.Customers")
//Stopping Spark Session
spark.stop()
}
Please let me know in case more information required.
Thanks in advance.
This is the culprit
joinData.write
.format("csv")
Instead used this and it worked.
joinData.write
.format("Hive")
Since I am writing data to hive table (orc format), the format should be "Hive" and not csv.
Also, do not forget to enable hive support while creating spark session.
Also, In spark 2, bucketby & sortby is not supported. Maybe it does in Spark 3.

Spark Scala Write dataframe to MongoDB

I am attempting to write my transformed data frame into MongoDB using this as a guide
https://docs.mongodb.com/spark-connector/master/scala/streaming/
So far, my reading of data frame from MongoDB works perfectly. As shown below.
val mongoURI = "mongodb://000.000.000.000:27017"
val Conf = makeMongoURI(mongoURI,"blog","articles")
val readConfigintegra: ReadConfig = ReadConfig(Map("uri" -> Conf))
val sparkSess = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.output.uri", "mongodb://000.000.000.000:27017/blog.vectors")
.getOrCreate()
// Uses the ReadConfig
val df3 = sparkSess.sqlContext.loadFromMongoDB(ReadConfig(Map("uri" -> "mongodb://000.000.000.000:27017/blog.articles")))
However, writing this data frame to MongoDB seems to prove more difficult.
//reads data from mongo and does some transformations
val data = read_mongo()
data.show(20,false)
data.write.mode("append").mongo()
For the last line, I receive the following error.
Exception in thread "main" java.lang.IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.output.uri' or 'spark.mongodb.output.database' property
This seems confusing to me as I set this within my spark Session in the code blocks above.
val sparkSess = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.output.uri", "mongodb://000.000.000.000:27017/blog.vectors")
.getOrCreate()
Can you spot anything I'm doing wrong?
My answer is pretty much parallels how I read it but uses writeConfig instead.
data.saveToMongoDB(WriteConfig(Map("uri" -> "mongodb://000.000.000.000:27017/blog.vectors")))

Getting all records as NULL while reading a Simple JSON file in Spark scala

I am new to Scala and I am learning Spark with scala.
Problem ->
I am having Simple JSON file having 20 fields and 100 records.
I have created a code which reads the JSON file and save it as csv. So everytime I am reading the JSON file I am getting the null values in the dataframe.
CODE ->
sql_c = SQLContext(SparkContext.getOrCreate())
df = sql_c.read.format("json").load("data.json")
df.show()
I am getting all the records as NULL.
Please help me out and thanks in advance.
Try with the spark session.
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val df = spark.read.json("data.json")
df.show()

spark dataframe write to file using scala

I am trying to read a file and add two extra columns. 1. Seq no and 2. filename.
When I run spark job in scala IDE output is generated correctly but when I run in putty with local or cluster mode job is stucks at stage-2 (save at File_Process). There is no progress even i wait for an hour. I am testing on 1GB data.
Below is the code i am using
object File_Process
{
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("yarn")
.appName("File_Process")
.getOrCreate()
def main(arg:Array[String])
{
val FileDF = spark.read
.csv("/data/sourcefile/")
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow => Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
val dataframefinal = datasetnew.withColumn("Filetag", lit(filename))
val query = dataframefinal.write
.mode("overwrite")
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.save("/data/text_file/")
spark.stop()
}
If I remove logic to add seq_no, code is working fine.
code for creating seq no is
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow =>Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
Thanks in advance.

Reading TSV into Spark Dataframe with Scala API

I have been trying to get the databricks library for reading CSVs to work. I am trying to read a TSV created by hive into a spark data frame using the scala api.
Here is an example that you can run in the spark shell (I made the sample data public so it can work for you)
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val sqlContext = new SQLContext(sc)
val segments = sqlContext.read.format("com.databricks.spark.csv").load("s3n://michaeldiscenza/data/test_segments")
The documentation says you can specify the delimiter but I am unclear about how to specify that option.
All of the option parameters are passed in the option() function as below:
val segments = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.load("s3n://michaeldiscenza/data/test_segments")
With Spark 2.0+ use the built-in CSV connector to avoid third party dependancy and better performance:
val spark = SparkSession.builder.getOrCreate()
val segments = spark.read.option("sep", "\t").csv("/path/to/file")
You May also try to inferSchema and check for schema.
val df = spark.read.format("csv")
.option("inferSchema", "true")
.option("sep","\t")
.option("header", "true")
.load(tmp_loc)
df.printSchema()