Reading csv files to create dataframe - scala

I am trying to read a csv file to create a dataframe (https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html)
Using:
spark-1.3.1-bin-hadoop2.6
spark-csv_2.11-1.1.0
Code:
import org.apache.spark.sql.SQLContext
object test {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.csvFile("filename.csv")
...
}
}
Error:
value csvFile is not a member of org.apache.spark.sql.SQLContext
I was trying to do as advised here: Spark - load CSV file as DataFrame?
But sqlContext doesn't seem to recognize the csvFile method of CsvContext class.
Any advise would be appreciated!

I am also facing some issues with CSV(without Spark-CSV) but here is somethings that you can look at and check if they are OK.
Build the Spark shell with the spark-csv library using sbt assembly.
Add the spark-csv dependency to POM.XML of you maven project.
use the load/save methods of Dataframe API.
SPARK-CSV GITHUB
refer the spark-csv github readme.md page and you will up and running :)

Related

Loop and process multiple HDFS files in spark/scala

I have multiple files in my HDFS folder and I want to loop and run my scala transformation logic on it.
I am using below script which is working fine in my development environment using local files but it is failing when I run on my HDFS environment. Any idea where am I doing wrong please?
val files = new File("hdfs://172.X.X.X:8020/landing/").listFiles.map(_.getName).toList
files.foreach { file =>
print(file)
val event = spark.read.option("multiline", "true").json("hdfs://172.X.X.X:8020/landing/" + file)
event.show(false)
}
Can someone correct it or suggest alternative solution please.
You should use Hadoop IO library to handle hadoop files.
code:
import java.net.URI
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession
val spark=SparkSession.builder().master("local[*]").getOrCreate()
val fs=FileSystem.get(new URI("hdfs://172.X.X.X:8020/"),spark.sparkContext.hadoopConfiguration)
fs.globStatus(new Path("/landing/*")).toList.foreach{
f=>
val event = spark.read.option("multiline", "true").json("hdfs://172.X.X.X:8020/landing/" + f.getPath.getName)
event.show(false)
}

Execute code scala from spark in Zeppelin

I would like to run a scala code on Zeppelin from Spark cluster.
For example:
This is code into hdfs Spark "HelloWorldScala.scala":
object HelloWorldScala{
def main (arg: Array[String]): Unit = {
val conf = new SparkConf().setAppName("myApp_Enrico")
val spark = SparkSession.builder.config(conf).getOrCreate()
val aList = List(1,2,3,4,5,6,7,8,9,10)
val aRdd = spark.sparkContext.parallelize(aList)
println("********* HELLO WORLD AND HELLO SPARK!! ******")
println("Print even numbers")
aRdd.filter(x=>x%2==0).map(x=>x*2).collect().foreach(println)
}
}
I would like to import in Zeppelin the HelloWorldScala file and run main, but I see the error:
Error code Zeppelin
Unfortunately you can't import single file in Zeppelin. You can pack your scala files into .jar library and put it to spark.jars (setted as property in spark) directory, after you will can import your library using line: import your.libray.packages.YourClass and using non-private functions from it. If you don't know about jar packages, and spark.jar directories just read a bit more about that.
UPDATE:
%dep
z.load("your_package_group:artifact:version")
%spark
import com.yourpackage.HelloWorldScala

spark error: spark.read.format("org.apache.spark.csv")

I am getting the following error after firing the command from spark-shell
scala> val df1 = spark.read.format("org.apache.spark.csv").option("inferSchema", true).option("header",true).option("delimiter", ",").csv("/user/mailtosudiptabiswa
s7917/src_files/movies_data_srcfile_sess06_01.csv")
<console>:21: error: not found: value spark
val df1 = spark.read.format("org.apache.spark.csv").option("inferSchema", true).option("header",true).option("delimiter", ",").csv("/user/mailtosudiptabiswas7917/src_files/movies_data_srcfile_sess06_01.csv")
Do I need to import something explicitly.
Please help with the complete command set
Thanks.
It seems like you are using the old version of spark, You need to use the spark2.x or higher and import the implicits as
import spark.implicits._
And then
val df1 = spark.read.format("csv").option("inferSchema", true).option("header",true).option("delimiter", ",").csv("path")
You aren't even getting a SparkSession. You are using an older version of Spark it seems, and you should use the SQlContext and also you need to include the external databricks csv library when you start spark shell...
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
and then from within the spark shell...
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
You can see more info about it here

How to read a file using sparkstreaming and write to a simple file using Scala?

I'm trying to read a file using a scala SparkStreaming program. The file is stored in a directory on my local machine and trying to write it as a new file on my local machine itself. But whenever I write my stream and store it as parquet I end up getting blank folders.
This is my code :
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("local[*]")
.appName("StreamAFile")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
import spark.implicits._
val schemaforfile = new StructType().add("SrNo",IntegerType).add("Name",StringType).add("Age",IntegerType).add("Friends",IntegerType)
val file = spark.readStream.schema(schemaforfile).csv("C:\\SparkScala\\fakefriends.csv")
file.writeStream.format("parquet").start("C:\\Users\\roswal01\\Desktop\\streamed")
spark.stop()
Is there anything missing in my code or anything in the code where I've gone wrong?
I also tried reading this file from a hdfs location but the same code ends up not creating any output folders on my hdfs.
You've mistake here:
val file = spark.readStream.schema(schemaforfile).csv("C:\\SparkScala\\fakefriends.csv")
csv() function should have directory path as an argument. It will scan this directory and read all new files when they will be moved into this directory
For checkpointing, you should add
.option("checkpointLocation", "path/to/HDFS/dir")
For example:
val query = file.writeStream.format("parquet")
.option("checkpointLocation", "path/to/HDFS/dir")
.start("C:\\Users\\roswal01\\Desktop\\streamed")
query.awaitTermination()

Scala Word count jar is not running in spark

I'm very new to both Scala and Spark. I added the Scala IDE to Eclipse Luna. I created a maven project in the eclipse. I was to run the program within the eclipse with run as configuration option and able to get the output successfully. But when i create the jar for the following program and tried to run the spark shell am getting the following error.
error: ';' expected but 'class' found.
Command use to run the jar
spark-submit --class com.kirthi.spark.proj.sparkexamples.WordsCount --master local /home/cloudera/workspace/sparkwc1.jar hdfs://localhost:8020/kirthi3/dataset.txt hdfs://localhost:8020/kirthi3/sparkwco
The word count program which i tried
package com.kirthi.spark.proj.sparkexamples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordsCount {
def main(args: Array[String]){
val conf = new SparkConf()
.setAppName("Word Count")
.setMaster("local")
val sc = new SparkContext(conf)
val textFile = sc.textFile(args(0))
val words = textFile.flatMap(line => line.split(","))
val counts = words.map(word => (word,1))
val wordcount = counts.reduceByKey(_+_)
val wordcount_sorted = wordcount.sortByKey()
wordcount_sorted.foreach(println)
wordcount_sorted.saveAsTextFile(args(1))
}
}
Kindly help me out in this as I am struck with initial program for spark.
I am using cloudera quickstart CDH 5.5
As shown in the comments, you ran the above command in the Scala REPL, and which you should run it from a regular linux shell.