Create dataframe with header using header and data file - scala

I have two files data.csv and headers.csv. I want to create dataframe in Spark/Scala with headers.
var data = spark.sqlContext.read.format(
"com.databricks.spark.csv").option("header", "true"
).option("inferSchema", "true").load(data_path)
Can you help me customizing above lines to do this?

you can read the headers.csv by using the above method and use the schema of headers dataframe to read the data.csv as below
val headersDF = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("path to headers.csv")
val schema = headersDF.schema
val dataDF = sqlContext
.read
.format("com.databricks.spark.csv")
.schema(schema)
.load("path to data.csv")
I hope the answer is helpful

Related

How to read only new files in spark

Im reading csv files with spark and scala,the files are coming from another spark streaming job.
I need to read only the new files ?
val df= spark
.read //
.schema(test_raw)
.option("header", "true")
.option("sep", ",")
.csv(path).toDF().cache()
event3.registerTempTable("test")
I resolved the problem by adding a checkpoint on the dataframe like this
val df= spark
.read //
.schema(test_raw)
.option("header", "true")
.option("sep", ",")
.csv(path).toDF().checkpoint().cache()

Create DataFrame / Dataset using Header and Data in two different directories

I am getting the input file as CSV. Here I get two directories, first directory will have one file with header record and second directory will have data files. Here, I want to create a Dataframe/Dataset.
One way I can do is creating case class and split the data files by delimiter and attached the schema and create dataFrame.
What I am looking is read Header file and data file and create dataFrame. I saw a solution using databricks but my organization has restriction to use the databricks and below is the code which I come across. Can one you help me the solution without using databricks.
val headersDF = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("path to headers.csv")
val schema = headersDF.schema
val dataDF = sqlContext
.read
.format("com.databricks.spark.csv")
.schema(schema)
.load("path to data.csv")
You can do it like this
val schema=spark
.read
.format("csv")
.option("header","true")
.option("delimiter",",")
.load("C:\\spark\\programs\\empheaders.csv")
.schema
val data=spark
.read
.format("csv")
.schema(schema)
.option("delimiter",",")
.load("C:\\spark\\programs\\empdata.csv")
Because in your header CSV file you don't have any data there is no point in inferring the schema out of it.
So just get the field names by reading it.
val headerRDD = sc.parallelize(Seq(("Name,Age,Sal"))) //Assume this line is in your Header CSV
val header = headerRDD.flatMap(_.split(",")).collect
//headerRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at parallelize at command-2903591155643047:1
//header: Array[String] = Array(Name, Age, Sal)
Then read the data CSV file.
Either map each line to a case class or a tuple. Convert the data to a DataFrame by passing the header array.
val dataRdd = sc.parallelize(Seq(("Tom,22,500000"),("Rick,40,1000000"))) //Assume these lines are in your data CSV file
val data = dataRdd.map(_.split(",")).map(x => (x(0),x(1).toInt,x(2).toDouble)).toDF(header: _*)
//dataRdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[72] at parallelize at command-2903591155643048:1
//data: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 1 more field]
Result:
data.show()
+----+---+---------+
|Name|Age| Sal|
+----+---+---------+
| Tom| 22| 500000.0|
|Rick| 40|1000000.0|
+----+---+---------+

Trying to create Data frame from a file with delimiter '|'

I want to load a text file which has delimiter "|" into Dataframe in spark.
one way is to create the RDD and use toDF to create Dataframe. However I was wondering if I can create DF directly.
As of now I am using the below command
val productsDF = sqlContext.read.text("/user/danishdshadab786/paper2/products/")
For Spark 2.x
val df = spark.read.format("csv")
.option("delimiter", "|")
.load("/user/danishdshadab786/paper2/products/")
For Spark<2.0
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.load("/user/danishdshadab786/paper2/products/")
You can add more options like option("header", "true") for reading headers in the same statement.
You can specify the delimiter in the 'read' options:
spark.read
.option("delimiter", "|")
.csv("/user/danishdshadab786/paper2/products/")

Spark - CSV text loading parsing error

I am using following code to load a csv file that has text/notes in it.
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("parserLib", "UNIVOCITY")
.load(dataPath)
.na.drop()
Notes are not in any specific format. During loading I am getting this error:
com.univocity.parsers.common.TextParsingException: Error processing input: null
Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'.
I'd appreciate any help. Thanks.
I do not have privilege to comment on question, I'm adding answer.
As you are doing na.drop(), may use option("mode", "DROPMALFORMED") as well.
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("mode", "DROPMALFORMED")
.option("parserLib", "UNIVOCITY")
.load(dataPath)
.na.drop()
BTW, databricks spark csv is inbuilt in Spark 2.0+

Save values in spark

i am trying to read and write data from the my local folder, but my data's not identical.
val data =sc.textFile("/user/cts367689/datagen.txt")
val a=data.map(line=>(line.split(",")(0).toInt+line.split(",")(4).toInt,line.split(",")(3),line.split(",")(2)))
a.saveAsTextFile("/user/cts367689/sparkoutput")
Output:
(526,female,avil)
(635,male,avil)
(983,male,paracetamol)
(342,female,paracetamol)
(158,female,avil)
how can i save output as below,need to remove brackets.
Expected Result:
526,female,avil
635,male,avil
983,male,paracetamol
342,female,paracetamol
158,female,avil
val a = data.map (
line =>
(line.split(",")(0).toInt + line.split(",")(4).toInt) + "," +
line.split(",")(3) + "," +
line.split(",")(2)
)
Try doing this instead of returning it in (). That makes a tuple.
spark has capability of handling unstructured files. you are using one those functions.
for CSV(comma separated values) file there are some good libraries to do the same.
you can have look at this link
for your question , answer is as shown below.
import org.apache.spark.sql.SQLContext
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "false")
.load("/user/cts367689/datagen.txt");
df.select("id", "gender", "name").write()
.format("com.databricks.spark.csv")
.option("header", "true")
.save("/user/cts367689/sparkoutput");
use:
val a = data.map(line => line.split(",")(0).toInt+line.split(",")(4).toInt+","+line.split(",")(3)+","+line.split(",")(2))