Unwanted quotes appearing in data while saving dataframe as file - scala

I first read a delimited file with multiple rows and index rows using zip with index.
Next I'm trying to write that dataframe which is created from a RDD[Row] to a csv file using scala.
This is my code :
val FileDF = spark.read.csv(inputfilepath)
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow => Row.fromSeq((indexedRow._2.toLong+SEED+1) +: indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val dataframenew = spark.createDataFrame(rdd,FileDFWithSeqNo)
dataframenew.write.format("com.databricks.spark.csv").option("delimiter","|").save("C:\\Users\\path\\Desktop\\IndexedOutput")
where dataframenew is the final dataframe.
Input data is like :
0|0001|10|1|6001825851|0|0|0000|0|003800543||2017-03-02 00:00:00|95|O|473|3.74|0.05|N|||5676|6001661630||473|1|||UPS|2017-03-02 00:00:00|0.0000||0||20170303|793358|793358115230979
0|0001|10|1|6001825853|0|0|0000|0|003811455||2017-03-02 00:00:00|95|O|90|15.14|0.55|N|||1080|6001661630||90|1|||UPS|2017-03-02 00:00:00|0.0000||0||20170303|793358|793358115230980
0|0001|10|1|6001825854|0|0|0000|0|003812898||2017-03-02 00:00:00|95|O|15|7.60|1.33|N|||720|6001661630||15|1|||UPS|2017-03-02 00:00:00|0.0000||0||20170303|793358|793358115230981
I'm zipping with index to get unique identifier for each row.
But this gives me an output file with data like :
1001,"0|0001|10|1|6001825851|0|0|0000|PS|0|0.0000||0||20170303|793358|793358115230979",cabc
While the expected output should be :
1001,0|0001|10|1|6001825851|0|0|0000|PS|0|0.0000||0||20170303|793358|793358115230979,cabc
Why are the extra quotes getting added into the data and how can I eliminate this?

Related

Scala Spark: Order changes when writing a DataFrame to a CSV file

I have two data frames which I am merging using union. After performing the union, printing the final dataframe using df.show(), shows that the records are in the order as intended (first dataframe records on the top followed by second dataframe records). But when I write this final data frame to the csv file, the records from the first data frame, that I want to be on the top of the csv file are losing their position. The first data frame's records are getting mixed with the second dataframe's records. Any help would be appreciated.
Below is a the code sample:
val intVar = 1
val myList = List(("hello",intVar))
val firstDf = myList.toDF()
val secondDf: DataFrame = testRdd.toDF()
val finalDF = firstDf.union(secondDf)
finalDF.show() // prints the dataframe with firstDf records on the top followed by the secondDf records
val outputfilePath = "/home/out.csv"
finalDF.coalesce(1).write.csv(outputFilePath) //the first Df records are getting mixed with the second Df records.

Spark scala- How to apply transformation logic on a generic set of columns defined in a file

I am using spark scala 1.6 version.
I have 2 files, one is a schema file which has hundreds of columns separated by commas and another file is .gz file which contains data.
I am trying to read the data using the schema file and apply different transformation logic on a set of few columns .
I tried running a sample code but I have hardcoded the columns numbers in the attached pic.
Also I want to write a udf which could read any set of columns and apply the transformation like replacing a special character and give the output.
Appreciate any suggestion
import org.apache.spark.SparkContext
val rdd1 = sc.textFile("../inp2.txt")
val rdd2 = rdd1.map(line => line.split("\t"))
val rdd2 = rdd1.map(line => line.split("\t")(1)).toDF
val replaceUDF = udf{s: String => s.replace(".", "")}
rdd2.withColumn("replace", replaceUDF('_1)).show
You can read the field name file with simple scala code and create a list of column names as
// this reads the file and creates a list of columnnames
val line = Source.fromFile("path to file").getLines().toList.head
val columnNames = line.split(",")
//read the text file as an rdd and convert to Dataframe
val rdd1 = sc.textFile("../inp2.txt")
val rdd2 = rdd1.map(line => line.split("\t")(1))
.toDF(columnNames : _*)
This creates a dataframe with columns names that you have in a separate file.
Hope this helps!

Scala: How to merge the multiple CSV files in data frame

I am writing the below code to get the csv file in RDD, I want to union multiple csv files and want to store in the single RDD variable. I am able to store the data of one csv file in RDD kindly help me how to union multiple csv files and to store in single RDD variable .
val Rdd = spark.sparkContext.textFile(“File1.csv").map(_.split(","))
I am expecting something like
val Rdd = spark.sparkContext.textFile(“File1.csv").map(_.split(",")) union spark.sparkContext.textFile(“File2.csv").map(_.split(","))
If you have a large number of files I would suggest
val rdd = List("file1", "file2", "file3", "file4", "file5")
.map(spark.sparkContext.textFile(_))
.reduce(_ union _)
Or if you only know you have 0 or more files:
val rdd = getListOfFilenames()
.map(spark.sparkContext.textFile(_))
.foldLeft(spark.sparkContext.emptyRDD[String])(_ union _)

Obtaining one column of a RDD[Array[String]] and converting it to dataset/dataframe

I have a .csv file that I read in to a RDD:
val dataH = sc.textFile(filepath).map(line => line.split(",").map(elem => elem.trim))
I would like to iterate over this RDD in order and compare adjacent elements, this comparison is only dependent of one column of the datastructure. It is not possible to iterate over RDDs so instead, the idea is to first convert the column of RDD to either a Dataset or Dataframe.
You can convert a RDD to a dataset like this (which doesn't work if my structure is RDD[Array[String]]:
val sc = new SparkContext(conf)
val sqc = new SQLContext(sc)
import sqc.implicits._
val lines = sqc.createDataset(dataH)
How do I obtain just the one column that I am interested in from dataH and thereafter create a dataset just from it?
I am using Spark 1.6.0.
You can just map your Array to the desired index, e.g. :
dataH.map(arr => arr(0)).toDF("col1")
Or safer (avoids Exception in case the index is out of bound):
dataH.map(arr => arr.lift(0).orElse(None)).toDF("col1")

Timestamp issue when loading CSV to dataframe

I am trying to load a csv file into a distributed dataframe (ddf), whilst giving a schema. The ddf gets loaded but the timestamp column shows only null values. I believe this happens because spark expects timestamp in a particular format. So I have two questions:
1) How do I give spark the format or make it detect format (like
"MM/dd/yyyy' 'HH:mm:ss")
2) If 1 is not an option how to convert the field (assuming I imported as String) to timestamp.
For Q2 I have tried using following :
def convert(row :org.apache.spark.sql.Row) :org.apache.spark.sql.Row = {
import org.apache.spark.sql.Row
val format = new java.text.SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
val d1 = getTimestamp(row(3))
return Row(row(0),row(1),row(2),d1);
}
val rdd1 = df.map(convert)
val df1 = sqlContext.createDataFrame(rdd1,schema1)
The last step doesn't work as there are null values which dont let it finish. I get errors like :
java.lang.RuntimeException: Failed to check null bit for primitive long value.
The sqlContext.load however is able to load the csv without any problems.
val df = sqlContext.load("com.databricks.spark.csv", schema, Map("path" -> "/path/to/file.csv", "header" -> "true"))