Scala: How to merge the multiple CSV files in data frame - scala

I am writing the below code to get the csv file in RDD, I want to union multiple csv files and want to store in the single RDD variable. I am able to store the data of one csv file in RDD kindly help me how to union multiple csv files and to store in single RDD variable .
val Rdd = spark.sparkContext.textFile(“File1.csv").map(_.split(","))
I am expecting something like
val Rdd = spark.sparkContext.textFile(“File1.csv").map(_.split(",")) union spark.sparkContext.textFile(“File2.csv").map(_.split(","))

If you have a large number of files I would suggest
val rdd = List("file1", "file2", "file3", "file4", "file5")
.map(spark.sparkContext.textFile(_))
.reduce(_ union _)
Or if you only know you have 0 or more files:
val rdd = getListOfFilenames()
.map(spark.sparkContext.textFile(_))
.foldLeft(spark.sparkContext.emptyRDD[String])(_ union _)

Related

Spark scala- How to apply transformation logic on a generic set of columns defined in a file

I am using spark scala 1.6 version.
I have 2 files, one is a schema file which has hundreds of columns separated by commas and another file is .gz file which contains data.
I am trying to read the data using the schema file and apply different transformation logic on a set of few columns .
I tried running a sample code but I have hardcoded the columns numbers in the attached pic.
Also I want to write a udf which could read any set of columns and apply the transformation like replacing a special character and give the output.
Appreciate any suggestion
import org.apache.spark.SparkContext
val rdd1 = sc.textFile("../inp2.txt")
val rdd2 = rdd1.map(line => line.split("\t"))
val rdd2 = rdd1.map(line => line.split("\t")(1)).toDF
val replaceUDF = udf{s: String => s.replace(".", "")}
rdd2.withColumn("replace", replaceUDF('_1)).show
You can read the field name file with simple scala code and create a list of column names as
// this reads the file and creates a list of columnnames
val line = Source.fromFile("path to file").getLines().toList.head
val columnNames = line.split(",")
//read the text file as an rdd and convert to Dataframe
val rdd1 = sc.textFile("../inp2.txt")
val rdd2 = rdd1.map(line => line.split("\t")(1))
.toDF(columnNames : _*)
This creates a dataframe with columns names that you have in a separate file.
Hope this helps!

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

Spark scala copying dataframe column to new dataframe

I have an empty dataframe with schema already created.
I'm trying to add the columns to this dataframe from a new dataframe to the existing columns in a for loop.
k schema - |ID|DATE|REPORTID|SUBMITTEDDATE|
for(data <- 0 to range-1){
val c = df2.select(substring(col("value"), str(data)._2, str(data)._3).alias(str(data)._1)).toDF()
//c.show()
k = c.withColumn(str(data)._1, c(str(data)._1))
}
k.show()
But the k dataframe has just one column, but it should have all the 4 columns populated with values.
I think the last line in for loop is replacing exisitng columns in the dataframe.
Can somebody help me with this?
Thanks!!
Add your logic and conditions and create new dataframe
val dataframe2 = dataframe1.select("A","B",C)
Copying few columns of a dataframe to another one is not possible in spark.
Although there are few alternatives to attain the same
1. You need to join both the dataframe based on some join condition.
2. Convert bot the data frame to json and do RDD Union
val rdd = df1.toJSON.union(df2.toJSON)
val dfFinal = spark.read.json(rdd)

Unwanted quotes appearing in data while saving dataframe as file

I first read a delimited file with multiple rows and index rows using zip with index.
Next I'm trying to write that dataframe which is created from a RDD[Row] to a csv file using scala.
This is my code :
val FileDF = spark.read.csv(inputfilepath)
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow => Row.fromSeq((indexedRow._2.toLong+SEED+1) +: indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val dataframenew = spark.createDataFrame(rdd,FileDFWithSeqNo)
dataframenew.write.format("com.databricks.spark.csv").option("delimiter","|").save("C:\\Users\\path\\Desktop\\IndexedOutput")
where dataframenew is the final dataframe.
Input data is like :
0|0001|10|1|6001825851|0|0|0000|0|003800543||2017-03-02 00:00:00|95|O|473|3.74|0.05|N|||5676|6001661630||473|1|||UPS|2017-03-02 00:00:00|0.0000||0||20170303|793358|793358115230979
0|0001|10|1|6001825853|0|0|0000|0|003811455||2017-03-02 00:00:00|95|O|90|15.14|0.55|N|||1080|6001661630||90|1|||UPS|2017-03-02 00:00:00|0.0000||0||20170303|793358|793358115230980
0|0001|10|1|6001825854|0|0|0000|0|003812898||2017-03-02 00:00:00|95|O|15|7.60|1.33|N|||720|6001661630||15|1|||UPS|2017-03-02 00:00:00|0.0000||0||20170303|793358|793358115230981
I'm zipping with index to get unique identifier for each row.
But this gives me an output file with data like :
1001,"0|0001|10|1|6001825851|0|0|0000|PS|0|0.0000||0||20170303|793358|793358115230979",cabc
While the expected output should be :
1001,0|0001|10|1|6001825851|0|0|0000|PS|0|0.0000||0||20170303|793358|793358115230979,cabc
Why are the extra quotes getting added into the data and how can I eliminate this?

How can I save an RDD into HDFS and later read it back?

I have an RDD whose elements are of type (Long, String). For some reason, I want to save the whole RDD into the HDFS, and later also read that RDD back in a Spark program. Is it possible to do that? And if so, how?
It is possible.
In RDD you have saveAsObjectFile and saveAsTextFile functions. Tuples are stored as (value1, value2), so you can later parse it.
Reading can be done with textFile function from SparkContext and then .map to eliminate ()
So:
Version 1:
rdd.saveAsTextFile ("hdfs:///test1/");
// later, in other program
val newRdds = sparkContext.textFile("hdfs:///test1/part-*").map (x => {
// here remove () and parse long / strings
})
Version 2:
rdd.saveAsObjectFile ("hdfs:///test1/");
// later, in other program - watch, you have tuples out of the box :)
val newRdds = sparkContext.sc.sequenceFile("hdfs:///test1/part-*", classOf[Long], classOf[String])
I would recommend to use DataFrame if your RDD is in tabular format. a data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
where a RDD is a Resilient Distributed Dataset that is more of a blackbox or core abstraction of data that cannot be optimized.
However, you can go from a DataFrame to an RDD and vice-versa, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via toDF method.
The following is the example to create/store a DataFrame in CSV and Parquet format in HDFS,
val conf = {
new SparkConf()
.setAppName("Spark-HDFS-Read-Write")
}
val sqlContext = new SQLContext(sc)
val sc = new SparkContext(conf)
val hdfs = "hdfs:///"
val df = Seq((1, "Name1")).toDF("id", "name")
// Writing file in CSV format
df.write.format("com.databricks.spark.csv").mode("overwrite").save(hdfs + "user/hdfs/employee/details.csv")
// Writing file in PARQUET format
df.write.format("parquet").mode("overwrite").save(hdfs + "user/hdfs/employee/details")
// Reading CSV files from HDFS
val dfIncsv = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load(hdfs + "user/hdfs/employee/details.csv")
// Reading PQRQUET files from HDFS
val dfInParquet = sqlContext.read.parquet(hdfs + "user/hdfs/employee/details")