RDD data into multiple rows in spark-scala - scala

I have a fixed width text file(sample) with data
2107abc2018abn2019gfh
where all the rows data are combined as single row
I need to read the textfile and split data according fixed row length=7
and generate multiple rows and store it in RDD.
2107abc
2018abn
2019gfh
where 2107 is one column and abc is one more column
will the logic will be applicable for huge data file like 1 GB or more?

I'm amusing that you have RDD[String] and you want to extract both columns from your data. First you can split the line at length 7 and then again at 4. You will get your columns separated. Below is the code for same.
//creating a sample RDD from the given string
val rdd = sc.parallelize(Seq("""2107abc2018abn2019gfh"""))
//Now first split at length 7 then again split at length 4 and create dataframe
val res = rdd.flatMap(_.grouped(7).map(x=>x.grouped(4).toSeq)).map(x=> (x(0),x(1)))
//print the rdd
res.foreach(println)
//output
//(2107,abc)
//(2018,abn)
//(2019,gfh)
If you want you can also convert your RDD to dataframe for further processing.
//convert to DF
val df = res.toDF("col1","col2")
//print the dataframe
df.show
//+----+----+
//|col1|col2|
//+----+----+
//|2107| abc|
//|2018| abn|
//|2019| gfh|
//+----+----+

Related

Convert a DataFrame to RDD and Split the RDD into the same number of Columns as DataFrame Dynamically

I am trying to Convert a DataFrame into RDD and Splitting them into Specific number of Columns based on Number of Columns in DataFrame Dynamically and Elegantly
i.e
This is a sample data from a table in hive employee
Id Name Age State City
123 Bob 34 Texas Dallas
456 Stan 26 Florida Tampa
val temp_df = spark.sql("Select * from employee")
val temp2_rdd = temp_df.rdd.map(x => (x(0),x(1),x(2),x(3))
I am looking to generate the tem2_rdd dynamically based on the number of columns from the table.
It should not be hard coded as i did.
As the maximum size of tuple is 22 in scala, any other collection that can hold the rdd efficiently.
Coding Language : Spark Scala
Please advise.
Instead of extracting and transforming each element using index you can use toSeq method of Row object.
val temp_df = spark.sql("Select * from employee")
// RDD[List[Any]]
val temp2_rdd = temp_df.rdd.map(_.toSeq.toList)
// RDD[List[String]]
val temp3_rdd = temp_df.rdd.map(_.toSeq.map(_.toString).toList)

How to add a list or array of strings as a column to a Spark Dataframe

So, I have n number of strings that I can keep either in an array or in a list like this:
val checks = Array("check1", "check2", "check3", "check4", "check5")
val checks: List[String] = List("check1", "check2", "check3", "check4", "check5")
Now, I have a spark dataframe df and I want to add a column with the values present in this List/Array. (It is guaranteed that the number of items in my List/Array will be exactly equal to the number of rows in the dataframe, i.e n)
I tried doing:
df.withColumn("Value", checks)
But that didn't work. What would be the best way to achieve this?
You need to add it as an array column as follows:
val df2 = df.withColumn("Value", array(checks.map(lit):_*))
If you want a single value for each row, you can get the array element:
val df2 = df.withColumn("Value", array(checks.map(lit):_*))
.withColumn("rn", row_number().over(Window.orderBy(lit(1))) - 1)
.withColumn("Value", expr("Value[rn]"))
.drop("rn")

Scala Spark: Order changes when writing a DataFrame to a CSV file

I have two data frames which I am merging using union. After performing the union, printing the final dataframe using df.show(), shows that the records are in the order as intended (first dataframe records on the top followed by second dataframe records). But when I write this final data frame to the csv file, the records from the first data frame, that I want to be on the top of the csv file are losing their position. The first data frame's records are getting mixed with the second dataframe's records. Any help would be appreciated.
Below is a the code sample:
val intVar = 1
val myList = List(("hello",intVar))
val firstDf = myList.toDF()
val secondDf: DataFrame = testRdd.toDF()
val finalDF = firstDf.union(secondDf)
finalDF.show() // prints the dataframe with firstDf records on the top followed by the secondDf records
val outputfilePath = "/home/out.csv"
finalDF.coalesce(1).write.csv(outputFilePath) //the first Df records are getting mixed with the second Df records.

How to convert Array[String] into spark Dataframe to save CSV file format? [duplicate]

This question already has answers here:
How to create DataFrame from Scala's List of Iterables?
(5 answers)
Closed 4 years ago.
Code that I'm using to parse the CSV
val ListOfNames = List("Ramesh","Suresh","Ganesh") //Dynamical will add list of names
val Seperator = ListOfNames.map(x => x.split(",") //mkString(",")
sc.parallelize(Array(seperator)).toDF().csv("path")
Getting output :
"Ramesh,Suresh,Ganesh" // Hence entire list into a single column in CSV
Expected output:
Ramesh, Suresh, Ganesh // each name into a single column in CSV
output should be in a row and each string should be in each column with comma separated.
If I try to change anything, it is saying CSV Data sources do not support array of string data type.
How to solve this?
If you are looking to convert your list of size n to a spark dataframe which holds n number of rows with only one column then the solution will look like below:
import sparkSession.sqlContext.implicits._
val listOfNames = List("Ramesh","Suresh","Ganesh")
val df = listOfNames.toDF("names")
df.show(false)
output:
+------+
|names |
+------+
|Ramesh|
|Suresh|
|Ganesh|
+------+

How to divide dataset in two parts based on filter in Spark-scala

Is it possible to divide DF in two parts using single filter operation.For example
let say df has below records
UID Col
1 a
2 b
3 c
if I do
df1 = df.filter(UID <=> 2)
can I save filtered and non-filtered records in different RDD in single operation
?
df1 can have records where uid = 2
df2 can have records with uid 1 and 3
If you're interested only in saving data you can add an indicator column to the DataFrame:
val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("uid", "col")
val dfWithInd = df.withColumn("ind", $"uid" <=> 2)
and use it as a partition column for the DataFrameWriter with one of the supported formats (as for 1.6 it is Parquet, text, and JSON):
dfWithInd.write.partitionBy("ind").parquet(...)
It will create two separate directories (ind=false, ind=true) on write.
In general though, it is not possible to yield multiple RDDs or DataFrames from a single transformation. See How to split a RDD into two or more RDDs?