Spark Scala read CSV which has a comma in the data - scala

My CSV file which is in a zip file has the below data,
"Potter, Jr",Harry,92.32,09/09/2018
John,Williams,78,01/02/1992
And I read it using spark scala csv reader. If I use,
.option('quote', '"')
.option('escape', '"')
I will not be getting the fixed number of columns as output. For line 1, the output would be 5 and line 2 it would be 4. The desired output should return 4 columns only. Is there any way to read it as DF or RDD?
Thanks,
Ash

For the given input data, I was able to read the data using:
val input = spark.read.csv("input_file.csv")
This gave me a Dataframe with 4 string columns.

Check this.
val df = spark.read.csv("in/potter.txt").toDF("fname","lname","value","dt")
df.show()
+----------+--------+-----+----------+
| fname| lname|value| dt|
+----------+--------+-----+----------+
|Potter, Jr| Harry|92.32|09/09/2018|
| John|Williams| 78|01/02/1992|
+----------+--------+-----+----------+

Related

Load Text Files and store it in Dataframe using Pyspark

I am migrating pig script to pyspark and I am new to Pyspark so I am stuck at data loading.
My pig script looks like:
Bag1 = LOAD '/refined/em/em_results/202112/' USING PigStorage('\u1') AS
(PAYER_SHORT: chararray
,SUPER_PAYER_SHORT: chararray
,PAID: double
,AMOUNT: double
);
I want something similar to this in Pyspark.
Currently I have tried this in Pyspark:
df = spark.read.format("csv").load("/refined/em/em_results/202112/*")
I am able to read the text file with this but values are coming in single column instead of separated in different columns. Please find below some sample values:
|_c0
|AZZCMMETAL2021/1211FGPP7491764 |
|AZZCMMETAL2021/1221HEMP7760484 |
Output should look like this:
_c0 _c1 _c2 _c3_c4 _c5 _c6 _c7
AZZCM METAL 2021/12 11 FGP P 7 491764
AZZCM METAL 2021/12 11 HEM P 7 760484
Any clue how to achieve this? Thanks!!
Generaly spark would try to take (,)[comma] as a separator value in you case you have to provide space as your separator.
df = spark.read.csv(file_path, sep =' ')
This resolves the issue. Instead of "\\u1", I used "\u0001". Please find below the answer.
df = spark.read.option("sep","\u0001").csv("/refined/em/em_results/202112/*")

RDD data into multiple rows in spark-scala

I have a fixed width text file(sample) with data
2107abc2018abn2019gfh
where all the rows data are combined as single row
I need to read the textfile and split data according fixed row length=7
and generate multiple rows and store it in RDD.
2107abc
2018abn
2019gfh
where 2107 is one column and abc is one more column
will the logic will be applicable for huge data file like 1 GB or more?
I'm amusing that you have RDD[String] and you want to extract both columns from your data. First you can split the line at length 7 and then again at 4. You will get your columns separated. Below is the code for same.
//creating a sample RDD from the given string
val rdd = sc.parallelize(Seq("""2107abc2018abn2019gfh"""))
//Now first split at length 7 then again split at length 4 and create dataframe
val res = rdd.flatMap(_.grouped(7).map(x=>x.grouped(4).toSeq)).map(x=> (x(0),x(1)))
//print the rdd
res.foreach(println)
//output
//(2107,abc)
//(2018,abn)
//(2019,gfh)
If you want you can also convert your RDD to dataframe for further processing.
//convert to DF
val df = res.toDF("col1","col2")
//print the dataframe
df.show
//+----+----+
//|col1|col2|
//+----+----+
//|2107| abc|
//|2018| abn|
//|2019| gfh|
//+----+----+

How to convert Array[String] into spark Dataframe to save CSV file format? [duplicate]

This question already has answers here:
How to create DataFrame from Scala's List of Iterables?
(5 answers)
Closed 4 years ago.
Code that I'm using to parse the CSV
val ListOfNames = List("Ramesh","Suresh","Ganesh") //Dynamical will add list of names
val Seperator = ListOfNames.map(x => x.split(",") //mkString(",")
sc.parallelize(Array(seperator)).toDF().csv("path")
Getting output :
"Ramesh,Suresh,Ganesh" // Hence entire list into a single column in CSV
Expected output:
Ramesh, Suresh, Ganesh // each name into a single column in CSV
output should be in a row and each string should be in each column with comma separated.
If I try to change anything, it is saying CSV Data sources do not support array of string data type.
How to solve this?
If you are looking to convert your list of size n to a spark dataframe which holds n number of rows with only one column then the solution will look like below:
import sparkSession.sqlContext.implicits._
val listOfNames = List("Ramesh","Suresh","Ganesh")
val df = listOfNames.toDF("names")
df.show(false)
output:
+------+
|names |
+------+
|Ramesh|
|Suresh|
|Ganesh|
+------+

Spark dataframe explode column

Every row in the dataframe contains a csv formatted string line plus another simple string, so what I'm trying to get at the end is a dataframe composed of the fields extracted from the line string together with category.
So I proceeded as follows to explode the line string
val df = stream.toDF("line","category")
.map(x => x.getString(0))......
At the end I manage to get a new dataframe composed of the line fields but I can't return the category to the new dataframe
I can't join the new dataframe with the initial one since the common field id was not a separate column at first.
Sample of input :
line | category
"'1';'daniel';'dan#gmail.com'" | "premium"
Sample of output:
id | name | email | category
1 | "daniel"| "dan#gmail.com"| "premium"
Any suggestions, thanks in advance.
If the structure of strings in line column is fixed as mentioned in the question, then following simple solution should work where split inbuilt function is used to split the string into array and then finally selecting the elements from the array and aliasing to get the final dataframe
import org.apache.spark.sql.functions._
df.withColumn("line", split(col("line"), ";"))
.select(col("line")(0).as("id"), col("line")(1).as("name"), col("line")(2).as("email"), col("category"))
.show(false)
which should give you
+---+--------+---------------+--------+
|id |name |email |category|
+---+--------+---------------+--------+
|'1'|'daniel'|'dan#gmail.com'|premium |
+---+--------+---------------+--------+
I hope the answer is helpful

How do you write a dataframe/RDD with custom delimeiter (ctrl-A delimited) file in spark scala?

I am working over poc in which I need to create dataframe and then save it as ctrl A delimited file.
My query to create intermediate result is below
val grouped = results.groupBy("club_data","student_id_add","student_id").agg(sum(results("amount").cast(IntegerType)).as("amount"),count("amount").as("cnt")).filter((length(trim($"student_id")) > 1) && ($"student_id").isNotNull)
Saving result in text file
grouped.select($"club_data", $"student_id_add", $"amount",$"cnt").rdd.saveAsTextFile("/amit/spark/output4/")
Output :
[amit,DI^A356035,581,1]
It saves data as comma separated but I need to save it as ctrl-A separate
I tried option("delimiter", "\u0001") but seems it's not supported by dataframe/rdd.
Is there any function which helps?
If you have a dataframe you can use Spark-CSV to write as a csv with delimiter as below.
df.write.mode(SaveMode.Overwrite).option("delimiter", "\u0001").csv("outputCSV")
With Older version of Spark
df.write
.format("com.databricks.spark.csv")
.option("delimiter", "\u0001")
.mode(SaveMode.Overwrite)
.save("outputCSV")
You can read back as below
spark.read.option("delimiter", "\u0001").csv("outputCSV").show()
IF you have an RDD than you can use mkString() function on RDD and save with saveAsTextFile()
rdd.map(r => r.mkString(\u0001")).saveAsTextFile("outputCSV")
Hope this helps!
df.rdd.map(x=>x.mkString("^A")).saveAsTextFile("file:/home/iot/data/stackOver")
convert the rows to text before saving:
grouped.select($"club_data", $"student_id_add", $"amount",$"cnt").map(row => row.mkString(\u0001")).saveAsTextFile("/amit/spark/output4/")