Condense several columns read with Spark CSV - scala

I have data like the following in a CSV file:
ColumnA,1,2,3,2,1
"YYY",242,34234,232,322,432
"ZZZ",16,435,363,3453,3434
I want to read it with https://github.com/databricks/spark-csv
I would like to read this into a DataFrame and condense all the columns except the first one into a Seq.
So I would like to obtain something like this from it:
MyCaseClass("YYY", Seq(242,34234,232,322,432))
MyCaseClass("ZZZ", Seq(16,435,363,3453,3434))
I'm not sure how to obtain that.
I tried reading like this, where url is the location of the file:
val rawData = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(url)
Then, I am mapping it into the values that I want.
The problem is that I get the error:
The header contains a duplicate entry: '1'
So how can I condense all the fields except the first into a Seq using spark-csv?
EDIT
I can not change the format of the input.

you can do by mapping over row . And also as Pawel's comment duplicate column name is not allowed. So, you can do like :
val dataFrame = yourCSV_DataFrame
dataFrame.map{row =>
Row(row(0), Seq(row(1), row(2), row(3) ...))
}

Related

Spark Dataframe from a different data format

I've this data set. for which I need to create a sparkdataframe in scala. This data is a column in a csv file. column name is dataheader
dataheader
"{""date_time"":""1999/05/22 03:03:07.011"",""cust_id"":""cust1"",""timestamp"":944248234000,""msgId"":""113"",""activityTimeWindowMilliseconds"":20000,""ec"":""event1"",""name"":""ABC"",""entityId"":""1001"",""et"":""StateChange"",""logType"":""type123,""lastActivityTS"":944248834000,""sc_id"":""abc1d1c9"",""activityDetectedInLastTimeWindow"":true}"
"{""date_time"":""1999/05/23 03:03:07.011"",""cust_id"":""cust1"",""timestamp"":944248234000,""msgId"":""114"",""activityTimeWindowMilliseconds"":20000,""ec"":""event2"",""name"":""ABC"",""entityId"":""1001"",""et"":""StateChange"",""logType"":""type123,""lastActivityTS"":944248834000,""sc_id"":""abc1d1c9"",""activityDetectedInLastTimeWindow"":true}"
I was able to read the csv file -
val df_tmp = spark
.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("quoteMode", "ALL")
.option("delimiter", ",")
.option("escape", "\"")
//.option("inferSchema","true")
.option("multiline", "true")
.load("D:\\dataFile.csv")
I tried to split the data into separate columns in a dataframe but did not succeed.
one thing I noticed in data is both key and value are enclosed by double double quotes ""key1"":""value1""
If you want to get the field inside the data field, you need to parse it and write it into a new CSV file.
It's obviously a string in json format

How to add a file name to a column in a data frame as multiple files are merged together?

How can I add a file_name column to a dataframe, as data is loading into the frame? So, I want the file_name to show for every record in the dataframe.
I did some research on this, and found something that seems like it should work, but it actually doesn't load any file names, only the data in the files themselves.
import org.apache.spark.sql.functions._
val df = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/corp/ABC*.gz")
df.withColumn("file_name", input_file_name)
What is wrong with my code here? Thanks.
The input_file_name function creates a string column for the file name of the current Spark task.
import org.apache.spark.sql.functions.input_file_name
val df= spark.read
.option("delimiter", "|")
.option("header", "false")
.csv("mnt/rawdata/2019/01/01/corp/")
.withColumn("file_name", input_file_name())

Spark, Scala not able to create view appropriately after reading from file

I am using spark and scala on jdk1.8.I am new to Scala.
I am reading a text file (pat1.txt) that looks like :
Now I am reading that file from my scala code as :
val sqlContext = SparkSession.builder().getOrCreate()
sqlContext.read
.format(externalEntity.getExtractfileType)
.option("compression", externalEntity.getCompressionCodec)
.option("header", if (externalEntity.getHasHeader.toUpperCase == "Y") "true" else "false")
.option("inferSchema", "true")
.option("delimiter", externalEntity.getExtractDelimiter)
.load(externalEntity.getFilePath)
.createOrReplaceTempView(externalEntity.getExtractName)
And then making a query as from my scala code:
val queryResult = sqlContext.sql(myQuery)
and output is generated as :
queryResult
.repartition(LteGenericExtractEntity.getNumberOfFiles.toInt)
.write.format("csv")
.option("compression", LteGenericExtractEntity.getCompressionCodec)
.option("delimiter", LteGenericExtractEntity.getExtractDelimiter)
.option("header", "true"")
.save(s"${outputDirectory}/${extractFileBase}")
Now when the 'myQuery' above is
select * from PAT1
The program is generating o/p as (notice the extra line with "value" that was not part of the file). Basically the program is not able to to identify the "," separated columns in the input file and in the output it creates 1 column under the header that is named as "value". So the output file looks like :
If I change 'myQuery' as :
select p1.FIRST_NAME, p1.LAST_NAME,p1.HOBBY from PAT1 p1
It throws exception as:
My input can be in any format ( like can be text/csv and can have compression) and output will always be in .csv
I am getting hard time to understand how to change the read part so the created view can have columns appropriately.Can I get help on that.
This looks like csv file, but with .txt extension.
You could try the following:
Rad this file as csv with extra options like spark.read.option("inferSchema", "true").option("header", "true").csv("path/to/file")
After reading file as you did, just specify the schema of the dataframe as:
sqlContext.read.format("text")
.option("compression", "none")
.option("delimiter", ",")
.option("header", "true")
.load("/tmp/pat1")
.toDF("first_name", "last_name", "hobby")

Spark scala- How to apply transformation logic on a generic set of columns defined in a file

I am using spark scala 1.6 version.
I have 2 files, one is a schema file which has hundreds of columns separated by commas and another file is .gz file which contains data.
I am trying to read the data using the schema file and apply different transformation logic on a set of few columns .
I tried running a sample code but I have hardcoded the columns numbers in the attached pic.
Also I want to write a udf which could read any set of columns and apply the transformation like replacing a special character and give the output.
Appreciate any suggestion
import org.apache.spark.SparkContext
val rdd1 = sc.textFile("../inp2.txt")
val rdd2 = rdd1.map(line => line.split("\t"))
val rdd2 = rdd1.map(line => line.split("\t")(1)).toDF
val replaceUDF = udf{s: String => s.replace(".", "")}
rdd2.withColumn("replace", replaceUDF('_1)).show
You can read the field name file with simple scala code and create a list of column names as
// this reads the file and creates a list of columnnames
val line = Source.fromFile("path to file").getLines().toList.head
val columnNames = line.split(",")
//read the text file as an rdd and convert to Dataframe
val rdd1 = sc.textFile("../inp2.txt")
val rdd2 = rdd1.map(line => line.split("\t")(1))
.toDF(columnNames : _*)
This creates a dataframe with columns names that you have in a separate file.
Hope this helps!

How do you write a dataframe/RDD with custom delimeiter (ctrl-A delimited) file in spark scala?

I am working over poc in which I need to create dataframe and then save it as ctrl A delimited file.
My query to create intermediate result is below
val grouped = results.groupBy("club_data","student_id_add","student_id").agg(sum(results("amount").cast(IntegerType)).as("amount"),count("amount").as("cnt")).filter((length(trim($"student_id")) > 1) && ($"student_id").isNotNull)
Saving result in text file
grouped.select($"club_data", $"student_id_add", $"amount",$"cnt").rdd.saveAsTextFile("/amit/spark/output4/")
Output :
[amit,DI^A356035,581,1]
It saves data as comma separated but I need to save it as ctrl-A separate
I tried option("delimiter", "\u0001") but seems it's not supported by dataframe/rdd.
Is there any function which helps?
If you have a dataframe you can use Spark-CSV to write as a csv with delimiter as below.
df.write.mode(SaveMode.Overwrite).option("delimiter", "\u0001").csv("outputCSV")
With Older version of Spark
df.write
.format("com.databricks.spark.csv")
.option("delimiter", "\u0001")
.mode(SaveMode.Overwrite)
.save("outputCSV")
You can read back as below
spark.read.option("delimiter", "\u0001").csv("outputCSV").show()
IF you have an RDD than you can use mkString() function on RDD and save with saveAsTextFile()
rdd.map(r => r.mkString(\u0001")).saveAsTextFile("outputCSV")
Hope this helps!
df.rdd.map(x=>x.mkString("^A")).saveAsTextFile("file:/home/iot/data/stackOver")
convert the rows to text before saving:
grouped.select($"club_data", $"student_id_add", $"amount",$"cnt").map(row => row.mkString(\u0001")).saveAsTextFile("/amit/spark/output4/")