Spark Dataframe from a different data format - scala

I've this data set. for which I need to create a sparkdataframe in scala. This data is a column in a csv file. column name is dataheader
dataheader
"{""date_time"":""1999/05/22 03:03:07.011"",""cust_id"":""cust1"",""timestamp"":944248234000,""msgId"":""113"",""activityTimeWindowMilliseconds"":20000,""ec"":""event1"",""name"":""ABC"",""entityId"":""1001"",""et"":""StateChange"",""logType"":""type123,""lastActivityTS"":944248834000,""sc_id"":""abc1d1c9"",""activityDetectedInLastTimeWindow"":true}"
"{""date_time"":""1999/05/23 03:03:07.011"",""cust_id"":""cust1"",""timestamp"":944248234000,""msgId"":""114"",""activityTimeWindowMilliseconds"":20000,""ec"":""event2"",""name"":""ABC"",""entityId"":""1001"",""et"":""StateChange"",""logType"":""type123,""lastActivityTS"":944248834000,""sc_id"":""abc1d1c9"",""activityDetectedInLastTimeWindow"":true}"
I was able to read the csv file -
val df_tmp = spark
.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("quoteMode", "ALL")
.option("delimiter", ",")
.option("escape", "\"")
//.option("inferSchema","true")
.option("multiline", "true")
.load("D:\\dataFile.csv")
I tried to split the data into separate columns in a dataframe but did not succeed.
one thing I noticed in data is both key and value are enclosed by double double quotes ""key1"":""value1""

If you want to get the field inside the data field, you need to parse it and write it into a new CSV file.
It's obviously a string in json format

Related

How to add a file name to a column in a data frame as multiple files are merged together?

How can I add a file_name column to a dataframe, as data is loading into the frame? So, I want the file_name to show for every record in the dataframe.
I did some research on this, and found something that seems like it should work, but it actually doesn't load any file names, only the data in the files themselves.
import org.apache.spark.sql.functions._
val df = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/corp/ABC*.gz")
df.withColumn("file_name", input_file_name)
What is wrong with my code here? Thanks.
The input_file_name function creates a string column for the file name of the current Spark task.
import org.apache.spark.sql.functions.input_file_name
val df= spark.read
.option("delimiter", "|")
.option("header", "false")
.csv("mnt/rawdata/2019/01/01/corp/")
.withColumn("file_name", input_file_name())

Spark Dataframe to TXT file without carriage return

I am trying to save the spark dataframe as text file. While doing this, I need to have specific column delimiter and row delimiters. I am unable to get the row delimiter working. Any help would be greatly appreciated.
Below is the sample code for reference.
//option -1
spark.sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter", "\\§")
df.coalesce(1)
.map(_.mkString("\u00B6"))
.write
.option("encoding", "US-ASCI")
.mode(SaveMode.Overwrite).text(FileName)
//option-2
df.coalesce(1)
.write.mode(SaveMode.Overwrite)
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("encoding", "US-ASCI")
.option("multiLine", false)
.option("delimiter", "\u00B6")
.option("lineSep", "\u00A7")
.csv(FileName1)
Below is my input and output for reference:
Input:
Test1,Test2,Test2
Pqr,Rsu,Lmn
one,two,three
Output:
Test1¶Test2¶Test2§Pqr¶Rsu¶Lmn§one¶two¶three
From Spark 2.4.0, the "lineSep" option can be used to write json and text files with a custom line separator (cf. DataFrameWriter spec). This option is ignored in previous Spark versions and for csv format.
val df = spark.createDataFrame(Seq(("Test1","Test2","Test2"), ("one","two","three")))
df.map(_.mkString("\u00B6"))
.coalesce(1)
.write
.option("lineSep", "\u00A7")
.text(FileName)
Output with Spark 2.4.*:
Test1¶Test2¶Test2§one¶two¶three
Output with Spark 2.3.* and lower (the "lineSep" option is ignored):
Test1¶Test2¶Test2
one¶two¶three

Spark, Scala not able to create view appropriately after reading from file

I am using spark and scala on jdk1.8.I am new to Scala.
I am reading a text file (pat1.txt) that looks like :
Now I am reading that file from my scala code as :
val sqlContext = SparkSession.builder().getOrCreate()
sqlContext.read
.format(externalEntity.getExtractfileType)
.option("compression", externalEntity.getCompressionCodec)
.option("header", if (externalEntity.getHasHeader.toUpperCase == "Y") "true" else "false")
.option("inferSchema", "true")
.option("delimiter", externalEntity.getExtractDelimiter)
.load(externalEntity.getFilePath)
.createOrReplaceTempView(externalEntity.getExtractName)
And then making a query as from my scala code:
val queryResult = sqlContext.sql(myQuery)
and output is generated as :
queryResult
.repartition(LteGenericExtractEntity.getNumberOfFiles.toInt)
.write.format("csv")
.option("compression", LteGenericExtractEntity.getCompressionCodec)
.option("delimiter", LteGenericExtractEntity.getExtractDelimiter)
.option("header", "true"")
.save(s"${outputDirectory}/${extractFileBase}")
Now when the 'myQuery' above is
select * from PAT1
The program is generating o/p as (notice the extra line with "value" that was not part of the file). Basically the program is not able to to identify the "," separated columns in the input file and in the output it creates 1 column under the header that is named as "value". So the output file looks like :
If I change 'myQuery' as :
select p1.FIRST_NAME, p1.LAST_NAME,p1.HOBBY from PAT1 p1
It throws exception as:
My input can be in any format ( like can be text/csv and can have compression) and output will always be in .csv
I am getting hard time to understand how to change the read part so the created view can have columns appropriately.Can I get help on that.
This looks like csv file, but with .txt extension.
You could try the following:
Rad this file as csv with extra options like spark.read.option("inferSchema", "true").option("header", "true").csv("path/to/file")
After reading file as you did, just specify the schema of the dataframe as:
sqlContext.read.format("text")
.option("compression", "none")
.option("delimiter", ",")
.option("header", "true")
.load("/tmp/pat1")
.toDF("first_name", "last_name", "hobby")

How do you write a dataframe/RDD with custom delimeiter (ctrl-A delimited) file in spark scala?

I am working over poc in which I need to create dataframe and then save it as ctrl A delimited file.
My query to create intermediate result is below
val grouped = results.groupBy("club_data","student_id_add","student_id").agg(sum(results("amount").cast(IntegerType)).as("amount"),count("amount").as("cnt")).filter((length(trim($"student_id")) > 1) && ($"student_id").isNotNull)
Saving result in text file
grouped.select($"club_data", $"student_id_add", $"amount",$"cnt").rdd.saveAsTextFile("/amit/spark/output4/")
Output :
[amit,DI^A356035,581,1]
It saves data as comma separated but I need to save it as ctrl-A separate
I tried option("delimiter", "\u0001") but seems it's not supported by dataframe/rdd.
Is there any function which helps?
If you have a dataframe you can use Spark-CSV to write as a csv with delimiter as below.
df.write.mode(SaveMode.Overwrite).option("delimiter", "\u0001").csv("outputCSV")
With Older version of Spark
df.write
.format("com.databricks.spark.csv")
.option("delimiter", "\u0001")
.mode(SaveMode.Overwrite)
.save("outputCSV")
You can read back as below
spark.read.option("delimiter", "\u0001").csv("outputCSV").show()
IF you have an RDD than you can use mkString() function on RDD and save with saveAsTextFile()
rdd.map(r => r.mkString(\u0001")).saveAsTextFile("outputCSV")
Hope this helps!
df.rdd.map(x=>x.mkString("^A")).saveAsTextFile("file:/home/iot/data/stackOver")
convert the rows to text before saving:
grouped.select($"club_data", $"student_id_add", $"amount",$"cnt").map(row => row.mkString(\u0001")).saveAsTextFile("/amit/spark/output4/")

Spark-Scala Malformed Line Issue

I have a control-A delimited file which I am trying to convert to parquet format. However in the file there is a String field with a single " in it.
Reading the data like below:
val dataframe = sparkSession.sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", datasetDelimiter)
.option("header", "false")
.option("mode","FAILFAST")
//.option("mode", "DROPMALFORMED")
.option("treatEmptyValuesAsNulls","true")
.option("nullValue"," ")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.schema(schema)
.load(fileLocation)
dataframe
As you can see there is just an open double quote in the data and no closed double quote. This is resulting in Malformed Line exception. While reading I have explicitly mention the delimiter as U0001. Is there any way to convert such data to parquet without losing any data
You can set the quote option to empty String:
.option("quote", "")
// or, equivalently, .option("quote", '\u0000')
That would tell Spark to treat " as any other non-special character.
(tested with Spark 2.1.0)