How to add a file name to a column in a data frame as multiple files are merged together? - scala

How can I add a file_name column to a dataframe, as data is loading into the frame? So, I want the file_name to show for every record in the dataframe.
I did some research on this, and found something that seems like it should work, but it actually doesn't load any file names, only the data in the files themselves.
import org.apache.spark.sql.functions._
val df = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/corp/ABC*.gz")
df.withColumn("file_name", input_file_name)
What is wrong with my code here? Thanks.

The input_file_name function creates a string column for the file name of the current Spark task.
import org.apache.spark.sql.functions.input_file_name
val df= spark.read
.option("delimiter", "|")
.option("header", "false")
.csv("mnt/rawdata/2019/01/01/corp/")
.withColumn("file_name", input_file_name())

Related

Spark Dataframe from a different data format

I've this data set. for which I need to create a sparkdataframe in scala. This data is a column in a csv file. column name is dataheader
dataheader
"{""date_time"":""1999/05/22 03:03:07.011"",""cust_id"":""cust1"",""timestamp"":944248234000,""msgId"":""113"",""activityTimeWindowMilliseconds"":20000,""ec"":""event1"",""name"":""ABC"",""entityId"":""1001"",""et"":""StateChange"",""logType"":""type123,""lastActivityTS"":944248834000,""sc_id"":""abc1d1c9"",""activityDetectedInLastTimeWindow"":true}"
"{""date_time"":""1999/05/23 03:03:07.011"",""cust_id"":""cust1"",""timestamp"":944248234000,""msgId"":""114"",""activityTimeWindowMilliseconds"":20000,""ec"":""event2"",""name"":""ABC"",""entityId"":""1001"",""et"":""StateChange"",""logType"":""type123,""lastActivityTS"":944248834000,""sc_id"":""abc1d1c9"",""activityDetectedInLastTimeWindow"":true}"
I was able to read the csv file -
val df_tmp = spark
.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("quoteMode", "ALL")
.option("delimiter", ",")
.option("escape", "\"")
//.option("inferSchema","true")
.option("multiline", "true")
.load("D:\\dataFile.csv")
I tried to split the data into separate columns in a dataframe but did not succeed.
one thing I noticed in data is both key and value are enclosed by double double quotes ""key1"":""value1""
If you want to get the field inside the data field, you need to parse it and write it into a new CSV file.
It's obviously a string in json format

Spark Dataframe to TXT file without carriage return

I am trying to save the spark dataframe as text file. While doing this, I need to have specific column delimiter and row delimiters. I am unable to get the row delimiter working. Any help would be greatly appreciated.
Below is the sample code for reference.
//option -1
spark.sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter", "\\§")
df.coalesce(1)
.map(_.mkString("\u00B6"))
.write
.option("encoding", "US-ASCI")
.mode(SaveMode.Overwrite).text(FileName)
//option-2
df.coalesce(1)
.write.mode(SaveMode.Overwrite)
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("encoding", "US-ASCI")
.option("multiLine", false)
.option("delimiter", "\u00B6")
.option("lineSep", "\u00A7")
.csv(FileName1)
Below is my input and output for reference:
Input:
Test1,Test2,Test2
Pqr,Rsu,Lmn
one,two,three
Output:
Test1¶Test2¶Test2§Pqr¶Rsu¶Lmn§one¶two¶three
From Spark 2.4.0, the "lineSep" option can be used to write json and text files with a custom line separator (cf. DataFrameWriter spec). This option is ignored in previous Spark versions and for csv format.
val df = spark.createDataFrame(Seq(("Test1","Test2","Test2"), ("one","two","three")))
df.map(_.mkString("\u00B6"))
.coalesce(1)
.write
.option("lineSep", "\u00A7")
.text(FileName)
Output with Spark 2.4.*:
Test1¶Test2¶Test2§one¶two¶three
Output with Spark 2.3.* and lower (the "lineSep" option is ignored):
Test1¶Test2¶Test2
one¶two¶three

Spark, Scala not able to create view appropriately after reading from file

I am using spark and scala on jdk1.8.I am new to Scala.
I am reading a text file (pat1.txt) that looks like :
Now I am reading that file from my scala code as :
val sqlContext = SparkSession.builder().getOrCreate()
sqlContext.read
.format(externalEntity.getExtractfileType)
.option("compression", externalEntity.getCompressionCodec)
.option("header", if (externalEntity.getHasHeader.toUpperCase == "Y") "true" else "false")
.option("inferSchema", "true")
.option("delimiter", externalEntity.getExtractDelimiter)
.load(externalEntity.getFilePath)
.createOrReplaceTempView(externalEntity.getExtractName)
And then making a query as from my scala code:
val queryResult = sqlContext.sql(myQuery)
and output is generated as :
queryResult
.repartition(LteGenericExtractEntity.getNumberOfFiles.toInt)
.write.format("csv")
.option("compression", LteGenericExtractEntity.getCompressionCodec)
.option("delimiter", LteGenericExtractEntity.getExtractDelimiter)
.option("header", "true"")
.save(s"${outputDirectory}/${extractFileBase}")
Now when the 'myQuery' above is
select * from PAT1
The program is generating o/p as (notice the extra line with "value" that was not part of the file). Basically the program is not able to to identify the "," separated columns in the input file and in the output it creates 1 column under the header that is named as "value". So the output file looks like :
If I change 'myQuery' as :
select p1.FIRST_NAME, p1.LAST_NAME,p1.HOBBY from PAT1 p1
It throws exception as:
My input can be in any format ( like can be text/csv and can have compression) and output will always be in .csv
I am getting hard time to understand how to change the read part so the created view can have columns appropriately.Can I get help on that.
This looks like csv file, but with .txt extension.
You could try the following:
Rad this file as csv with extra options like spark.read.option("inferSchema", "true").option("header", "true").csv("path/to/file")
After reading file as you did, just specify the schema of the dataframe as:
sqlContext.read.format("text")
.option("compression", "none")
.option("delimiter", ",")
.option("header", "true")
.load("/tmp/pat1")
.toDF("first_name", "last_name", "hobby")

Create DataFrame / Dataset using Header and Data in two different directories

I am getting the input file as CSV. Here I get two directories, first directory will have one file with header record and second directory will have data files. Here, I want to create a Dataframe/Dataset.
One way I can do is creating case class and split the data files by delimiter and attached the schema and create dataFrame.
What I am looking is read Header file and data file and create dataFrame. I saw a solution using databricks but my organization has restriction to use the databricks and below is the code which I come across. Can one you help me the solution without using databricks.
val headersDF = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("path to headers.csv")
val schema = headersDF.schema
val dataDF = sqlContext
.read
.format("com.databricks.spark.csv")
.schema(schema)
.load("path to data.csv")
You can do it like this
val schema=spark
.read
.format("csv")
.option("header","true")
.option("delimiter",",")
.load("C:\\spark\\programs\\empheaders.csv")
.schema
val data=spark
.read
.format("csv")
.schema(schema)
.option("delimiter",",")
.load("C:\\spark\\programs\\empdata.csv")
Because in your header CSV file you don't have any data there is no point in inferring the schema out of it.
So just get the field names by reading it.
val headerRDD = sc.parallelize(Seq(("Name,Age,Sal"))) //Assume this line is in your Header CSV
val header = headerRDD.flatMap(_.split(",")).collect
//headerRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at parallelize at command-2903591155643047:1
//header: Array[String] = Array(Name, Age, Sal)
Then read the data CSV file.
Either map each line to a case class or a tuple. Convert the data to a DataFrame by passing the header array.
val dataRdd = sc.parallelize(Seq(("Tom,22,500000"),("Rick,40,1000000"))) //Assume these lines are in your data CSV file
val data = dataRdd.map(_.split(",")).map(x => (x(0),x(1).toInt,x(2).toDouble)).toDF(header: _*)
//dataRdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[72] at parallelize at command-2903591155643048:1
//data: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 1 more field]
Result:
data.show()
+----+---+---------+
|Name|Age| Sal|
+----+---+---------+
| Tom| 22| 500000.0|
|Rick| 40|1000000.0|
+----+---+---------+

Read .csv data in european format with Spark

I am currently doing my first attempts with Apache Spark.
I would like to read a .csv File with an SQLContext object, but Spark won't provide the correct results as the File is a european one (comma as decimal separator and semicolon used as value separator).
Is there a way to tell Spark to follow a different .csv syntax?
val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("Foo")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("inferSchema","true")
.load("data.csv")
df.show()
A row in the relating .csv looks like this:
04.10.2016;12:51:00;1,1;0,41;0,416
Spark interprets the entire row as a column. df.show() prints:
+--------------------------------+
|Col1;Col2,Col3;Col4;Col5 |
+--------------------------------+
| 04.10.2016;12:51:...|
+--------------------------------+
In previous attempts to get it working df.show() was even printing more row-content where it now says '...' but eventually cutting the row at the comma in the third col.
You can just read as Test and split by ; or set a custom delimiter to the CSV format as in .option("delimiter",";")