Create DataFrame / Dataset using Header and Data in two different directories - scala

I am getting the input file as CSV. Here I get two directories, first directory will have one file with header record and second directory will have data files. Here, I want to create a Dataframe/Dataset.
One way I can do is creating case class and split the data files by delimiter and attached the schema and create dataFrame.
What I am looking is read Header file and data file and create dataFrame. I saw a solution using databricks but my organization has restriction to use the databricks and below is the code which I come across. Can one you help me the solution without using databricks.
val headersDF = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("path to headers.csv")
val schema = headersDF.schema
val dataDF = sqlContext
.read
.format("com.databricks.spark.csv")
.schema(schema)
.load("path to data.csv")

You can do it like this
val schema=spark
.read
.format("csv")
.option("header","true")
.option("delimiter",",")
.load("C:\\spark\\programs\\empheaders.csv")
.schema
val data=spark
.read
.format("csv")
.schema(schema)
.option("delimiter",",")
.load("C:\\spark\\programs\\empdata.csv")

Because in your header CSV file you don't have any data there is no point in inferring the schema out of it.
So just get the field names by reading it.
val headerRDD = sc.parallelize(Seq(("Name,Age,Sal"))) //Assume this line is in your Header CSV
val header = headerRDD.flatMap(_.split(",")).collect
//headerRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at parallelize at command-2903591155643047:1
//header: Array[String] = Array(Name, Age, Sal)
Then read the data CSV file.
Either map each line to a case class or a tuple. Convert the data to a DataFrame by passing the header array.
val dataRdd = sc.parallelize(Seq(("Tom,22,500000"),("Rick,40,1000000"))) //Assume these lines are in your data CSV file
val data = dataRdd.map(_.split(",")).map(x => (x(0),x(1).toInt,x(2).toDouble)).toDF(header: _*)
//dataRdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[72] at parallelize at command-2903591155643048:1
//data: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 1 more field]
Result:
data.show()
+----+---+---------+
|Name|Age| Sal|
+----+---+---------+
| Tom| 22| 500000.0|
|Rick| 40|1000000.0|
+----+---+---------+

Related

How to add a file name to a column in a data frame as multiple files are merged together?

How can I add a file_name column to a dataframe, as data is loading into the frame? So, I want the file_name to show for every record in the dataframe.
I did some research on this, and found something that seems like it should work, but it actually doesn't load any file names, only the data in the files themselves.
import org.apache.spark.sql.functions._
val df = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/corp/ABC*.gz")
df.withColumn("file_name", input_file_name)
What is wrong with my code here? Thanks.
The input_file_name function creates a string column for the file name of the current Spark task.
import org.apache.spark.sql.functions.input_file_name
val df= spark.read
.option("delimiter", "|")
.option("header", "false")
.csv("mnt/rawdata/2019/01/01/corp/")
.withColumn("file_name", input_file_name())

Create dataframe with header using header and data file

I have two files data.csv and headers.csv. I want to create dataframe in Spark/Scala with headers.
var data = spark.sqlContext.read.format(
"com.databricks.spark.csv").option("header", "true"
).option("inferSchema", "true").load(data_path)
Can you help me customizing above lines to do this?
you can read the headers.csv by using the above method and use the schema of headers dataframe to read the data.csv as below
val headersDF = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("path to headers.csv")
val schema = headersDF.schema
val dataDF = sqlContext
.read
.format("com.databricks.spark.csv")
.schema(schema)
.load("path to data.csv")
I hope the answer is helpful

How to do custom partition in spark dataframe with saveAsTextFile

I have created data in Spark and then performed a join operation, finally I have to save the output to partitioned files.
I am converting data frame into RDD and then saving as text file that allows me to use multi-char delimiter. My question is to how use dataframe columns as custom partition in this case.
I can not use below option for custom partition because it does not support multi-char delimiter:
dfMainOutput.write.partitionBy("DataPartiotion","StatementTypeCode")
.format("csv")
.option("delimiter", "^")
.option("nullValue", "")
.option("codec", "gzip")
.save("s3://trfsdisu/SPARK/FinancialLineItem/output")
To use multi-char delimiter I have converted this in RDD like below code:
dfMainOutput.rdd.map(x=>x.mkString("|^|")).saveAsTextFile("dir path to store")
But in above option how would I do custom partition based on the columns "DataPartiotion" and "StatementTypeCode"?
Do I have to convert back to again from RDD to a dataframe?
Here is my code that i have tried
val dfMainOutput = df1result.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
.select($"LineItem_organizationId", $"LineItem_lineItemId",
when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition_1").as("DataPartition_1"),
when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").as("StatementTypeCode"),
when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").alias("StatementtypeCode"),
when($"LineItemName_1".isNotNull, $"LineItemName_1").otherwise($"LineItemName").as("LineItemName"),
when($"LocalLanguageLabel_1".isNotNull, $"LocalLanguageLabel_1").otherwise($"LocalLanguageLabel").as("LocalLanguageLabel"),
when($"FinancialConceptLocal_1".isNotNull, $"FinancialConceptLocal_1").otherwise($"FinancialConceptLocal").as("FinancialConceptLocal"),
when($"FinancialConceptGlobal_1".isNotNull, $"FinancialConceptGlobal_1").otherwise($"FinancialConceptGlobal").as("FinancialConceptGlobal"),
when($"IsDimensional_1".isNotNull, $"IsDimensional_1").otherwise($"IsDimensional").as("IsDimensional"),
when($"InstrumentId_1".isNotNull, $"InstrumentId_1").otherwise($"InstrumentId").as("InstrumentId"),
when($"LineItemSequence_1".isNotNull, $"LineItemSequence_1").otherwise($"LineItemSequence").as("LineItemSequence"),
when($"PhysicalMeasureId_1".isNotNull, $"PhysicalMeasureId_1").otherwise($"PhysicalMeasureId").as("PhysicalMeasureId"),
when($"FinancialConceptCodeGlobalSecondary_1".isNotNull, $"FinancialConceptCodeGlobalSecondary_1").otherwise($"FinancialConceptCodeGlobalSecondary").as("FinancialConceptCodeGlobalSecondary"),
when($"IsRangeAllowed_1".isNotNull, $"IsRangeAllowed_1").otherwise($"IsRangeAllowed".cast(DataTypes.StringType)).as("IsRangeAllowed"),
when($"IsSegmentedByOrigin_1".isNotNull, $"IsSegmentedByOrigin_1").otherwise($"IsSegmentedByOrigin".cast(DataTypes.StringType)).as("IsSegmentedByOrigin"),
when($"SegmentGroupDescription".isNotNull, $"SegmentGroupDescription").otherwise($"SegmentGroupDescription").as("SegmentGroupDescription"),
when($"SegmentChildDescription_1".isNotNull, $"SegmentChildDescription_1").otherwise($"SegmentChildDescription").as("SegmentChildDescription"),
when($"SegmentChildLocalLanguageLabel_1".isNotNull, $"SegmentChildLocalLanguageLabel_1").otherwise($"SegmentChildLocalLanguageLabel").as("SegmentChildLocalLanguageLabel"),
when($"LocalLanguageLabel_languageId_1".isNotNull, $"LocalLanguageLabel_languageId_1").otherwise($"LocalLanguageLabel_languageId").as("LocalLanguageLabel_languageId"),
when($"LineItemName_languageId_1".isNotNull, $"LineItemName_languageId_1").otherwise($"LineItemName_languageId").as("LineItemName_languageId"),
when($"SegmentChildDescription_languageId_1".isNotNull, $"SegmentChildDescription_languageId_1").otherwise($"SegmentChildDescription_languageId").as("SegmentChildDescription_languageId"),
when($"SegmentChildLocalLanguageLabel_languageId_1".isNotNull, $"SegmentChildLocalLanguageLabel_languageId_1").otherwise($"SegmentChildLocalLanguageLabel_languageId").as("SegmentChildLocalLanguageLabel_languageId"),
when($"SegmentGroupDescription_languageId_1".isNotNull, $"SegmentGroupDescription_languageId_1").otherwise($"SegmentGroupDescription_languageId").as("SegmentGroupDescription_languageId"),
when($"SegmentMultipleFundbDescription_1".isNotNull, $"SegmentMultipleFundbDescription_1").otherwise($"SegmentMultipleFundbDescription").as("SegmentMultipleFundbDescription"),
when($"SegmentMultipleFundbDescription_languageId_1".isNotNull, $"SegmentMultipleFundbDescription_languageId_1").otherwise($"SegmentMultipleFundbDescription_languageId").as("SegmentMultipleFundbDescription_languageId"),
when($"IsCredit_1".isNotNull, $"IsCredit_1").otherwise($"IsCredit".cast(DataTypes.StringType)).as("IsCredit"),
when($"FinancialConceptLocalId_1".isNotNull, $"FinancialConceptLocalId_1").otherwise($"FinancialConceptLocalId").as("FinancialConceptLocalId"),
when($"FinancialConceptGlobalId_1".isNotNull, $"FinancialConceptGlobalId_1").otherwise($"FinancialConceptGlobalId").as("FinancialConceptGlobalId"),
when($"FinancialConceptCodeGlobalSecondaryId_1".isNotNull, $"FinancialConceptCodeGlobalSecondaryId_1").otherwise($"FinancialConceptCodeGlobalSecondaryId").as("FinancialConceptCodeGlobalSecondaryId"),
when($"FFAction_1".isNotNull, $"FFAction_1").otherwise((concat(col("FFAction"), lit("|!|"))).as("FFAction")))
.filter(!$"FFAction".contains("D"))
val dfMainOutputFinal = dfMainOutput.select(concat_ws("|^|", columns.map(c => col(c)): _*).as("concatenated"))
dfMainOutputFinal.write.partitionBy("DataPartition_1","StatementTypeCode")
.format("csv")
.option("codec", "gzip")
.save("s3://trfsdisu/SPARK/FinancialLineItem/output")
This can be done by using concat_ws, this function works similarly to mkString but can be performed on directly on dataframe. This makes the conversion step to rdd redundant and the df.write.partitionBy() method can be used. A small example that will concatenate all available columns,
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(("01", "20000", "45.30"), ("01", "30000", "45.30"))
.toDF("col1", "col2", "col3")
val df2 = df.select($"DataPartiotion", $"StatementTypeCode",
concat_ws("|^|", df.schema.fieldNames.map(c => col(c)): _*).as("concatenated"))
This will give you a resulting dataframe like this,
+--------------+-----------------+------------------+
|DataPartiotion|StatementTypeCode| concatenated|
+--------------+-----------------+------------------+
| 01| 20000|01|^|20000|^|45.30|
| 01| 30000|01|^|30000|^|45.30|
+--------------+-----------------+------------------+

Spark: convert a CSV to RDD[Row]

I have a .csv file, which contains 258 columns in following structure.
["label", "index_1", "index_2", ... , "index_257"]
Now I wanna transform this .csv file to a RDD[Row]:
val data_csv = sc.textFile("~/test.csv")
val rowRDD = data_csv.map(_.split(",")).map(p => Row( p(0), p(1).trim, p(2).trim))
If I do the transform in this way, I have to write down 258 columns specifically. So I tried:
val rowRDD = data_csv.map(_.split(",")).map(p => Row( _ => p(_).trim))
and
val rowRDD = data_csv.map(_.split(",")).map(p => Row( x => p(x).trim))
But these two also not working and report error:
error: missing parameter type for expanded function ((x$2) => p(x$2).trim)
Can anyone tell me how to do this transform? Thanks a lot.
you should use sqlContext instead of sparkContext as
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", true)
.load(("~/test.csv")
this will create dataframe. calling .rdd on df should give you RDD[Row]
val rdd = df.rdd
Rather reading as a textFile read CSV files with the spark-csv
In your case
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.option("quote", "\"") //escape the quotes
.option("ignoreLeadingWhiteSpace", true) // escape space before your data
.load("cars.csv")
This loads data as a dataframe, now you can easily change it to RDD.
Hope this helps!
Apart from the other answers that are correct, the correct way to do what you're trying to do is to use Row.fromSeq inside the map function.
val rdd = sc.parallelize(Array((1 to 258).toArray, (1 to 258).toArray) )
.map(Row.fromSeq(_))
This will turn your rdd to type Row:
Array[org.apache.spark.sql.Row] = Array([1,2,3,4,5,6,7,8,9,10...

Save values in spark

i am trying to read and write data from the my local folder, but my data's not identical.
val data =sc.textFile("/user/cts367689/datagen.txt")
val a=data.map(line=>(line.split(",")(0).toInt+line.split(",")(4).toInt,line.split(",")(3),line.split(",")(2)))
a.saveAsTextFile("/user/cts367689/sparkoutput")
Output:
(526,female,avil)
(635,male,avil)
(983,male,paracetamol)
(342,female,paracetamol)
(158,female,avil)
how can i save output as below,need to remove brackets.
Expected Result:
526,female,avil
635,male,avil
983,male,paracetamol
342,female,paracetamol
158,female,avil
val a = data.map (
line =>
(line.split(",")(0).toInt + line.split(",")(4).toInt) + "," +
line.split(",")(3) + "," +
line.split(",")(2)
)
Try doing this instead of returning it in (). That makes a tuple.
spark has capability of handling unstructured files. you are using one those functions.
for CSV(comma separated values) file there are some good libraries to do the same.
you can have look at this link
for your question , answer is as shown below.
import org.apache.spark.sql.SQLContext
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "false")
.load("/user/cts367689/datagen.txt");
df.select("id", "gender", "name").write()
.format("com.databricks.spark.csv")
.option("header", "true")
.save("/user/cts367689/sparkoutput");
use:
val a = data.map(line => line.split(",")(0).toInt+line.split(",")(4).toInt+","+line.split(",")(3)+","+line.split(",")(2))