Format csv file with column creation in Spark scala - scala

I have a csv file, as below
It has 6 rows with top row as header, while header read as "Students Marks"
dataframe is treating them as one columns, now i want to separate both columns with data. "student" and "marks" are separated by space.
df.show()
_______________
##Student Marks##
---------------
A 10;20;10;20
A 20;20;30;10
B 10;10;10;10
B 20;20;20;10
B 30;30;30;20
Now i want to transform this csv table into two columns, with student and Marks, Also for every student the marks with add up, something like below
Student | Marks
A | 30;40;40;30
B | 60;60;60;40
I have tried with below but it is throwing an error
df.withColumn("_tmp", split($"Students Marks","\\ ")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2")).drop("_tmp")

You can read the csv file with the delimiteryou want and calculate result as below
val df = spark.read
.option("header", true)
.option("delimiter", " ")
.csv("path to csv")
After You get the dataframe df
val resultDF = df.withColumn("split", split($"Marks", ";"))
.withColumn("a", $"split"(0))
.withColumn("b", $"split"(1))
.withColumn("c", $"split"(2))
.withColumn("d", $"split"(3))
.groupBy("Student")
.agg(concat_ws(";", array(
Seq(sum($"a"), sum($"b"), sum($"c"), sum($"d")): _*)
).as("Marks"))
resultDF.show(false)
Output:
+-------+-------------------+
|Student|Marks |
+-------+-------------------+
|B |60.0;60.0;60.0;40.0|
|A |30.0;40.0;40.0;30.0|
+-------+-------------------+

Three Ideas. The first one is to read the file, split it by space and then create the dataFrame:
val df = sqlContext.read
.format("csv")
.option("header", "true")
.option("delimiter", " ")
.load("your_file.csv")
The second one is to read the file to dataframe and split it:
df.withColumn("Student", split($"Students Marks"," ").getItem(0))
.withColumn("Marks", split($"Students Marks"," ").getItem(1))
.drop("Students Marks")
The last one is your solution. It should work, but when you use the select, you don't use $"_tmp", therefore, it should work without the .drop("_tmp")
df.withColumn("_tmp", split($"Students Marks"," "))
.select($"_tmp".getItem(0).as("Student"),$"_tmp".getItem(1).as("Marks"))

Related

Split file with Space where column data is also having space

Hi I have data file which is having space as delimiter and also the data each column also contain spaces..How can i split it using spark program using scala:
Sample data Filed:
student.txt
3 columns:
Name
Address
Id
Name Address Id
Abhi Rishii Bangalore,Karnataka 1234
Rinki siyty Hydrabad,Andra 2345
Output Data frame should be:
|Name |City |State |Id--|
+-------------+------+-----------+-----+
| Abhi Rishii|Bangalore|Karnataka|1234|
| Rinki siyty|Hydrabad |Andra |2345|
+----+-----+-----------+---------+-----+
Your file is a tab delimited file.
You can use Spark's csv reader to read this file directly into a dataframe.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
var studentDf = spark.read.format("csv") // Use "csv" for both TSV and CSV
.option("header", "true")
.option("delimiter", "\t") // Set delimiter to tab .
.load("student.txt")
.withColumn("_tmp", split($"Address", "\\,"))
.withColumn($"_tmp".getItem(0).as("City"))
.withColumn($"_tmp".getItem(1).as("State"))
.drop("_tmp")
.drop("Address")
studentDf .show()
|Name |City |State |Id--|
+-------------+------+-----------+-----+
| Abhi Rishii|Bangalore|Karnataka|1234|
| Rinki siyty|Hydrabad |Andra |2345|
+----+-----+-----------+---------+-----+

Create DataFrame / Dataset using Header and Data in two different directories

I am getting the input file as CSV. Here I get two directories, first directory will have one file with header record and second directory will have data files. Here, I want to create a Dataframe/Dataset.
One way I can do is creating case class and split the data files by delimiter and attached the schema and create dataFrame.
What I am looking is read Header file and data file and create dataFrame. I saw a solution using databricks but my organization has restriction to use the databricks and below is the code which I come across. Can one you help me the solution without using databricks.
val headersDF = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("path to headers.csv")
val schema = headersDF.schema
val dataDF = sqlContext
.read
.format("com.databricks.spark.csv")
.schema(schema)
.load("path to data.csv")
You can do it like this
val schema=spark
.read
.format("csv")
.option("header","true")
.option("delimiter",",")
.load("C:\\spark\\programs\\empheaders.csv")
.schema
val data=spark
.read
.format("csv")
.schema(schema)
.option("delimiter",",")
.load("C:\\spark\\programs\\empdata.csv")
Because in your header CSV file you don't have any data there is no point in inferring the schema out of it.
So just get the field names by reading it.
val headerRDD = sc.parallelize(Seq(("Name,Age,Sal"))) //Assume this line is in your Header CSV
val header = headerRDD.flatMap(_.split(",")).collect
//headerRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at parallelize at command-2903591155643047:1
//header: Array[String] = Array(Name, Age, Sal)
Then read the data CSV file.
Either map each line to a case class or a tuple. Convert the data to a DataFrame by passing the header array.
val dataRdd = sc.parallelize(Seq(("Tom,22,500000"),("Rick,40,1000000"))) //Assume these lines are in your data CSV file
val data = dataRdd.map(_.split(",")).map(x => (x(0),x(1).toInt,x(2).toDouble)).toDF(header: _*)
//dataRdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[72] at parallelize at command-2903591155643048:1
//data: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 1 more field]
Result:
data.show()
+----+---+---------+
|Name|Age| Sal|
+----+---+---------+
| Tom| 22| 500000.0|
|Rick| 40|1000000.0|
+----+---+---------+

Drop column(s) in spark csv data frame

I have a dataframe to which i do concatenation to its all fields.
After concatenation it becomes another dataframe and finally I write its output to csv file with partitioned on two of its columns. One of its column is present in first dataframe which I do not want to include in the final output.
Here is my code:
val dfMainOutput = df1resultFinal.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
.select($"LineItem_organizationId", $"LineItem_lineItemId",
when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition".cast(DataTypes.StringType)).as("DataPartition"),
when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").as("StatementTypeCode"),
when($"FFAction_1".isNotNull, concat(col("FFAction_1"), lit("|!|"))).otherwise(concat(col("FFAction"), lit("|!|"))).as("FFAction"))
.filter(!$"FFAction".contains("D"))
Here I am concatenating and creating another dataframe:
val dfMainOutputFinal = dfMainOutput.select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.fieldNames.map(c => col(c)): _*).as("concatenated"))
This is what i have tried
dfMainOutputFinal
.drop("DataPartition")
.write
.partitionBy("DataPartition","StatementTypeCode")
.format("csv")
.option("header","true")
.option("encoding", "\ufeff")
.option("codec", "gzip")
.save("path to csv")
Now i dont want DataPartition column in my output .
I am doing partition based on DataPartition so i am not getting but because DataPartition is present in the main data frame I am getting it in the output.
QUESTION 1: How can ignore a columns from Dataframe
QUESTION 2: Is there any way to add "\ufeff" in the csv output file before writing my actual data so that my encoding format will become UTF-8-BOM.
As per the suggested answer
This is what i have tried
val dfMainOutputFinal = dfMainOutput.select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.filter(_ != "DataPartition").fieldNames.map(c => col(c)): _*).as("concatenated"))
But getting below error
<console>:238: error: value fieldNames is not a member of Seq[org.apache.spark.sql.types.StructField]
val dfMainOutputFinal = dfMainOutput.select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.filter(_ != "DataPartition").fieldNames.map(c => col(c)): _*).as("concatenated"))
Below is the question if i have to remove two columns in final output
val dfMainOutputFinal = dfMainOutput.select($"DataPartition","PartitionYear",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition","PartitionYear").map(c => col(c)): _*).as("concatenated"))
Question 1:
The columns you use in df.write.partitionBy() will not be added to the final csv file. They are automatically ignored since the data is encoded in the file structure. However, if what you mean is to remove it from the concat_ws (and thereby from the file), it is possible to do with a small change:
concat_ws("|^|",
dfMainOutput.schema.fieldNames
.filter(_ != "DataPartition")
.map(c => col(c)): _*).as("concatenated"))
Here the column DataPartition is filtered away before the concatenation.
Question 2:
Spark does not seem to support UTF-8 BOM and it seems to cause problems when reading in files with the format. I can't think of any easy way to add the BOM bytes to each csv file other than writing a script to add them after Spark has finished. My recommendation would be to simply use normal UTF-8 formatting.
dfMainOutputFinal.write.partitionBy("DataPartition","StatementTypeCode")
.format("csv")
.option("header", "true")
.option("encoding", "UTF-8")
.option("codec", "gzip")
.save("path to csv")
Additionally, according to the Unicode standard, BOM is not recommended.
... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.
QUESTION 1: How can ignore a columns from Dataframe
Ans:
val df = sc.parallelize(List(Person(1,2,3), Person(4,5,6))).toDF("age", "height", "weight")
df.columns
df.show()
+---+------+------+
|age|height|weight|
+---+------+------+
| 1| 2| 3|
| 4| 5| 6|
+---+------+------+
val df_new=df.select("age", "height")
df_new.columns
df_new.show()
+---+------+
|age|height|
+---+------+
| 1| 2|
| 4| 5|
+---+------+
df: org.apache.spark.sql.DataFrame = [age: int, height: int ... 1 more field]
df_new: org.apache.spark.sql.DataFrame = [age: int, height: int]
QUESTION 2: Is there any way to add "\ufeff" in the csv output file
before writing my actual data so that my encoding format will become
UTF-8-BOM.
Ans:
String path= "/data/vaquarkhan/input/unicode.csv";
String outputPath = "file:/data/vaquarkhan/output/output.csv";
getSparkSession()
.read()
.option("inferSchema", "true")
.option("header", "true")
.option("encoding", "UTF-8")
.csv(path)
.write()
.mode(SaveMode.Overwrite)
.csv(outputPath);
}

How to do custom partition in spark dataframe with saveAsTextFile

I have created data in Spark and then performed a join operation, finally I have to save the output to partitioned files.
I am converting data frame into RDD and then saving as text file that allows me to use multi-char delimiter. My question is to how use dataframe columns as custom partition in this case.
I can not use below option for custom partition because it does not support multi-char delimiter:
dfMainOutput.write.partitionBy("DataPartiotion","StatementTypeCode")
.format("csv")
.option("delimiter", "^")
.option("nullValue", "")
.option("codec", "gzip")
.save("s3://trfsdisu/SPARK/FinancialLineItem/output")
To use multi-char delimiter I have converted this in RDD like below code:
dfMainOutput.rdd.map(x=>x.mkString("|^|")).saveAsTextFile("dir path to store")
But in above option how would I do custom partition based on the columns "DataPartiotion" and "StatementTypeCode"?
Do I have to convert back to again from RDD to a dataframe?
Here is my code that i have tried
val dfMainOutput = df1result.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
.select($"LineItem_organizationId", $"LineItem_lineItemId",
when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition_1").as("DataPartition_1"),
when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").as("StatementTypeCode"),
when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").alias("StatementtypeCode"),
when($"LineItemName_1".isNotNull, $"LineItemName_1").otherwise($"LineItemName").as("LineItemName"),
when($"LocalLanguageLabel_1".isNotNull, $"LocalLanguageLabel_1").otherwise($"LocalLanguageLabel").as("LocalLanguageLabel"),
when($"FinancialConceptLocal_1".isNotNull, $"FinancialConceptLocal_1").otherwise($"FinancialConceptLocal").as("FinancialConceptLocal"),
when($"FinancialConceptGlobal_1".isNotNull, $"FinancialConceptGlobal_1").otherwise($"FinancialConceptGlobal").as("FinancialConceptGlobal"),
when($"IsDimensional_1".isNotNull, $"IsDimensional_1").otherwise($"IsDimensional").as("IsDimensional"),
when($"InstrumentId_1".isNotNull, $"InstrumentId_1").otherwise($"InstrumentId").as("InstrumentId"),
when($"LineItemSequence_1".isNotNull, $"LineItemSequence_1").otherwise($"LineItemSequence").as("LineItemSequence"),
when($"PhysicalMeasureId_1".isNotNull, $"PhysicalMeasureId_1").otherwise($"PhysicalMeasureId").as("PhysicalMeasureId"),
when($"FinancialConceptCodeGlobalSecondary_1".isNotNull, $"FinancialConceptCodeGlobalSecondary_1").otherwise($"FinancialConceptCodeGlobalSecondary").as("FinancialConceptCodeGlobalSecondary"),
when($"IsRangeAllowed_1".isNotNull, $"IsRangeAllowed_1").otherwise($"IsRangeAllowed".cast(DataTypes.StringType)).as("IsRangeAllowed"),
when($"IsSegmentedByOrigin_1".isNotNull, $"IsSegmentedByOrigin_1").otherwise($"IsSegmentedByOrigin".cast(DataTypes.StringType)).as("IsSegmentedByOrigin"),
when($"SegmentGroupDescription".isNotNull, $"SegmentGroupDescription").otherwise($"SegmentGroupDescription").as("SegmentGroupDescription"),
when($"SegmentChildDescription_1".isNotNull, $"SegmentChildDescription_1").otherwise($"SegmentChildDescription").as("SegmentChildDescription"),
when($"SegmentChildLocalLanguageLabel_1".isNotNull, $"SegmentChildLocalLanguageLabel_1").otherwise($"SegmentChildLocalLanguageLabel").as("SegmentChildLocalLanguageLabel"),
when($"LocalLanguageLabel_languageId_1".isNotNull, $"LocalLanguageLabel_languageId_1").otherwise($"LocalLanguageLabel_languageId").as("LocalLanguageLabel_languageId"),
when($"LineItemName_languageId_1".isNotNull, $"LineItemName_languageId_1").otherwise($"LineItemName_languageId").as("LineItemName_languageId"),
when($"SegmentChildDescription_languageId_1".isNotNull, $"SegmentChildDescription_languageId_1").otherwise($"SegmentChildDescription_languageId").as("SegmentChildDescription_languageId"),
when($"SegmentChildLocalLanguageLabel_languageId_1".isNotNull, $"SegmentChildLocalLanguageLabel_languageId_1").otherwise($"SegmentChildLocalLanguageLabel_languageId").as("SegmentChildLocalLanguageLabel_languageId"),
when($"SegmentGroupDescription_languageId_1".isNotNull, $"SegmentGroupDescription_languageId_1").otherwise($"SegmentGroupDescription_languageId").as("SegmentGroupDescription_languageId"),
when($"SegmentMultipleFundbDescription_1".isNotNull, $"SegmentMultipleFundbDescription_1").otherwise($"SegmentMultipleFundbDescription").as("SegmentMultipleFundbDescription"),
when($"SegmentMultipleFundbDescription_languageId_1".isNotNull, $"SegmentMultipleFundbDescription_languageId_1").otherwise($"SegmentMultipleFundbDescription_languageId").as("SegmentMultipleFundbDescription_languageId"),
when($"IsCredit_1".isNotNull, $"IsCredit_1").otherwise($"IsCredit".cast(DataTypes.StringType)).as("IsCredit"),
when($"FinancialConceptLocalId_1".isNotNull, $"FinancialConceptLocalId_1").otherwise($"FinancialConceptLocalId").as("FinancialConceptLocalId"),
when($"FinancialConceptGlobalId_1".isNotNull, $"FinancialConceptGlobalId_1").otherwise($"FinancialConceptGlobalId").as("FinancialConceptGlobalId"),
when($"FinancialConceptCodeGlobalSecondaryId_1".isNotNull, $"FinancialConceptCodeGlobalSecondaryId_1").otherwise($"FinancialConceptCodeGlobalSecondaryId").as("FinancialConceptCodeGlobalSecondaryId"),
when($"FFAction_1".isNotNull, $"FFAction_1").otherwise((concat(col("FFAction"), lit("|!|"))).as("FFAction")))
.filter(!$"FFAction".contains("D"))
val dfMainOutputFinal = dfMainOutput.select(concat_ws("|^|", columns.map(c => col(c)): _*).as("concatenated"))
dfMainOutputFinal.write.partitionBy("DataPartition_1","StatementTypeCode")
.format("csv")
.option("codec", "gzip")
.save("s3://trfsdisu/SPARK/FinancialLineItem/output")
This can be done by using concat_ws, this function works similarly to mkString but can be performed on directly on dataframe. This makes the conversion step to rdd redundant and the df.write.partitionBy() method can be used. A small example that will concatenate all available columns,
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(("01", "20000", "45.30"), ("01", "30000", "45.30"))
.toDF("col1", "col2", "col3")
val df2 = df.select($"DataPartiotion", $"StatementTypeCode",
concat_ws("|^|", df.schema.fieldNames.map(c => col(c)): _*).as("concatenated"))
This will give you a resulting dataframe like this,
+--------------+-----------------+------------------+
|DataPartiotion|StatementTypeCode| concatenated|
+--------------+-----------------+------------------+
| 01| 20000|01|^|20000|^|45.30|
| 01| 30000|01|^|30000|^|45.30|
+--------------+-----------------+------------------+

Read .csv data in european format with Spark

I am currently doing my first attempts with Apache Spark.
I would like to read a .csv File with an SQLContext object, but Spark won't provide the correct results as the File is a european one (comma as decimal separator and semicolon used as value separator).
Is there a way to tell Spark to follow a different .csv syntax?
val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("Foo")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("inferSchema","true")
.load("data.csv")
df.show()
A row in the relating .csv looks like this:
04.10.2016;12:51:00;1,1;0,41;0,416
Spark interprets the entire row as a column. df.show() prints:
+--------------------------------+
|Col1;Col2,Col3;Col4;Col5 |
+--------------------------------+
| 04.10.2016;12:51:...|
+--------------------------------+
In previous attempts to get it working df.show() was even printing more row-content where it now says '...' but eventually cutting the row at the comma in the third col.
You can just read as Test and split by ; or set a custom delimiter to the CSV format as in .option("delimiter",";")