Drop column(s) in spark csv data frame - scala

I have a dataframe to which i do concatenation to its all fields.
After concatenation it becomes another dataframe and finally I write its output to csv file with partitioned on two of its columns. One of its column is present in first dataframe which I do not want to include in the final output.
Here is my code:
val dfMainOutput = df1resultFinal.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
.select($"LineItem_organizationId", $"LineItem_lineItemId",
when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition".cast(DataTypes.StringType)).as("DataPartition"),
when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").as("StatementTypeCode"),
when($"FFAction_1".isNotNull, concat(col("FFAction_1"), lit("|!|"))).otherwise(concat(col("FFAction"), lit("|!|"))).as("FFAction"))
.filter(!$"FFAction".contains("D"))
Here I am concatenating and creating another dataframe:
val dfMainOutputFinal = dfMainOutput.select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.fieldNames.map(c => col(c)): _*).as("concatenated"))
This is what i have tried
dfMainOutputFinal
.drop("DataPartition")
.write
.partitionBy("DataPartition","StatementTypeCode")
.format("csv")
.option("header","true")
.option("encoding", "\ufeff")
.option("codec", "gzip")
.save("path to csv")
Now i dont want DataPartition column in my output .
I am doing partition based on DataPartition so i am not getting but because DataPartition is present in the main data frame I am getting it in the output.
QUESTION 1: How can ignore a columns from Dataframe
QUESTION 2: Is there any way to add "\ufeff" in the csv output file before writing my actual data so that my encoding format will become UTF-8-BOM.
As per the suggested answer
This is what i have tried
val dfMainOutputFinal = dfMainOutput.select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.filter(_ != "DataPartition").fieldNames.map(c => col(c)): _*).as("concatenated"))
But getting below error
<console>:238: error: value fieldNames is not a member of Seq[org.apache.spark.sql.types.StructField]
val dfMainOutputFinal = dfMainOutput.select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.filter(_ != "DataPartition").fieldNames.map(c => col(c)): _*).as("concatenated"))
Below is the question if i have to remove two columns in final output
val dfMainOutputFinal = dfMainOutput.select($"DataPartition","PartitionYear",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition","PartitionYear").map(c => col(c)): _*).as("concatenated"))

Question 1:
The columns you use in df.write.partitionBy() will not be added to the final csv file. They are automatically ignored since the data is encoded in the file structure. However, if what you mean is to remove it from the concat_ws (and thereby from the file), it is possible to do with a small change:
concat_ws("|^|",
dfMainOutput.schema.fieldNames
.filter(_ != "DataPartition")
.map(c => col(c)): _*).as("concatenated"))
Here the column DataPartition is filtered away before the concatenation.
Question 2:
Spark does not seem to support UTF-8 BOM and it seems to cause problems when reading in files with the format. I can't think of any easy way to add the BOM bytes to each csv file other than writing a script to add them after Spark has finished. My recommendation would be to simply use normal UTF-8 formatting.
dfMainOutputFinal.write.partitionBy("DataPartition","StatementTypeCode")
.format("csv")
.option("header", "true")
.option("encoding", "UTF-8")
.option("codec", "gzip")
.save("path to csv")
Additionally, according to the Unicode standard, BOM is not recommended.
... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.

QUESTION 1: How can ignore a columns from Dataframe
Ans:
val df = sc.parallelize(List(Person(1,2,3), Person(4,5,6))).toDF("age", "height", "weight")
df.columns
df.show()
+---+------+------+
|age|height|weight|
+---+------+------+
| 1| 2| 3|
| 4| 5| 6|
+---+------+------+
val df_new=df.select("age", "height")
df_new.columns
df_new.show()
+---+------+
|age|height|
+---+------+
| 1| 2|
| 4| 5|
+---+------+
df: org.apache.spark.sql.DataFrame = [age: int, height: int ... 1 more field]
df_new: org.apache.spark.sql.DataFrame = [age: int, height: int]
QUESTION 2: Is there any way to add "\ufeff" in the csv output file
before writing my actual data so that my encoding format will become
UTF-8-BOM.
Ans:
String path= "/data/vaquarkhan/input/unicode.csv";
String outputPath = "file:/data/vaquarkhan/output/output.csv";
getSparkSession()
.read()
.option("inferSchema", "true")
.option("header", "true")
.option("encoding", "UTF-8")
.csv(path)
.write()
.mode(SaveMode.Overwrite)
.csv(outputPath);
}

Related

I need to skip three rows from the dataframe while loading from a CSV file in scala

I am loading my CSV file to a data frame and I can do that but I need to skip the starting three lines from the file.
I tried .option() command by giving header as true but it is ignoring the only first line.
val df = spark.sqlContext.read
.schema(Myschema)
.option("header",true)
.option("delimiter", "|")
.csv(path)
I thought of giving header as 3 lines but I couldn't find the way to do that.
alternative thought: skip those 3 lines from the data frame
Please help me with this. Thanks in Advance.
A generic way to handle your problem would be to index the dataframe and filter the indices that are greater than 2.
Straightforward approach:
As suggested in another answer, you may try adding an index with monotonically_increasing_id.
df.withColumn("Index",monotonically_increasing_id)
.filter('Index > 2)
.drop("Index")
Yet, that's only going to work if the first 3 rows are in the first partition. Moreover, as mentioned in the comments, this is the case today but this code may break completely with further versions or spark and that would be very hard to debug. Indeed, the contract in the API is just "The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive". It is therefore not very sage to assume that they will always start from zero. There might even be other cases in the current version in which that does not work (I'm not sure though).
To illustrate my first concern, have a look at this:
scala> spark.range(4).withColumn("Index",monotonically_increasing_id()).show()
+---+----------+
| id| Index|
+---+----------+
| 0| 0|
| 1| 1|
| 2|8589934592|
| 3|8589934593|
+---+----------+
We would only remove two rows...
Safe approach:
The previous approach will work most of the time though but to be safe, you can use zipWithIndex from the RDD API to get consecutive indices.
def zipWithIndex(df : DataFrame, name : String) : DataFrame = {
val rdd = df.rdd.zipWithIndex
.map{ case (row, i) => Row.fromSeq(row.toSeq :+ i) }
val newSchema = df.schema
.add(StructField(name, LongType, false))
df.sparkSession.createDataFrame(rdd, newSchema)
}
zipWithIndex(df, "index").where('index > 2).drop("index")
We can check that it's safer:
scala> zipWithIndex(spark.range(4).toDF("id"), "index").show()
+---+-----+
| id|index|
+---+-----+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
+---+-----+
You can try this option
df.withColumn("Index",monotonically_increasing_id())
.filter(col("Index") > 2)
.drop("Index")
You may try changing wrt to your schema.
import org.apache.spark.sql.Row
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Read CSV
val file = sc.textFile("csvfilelocation")
//Remove first 3 lines
val data = file.mapPartitionsWithIndex{ (idx, iter) => if (idx == 0) iter.drop(3) else iter }
//Create RowRDD by mapping each line to the required fields
val rowRdd = data.map(x=>Row(x(0), x(1)))
//create dataframe by calling sqlcontext.createDataframe with rowRdd and your schema
val df = sqlContext.createDataFrame(rowRdd, schema)

Format csv file with column creation in Spark scala

I have a csv file, as below
It has 6 rows with top row as header, while header read as "Students Marks"
dataframe is treating them as one columns, now i want to separate both columns with data. "student" and "marks" are separated by space.
df.show()
_______________
##Student Marks##
---------------
A 10;20;10;20
A 20;20;30;10
B 10;10;10;10
B 20;20;20;10
B 30;30;30;20
Now i want to transform this csv table into two columns, with student and Marks, Also for every student the marks with add up, something like below
Student | Marks
A | 30;40;40;30
B | 60;60;60;40
I have tried with below but it is throwing an error
df.withColumn("_tmp", split($"Students Marks","\\ ")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2")).drop("_tmp")
You can read the csv file with the delimiteryou want and calculate result as below
val df = spark.read
.option("header", true)
.option("delimiter", " ")
.csv("path to csv")
After You get the dataframe df
val resultDF = df.withColumn("split", split($"Marks", ";"))
.withColumn("a", $"split"(0))
.withColumn("b", $"split"(1))
.withColumn("c", $"split"(2))
.withColumn("d", $"split"(3))
.groupBy("Student")
.agg(concat_ws(";", array(
Seq(sum($"a"), sum($"b"), sum($"c"), sum($"d")): _*)
).as("Marks"))
resultDF.show(false)
Output:
+-------+-------------------+
|Student|Marks |
+-------+-------------------+
|B |60.0;60.0;60.0;40.0|
|A |30.0;40.0;40.0;30.0|
+-------+-------------------+
Three Ideas. The first one is to read the file, split it by space and then create the dataFrame:
val df = sqlContext.read
.format("csv")
.option("header", "true")
.option("delimiter", " ")
.load("your_file.csv")
The second one is to read the file to dataframe and split it:
df.withColumn("Student", split($"Students Marks"," ").getItem(0))
.withColumn("Marks", split($"Students Marks"," ").getItem(1))
.drop("Students Marks")
The last one is your solution. It should work, but when you use the select, you don't use $"_tmp", therefore, it should work without the .drop("_tmp")
df.withColumn("_tmp", split($"Students Marks"," "))
.select($"_tmp".getItem(0).as("Student"),$"_tmp".getItem(1).as("Marks"))

How can I make a Dataframe in Spark from a String instead of a file? [duplicate]

This question already has answers here:
Can I read a CSV represented as a string into Apache Spark using spark-csv
(3 answers)
Closed 3 years ago.
At the moment, I am making a dataframe from a tab separated file with a header, like this.
val df = sqlContext.read.format("csv")
.option("header", "true")
.option("delimiter", "\t")
.option("inferSchema","true").load(pathToFile)
I want to do exactly the same thing but with a String instead of a file. How can I do that?
To the best of my knowledge, there is no built in way to build a dataframe from a string. Yet, for prototyping purposes, you can create a dataframe from a Seq of Tuples.
You could use that to your advantage to create a dataframe from a string.
scala> val s ="x,y,z\n1,2,3\n4,5,6\n7,8,9"
s: String =
x,y,z
1,2,3
4,5,6
7,8,9
scala> val data = s.split('\n')
// Then we extract the first element to use it as a header.
scala> val header = data.head.split(',')
scala> val df = data.tail.toSeq
// converting the seq of strings to a DF with only one column
.toDF("X")
// spliting the string
.select(split('X, ",") as "X")
// extracting each column from the array and renaming them
.select( header.indices.map( i => 'X.getItem(i).as(header(i))) : _*)
scala> df.show
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
+---+---+---+
ps: if you are not in the spark REPL make sure to write this import spark.implicits._ so as to use toDF().

How to do custom partition in spark dataframe with saveAsTextFile

I have created data in Spark and then performed a join operation, finally I have to save the output to partitioned files.
I am converting data frame into RDD and then saving as text file that allows me to use multi-char delimiter. My question is to how use dataframe columns as custom partition in this case.
I can not use below option for custom partition because it does not support multi-char delimiter:
dfMainOutput.write.partitionBy("DataPartiotion","StatementTypeCode")
.format("csv")
.option("delimiter", "^")
.option("nullValue", "")
.option("codec", "gzip")
.save("s3://trfsdisu/SPARK/FinancialLineItem/output")
To use multi-char delimiter I have converted this in RDD like below code:
dfMainOutput.rdd.map(x=>x.mkString("|^|")).saveAsTextFile("dir path to store")
But in above option how would I do custom partition based on the columns "DataPartiotion" and "StatementTypeCode"?
Do I have to convert back to again from RDD to a dataframe?
Here is my code that i have tried
val dfMainOutput = df1result.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
.select($"LineItem_organizationId", $"LineItem_lineItemId",
when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition_1").as("DataPartition_1"),
when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").as("StatementTypeCode"),
when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").alias("StatementtypeCode"),
when($"LineItemName_1".isNotNull, $"LineItemName_1").otherwise($"LineItemName").as("LineItemName"),
when($"LocalLanguageLabel_1".isNotNull, $"LocalLanguageLabel_1").otherwise($"LocalLanguageLabel").as("LocalLanguageLabel"),
when($"FinancialConceptLocal_1".isNotNull, $"FinancialConceptLocal_1").otherwise($"FinancialConceptLocal").as("FinancialConceptLocal"),
when($"FinancialConceptGlobal_1".isNotNull, $"FinancialConceptGlobal_1").otherwise($"FinancialConceptGlobal").as("FinancialConceptGlobal"),
when($"IsDimensional_1".isNotNull, $"IsDimensional_1").otherwise($"IsDimensional").as("IsDimensional"),
when($"InstrumentId_1".isNotNull, $"InstrumentId_1").otherwise($"InstrumentId").as("InstrumentId"),
when($"LineItemSequence_1".isNotNull, $"LineItemSequence_1").otherwise($"LineItemSequence").as("LineItemSequence"),
when($"PhysicalMeasureId_1".isNotNull, $"PhysicalMeasureId_1").otherwise($"PhysicalMeasureId").as("PhysicalMeasureId"),
when($"FinancialConceptCodeGlobalSecondary_1".isNotNull, $"FinancialConceptCodeGlobalSecondary_1").otherwise($"FinancialConceptCodeGlobalSecondary").as("FinancialConceptCodeGlobalSecondary"),
when($"IsRangeAllowed_1".isNotNull, $"IsRangeAllowed_1").otherwise($"IsRangeAllowed".cast(DataTypes.StringType)).as("IsRangeAllowed"),
when($"IsSegmentedByOrigin_1".isNotNull, $"IsSegmentedByOrigin_1").otherwise($"IsSegmentedByOrigin".cast(DataTypes.StringType)).as("IsSegmentedByOrigin"),
when($"SegmentGroupDescription".isNotNull, $"SegmentGroupDescription").otherwise($"SegmentGroupDescription").as("SegmentGroupDescription"),
when($"SegmentChildDescription_1".isNotNull, $"SegmentChildDescription_1").otherwise($"SegmentChildDescription").as("SegmentChildDescription"),
when($"SegmentChildLocalLanguageLabel_1".isNotNull, $"SegmentChildLocalLanguageLabel_1").otherwise($"SegmentChildLocalLanguageLabel").as("SegmentChildLocalLanguageLabel"),
when($"LocalLanguageLabel_languageId_1".isNotNull, $"LocalLanguageLabel_languageId_1").otherwise($"LocalLanguageLabel_languageId").as("LocalLanguageLabel_languageId"),
when($"LineItemName_languageId_1".isNotNull, $"LineItemName_languageId_1").otherwise($"LineItemName_languageId").as("LineItemName_languageId"),
when($"SegmentChildDescription_languageId_1".isNotNull, $"SegmentChildDescription_languageId_1").otherwise($"SegmentChildDescription_languageId").as("SegmentChildDescription_languageId"),
when($"SegmentChildLocalLanguageLabel_languageId_1".isNotNull, $"SegmentChildLocalLanguageLabel_languageId_1").otherwise($"SegmentChildLocalLanguageLabel_languageId").as("SegmentChildLocalLanguageLabel_languageId"),
when($"SegmentGroupDescription_languageId_1".isNotNull, $"SegmentGroupDescription_languageId_1").otherwise($"SegmentGroupDescription_languageId").as("SegmentGroupDescription_languageId"),
when($"SegmentMultipleFundbDescription_1".isNotNull, $"SegmentMultipleFundbDescription_1").otherwise($"SegmentMultipleFundbDescription").as("SegmentMultipleFundbDescription"),
when($"SegmentMultipleFundbDescription_languageId_1".isNotNull, $"SegmentMultipleFundbDescription_languageId_1").otherwise($"SegmentMultipleFundbDescription_languageId").as("SegmentMultipleFundbDescription_languageId"),
when($"IsCredit_1".isNotNull, $"IsCredit_1").otherwise($"IsCredit".cast(DataTypes.StringType)).as("IsCredit"),
when($"FinancialConceptLocalId_1".isNotNull, $"FinancialConceptLocalId_1").otherwise($"FinancialConceptLocalId").as("FinancialConceptLocalId"),
when($"FinancialConceptGlobalId_1".isNotNull, $"FinancialConceptGlobalId_1").otherwise($"FinancialConceptGlobalId").as("FinancialConceptGlobalId"),
when($"FinancialConceptCodeGlobalSecondaryId_1".isNotNull, $"FinancialConceptCodeGlobalSecondaryId_1").otherwise($"FinancialConceptCodeGlobalSecondaryId").as("FinancialConceptCodeGlobalSecondaryId"),
when($"FFAction_1".isNotNull, $"FFAction_1").otherwise((concat(col("FFAction"), lit("|!|"))).as("FFAction")))
.filter(!$"FFAction".contains("D"))
val dfMainOutputFinal = dfMainOutput.select(concat_ws("|^|", columns.map(c => col(c)): _*).as("concatenated"))
dfMainOutputFinal.write.partitionBy("DataPartition_1","StatementTypeCode")
.format("csv")
.option("codec", "gzip")
.save("s3://trfsdisu/SPARK/FinancialLineItem/output")
This can be done by using concat_ws, this function works similarly to mkString but can be performed on directly on dataframe. This makes the conversion step to rdd redundant and the df.write.partitionBy() method can be used. A small example that will concatenate all available columns,
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(("01", "20000", "45.30"), ("01", "30000", "45.30"))
.toDF("col1", "col2", "col3")
val df2 = df.select($"DataPartiotion", $"StatementTypeCode",
concat_ws("|^|", df.schema.fieldNames.map(c => col(c)): _*).as("concatenated"))
This will give you a resulting dataframe like this,
+--------------+-----------------+------------------+
|DataPartiotion|StatementTypeCode| concatenated|
+--------------+-----------------+------------------+
| 01| 20000|01|^|20000|^|45.30|
| 01| 30000|01|^|30000|^|45.30|
+--------------+-----------------+------------------+

Read a CSV file with , as delim and numeric data also contain , separator to create RDD in Spark using Scala [duplicate]

This question already has answers here:
How do I convert csv file to rdd
(12 answers)
Closed 5 years ago.
I have a csv file with one of the columns containing value enclosed in double quotes. This column also has commas in it. How do I read this type of columns in CSV in Spark using Scala into an RDD. Column values enclosed in double quotes should be read as Integer type as they are values like Total assets, Total Debts.
example records from csv is
Jennifer,7/1/2000,0,,0,,151,11,8,"25,950,816","5,527,524",51,45,45,45,48,50,2,,
John,7/1/2003,0,,"200,000",0,151,25,8,"28,255,719","6,289,723",48,46,46,46,48,50,2,"4,766,127,272",169
I would suggest you to read with SQLContext as a csv file as it has well tested mechanisms and flexible apis to satisfy your needs
You can do
val dataframe =sqlContext.read.csv("path to your csv file")
Output would be
+-----------+--------+---+----+-------+----+---+---+---+----------+---------+----+----+----+----+----+----+----+-------------+----+
| _c0| _c1|_c2| _c3| _c4| _c5|_c6|_c7|_c8| _c9| _c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17| _c18|_c19|
+-----------+--------+---+----+-------+----+---+---+---+----------+---------+----+----+----+----+----+----+----+-------------+----+
| Jennifer|7/1/2000| 0|null| 0|null|151| 11| 8|25,950,816|5,527,524| 51| 45| 45| 45| 48| 50| 2| null|null|
|Afghanistan|7/1/2003| 0|null|200,000| 0|151| 25| 8|28,255,719|6,289,723| 48| 46| 46| 46| 48| 50| 2|4,766,127,272| 169|
+-----------+--------+---+----+-------+----+---+---+---+----------+---------+----+----+----+----+----+----+----+-------------+----+
Now you can change the header names, change the required columns to integers and do a lot of things
You can even change it to rdd
Edited
If you prefer to read in RDD and stay in RDD, then
Read the file with sparkContext as a textFile
val rdd = sparkContext.textFile("/home/anahcolus/IdeaProjects/scalaTest/src/test/resources/test.csv")
Then split the lines with , by ignoring , in between "
rdd.map(line => line.split(",(?=([^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)", -1))
#ibh this is not Spark or Scala specific stuff. In Spark you will read file the usual way
val conf = new SparkConf().setAppName("app_name").setMaster("local")
val ctx = new SparkContext(conf)
val file = ctx.textFile("<your file>.csv")
rdd.foreach{line =>
// cleanup code as per regex below
val tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1)
// side effect
val myObject = new MyObject(tokens)
mylist.add(myObject)
}
See this regex also.