Spark CSV reader : garbled Japanese text and handling multilines - scala

In my Spark job (spark 2.4.1) , I am reading CSV files on S3.These files contain Japanese characters.Also they can have ^M character (u000D) so I need to parse them as multiline.
First I used following code to read CSV files:
implicit class DataFrameReadImplicits (dataFrameReader: DataFrameReader) {
def readTeradataCSV(schema: StructType, s3Path: String) : DataFrame = {
dataFrameReader.option("delimiter", "\u0001")
.option("header", "false")
.option("inferSchema", "false")
.option("multiLine","true")
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
.schema(schema)
.csv(s3Path)
}
}
But when I read DF using this method all the Japanese characters are garbled.
After doing some tests I found out that If I read the same S3 file using "spark.sparkContext.textFile(path)" Japanese characters encoded properly.
So I tried this way :
implicit class SparkSessionImplicits (spark : SparkSession) {
def readTeradataCSV(schema: StructType, s3Path: String) = {
import spark.sqlContext.implicits._
spark.read.option("delimiter", "\u0001")
.option("header", "false")
.option("inferSchema", "false")
.option("multiLine","true")
.schema(schema)
.csv(spark.sparkContext.textFile(s3Path).map(str => str.replaceAll("\u000D"," ")).toDS())
}
}
Now the encoding issue is fixed.However multilines doesn't work properly and lines are broken near ^M character , even though I tried to replace ^M using str.replaceAll("\u000D"," ")
Any tips on how to read Japanese characters using first method,
or
handle multi-lines using the second method ?
UPDATE:
This encoding issue happens when the app runs on the Spark cluster.When I ran the app locally, reading the same S3 file, encoding works just fine.

Some things are in the code but not (yet) in the docs. Did you try setting explicitly your line separator, thus avoiding the "multiline" workaround because of ^M?
From the unit tests for Spark "TextSuite" branch 2.4
https://github.com/apache/spark/blob/branch-2.4/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala
def testLineSeparator(lineSep: String): Unit = {
test(s"SPARK-23577: Support line separator - lineSep: '$lineSep'") {
...
}
// scalastyle:off nonascii
Seq("|", "^", "::", "!!!#3", 0x1E.toChar.toString, "아").foreach { lineSep =>
testLineSeparator(lineSep)
}
// scalastyle:on nonascii
From the source code for CSV options parsing, branch 3.0
https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
val lineSeparator: Option[String] = parameters.get("lineSep").map { sep =>
require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
require(sep.length == 1, "'lineSep' can contain only 1 character.")
sep
}
val lineSeparatorInRead: Option[Array[Byte]] = lineSeparator.map { lineSep =>
lineSep.getBytes(charset)
}
So, looks like CSV does not support strings for line delimiters, just single characters, because it relies on some Hadoop library. I hope that's fine in your case.
The matching JIRAs are...
SPARK-21289 Text based formats do not support custom end-of-line delimiters ...
SPARK-23577 specific to text datasource > fixed in V2.4.0

if your data is enclosed by double quote then you can use escape property.
df = (spark.read
.option("header", "false")
.csv("******",multiLine=True, escape='"')
)

Related

Read a dataframe from csv/json/parquet depending on the argument given in spark

So, I'm reading a csv file into a dataframe in Spark (scala) using the following code:
val dataframe=spark.read
.option("sep", args(0))
.option("encoding","UTF-8")
.schema(sch)
.csv(args(1))
where args(0) is a runtime argument specifying the delimiter in my csv (comma, tab etc...), and args(1) is the S3 path from where the csv is read.
I want to generalize this input so that, depending on a third argument args(2), I can read into my dataframe, either csv, or json, or parquet formats with the schema sch.
What would be the best approach to achieve this?
You can use .format to specify the input file format (csv/json/parquet/etc.), and load the file using .load.
val dataframe = args(2) match {
case "csv" => {spark.read
.format(args(2))
.option("sep", args(0))
.option("encoding","UTF-8")
.schema(sch)
.load(args(1))
}
case _ => {spark.read
.format(args(2))
.option("encoding","UTF-8")
.schema(sch)
.load(args(1))
}
}

Spark Scala - textFile() and sequenceFile() RDDs

I'm successfully loading my sequence files into a DataFrame with some code like this:
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.sequenceFile[LongWritable,String](src)
val jsonRecs = file.map((record: (String, String)) => new String(record._2))
val df = sqlContext.read.json(jsonRecs)
I'd like to do the same with some text files. The text files have a similar format as the sequence files (A timestamp, a tab char, then the json). But the problem is textFile() returns an RDD[String] instead of an RDD[LongWritable,String] like the sequenceFile() method.
My goal is to be able to test the program with either sequence files or text files as input.
How could I convert the RDD[String] coming from textFile() into an RDD[LongWritable,String]? Or is there a better solution?
Assuming that your text file is a CSV file, you can use following code for reading a CSV file in a Dataframe where spark is the SparkSession:
val df = spark.read.option("header", "false").csv("file.txt")
Like header option there are multiple options you can provide depending upon your requirement. Check this for more details.
Thanks for the responses. It's not a CSV but I guess it could be. It's just the text output of doing this on a sequence file in HDFS:
hdfs dfs -text /path/to/my/file > myFile.txt
Anyway, I found a solution that works for both sequence and text file for my use case. This code ends up setting the variable 'file' to a RDD[String,String] in both cases, and I can work with that.
var file = if (inputType.equalsIgnoreCase("text")) {
sc.textFile(src).map(line => (line.split("\t")(0), line.split("\t")(1)))
} else { // Default to assuming sequence files are input
sc.sequenceFile[String,String](src)
}

skip header of csv while reading multiple files into rdd in scala

I am trying to read multiple csvs into an rdd from a path. This path has many csvs Is there a way I can avoid the headers while reading all the csvs into rdd? or use spotsRDD to omit out the header without having to use filter or deal with each csv individually and then union them?
val path ="file:///home/work/csvs/*"
val spotsRDD= sc.textFile(path)
println(spotsRDD.count())
Thanks
That is pity you are using spark 1.0.0.
You can use CSV Data Source for Apache Spark but this library requires Spark 1.3+ and btw. this library was inlined to Spark 2.x.
But we can analyse and implement something similar.
When we look into the com/databricks/spark/csv/DefaultSource.scala there is
val useHeader = parameters.getOrElse("header", "false")
and then in the com/databricks/spark/csv/CsvRelation.scala there is
// If header is set, make sure firstLine is materialized before sending to executors.
val filterLine = if (useHeader) firstLine else null
baseRDD().mapPartitions { iter =>
// When using header, any input line that equals firstLine is assumed to be header
val csvIter = if (useHeader) {
iter.filter(_ != filterLine)
} else {
iter
}
parseCSV(csvIter, csvFormat)
so if we assume the first line is only once in RDD (our csv rows) we can do something like in the example below:
CSV example file:
Latitude,Longitude,Name
48.1,0.25,"First point"
49.2,1.1,"Second point"
47.5,0.75,"Third point"
scala> val csvData = sc.textFile("test.csv")
csvData: org.apache.spark.rdd.RDD[String] = test.csv MapPartitionsRDD[24] at textFile at <console>:24
scala> val header = csvDataRdd.first
header: String = Latitude,Longitude,Name
scala> val csvDataWithoutHeaderRdd = csvDataRdd.mapPartitions{iter => iter.filter(_ != header)}
csvDataWithoutHeaderRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at mapPartitions at <console>:28
scala> csvDataWithoutHeaderRdd.foreach(println)
49.2,1.1,"Second point"
48.1,0.25,"First point"
47.5,0.75,"Third point"

Spark Reading Compressed with Special Format

I have a file .gz I need to read this file and add the time and file name to this file I have some problems and need your help to recommend a way for this points.
Because the file is compressed the first line is reading with not the proper format I think due to encoding problem I tried the below code but not working
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
File has special format and I need to read it using Regex into a datafame ==> the only way i found is to read it using RDD and map it to the regex is there any way to read it direct to DF and pass the regex?
val Test_special_format_RawData = sc.textFile("file://"+filename.toString())
.map(line ⇒ line.replace("||", "|NA|NA"))
.map(line ⇒ if (line.takeRight(1) == "|") line+"NA" else line)
.map { x ⇒ regex_var.findAllIn(x).toArray }
import hiveSqlContext.implicits._
val Test_special_format_DF = Test_special_format_RawData.filter { x⇒x.length==30 }
.filter { x⇒x(0) !=header(0) }
.map { x⇒ (x(0), x(1), x(2), x(3), x(4), x(5), x(6), x(7),
x(8), x(9), x(10), x(11), x(12), x(13), x(14),
x(15),x(16), x(17), x(18), x(19))}.toDF()
val Test_special_format_Tranformed_Data = Test_special_format_DF.withColumn("FileName", lit(filename.getName))
.withColumn("rtm_insertion_date", lit(RTM_DATE_FORMAT.format(Cal.getInstance().getTime())))
Can I ignore any delimiter between any special charachter for example if "|" pipe coming between ^~ ^~ ignore it?
Some times the dataframe columns types received by wrong data types. How can we handle this problem to apply data quality checks?
When I tried to insert into hive from the Spark using Dataframe. Can I specify the rejection Directory for un handle rows error below is the code I used?
Test_special_format_Tranformed_Data.write.partitionBy("rtm_insertion_date")
.mode(SaveMode.Append).insertInto("dpi_Test_special_format_source")
Sample of the file is here
I will answer my question regarding the file format issue. The solution is to override the default extension format for the gzib.
import org.apache.hadoop.io.compress.GzipCodec
class TmpGzipCodec extends GzipCodec {
override def getDefaultExtension(): String = ".gz.tmp"
}
Now we just registered this codec, setting spark.hadoop.io.compression.codecs on SparkConf:
val conf = new SparkConf()
// Custom Codec that process .gz.tmp extensions as a common Gzip format
conf.set("spark.hadoop.io.compression.codecs", "smx.ananke.spark.util.codecs.TmpGzipCodec")
val sc = new SparkContext(conf)
val data = sc.textFile("s3n://my-data-bucket/2015/09/21/13/*")
I found this solution is this link
Regarding the malformed records, There are two solutions as follow:
Case as case class and then check if it pattern matched this case class or not.
Parse the RDD line by line but it required update in the spark.csv library.
Regarding delimiter delimiter issue, it required to use RDD with regex.

How Spark read file with underline the beginning of the file name?

When I use Spark to parse log files, I notice that if the first character of filename is _ , the result will be empty. Here is my test code:
SparkSession spark = SparkSession
.builder()
.appName("TestLog")
.master("local")
.getOrCreate();
JavaRDD<String> input = spark.read().text("D:\\_event_2.log").javaRDD();
System.out.println("size : " + input.count());
If I modify the file name to event_2.log, the code will run it correctly.
I found that the text function is defined as:
#scala.annotation.varargs
def text(paths: String*): Dataset[String] = {
format("text").load(paths : _*).as[String](sparkSession.implicits.newStringEncoder)
}
I think it could be due to _ being scala's placeholder. How can I avoid this problem?
This has nothing to do with Scala. Spark uses Hadoop Input API to read file, which ignore every file that starts with underscore(_) or dot (.)
I don't know how to disable this in Spark though.