Spark Reading Compressed with Special Format - scala

I have a file .gz I need to read this file and add the time and file name to this file I have some problems and need your help to recommend a way for this points.
Because the file is compressed the first line is reading with not the proper format I think due to encoding problem I tried the below code but not working
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
File has special format and I need to read it using Regex into a datafame ==> the only way i found is to read it using RDD and map it to the regex is there any way to read it direct to DF and pass the regex?
val Test_special_format_RawData = sc.textFile("file://"+filename.toString())
.map(line ⇒ line.replace("||", "|NA|NA"))
.map(line ⇒ if (line.takeRight(1) == "|") line+"NA" else line)
.map { x ⇒ regex_var.findAllIn(x).toArray }
import hiveSqlContext.implicits._
val Test_special_format_DF = Test_special_format_RawData.filter { x⇒x.length==30 }
.filter { x⇒x(0) !=header(0) }
.map { x⇒ (x(0), x(1), x(2), x(3), x(4), x(5), x(6), x(7),
x(8), x(9), x(10), x(11), x(12), x(13), x(14),
x(15),x(16), x(17), x(18), x(19))}.toDF()
val Test_special_format_Tranformed_Data = Test_special_format_DF.withColumn("FileName", lit(filename.getName))
.withColumn("rtm_insertion_date", lit(RTM_DATE_FORMAT.format(Cal.getInstance().getTime())))
Can I ignore any delimiter between any special charachter for example if "|" pipe coming between ^~ ^~ ignore it?
Some times the dataframe columns types received by wrong data types. How can we handle this problem to apply data quality checks?
When I tried to insert into hive from the Spark using Dataframe. Can I specify the rejection Directory for un handle rows error below is the code I used?
Test_special_format_Tranformed_Data.write.partitionBy("rtm_insertion_date")
.mode(SaveMode.Append).insertInto("dpi_Test_special_format_source")
Sample of the file is here

I will answer my question regarding the file format issue. The solution is to override the default extension format for the gzib.
import org.apache.hadoop.io.compress.GzipCodec
class TmpGzipCodec extends GzipCodec {
override def getDefaultExtension(): String = ".gz.tmp"
}
Now we just registered this codec, setting spark.hadoop.io.compression.codecs on SparkConf:
val conf = new SparkConf()
// Custom Codec that process .gz.tmp extensions as a common Gzip format
conf.set("spark.hadoop.io.compression.codecs", "smx.ananke.spark.util.codecs.TmpGzipCodec")
val sc = new SparkContext(conf)
val data = sc.textFile("s3n://my-data-bucket/2015/09/21/13/*")
I found this solution is this link
Regarding the malformed records, There are two solutions as follow:
Case as case class and then check if it pattern matched this case class or not.
Parse the RDD line by line but it required update in the spark.csv library.
Regarding delimiter delimiter issue, it required to use RDD with regex.

Related

Read a dataframe from csv/json/parquet depending on the argument given in spark

So, I'm reading a csv file into a dataframe in Spark (scala) using the following code:
val dataframe=spark.read
.option("sep", args(0))
.option("encoding","UTF-8")
.schema(sch)
.csv(args(1))
where args(0) is a runtime argument specifying the delimiter in my csv (comma, tab etc...), and args(1) is the S3 path from where the csv is read.
I want to generalize this input so that, depending on a third argument args(2), I can read into my dataframe, either csv, or json, or parquet formats with the schema sch.
What would be the best approach to achieve this?
You can use .format to specify the input file format (csv/json/parquet/etc.), and load the file using .load.
val dataframe = args(2) match {
case "csv" => {spark.read
.format(args(2))
.option("sep", args(0))
.option("encoding","UTF-8")
.schema(sch)
.load(args(1))
}
case _ => {spark.read
.format(args(2))
.option("encoding","UTF-8")
.schema(sch)
.load(args(1))
}
}

Spark scala read multiple files from S3 using Seq(paths)

I have a scala program that reads json files into a DataFrame using DataFrameReader, using a file pattern like "s3n://bucket/filepath/*.json" to specify files. Now I need to read both ".json" and ".json.gz" (gzip) files into the dataframe.
Since current approach uses a wildcard, like this:
session.read().json("s3n://bucket/filepath/*.json")
I want to read both json and json-gzip files, but I have not found documentation for the wildcard pattern expression. I was tempted to compose a more complex wildcard, but the lack of wildcard documentation motivated me to consider another approach.
Reading the documentation for Spark, it says that the DataFrameReader has these relevant methods,
json(path: String): DataFrame
json(paths: String*): DataFrame
Which would produce code more like this:
// spark.isInstanceOf[SparkSession]
// val reader: DataFrameReader = spark.read
val df: DataFrame = spark.read.json(path: String)
// or
val df: DataFrame = spark.read.json(paths: String*)
I need to read json and json-gzip files, but I may need to read other filename formats. The second method (above) accepts a Scala Seq(uence), which means I could provide a Seq(uence), which I could later add other filename wildcards.
// session.isInstanceOf[SparkSession]
val s3json: String = "s3n://bucket/filepath/*.json"
val s3gzip: String = "s3n://bucket/filepath/*.json.gz"
val paths: Seq[String] = Seq(s3json, s3gzip)
val df: DataFrame = session.read().json(paths)
Please comment on this approach, and is this idionatic?
I have also seen examples of the last line with the splat operator ("_") added to the paths sequence. Is that needed? Can you explain what the ": _" part does?
val df: DataFrame = session.read().json(paths: _*)
Example of the splat operator use are here:
How to read multiple directories in s3 in spark Scala?
How to pass a list of paths to spark.read.load?
Adding further to blackbishop's answer, you can use val df = spark.read.json(paths: _*) for reading files from entirely independent buckets/folders.
val paths = Seq("s3n://bucket1/filepath1/","s3n://bucket2/filepath/2")
val df = spark.read.json(paths: _*)
The _* converts a Seq to variable arguments needed by path function.
You can use brace expansions in your path to include the 2 extensions:
val df = spark.read.json("s3n://bucket/filepath/{*.json,*.json.gz}")
If your bucket contains only .json and .json.gz files, you can actually read all the files:
val df = spark.read.json("s3n://bucket/filepath/")

Spark Scala - textFile() and sequenceFile() RDDs

I'm successfully loading my sequence files into a DataFrame with some code like this:
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.sequenceFile[LongWritable,String](src)
val jsonRecs = file.map((record: (String, String)) => new String(record._2))
val df = sqlContext.read.json(jsonRecs)
I'd like to do the same with some text files. The text files have a similar format as the sequence files (A timestamp, a tab char, then the json). But the problem is textFile() returns an RDD[String] instead of an RDD[LongWritable,String] like the sequenceFile() method.
My goal is to be able to test the program with either sequence files or text files as input.
How could I convert the RDD[String] coming from textFile() into an RDD[LongWritable,String]? Or is there a better solution?
Assuming that your text file is a CSV file, you can use following code for reading a CSV file in a Dataframe where spark is the SparkSession:
val df = spark.read.option("header", "false").csv("file.txt")
Like header option there are multiple options you can provide depending upon your requirement. Check this for more details.
Thanks for the responses. It's not a CSV but I guess it could be. It's just the text output of doing this on a sequence file in HDFS:
hdfs dfs -text /path/to/my/file > myFile.txt
Anyway, I found a solution that works for both sequence and text file for my use case. This code ends up setting the variable 'file' to a RDD[String,String] in both cases, and I can work with that.
var file = if (inputType.equalsIgnoreCase("text")) {
sc.textFile(src).map(line => (line.split("\t")(0), line.split("\t")(1)))
} else { // Default to assuming sequence files are input
sc.sequenceFile[String,String](src)
}

Transformations and Actions in Apache Spark

I have scala code that takes multiple input files from HDFS using wildcards and each files goes into a function where processing takes place for each file individually.
import de.l3s.boilerpipe.extractors.KeepEverythingExtractor
val data = sc.wholeTextFiles("hdfs://localhost:port/akshat/folder/*/*")
val files = data.map { case (filename, content) => filename}
def doSomething(file: String): (String,String) = {
// logic of processing a single file comes here
val logData = sc.textFile(file);
val c = logData.toLocalIterator.mkString
val d = KeepEverythingExtractor.INSTANCE.getText(c)
val e = sc.parallelize(d.split("\n"))
val recipeName = e.take(10).last
val prepTime = e.take(18).last
(recipeName,prepTime)
}
//How transformation and action applied here?
I am stuck at how to to apply further transformations and actions so that all my input files are mapped according to function doSomething and all the output from each of the input files is stored in a single file using saveAsTextFile.
So if my understanding is correct, you have an RDD of Pairs and you wish to transform it some more, and then save the output for each key to a unique file. Transforming it some more is relatively easy, mapValue will allow you to write transformations on just the value, as well any other transformation will work on RDDs of Pairs.
Saving the output to a unique file for each key however, is a bit trickier. One option would be to try and find a hadoopoutput format which does what you want and then use saveAsHadoopFile, another option would be to use foreach and then just write the code to output each key/value pair as desired.

spark textfile load file instead of lines

In Spark, we can use textFile to load file into lines and try to do some operations with these lines as follows.
val lines = sc.textFile("xxx")
val counts = lines.filter(line => lines.contains("a")).count()
However, in my situation, I would like to load the file into blocks because the data in files and the block will be kind of as follow. Blocks will be separated with empty line in files.
user: 111
book: 222
comments: like it!
Therefore, I hope the textFile function or any other solutions can help me load the file with blocks, which may be achieved as follows.
val blocks = sc.textFile("xxx", 3 line)
Does anyone face this situation before? Thanks
I suggest you to implement your own file reader function from Hdfs. Look attextFile function, it's built on top of hadoopFile function and it uses TextInputFormat:
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
But this TextInputFormat can be customized via hadoop properties as described in this answer. In your case delimiter could be:
conf.set("textinputformat.record.delimiter", "\n\n")