spark textfile load file instead of lines - scala

In Spark, we can use textFile to load file into lines and try to do some operations with these lines as follows.
val lines = sc.textFile("xxx")
val counts = lines.filter(line => lines.contains("a")).count()
However, in my situation, I would like to load the file into blocks because the data in files and the block will be kind of as follow. Blocks will be separated with empty line in files.
user: 111
book: 222
comments: like it!
Therefore, I hope the textFile function or any other solutions can help me load the file with blocks, which may be achieved as follows.
val blocks = sc.textFile("xxx", 3 line)
Does anyone face this situation before? Thanks

I suggest you to implement your own file reader function from Hdfs. Look attextFile function, it's built on top of hadoopFile function and it uses TextInputFormat:
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
But this TextInputFormat can be customized via hadoop properties as described in this answer. In your case delimiter could be:
conf.set("textinputformat.record.delimiter", "\n\n")

Related

Scala: Reading HDFS file as Stream

I would like to read an HDFS File in scala. This is a text file and wanted to insert a field default value in each line. How do I read the hdfs file as stream line by line?
I got this code:
val hdfs = FileSystem.get(new URI("hdfs://df:port/"), new Configuration())
val path = new Path("/dir/fileNm")
val stream = hdfs.open(path)
Stream.cons(stream.read, Stream.continually( stream.read))
But this read byte by byte. The readLine() is deprecated. How to read a line?
I am using scala version - 2.11.8
Thanks,
Revathy.
You can use scala.io.Source:
val source = Source.fromInputStream(stream)
source.getLines() // Iterator[String]
I think you shuld do something similar to this:
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
readLines.takeWhile(_ != null).foreach(line => println(line))
Pipe the contents to another function that will delineate by new line character then just use that line stream as you normally would. Sometimes you have to do the work yourself.

Spark Scala - textFile() and sequenceFile() RDDs

I'm successfully loading my sequence files into a DataFrame with some code like this:
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.sequenceFile[LongWritable,String](src)
val jsonRecs = file.map((record: (String, String)) => new String(record._2))
val df = sqlContext.read.json(jsonRecs)
I'd like to do the same with some text files. The text files have a similar format as the sequence files (A timestamp, a tab char, then the json). But the problem is textFile() returns an RDD[String] instead of an RDD[LongWritable,String] like the sequenceFile() method.
My goal is to be able to test the program with either sequence files or text files as input.
How could I convert the RDD[String] coming from textFile() into an RDD[LongWritable,String]? Or is there a better solution?
Assuming that your text file is a CSV file, you can use following code for reading a CSV file in a Dataframe where spark is the SparkSession:
val df = spark.read.option("header", "false").csv("file.txt")
Like header option there are multiple options you can provide depending upon your requirement. Check this for more details.
Thanks for the responses. It's not a CSV but I guess it could be. It's just the text output of doing this on a sequence file in HDFS:
hdfs dfs -text /path/to/my/file > myFile.txt
Anyway, I found a solution that works for both sequence and text file for my use case. This code ends up setting the variable 'file' to a RDD[String,String] in both cases, and I can work with that.
var file = if (inputType.equalsIgnoreCase("text")) {
sc.textFile(src).map(line => (line.split("\t")(0), line.split("\t")(1)))
} else { // Default to assuming sequence files are input
sc.sequenceFile[String,String](src)
}

How to read many files and assign each file to the next variable?

I am beginner in Scala, I have following question:
How to read more that one csv file and and assign each file to the next variable?
I know how the read one file:
val file_1.sc.textFile("/Users/data/urls_20170225")
I also know how to read many files:
val file_2.sc.textFile("/Users/data/urls_*")
But second way assign all data to one variables file_2, is something that I don't want to! I am looking for elegant way to do this in Spark Scala.
spark has no API to load multiple files into multiple RDD. What you can do is load them one by one into one List of RDD. Below is a sample code.
def main(arg: Array[String]): Unit = {
val dir = """F:\Works\SO\Scala\src\main\resource"""
val startsWith = """urls_""" // we will use this as the wildcard
val fileList:List[File] = getListOfFiles(new File(dir))
val filesRDD: List[RDD[String]] = fileList.collect({
case file: File if file.getName.startsWith(startsWith)=> spark.sparkContext.textFile(file.getPath)
})
}
//Get all the individual file paths
def getListOfFiles(dir: File):List[File] = dir.listFiles.filter(_.isFile).toList

Spark Reading Compressed with Special Format

I have a file .gz I need to read this file and add the time and file name to this file I have some problems and need your help to recommend a way for this points.
Because the file is compressed the first line is reading with not the proper format I think due to encoding problem I tried the below code but not working
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
File has special format and I need to read it using Regex into a datafame ==> the only way i found is to read it using RDD and map it to the regex is there any way to read it direct to DF and pass the regex?
val Test_special_format_RawData = sc.textFile("file://"+filename.toString())
.map(line ⇒ line.replace("||", "|NA|NA"))
.map(line ⇒ if (line.takeRight(1) == "|") line+"NA" else line)
.map { x ⇒ regex_var.findAllIn(x).toArray }
import hiveSqlContext.implicits._
val Test_special_format_DF = Test_special_format_RawData.filter { x⇒x.length==30 }
.filter { x⇒x(0) !=header(0) }
.map { x⇒ (x(0), x(1), x(2), x(3), x(4), x(5), x(6), x(7),
x(8), x(9), x(10), x(11), x(12), x(13), x(14),
x(15),x(16), x(17), x(18), x(19))}.toDF()
val Test_special_format_Tranformed_Data = Test_special_format_DF.withColumn("FileName", lit(filename.getName))
.withColumn("rtm_insertion_date", lit(RTM_DATE_FORMAT.format(Cal.getInstance().getTime())))
Can I ignore any delimiter between any special charachter for example if "|" pipe coming between ^~ ^~ ignore it?
Some times the dataframe columns types received by wrong data types. How can we handle this problem to apply data quality checks?
When I tried to insert into hive from the Spark using Dataframe. Can I specify the rejection Directory for un handle rows error below is the code I used?
Test_special_format_Tranformed_Data.write.partitionBy("rtm_insertion_date")
.mode(SaveMode.Append).insertInto("dpi_Test_special_format_source")
Sample of the file is here
I will answer my question regarding the file format issue. The solution is to override the default extension format for the gzib.
import org.apache.hadoop.io.compress.GzipCodec
class TmpGzipCodec extends GzipCodec {
override def getDefaultExtension(): String = ".gz.tmp"
}
Now we just registered this codec, setting spark.hadoop.io.compression.codecs on SparkConf:
val conf = new SparkConf()
// Custom Codec that process .gz.tmp extensions as a common Gzip format
conf.set("spark.hadoop.io.compression.codecs", "smx.ananke.spark.util.codecs.TmpGzipCodec")
val sc = new SparkContext(conf)
val data = sc.textFile("s3n://my-data-bucket/2015/09/21/13/*")
I found this solution is this link
Regarding the malformed records, There are two solutions as follow:
Case as case class and then check if it pattern matched this case class or not.
Parse the RDD line by line but it required update in the spark.csv library.
Regarding delimiter delimiter issue, it required to use RDD with regex.

Transformations and Actions in Apache Spark

I have scala code that takes multiple input files from HDFS using wildcards and each files goes into a function where processing takes place for each file individually.
import de.l3s.boilerpipe.extractors.KeepEverythingExtractor
val data = sc.wholeTextFiles("hdfs://localhost:port/akshat/folder/*/*")
val files = data.map { case (filename, content) => filename}
def doSomething(file: String): (String,String) = {
// logic of processing a single file comes here
val logData = sc.textFile(file);
val c = logData.toLocalIterator.mkString
val d = KeepEverythingExtractor.INSTANCE.getText(c)
val e = sc.parallelize(d.split("\n"))
val recipeName = e.take(10).last
val prepTime = e.take(18).last
(recipeName,prepTime)
}
//How transformation and action applied here?
I am stuck at how to to apply further transformations and actions so that all my input files are mapped according to function doSomething and all the output from each of the input files is stored in a single file using saveAsTextFile.
So if my understanding is correct, you have an RDD of Pairs and you wish to transform it some more, and then save the output for each key to a unique file. Transforming it some more is relatively easy, mapValue will allow you to write transformations on just the value, as well any other transformation will work on RDDs of Pairs.
Saving the output to a unique file for each key however, is a bit trickier. One option would be to try and find a hadoopoutput format which does what you want and then use saveAsHadoopFile, another option would be to use foreach and then just write the code to output each key/value pair as desired.