Loading files in a loop in spark - scala

I have n number of files in a directory with same .txt extension and I want to load them in a loop and then make separate dataframes for each of them.
I have read this but in my case all my files have same extension and I want to iterate over them one by one and make dataframe for every file.
I started by counting files in a directory with following line of code
sc.wholeTextFiles("/path/to/dir/*.txt").count()
but I don't know how should I proceed further?
Please guide me.
I am using Spark 2.3 and Scala.
Thanks.

The wholetextiles returns a paired Rdd Function
def wholeTextFiles(path: String, minPartitions: Int): rdd.RDD[(String, String)]
You can do map over the rdd, the key of the rdd is path of the file and value is content of the file
sc.wholeTextFiles("/path/to/dir/*.txt").take(2)
sc.wholeTextFiles("/path/to/dir/*.txt").map((x,y)=> some logic on x and y )

You could use the hadoop fs and get the list of files under the directory and then iterate it over and save it to differnet dataframes.
Something like the below:
// Hadoop FS
val hadoop_fs = FileSystem.get(sc1.hadoopConfiguration)
// Get list of part files
val fs_status = hadoop_fs.listLocatedStatus(new Path(fileFullPath))
while (fs_status.hasNext) {
val fileStatus = fs_status.next.getPath
val filepath = fileStatus.toString
val df = sc1.textFile(filepath)
}

Related

Compare two different RDDs using scala

I have two RDDs- one from hdfs file system and the other created from a string as shown below-
val txt=sc.textFile("/tmp/textFile.txt")
val str="This\nfile is\nallowed"
val strRDD=sc.parallelize(List(str))
Now, I want two compare the data in these two RDDs:
OR
The result should be an empty RDD but that is not the case. Can someone please explain how I should compare the data of these two RDDs?
Values of the two rdds that you've created looks to be same but are not same. It is evident if you do the count of elements in both rdds as
txt.collect().count(!_.isEmpty)
//res0: Int = 3
strRDD.collect().count(!_.isEmpty)
//res1: Int = 1
The result should be an empty RDD but that is not the case.
Thats the reason the results of txt.subtract(strRDD) and strRDD.subtract(txt) are not same
val txt=sc.textFile("/tmp/textFile.txt") gives each line as separate element in txt RDD
val str="This\nfile is\nallowed"
val strRDD=sc.parallelize(List(str)) gives one \n separated element in strRDD RDD
I hope the explanation is clear

How to split 1 RDD into 6 parts in a performant manner?

I have built a Spark RDD where each element of this RDD is a JAXB Root Element representing an XML Record.
I want to split this RDD so as to produce 6 RDDs from this set. Essentially this job simply converts the hierarchical XML structure into 6 sets of flat CSV records. I am currently passing over the same RDD 6 six times to do this.
xmlRdd.cache()
val rddofTypeA = xmlRdd.map {iterate over XML Object and create Type A}
rddOfTypeA.saveAsTextFile("s3://...")
val rddofTypeB = xmlRdd.map { iterate over XML Object and create Type B}
rddOfTypeB.saveAsTextFile("s3://...")
val rddofTypeC = xmlRdd.map { iterate over XML Object and create Type C}
rddOfTypeC.saveAsTextFile("s3://...")
val rddofTypeD = xmlRdd.map { iterate over XML Object and create Type D}
rddOfTypeD.saveAsTextFile("s3://...")
val rddofTypeE = xmlRdd.map { iterate over XML Object and create Type E}
rddOfTypeE.saveAsTextFile("s3://...")
val rddofTypeF = xmlRdd.map { iterate over XML Object and create Type F}
rddOfTypeF.saveAsTextFile("s3://...")
My input dataset are 35 Million Records split into 186 files of 448MB each stored in Amazon S3. My output directory is also on S3. I am using EMR Spark.
With a six node m4.4xlarge cluster it taking 38 minutes to finish this splitting and writing the output.
Is there an efficient way to achieve this without walking over the RDD six times?
The easiest solution (from a Spark developer's perspective) is to do the map and saveAsTextFile per RDD on a separate thread.
What's not widely known (and hence exploited) is the fact that SparkContext is thread-safe and so can be used to submit jobs from separate threads.
With that said, you could do the following (using the simplest Scala solution with Future but not necessarily the best as Future starts a computation at instantiation time not when you say so):
xmlRdd.cache()
import scala.concurrent.ExecutionContext.Implicits.global
val f1 = Future {
val rddofTypeA = xmlRdd.map { map xml to csv}
rddOfTypeA.saveAsTextFile("s3://...")
}
val f2 = Future {
val rddofTypeB = xmlRdd.map { map xml to csv}
rddOfTypeB.saveAsTextFile("s3://...")
}
...
Future.sequence(Seq(f1,f2)).onComplete { ... }
That could cut the time for doing the mapping and saving, but would not cut the number of scans over the dataset. That should not be a big deal anyway since the dataset is cached and hence in memory and/or disk (the default persistence level is MEMORY_AND_DISK in Spark SQL's Dataset.cache).
Depending on your requirements regarding output paths you can solve it using simple partitionByClause with standard DataFrameWriter.
Instead of multiple maps design a function which takes element of xmlRdd and returns a Seq of Tuples. General structure would be like this:
def extractTypes(value: T): Seq[(String, String)] = {
val a: String = extractA(value)
val b: String = extractB(value)
...
val f: String = extractF(value)
Seq(("A", a), ("B", b), ..., ("F", f))
}
flatMap, convert to Dataset and write:
xmlRdd.flatMap(extractTypes _).toDF("id", "value").write
.partitionBy("id")
.option("escapeQuotes", "false")
.csv(...)

Read multiple *.tar.gz files as input for Spark in Scala [duplicate]

This question already has answers here:
How to read gz files in Spark using wholeTextFiles
(2 answers)
Closed 6 years ago.
I intend to apply linear regression on a dataset. it works fine when I apply a subset of the data in *.txt format as below:
// how could I read 26 *.tar.gz compressed files into a DataFrame?
val inputpath = "/Users/jasonzhu/Downloads/a.txt"
val rawDF = sc.textFile(inputpath).toDF()
val df = se.kth.spark.lab1.task2.Main.body(sqlContext, rawDF)
val splitDf = df.randomSplit(Array(0.95, 0.05), seed = 42L)
val (obsDF, testDF) =(splitDf(0).cache(), splitDf(1))
val maxIter = 6
val regParam = 0.07
val elasticNetParam = 0.1
println(s"maxIter=${maxIter}, regParam=${regParam}, elasticNetParam=${elasticNetParam}")
val myLR = new LinearRegression()
.setMaxIter(maxIter)
.setRegParam(regParam)
.setElasticNetParam(elasticNetParam)
val lrStage = 0
val pipeline = new Pipeline().setStages(Array(myLR))
val pipelineModel: PipelineModel = pipeline.fit(obsDF)
val lrModel = pipelineModel.stages(lrStage).asInstanceOf[LinearRegressionModel]
val trainingSummary = lrModel.summary
//print rmse of our model
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")
//do prediction - print first k
val predictedDF = pipelineModel.transform(testDF)
predictedDF.show(5, false)
After spiking, I intend to apply the whole dataset, which resides in 26 *.tar.gz files, to the linear regression model. I'd like to know how I should read these compressed files into a DataFrame of Spark and consume it efficiently by taking the advantage of parallelism in Spark. Thanks!
textFile() method can take wildcards as well. From documentation:
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").
Start with an empty RDD and run a loop to read each of the files as a RDD and keep adding the RDD to the initial RDD by a union operation in each iteration.

Open files in spark with given timestamp

I use newAPIHadoopFile in my scala class to read text files from HDFS as below
val conf = new SparkConf
val sc = new SparkContext(conf)
val hc = new Configuration(sc.hadoopConfiguration)
val dataFilePath = "/data/sample"
val input = sc.newAPIHadoopFile(dataFilePath, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], hc)
But now I just need to open files within a range of timestamp.
Any idea on how I could do that?
Thanks,
Jeff
If your files contain timestamp directly in filename, it's pretty easy:
val path = "/hdfs/some_dir/2016-07-*/*"
val data = sqlContext.jsonFile(data) // or textFile e.g.
data.count() // number of rows in all files matching pattern
This will read all dirs July of 2016 and all files in those dirs. You can do pattern matching even on filenames, e.g. val path = "/hdfs/some_dir/2016-07-01/file-*.json"
Is this helpful or you are looking for system timestamps filtering?
Edit:
In case you need to filter using system timestamp:
val path = "/hdfs/some_dir/"
val now: Long = System.currentTimeMillis / 1000
var files = new java.io.File(path).listFiles.filter(_.lastModified >= now)
Or you can construct more complex date filtering like selecting date in a "human" way. It should be easy now.

spark: read multiple textfiles and spill out the first line of each file?

How to read multiple files (> 1000 files) and say only print out the first line for each file in spark?
I was reading the link
How to read multiple text files into a single RDD?
which mentioned I can read in multiple files (say 3 files) in spark using the following syntax:
val fs = sc.textFile("a.txt,b.txt,c.txt")
But fs seems glue all the files together.
One approach is to use HadoopFile with TextInputFormat:
import org.apache.hadoop.mapred.TextInputFormat
import org.apache.hadoop.io.{LongWritable, Text}
val input: String = ???
val firstLines = sc.hadoopFile(
input, classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
.flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
}
Since keys of the TextInputFormat represent the offset of the beginning of the file for a given line you should get exactly what you want.