how get element from List based on its name? - scala

I have a path that contains subpath ,each subpath contains files
path="/data"
I implemented tow function to get csv files from each sub path
def getListOfSubDirectories(directoryName: String): Array[String] = {
(new File(directoryName))
.listFiles
.filter(_.isDirectory)
.map(_.getName)
}
def getListOfFiles(dir: String, extensions: List[String]): List[File] = {
val d = new File(dir)
d.listFiles.filter(_.isFile).toList.filter { file =>
extensions.exists(file.getName.endsWith(_))
}
}
each sub path contain 5 csv files : contextfile.csv,datafile.csv,datesfiles.csv,errors.csv,testfiles so my problem that i'll work with each file in a separate dataframe how I can get name of file for the right dataframe for example I want to get the name of files that concern context (i.e contextfile.csv). I worked like this but for each iteration the logic and the ranking in th List change
val dir=getListOfSubDirectories(path)
for (sup_path <- dir)
{ val Files = getListOfFiles(path + "//" + sup_path, List(".csv"))
val filename_context = Files(1).toString
val filename_datavalue = Files(0).toString
val filename_error = Files(3).toString
val filename_testresult = Files(4).toString
}
any help and thanks

I solve it just a simple filter
val filename_context = Files.filter(f =>f.getName.contains("context")).last.toString
val filename_datavalue = Files.filter(f =>f.getName.contains("data")).last.toString
val filename_error = Files.filter(f =>f.getName.contains("error")).last.toString
val filename_testresult = Files.filter(f =>f.getName.contains("test")).last.toString

Related

Create RDD in scala and spark

def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.master("local")
.appName("SparkAndHive")
.config("spark.sql.warehouse.dir", "/tmp/spark-warehouse 2")
.enableHiveSupport()
.getOrCreate()
GeoSparkSQLRegistrator.registerAll(spark.sqlContext)
val sparkConf: SparkConf = new SparkConf().setAppName("Spark RDD foreach Example").setMaster("local[2]").set("spark.executor.memory", "2g")
def displayFiles(files: Array[File], a: util.List[String], b: util.List[String]): Unit = {
for (filename <- files) { // If a sub directory is found,
if (filename.isDirectory) if (filename.getName.contains("fire")) {
rds.add(filename.getAbsolutePath)
println(filename.getAbsolutePath)
}
else if (filename.getName.contains("water")){
rdd.add(filename.getAbsolutePath)
println(filename.getAbsolutePath)
}
else {
displayFiles(filename.listFiles, a, b)
}
}
}
val files = new File("C://folder").listFiles
val list1 = new util.ArrayList[String]
val list2 = new util.ArrayList[String]
displayFiles(files, list1, list2)
val a= Seq(list1)
println(a)
val b= Seq(list2)
println(b)
val rdd1 = spark.sparkContext.parallelize(Seq(a))
rdd1.foreach(println))
val rdd2 = spark.sparkContext.parallelize(Seq(b))
rdd2.foreach(println))
I printed the list of subdirectory paths that ends with _fire and _water.
Then I created a list to store the paths that ends with _fire in one list and _water in another list.
I have created the RDD's for all the directories that are stored in both lists using a foreach loop.
When I am declaring a variable for the foreach loop and printing it, it is showing an empty list
Question: How to all the RDD's into single RDD i.e., one for _fire and another for _water?
You can create them more directly. The issue in your code is that displayFiles isn't actually returning anything, or modifying list1 or list2. Therefore, those lists will be empty, and so will a and b.
Instead, you can try something like:
val sc = spark.sparkContext
val basePath = "C://folder/"
val rddWater = sc.textFile(basePath + "*_water")
val rddFire = sc.textFile(basePath + "*_fire")
The above will get just all the contents of all the files matching the globs/paths matching the wildcards. Or, if you need to also figure out the file paths corresponding to each record, you can use sc.wholeTextFiles.
val rddWater = sc.wholeTextFiles("*_water")
val rddFire = sc.wholeTextFiles("*_fire")
// inspect contents using rddWater.collect() or rddFire.collect()
This site has more examples: https://sparkbyexamples.com/apache-spark-rdd/spark-read-multiple-text-files-into-a-single-rdd/

Nested for loop in Scala not iterating through out the file?

I have first file with data as
A,B,C
B,E,F
C,N,P
And second file with data as below
A,B,C,YES
B,C,D,NO
C,D,E,TRUE
D,E,F,FALSE
E,F,G,NO
I need every record in the first file to iterate with all records in the second file. But it's happening only for the first record.
Below is the code:
import scala.io.Source.fromFile
object TestComparision {
def main(args: Array[String]): Unit = {
val lines = fromFile("C:\\Users\\nreddy26\\Desktop\\Spark\\PRI.txt").getLines
val lines2 = fromFile("C:\\Users\\nreddy26\\Desktop\\Spark\\LKP.txt").getLines
var l = 0
var cnt = 0
for (line <- lines) {
for (line2 <- lines2) {
val cols = line.split(",").map(_.trim)
println(s"${cols(0)}|${cols(1)}|${cols(2)}")
val cols2 = line2.split(",").map(_.trim)
println(s"${cols2(0)}|${cols2(1)}|${cols2(2)}|${cols2(3)}")
}
}
}
}
As rightly suggested by #Luis, get the lines in List form by using toList:
val lines = fromFile("C:\\Users\\nreddy26\\Desktop\\Spark\\PRI.txt").getLines.toList
val lines2 = fromFile("C:\\Users\\nreddy26\\Desktop\\Spark\\LKP.txt").getLines.toList

Looping through Map Spark Scala

Within this code we have two files: athletes.csv that contains names, and twitter.test that contains the tweet message. We want to find name for every single line in the twitter.test that match the name in athletes.csv We applied map function to store the name from athletes.csv and want to iterate all of the name to all of the line in the test file.
object twitterAthlete {
def loadAthleteNames() : Map[String, String] = {
// Handle character encoding issues:
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
// Create a Map of Ints to Strings, and populate it from u.item.
var athleteInfo:Map[String, String] = Map()
//var movieNames:Map[Int, String] = Map()
val lines = Source.fromFile("../athletes.csv").getLines()
for (line <- lines) {
var fields = line.split(',')
if (fields.length > 1) {
athleteInfo += (fields(1) -> fields(7))
}
}
return athleteInfo
}
def parseLine(line:String): (String)= {
var athleteInfo = loadAthleteNames()
var hello = new String
for((k,v) <- athleteInfo){
if(line.toString().contains(k)){
hello = k
}
}
return (hello)
}
def main(args: Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "twitterAthlete")
val lines = sc.textFile("../twitter.test")
var athleteInfo = loadAthleteNames()
val splitting = lines.map(x => x.split(";")).map(x => if(x.length == 4 && x(2).length <= 140)x(2))
var hello = new String()
val container = splitting.map(x => for((key,value) <- athleteInfo)if(x.toString().contains(key)){key}).cache
container.collect().foreach(println)
// val mapping = container.map(x => (x,1)).reduceByKey(_+_)
//mapping.collect().foreach(println)
}
}
the first file look like:
id,name,nationality,sex,height........
001,Michael,USA,male,1.96 ...
002,Json,GBR,male,1.76 ....
003,Martin,female,1.73 . ...
the second file look likes:
time, id , tweet .....
12:00, 03043, some message that contain some athletes names , .....
02:00, 03023, some message that contain some athletes names , .....
some thinks like this ...
but i got empty result after running this code, any suggestions is much appreciated
result i got is empty :
()....
()...
()...
but the result that i expected something like:
(name,1)
(other name,1)
You need to use yield to return value to your map
val container = splitting.map(x => for((key,value) <- athleteInfo ; if(x.toString().contains(key)) ) yield (key, 1)).cache
I think you should just start with the simplest option first...
I would use DataFrames so you can use the built-in CSV parsing and leverage Catalyst, Tungsten, etc.
Then you can use the built-in Tokenizer to split the tweets into words, explode, and do a simple join. Depending how big/small the data with athlete names is you'll end up with a more optimized broadcast join and avoid a shuffle.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.Tokenizer
val tweets = spark.read.format("csv").load(...)
val athletes = spark.read.format("csv").load(...)
val tokenizer = new Tokenizer()
tokenizer.setInputCol("tweet")
tokenizer.setOutputCol("words")
val tokenized = tokenizer.transform(tweets)
val exploded = tokenized.withColumn("word", explode('words))
val withAthlete = exploded.join(athletes, 'word === 'name)
withAthlete.select(exploded("id"), 'name).show()

Union Dataframes based on condition in spark(scala)

I have a folder which consists of 4 subfolders which contains parquet files
Folder->A.parquet,B.parquet,C.parquet,D.parquet(subfolders). My requirement is I want to union data frames based on file Names I provide to the method.
I am doing it with code
val df = listDirectoriesGetWantedFile(folderPath,sqlContext,A,B)
def listDirectoriesGetWantedFile(folderPath: String, sqlContext: SQLContext, str1: String, str2: String): DataFrame = {
var df: DataFrame = null
val sb = new StringBuilder
sb.setLength(0)
var done = false
val path = new Path(folderPath)
if (fileSystem.isDirectory(path)) {
var files = fileSystem.listStatus(path)
for (file <- files) {
if (file.getPath.getName.contains(str) && !done) {
sb.append(file.getPath.toString())
sb.append(",")
done = true
} else if (file.getPath.getName.contains(str2)) {
sb.append(file.getPath.toString())
}
}
}
But I need to split the sb and then union the dataframes. Which I am unable to find the solution. How can I approach it and solve
If I understand your question, you could simply do something like this :
def listDirectoriesGetWantedFile(path: String,
sqlContext: SQLContext,
folder1: String,
folder2: String): DataFrame = {
val df1 = sqlContext.read.parquet(s"$path/$folder1")
val df2 = sqlContext.read.parquet(s"$path/$folder2")
df1.union(df2)
}
EDIT
By using Hadoop FileSystem, you can check path existence on your folders. So you may try something like that :
def listDirectoriesGetWantedFile(path: String, sqlContext: SQLContext, folders: Seq[String]): DataFrame = {
val conf = new Configuration()
val fs = FileSystem.get(conf)
val existingFolders = folders
.map(folder => new Path(s"$path/$folder"))
.filter(fs.exists(_))
.map(_.toString)
if (existingFolders.isEmpty) {
sqlContext.emptyDataFrame
} else {
sqlContext.read.parquet(existingFolders: _*)
}
}

Iterating through files in scala to create values based on the file names

I think there may be a simple solution to this, I was wondering if anybody knew how to iterate over a set of files and output a value based on the files name.
My problem is, I want to read in a set of graph edges for each month, and then create a seperate monthly graphs.
Currently I've done this the long way, which is fine for doing one years worth, but I'd like a way to automate it.
You can see my code below which hopefully clearly shows what I am doing.
//Load vertex data
val vertices= (sc.textFile("D:~vertices.csv")
.map(line => line.split(",")).map(parts => (parts.head.toLong, parts.tail)))
//Define function for creating edges from csv file
def EdgeMaker(file: RDD[String]): RDD[Edge[String]] = {
file.flatMap { line =>
if (!line.isEmpty && line(0) != '#') {
val lineArray = line.split(",")
if (lineArray.length < 0) {
None
} else {
val srcId = lineArray(0).toInt
val dstId = lineArray(1).toInt
val ID = lineArray(2).toString
(Array(Edge(srcId.toInt, dstId.toInt, ID)))
}
} else {
None
}
}
}
//make graphs -This is where I want automation, so I can iterate through a
//folder of edge files and output corresponding monthly graphs.
val edgesJan = EdgeMaker(sc.textFile("D:~edges2011Jan.txt"))
val graphJan = Graph(vertices, edgesJan)
val edgesFeb = EdgeMaker(sc.textFile("D:~edges2011Feb.txt"))
val graphFeb = Graph(vertices, edgesFeb)
val edgesMar = EdgeMaker(sc.textFile("D:~edges2011Mar.txt"))
val graphMar = Graph(vertices, edgesMar)
val edgesApr = EdgeMaker(sc.textFile("D:~edges2011Apr.txt"))
val graphApr = Graph(vertices, edgesApr)
val edgesMay = EdgeMaker(sc.textFile("D:~edges2011May.txt"))
val graphMay = Graph(vertices, edgesMay)
val edgesJun = EdgeMaker(sc.textFile("D:~edges2011Jun.txt"))
val graphJun = Graph(vertices, edgesJun)
val edgesJul = EdgeMaker(sc.textFile("D:~edges2011Jul.txt"))
val graphJul = Graph(vertices, edgesJul)
val edgesAug = EdgeMaker(sc.textFile("D:~edges2011Aug.txt"))
val graphAug = Graph(vertices, edgesAug)
val edgesSep = EdgeMaker(sc.textFile("D:~edges2011Sep.txt"))
val graphSep = Graph(vertices, edgesSep)
val edgesOct = EdgeMaker(sc.textFile("D:~edges2011Oct.txt"))
val graphOct = Graph(vertices, edgesOct)
val edgesNov = EdgeMaker(sc.textFile("D:~edges2011Nov.txt"))
val graphNov = Graph(vertices, edgesNov)
val edgesDec = EdgeMaker(sc.textFile("D:~edges2011Dec.txt"))
val graphDec = Graph(vertices, edgesDec)
Any help or pointers on this would be much appreciated.
you can use Spark Context wholeTextFiles to map the filename, and use the String for naming/calling/filtering/etc your values/output/etc
val fileLoad = sc.wholeTextFiles("hdfs:///..Path").map { case (filename, content) => ... }
The Spark Context textFile only reads the data, but does not keep the file name.
----EDIT----
Sorry I seem to have mis-understood the question; you can load multiple files using
sc.wholeTextFiles("~/path/file[0-5]*,/anotherPath/*.txt").map { case (filename, content) => ... }
the asterisk * should load in all files in the path assuming they are all supported input file types.
This read will concatenate all your files into 1 single large RDD to avoid multiple calling (because each call, you have to specify the path and filename which is what you want to avoid I think).
Reading with the filename allows you to GroupBy the file name and apply your graph function to each group.