Spark read multiple directories into multiple dataframes

Spark read multiple directories into multiple dataframes - scala

I have a directory structure on S3 looking like this:
foo
|-base
|-2017
|-01
|-04
|-part1.orc, part2.orc ....
|-A
|-2017
|-01
|-04
|-part1.orc, part2.orc ....
|-B
|-2017
|-01
|-04
|-part1.orc, part2.orc ....
Meaning that for directory foo I have multiple output tables, base, A, B, etc in a given path based on the timestamp of a job.
I'd like to left join them all, based on a timestamp and the master directory, in this case foo. This would mean reading in each output table base, A, B, etc into new separate input tables on which a left join can be applied. All with the base table as starting point
Something like this (not working code!)
val dfs: Seq[DataFrame] = spark.read.orc("foo/*/2017/01/04/*")
val base: DataFrame = spark.read.orc("foo/base/2017/01/04/*")
val result = dfs.foldLeft(base)((l, r) => l.join(r, 'id, "left"))
Can someone point me in the right direction on how to get that sequence of DataFrames? It might even be worth considering the reads as lazy, or sequential, thus only reading the A or B table when the join is applied to reduce memory requirements.
Note: the directory structure is not final, meaning it can change if that fits the solution.

From what I understand Spark uses the underlying Hadoop API to read in data file. So the inherited behavior is to read everything you specify into one single RDD/DataFrame.
To achieve what you want, you can first get a list of directories with:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{ FileSystem, Path }
val path = "foo/"
val hadoopConf = new Configuration()
val fs = FileSystem.get(hadoopConf)
val paths: Array[String] = fs.listStatus(new Path(path)).
filter(_.isDirectory).
map(_.getPath.toString)
Then load them into separated dataframes:
val dfs: Array[DataFrame] = paths.
map(path => spark.read.orc(path + "/2017/01/04/*"))

Here's a straight-forward solution to what (I think) you're trying to do, with no use of extra features like Hive or build-in partitioning abilities:
import spark.implicits._
// load base
val baseDF = spark.read.orc("foo/base/2017/01/04").as("base")
// create or use existing Hadoop FileSystem - this should use the actual config and path
val fs = FileSystem.get(new URI("."), new Configuration())
// find all other subfolders under foo/
val otherFolderPaths = fs.listStatus(new Path("foo/"), new PathFilter {
override def accept(path: Path): Boolean = path.getName != "base"
}).map(_.getPath)
// use foldLeft to join all, using the DF aliases to find the right "id" column
val result = otherFolderPaths.foldLeft(baseDF) { (df, path) =>
df.join(spark.read.orc(s"$path/2017/01/04").as(path.getName), $"base.id" === $"${path.getName}.id" , "left") }

Related

Scala : List the files which are greater than a file based on its name with timestamp pattern in it

I have to list out all files which are greater than particular file based on its timestamp in naming pattern in scala. Below is the example.
Files available:
log_20200601T123421.log
log_20200601T153432.log
log_20200705T093425.log
log_20200803T049383.log
Condition file:
log_20200601T123421.log - I need to list all the file names, which are greater than equal to 20200601T123421 in its name. The result would be,
Output list:
log_20200601T153432.log
log_20200705T093425.log
log_20200803T049383.log
How to achieve this in scala? I was trying with apache common, but i couldn't see greater than equal to NameFileFilter for it.

Perhaps the following code snippet could be a starting point:
import java.io.File
def getListOfFiles(dir: File):List[File] = dir.listFiles.filter(x => x.getName > "log_20200601T123421.log").toList
val files = getListOfFiles(new File("/tmp"))
For the extended task to collect files from different sub-directories:
import java.io.File
def recursiveListFiles(f: File): Array[File] = {
val these = f.listFiles
these ++ these.filter(_.isDirectory).flatMap(recursiveListFiles)
}
val files = recursiveListFiles(new File("/tmp")).filter(x => x.getName > "log_20200601T123421.log")

Create Spark UDF of a function that depends on other resources

I have a code for tokenizing a string.
But that tokenization method uses some data which is loaded when my application starts.
val stopwords = getStopwords();
val tokens = tokenize("hello i am good",stopwords)
def tokenize(string:String,stopwords: List[String]) : List[String] = {
val splitted = string.split(" ")
// I use this stopwords for filtering my splitted array.
// Then i return the items back.
}
Now I want to make the tokenize method an UDF for Spark.I want to use it to create new column in DataFrame Transformations.
I created simple UDFs before which had no dependencies like it needs items that needs to be read from text file etc.
Can some one tell me how to do these kind of operation?
This is what I have tried ,and its working.
val moviesDF = Seq(
("kingdomofheaven"),
("enemyatthegates"),
("salesinfointheyearofdecember"),
).toDF("column_name")
val tokenizeUDF: UserDefinedFunction = udf(tokenize(_: String): List[String])
moviesDF.withColumn("tokenized", tokenizeUDF(col("column_name"))).show(100, false)
def tokenize(name: String): List[String] = {
val wordFreqMap: Map[String, Double] = DataProviderUtil.getWordFreqMap()
val stopWords: Set[String] = DataProviderUtil.getStopWordSet()
val maxLengthWord: Int = wordFreqMap.keys.maxBy(_.length).length
.................
.................
}
Its giving me the expected output:
+----------------------------+--------------------------+
|columnname |tokenized |
+----------------------------+--------------------------+
|kingdomofheaven |[kingdom, heaven] |
|enemyatthegates |[enemi, gate] |
|salesinfointheyearofdecember|[sale, info, year, decemb]|
+----------------------------+--------------------------+
Now my question is , will it work when its deployed ? Currently I am
running it locally. My main concern it that this function reads from a
file to get information like stopwords,wordfreq etc for making the
tokenization possible. So registering it like this will work properly
?

At this point, if you deploy this code Spark will try to serialize your DataProviderUtil, you would need to mark as serializable that class. Another possibility is to declare you logic inside an Object. Functions inside objects are considered static functions and they are not serialized.

Parent child relationship model in pyspark using Graphx/Spark

I have a data-set which contains the (child, parent) entities. I need to find the ultimate parent of every child from the data-set. My data-set has 1.3 million records. Sample data is given below.
c-1, p-1
p-1, p-2
p-2, p-3
p-3, p-4
In the above sample data the ultimate parent of c-1 is p-4, ultimate parent of p-1 is p-4 and so on.
Some times to find the ultimate parent of a child i need to traverse multiple levels recursively.
This is what i have tried so far.
I tried to create a spark DF and tried to recursively find the
parent of every child. But this approach is taking very long time.
I tried to create
a UDF which can be applied on every row of the data-set. But i need
to call the DF (lookup data-set) in the UDF. But spark does not
support having DF in the UDF. So even this approach did not help me.
Any suggestions on to how to approach this problem?

To address both the problems cited by you, implementing CTE’s in spark is using Graphx Pregel API could come to your rescue.
Here is a sample code below.
//setup & call the pregel api
def calcTopLevelHierarcy(vertexDF: DataFrame, edgeDF: DataFrame): RDD[(Any,(Int,Any,String,Int,Int))] = {
// create the vertex RDD
// primary key, root, path
val verticesRDD = vertexDF
.rdd
.map{x=> (x.get(0),x.get(1) , x.get(2))}
.map{ x => (MurmurHash3.stringHash(x._1.toString).toLong, ( x._1.asInstanceOf[Any], x._2.asInstanceOf[Any] , x._3.asInstanceOf[String]) ) }
// create the edge RDD
// top down relationship
val EdgesRDD = edgeDF.rdd.map{x=> (x.get(0),x.get(1))}
.map{ x => Edge(MurmurHash3.stringHash(x._1.toString).toLong,MurmurHash3.stringHash(x._2.toString).toLong,"topdown" )}
// create graph
val graph = Graph(verticesRDD, EdgesRDD).cache()
val pathSeperator = """/"""
// initialize id,level,root,path,iscyclic, isleaf
val initialMsg = (0L,0,0.asInstanceOf[Any],List("dummy"),0,1)
// add more dummy attributes to the vertices - id, level, root, path, isCyclic, existing value of current vertex to build path, isleaf, pk
val initialGraph = graph.mapVertices((id, v) => (id,0,v._2,List(v._3),0,v._3,1,v._1) )
val hrchyRDD = initialGraph.pregel(initialMsg,
Int.MaxValue,
EdgeDirection.Out)(
setMsg,
sendMsg,
mergeMsg)
// build the path from the list
val hrchyOutRDD = hrchyRDD.vertices.map{case(id,v) => (v._8,(v._2,v._3,pathSeperator + v._4.reverse.mkString(pathSeperator),v._5, v._7 )) }
hrchyOutRDD
}
In the method, calcTopLevelHierarcy(), you can pass-in DataFrame (which addresses your second point).
Here is a very good link with some sample code. Please take a look.
Hope, this helps.

Spark: How to get String value while generating output file

I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?

As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)

Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.

Iteration over typesafe files

I have read this topic
Iterate over fields in typesafe config
and made some changes but still don't know how to iterate over conf files in play framework.
Providers=[{1234 : "CProduct"},
{12345 : "ChProduct"},
{123 : "SProduct"}]
This is my Conf file called providers.conf , the question is how can i iterate over them and create a dropdownbox from them. I would like to take them as map if possible which is [int,string]
I know , i have to take them like
val config = ConfigFactory.load("providers.conf").getConfigList("Providers")
i can the conf file like that but , i should get it from template in order to do that i need to convert it to either hashmap or list or whatever functional.
Cheers,

I'm not sure if this is the most efficient way to do this, but this works:
1) Lets get our config file:
val config = ConfigFactory.load().getConfigList("providers")
scala> config.getConfigList("providers")
res23: java.util.List[_ <: com.typesafe.config.Config] = [Config(SimpleConfigObject({"id":"1234","name":" Product2"})), Config(SimpleConfigObject({"id":"4523","name":"Product1"})), Config(SimpleConfigObject({"id":"432","name":" Product3"}))]
2) For this example introduce Provider entity:
case class Provider(id: String, name: String)
3) Now lets convert list with configs to providers:
import scala.collection.JavaConversions._
providers.map(conf => Provider(conf.getString("id"), conf.getString("name"))).toList
res27: List[Provider] = List(Provider(1234, Product2), Provider(4523,Product1), Provider(432, Product3))
We need explicitly convert it toList, cause by default Java List converts to Buffer.

Here is my solution for that ,
val config = ConfigFactory.load("providers.conf").getConfigList("Providers")
var providerlist = new java.util.ArrayList[model.Provider]
val providers = (0 until config.size())
providers foreach {
count =>
val iterator = config.get(count).entrySet().iterator()
while(iterator.hasNext()) {
val entry = iterator.next()
val p = new Provider(entry.getKey(), entry.getValue().render())
providerlist.add(p);
}
}
println(providerlist.get(0).providerId+providerlist.get(0).providerName)
println(providerlist.get(33).providerId+providerlist.get(33).providerName)
and my provider.class
package model
case class Provider(providerId: String, providerName: String)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark read multiple directories into multiple dataframes - scala

Related

Scala : List the files which are greater than a file based on its name with timestamp pattern in it

Create Spark UDF of a function that depends on other resources

Parent child relationship model in pyspark using Graphx/Spark

Spark: How to get String value while generating output file

Iteration over typesafe files

Categories

Resources