Write Parquet files from Spark RDD to dynamic folders - scala

Given the following snippet (Spark version: 1.5.2):
rdd.toDF().write.mode(SaveMode.Append).parquet(pathToStorage)
which saves RDD data to flattened Parquet files, I would like my storage to have a structure like:
country/
year/
yearmonth/
yearmonthday/
The data itself contains a country column and a timestamp one, so I started with this method. However, since I only have a timestamp in my data, I can't partition the whole thing by year/yearmonth/yearmonthday as those are not columns per se...
And this solution seemed pretty nice, except I can't get to adapt it to Parquet files...
Any idea?

I figured it out. In order for the path to be dynamically linked to the RDD, one first has to create a tuple from the rdd:
rdd.map(model => (model.country, model))
Then, the records will all have to be parsed, to retrieve the distinct countries:
val countries = rdd.map {
case (country, model) => country
}
.distinct()
.collect()
Now that the countries are known, the records can be written according to their distinct country:
countries.map {
country => {
val countryRDD = rdd.filter {
case (c, model) => c == country
}
.map(_._2)
countryRDD.toDF().write.parquet(pathToStorage + "/" + country)
}
}
Of course, the whole collection has to be parsed twice, but it is the only solution I found so far.
Regarding the timestamp, you will just have to do the same process with a 3-tuple (the third being something like 20160214); I went with the current timestamp finally.

Related

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}

Skipping rows with missing fields while loading a text file into a Spark Context

I need to load a tab delimited file into Spark Context. However some of the fields are missing values, and I need to filter out those rows. I am using the following code. However if the field is completely missing (e.g. one less tab in the row) this code throws an exception. What is a better way to achieve this?
val RDD = sc.textFile("file.txt").map(_.split("\t"))
.filter(_(0).nonEmpty)
.filter(_(1).nonEmpty)
.filter(_(2).nonEmpty)
.filter(_(3).nonEmpty)
.filter(_(4).nonEmpty)
.filter(_(5).nonEmpty)
I have found this to work nicely for large dataset:
val allRecords: RDD[Either[(String, String, String, String), Array[String]]] = sc.textFile(s"file.txt")
.map(x=>x.split("\t"))
.map {
case Array(name, address, phone, country) => Left(name, address, phone, country)
case badArray => Right(badArray)
}
val goodRecords = allRecords.collect{ case Left(r) => r }
You can read your file as a dataframe and use DataFrameNaFunctions
val df = sqlContext.read.format("com.databricks.spark.csv").option("delimiter", "\t").load("file.txt")
val cleanDF = df.na.drop()
Here is the link to Spark-csv library just in case.

SPARK - Use RDD.foreach to Create a Dataframe and execute actions on the Dataframe

I am new to SPARK and figuring out a better way to achieve the following scenario.
There is a database table containing 3 fields - Category, Amount, Quantity.
First I try to pull all the distinct Categories from the database.
val categories:RDD[String] = df.select(CATEGORY).distinct().rdd.map(r => r(0).toString)
Now for each category I want to execute the Pipeline which essentially creates dataframes from each category and apply some Machine Learning.
categories.foreach(executePipeline)
def execute(category: String): Unit = {
val dfCategory = sqlCtxt.read.jdbc(JDBC_URL,"SELECT * FROM TABLE_NAME WHERE CATEGORY="+category)
dfCategory.show()
}
Is it possible to do something like this ? Or is there any better alternative ?
// You could get all your data with a single query and convert it to an rdd
val data = sqlCtxt.read.jdbc(JDBC_URL,"SELECT * FROM TABLE_NAME).rdd
// then group the data by category
val groupedData = data.groupBy(row => row.getAs[String]("category"))
// then you get an RDD[(String, Iterable[org.apache.spark.sql.Row])]
// and you can iterate over it and execute your pipeline
groupedData.map { case (categoryName, items) =>
//executePipeline(categoryName, items)
}
Your code would fail on a TaskNotSerializable exception since you're trying to use the SQLContext (which isn't serializable) inside the execute method, which should be serialized and sent to workers to be executed on each record in the categories RDD.
Assuming you know the number of categories is limited, which means the list of categories isn't too large to fit in your driver memory, you should collect the categories to driver, and iterate over that local collection using foreach:
val categoriesRdd: RDD[String] = df.select(CATEGORY).distinct().rdd.map(r => r(0).toString)
val categories: Seq[String] = categoriesRdd.collect()
categories.foreach(executePipeline)
Another improvement would be reusing the dataframe that you loaded instead of performing another query, using a filter for each category:
def executePipeline(singleCategoryDf: DataFrame) { /* ... */ }
categories.foreach(cat => {
val filtered = df.filter(col(CATEGORY) === cat)
executePipeline(filtered)
})
NOTE: to make sure the re-use of df doesn't reload it for every execution, make sure you cache() it before collecting the categories.

Apache Spark map and reduce with passing values

I have a simple map and reduce job over an RDD loaded from Cassandra.
The code looks something like this
sc.cassandraTable("app","channels").select("id").toArray.foreach((o) => {
val orders = sc.cassandraTable("fam", "table")
.select("date", "f2", "f3", "f4")
.where("id = ?", o("id")) # This o("id") is the ID i want later append to the finished list
val month = orders
.map( oo => {
var total_revenue = List(oo.getIntOption("f2"), oo.getIntOption("f3"), oo.getIntOption("f4")).flatten.reduce(_ + _)
(getDateAs("hour", oo.getDate("date")), total_revenue)
})
.reduceByKey(_ + _)
})
So this code summs the revenue up and returns something like this
(2014-11-23 18:00:00, 12412)
(2014-11-23 19:00:00, 12511)
Now I want to save this back to a Cassandra Table revenue_hour but i need the ID somehow in that list, something like that.
(2014-11-23 18:00:00, 12412, "CH1")
(2014-11-23 19:00:00, 12511, "CH1")
How can I make this work with more then just a (key, value) list? How can i pass along more values, which should not be transformed, instead just passed through to the end so I can save it back to Cassandra?
Maybe you could use a class and work with it through the flow. I mean, define RevenueHour class
case class RevenueHour(date: java.util.Date,revenue: Long, id: String)
Then built an intermediate RevenueHour in the map phase and then another one in the reduce phase.
val map: RDD[(Date, RevenueHour)] = orders.map(row =>
(
getDateAs("hour", oo.getDate("date")),
RevenueHour(
row.getDate("date"),
List(row.getIntOption("f2"),row.getIntOption("f3"),row.getIntOption("f4")).flatten.reduce(_ + _),
row.getString("id")
)
)
).reduceByKey((o1: RevenueHour, o2: RevenueHour) => RevenueHour(getDateAs("hour", o1.date), o1.revenue + o2.revenue, o1.id))
I use o1 RevenueHour because both o1 and o2 will have same key and same id (because the where clause before).
Hope it helps.
The approach presented on the question is sequencing the processing of data by iterating over a array of ids and applying a Spark job on only a (potentially small) subset of the data.
Without knowing how is the relation between the 'channels' and 'table' data, I see two options to fully utilize the ability of Spark of processing data in parallel:
Option 1
If the data on the 'table' table (called "orders" from here on) contains all the set of ids that we require in the report, we could apply the reporting logic to the whole table:
Based on the question, we will use this C* schema:
CREATE TABLE example.orders (id text,
date TIMESTAMP,
f2 decimal,
f3 decimal,
f4 decimal,
PRIMARY KEY(id, date)
);
It makes is a lot easier to access cassandra data by providing a case class that represents the schema of the table:
case class Order(id: String, date:Long, f2:Option[BigDecimal], f3:Option[BigDecimal], f4:Option[BigDecimal]) {
lazy val total = List(f2,f3,f4).flatten.sum
}
Then we can define an rdd based on the cassandra table. When we provide the case class as type, the spark-cassandra driver can directly perform a conversion for our convenience:
val ordersRDD = sc.cassandraTable[Order]("example", "orders").select("id", "date", "f2", "f3", "f4")
val revenueByIDPerHour = ordersRDD.map{order => ((order.id, getDateAs("hour", order.date)), order.total)}.reduceByKey(_ + _)
And finally save back to Cassandra:
revenueByIDPerHour.map{ case ((id,date), revenue) => (id, date, revenue)}
.saveToCassandra("example","revenue", SomeColumns("id", "date", "total"))
Option 2
if the ids contained in the ("app","channels") table should be used to filter the set of ids (e.g. valid ids), then, we can join the ids from this table with the orders. The job will be similar to the previous on, with the addition of:
val idRDD = sc.cassandraTable("app","channels").select("id").map(_.getString)
val ordersRDD = sc.cassandraTable[Order]("example", "orders").select("id", "date", "f2", "f3", "f4")
val validOrders = idRDD.join(ordersRDD.map(order => (id,order))
These two ways illustrate how to work with Cassandra and Spark, making use of the distributed nature of Spark's operations. It should also be considerably faster then executing a query for each ID in the 'channels' table.

How do I iterate RDD's in apache spark (scala)

I use the following command to fill an RDD with a bunch of arrays containing 2 strings ["filename", "content"].
Now I want to iterate over every of those occurrences to do something with every filename and content.
val someRDD = sc.wholeTextFiles("hdfs://localhost:8020/user/cloudera/*")
I can't seem to find any documentation on how to do this however.
So what I want is this:
foreach occurrence-in-the-rdd{
//do stuff with the array found on loccation n of the RDD
}
You call various methods on the RDD that accept functions as parameters.
// set up an example -- an RDD of arrays
val sparkConf = new SparkConf().setMaster("local").setAppName("Example")
val sc = new SparkContext(sparkConf)
val testData = Array(Array(1,2,3), Array(4,5,6,7,8))
val testRDD = sc.parallelize(testData, 2)
// Print the RDD of arrays.
testRDD.collect().foreach(a => println(a.size))
// Use map() to create an RDD with the array sizes.
val countRDD = testRDD.map(a => a.size)
// Print the elements of this new RDD.
countRDD.collect().foreach(a => println(a))
// Use filter() to create an RDD with just the longer arrays.
val bigRDD = testRDD.filter(a => a.size > 3)
// Print each remaining array.
bigRDD.collect().foreach(a => {
a.foreach(e => print(e + " "))
println()
})
}
Notice that the functions you write accept a single RDD element as input, and return data of some uniform type, so you create an RDD of the latter type. For example, countRDD is an RDD[Int], while bigRDD is still an RDD[Array[Int]].
It will probably be tempting at some point to write a foreach that modifies some other data, but you should resist for reasons described in this question and answer.
Edit: Don't try to print large RDDs
Several readers have asked about using collect() and println() to see their results, as in the example above. Of course, this only works if you're running in an interactive mode like the Spark REPL (read-eval-print-loop.) It's best to call collect() on the RDD to get a sequential array for orderly printing. But collect() may bring back too much data and in any case too much may be printed. Here are some alternative ways to get insight into your RDDs if they're large:
RDD.take(): This gives you fine control on the number of elements you get but not where they came from -- defined as the "first" ones which is a concept dealt with by various other questions and answers here.
// take() returns an Array so no need to collect()
myHugeRDD.take(20).foreach(a => println(a))
RDD.sample(): This lets you (roughly) control the fraction of results you get, whether sampling uses replacement, and even optionally the random number seed.
// sample() does return an RDD so you may still want to collect()
myHugeRDD.sample(true, 0.01).collect().foreach(a => println(a))
RDD.takeSample(): This is a hybrid: using random sampling that you can control, but both letting you specify the exact number of results and returning an Array.
// takeSample() returns an Array so no need to collect()
myHugeRDD.takeSample(true, 20).foreach(a => println(a))
RDD.count(): Sometimes the best insight comes from how many elements you ended up with -- I often do this first.
println(myHugeRDD.count())
The fundamental operations in Spark are map and filter.
val txtRDD = someRDD filter { case(id, content) => id.endsWith(".txt") }
the txtRDD will now only contain files that have the extension ".txt"
And if you want to word count those files you can say
//split the documents into words in one long list
val words = txtRDD flatMap { case (id,text) => text.split("\\s+") }
// give each word a count of 1
val wordT = words map (x => (x,1))
//sum up the counts for each word
val wordCount = wordsT reduceByKey((a, b) => a + b)
You want to use mapPartitions when you have some expensive initialization you need to perform -- for example, if you want to do Named Entity Recognition with a library like the Stanford coreNLP tools.
Master map, filter, flatMap, and reduce, and you are well on your way to mastering Spark.
I would try making use of a partition mapping function. The code below shows how an entire RDD dataset can be processed in a loop so that each input goes through the very same function. I am afraid I have no knowledge about Scala, so everything I have to offer is java code. However, it should not be very difficult to translate it into scala.
JavaRDD<String> res = file.mapPartitions(new FlatMapFunction <Iterator<String> ,String>(){
#Override
public Iterable<String> call(Iterator <String> t) throws Exception {
ArrayList<String[]> tmpRes = new ArrayList <>();
String[] fillData = new String[2];
fillData[0] = "filename";
fillData[1] = "content";
while(t.hasNext()){
tmpRes.add(fillData);
}
return Arrays.asList(tmpRes);
}
}).cache();
what the wholeTextFiles return is a Pair RDD:
def wholeTextFiles(path: String, minPartitions: Int): RDD[(String, String)]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Here is an example of reading the files at a local path then printing every filename and content.
val conf = new SparkConf().setAppName("scala-test").setMaster("local")
val sc = new SparkContext(conf)
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.collect
.foreach(t => println(t._1 + ":" + t._2));
the result:
file:/Users/leon/Documents/test/1.txt:{"name":"tom","age":12}
file:/Users/leon/Documents/test/2.txt:{"name":"john","age":22}
file:/Users/leon/Documents/test/3.txt:{"name":"leon","age":18}
or converting the Pair RDD to a RDD first
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.map(t => t._2)
.collect
.foreach { x => println(x)}
the result:
{"name":"tom","age":12}
{"name":"john","age":22}
{"name":"leon","age":18}
And I think wholeTextFiles is more compliant for small files.
for (element <- YourRDD)
{
// do what you want with element in each iteration, and if you want the index of element, simply use a counter variable in this loop beginning from 0
println (element._1) // this will print all filenames
}