Having in mind the following dataset:
I would like to obtain
As you can see, basically the idea is to follow the path indicated by column ACTUAL_ID until it is null (if it wasn't already)
I tried to use a udf where I was passing the full initial Dataframe and the recursively would find what I want but it seems it is not possible to pass Dataframes to UDFs. I also looked into replacing a value of a row, but it seems that is not possible.
My latest attempt:
def calculateLatestImdate(df: DataFrame, lookupId: String) : String = {
var foundId = df.filter($"ID" === lookupId).select($"ACTUAL_ID").first.getAs[String]("ID");
if (foundId == "" || foundId == null)
{
lookupId
}
else
{
calculateLatestImdate(df, foundId);
}
}
val calculateLatestImdateUdf = udf((df:DataFrame, s:String) => {
calculateLatestImdate(df,s)
})
val df = sc.parallelize(Seq(("1", "", "A"), ("2", "3", "B"), ("3", "6", "C"), ("4", "5", "D"), ("5", "", "E"), ("6", "", "F"))).toDF("ID","ACTUAL_ID", "DATA")
val finalDf = df.withColumn("FINAL_ID", when(isEmpty($"ACTUAL_ID"), $"ID").otherwise(calculateLatestImdateUdf(df, $"ACTUAL_ID")))
This looked a bit like a graph problem to me so I worked up an answer using Scala and graphframes. It makes use of the connectedComponents algorithm and the outDegrees method of the graphframe. I've made an assumption that the end of each tree is unique as per your sample data but this assumption needs to be checked. I'd be interested to see what the performance is like with more data, but let me know what you think of the solution.
The complete script:
// NB graphframes had to be installed separately with the right Scala version
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.graphframes._
// Create the test data
// Vertices dataframe
val v2 = sqlContext.createDataFrame(List(
( 1, 0, "A" ), ( 2, 3, "B" ), ( 3, 6, "C" ),
( 4, 5, "D" ), ( 5, 0, "E" ), ( 6, 0, "F" )
)).toDF("id", "actual_id", "data")
// Edge dataframe
val e2 = sqlContext.createDataFrame(List(
(2, 3, "is linked to"),
(3, 6, "is linked to"),
(4, 5, "is linked to")
)).toDF("src", "dst", "relationship")
// Create the graph frame
val g2 = GraphFrame(v2, e2)
print(g2)
// The connected components adds a component id to each 'group'
sc.setCheckpointDir("/tmp/graphframes-example-connected-components")
val components = g2.connectedComponents.run() // doesn't work on Spark 1.4
display(components)
// "end" of tree nodes have no outDegree, so add that in to the component df
val endOfTree = components.join(g2.outDegrees, Seq("id"), "left")
.select("component", "data")
.where("outDegree is null")
endOfTree.show()
components.as("c").join(endOfTree.as("t"), $"c.component" === $"t.component")
.select($"c.id", $"c.component", $"t.data")
.orderBy("id")
.show()
My results:
If your data is already in a dataframe, it's easy to generate the edges dataframe from your original with just a select and a where filter, eg
// Create the GraphFrame from the dataframe
val v2 = df
val e2 = df
.select("id", "actual_id")
.withColumn("rel", lit("is linked to"))
.where("actual_id > 0")
.toDF("src", "dst", "rel")
val g2 = GraphFrame(v2, e2)
print(g2)
g2.vertices.show()
g2.edges.show()
Believe I have found an answer to my problem.
def calculateLatestId(df: DataFrame) : DataFrame = {
var joinedDf = df.as("df1").join(df.as("df2"), $"df1.ACTUAL_ID" === $"df2.ID", "outer").withColumn("FINAL_ID", when($"df2.ID".isNull, $"df1.ID").when($"df2.ACTUAL_ID".isNotNull, $"df2.ACTUAL_ID").otherwise($"df2.ID")).select($"df1.*", $"FINAL_ID").filter($"df1.ID".isNotNull)
val differentIds = joinedDf.filter($"df1.ACTUAL_ID" =!= $"FINAL_ID")
joinedDf = joinedDf.withColumn("ACTUAL_ID", $"FINAL_ID").drop($"FINAL_ID")
if(differentIds.count > 0)
{
calculateLatestId(joinedDf)
}
else
{
joinedDf = joinedDf.as("df1").join(joinedDf.as("df2"), $"df1.ACTUAL_ID" === $"df2.ID", "inner").select($"df1.ID", $"df2.*").drop($"df2.ID")
joinedDf
}
}
I believe the performance can be improved somehow, probably by reducing the amount of rows after each iteration and at the end do some sort of join + clean up.
Related
I am very new to Scala and learning to work with RDDs. I have two csv files which have the following headers and data:
csv1.txt
id,"location", "zipcode"
1, "a", "12345"
2, "b", "67890"
3, "c" "54321"
csv2.txt
"location_x", "location_y", trip_hrs
"a", "b", 1
"a", "c", 3
"b", "c", 2
"a", "b", 1
"c", "b", 2
Basically, csv1 data is a distinct set of locations and zip codes, whereas csv2 data has the trip duration between location_x and location_y.
The common piece of information in these two data sets is location in csv1 and location_x in csv2 even though they have different header names.
I would like to create two RDDs with one containing the data from csv1 and the other from csv2.
Then I would like to join these two RDDs and return the location, zipcode, and sum of all trip times from that location as shown below:
("a", "zipcode", 5)
("b", "zipcode", 2)
("c", "zipcode", 2)
I was wondering if one of you can assist me with this problem. Thanks.
I will give you the code (a complete app in IntelliJ) with some explanations. I hope it can be helpful.
Please read the Spark documentation for the explicit details.
working-with-key-value-pairs
This problem can be done with Spark Dataframes, you can try for yourself.
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
object Joining {
val spark = SparkSession
.builder()
.appName("Joining")
.master("local[*]")
.config("spark.sql.shuffle.partitions", "4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id", "Joining") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
val path = "/home/cloudera/files/tests/"
def main(args: Array[String]): Unit = {
Logger.getRootLogger.setLevel(Level.ERROR)
try {
// read the files
val file1 = sc.textFile(s"${path}join1.csv")
val header1 = file1.first // extract the header of the file
val file2 = sc.textFile(s"${path}join2.csv")
val header2 = file2.first // extract the header of the file
val rdd1 = file1
.filter(line => line != header1) // to leave out the header
.map(line => line.split(",")) // split the lines => Array[String]
.map(arr => (arr(1).trim,arr(2).trim)) // to make up a pairRDD with arr(1) as key and zipcode
val rdd2 = file2
.filter(line => line != header2)
.map(line => line.split(",")) // split the lines => Array[String]
.map(arr => (arr(0).trim, arr(2).trim.toInt)) // to make up a pairRDD with arr(0) as key and trip_hrs
val joined = rdd1 // join the pairRDD by its keys
.join(rdd2)
.cache() // cache joined in memory
joined.foreach(println) // checking data
println("**************")
// ("c",("54321",2))
// ("b",("67890",2))
// ("a",("12345",1))
// ("a",("12345",3))
// ("a",("12345",1))
val result = joined.reduceByKey({ case((zip, time), (zip1, time1) ) => (zip, time + time1) })
result.map({case( (id,(zip,time)) ) => (id, zip, time)}).foreach(println) // checking output
// ("b","67890",2)
// ("c","54321",2)
// ("a","12345",5)
// To have the opportunity to view the web console of Spark: http://localhost:4041/
println("Type whatever to the console to exit......")
scala.io.StdIn.readLine()
} finally {
sc.stop()
println("SparkContext stopped")
spark.stop()
println("SparkSession stopped")
}
}
}
If you can read CSV into RDD already, Trips can be summarized, and then joined with Locations:
val tripsSummarized = trips
.map({ case (location, _, hours) => (location, hours) })
.reduceByKey((hoursTotal, hoursIncrement) => hoursTotal + hoursIncrement)
val result = locations
.map({ case (_, location, zipCode) => (location, zipCode) })
.join(tripsSummarized)
.map({case (location, (zipCode, hoursTotal)) => (location, zipCode, hoursTotal) })
If locations without trips are required, "leftOuterJoin" can be used.
TL;DR: I need this spark constant :
val False : Column = lit(1) === lit(0)
Any idea how to do it prettier ?
Problem Context
I want to filter a dataframe from a collection. For exemple
case class Condition(column: String, value: String)
val conditions = Seq(
Condition("name", "bob"),
Condition("age", 18)
)
val personsDF = Seq(
("bob", 30),
("anna", 20),
("jack", 18)
).toDF("name", "age")
When applying my collection to personsDF I expect:
val expected = Seq(
("bob", 30),
("jack", 18)
)
To do so, I am creating a filter from the collection and apply it to the dataframe:
val conditionsFilter = conditions.foldLeft(initialValue) {
case (cumulatedFilter, Condition(column, value)) =>
cumulatedFilter || col(column) === value
}
personsDF.filter(conditionsFilter)
Pretty sweet, right ?
But to do so, I need the neutral value of OR operator which is False. Since False doesn't exist is Spark I used:
val False : Column = lit(1) === lit(0)
Any idea how to do this without tricks ?
You can just do :
val False : Column = lit(false)
This should be your initialValue, right? You can avoid that by using head and tail:
val buildCondition = (c:Condition) => col(c.column)===c.value
val initialValue = buildCondition(conditions.head)
val conditionsFilter = conditions.tail.foldLeft(initialValue)(
(cumulatedFilter, condition) =>
cumulatedFilter || buildCondition(condition)
)
Even shorter, you could use reduce:
val buildCondition = (c:Condition) => col(c.column)===c.value
val conditionsFilter = conditions.map(buildCondition).reduce(_ or _)
I am writing unit tests for one Spark method which takes multiple data frames as an input parameters and returns one data frame. Code for spark method looks like below:
class processor {
def process(df1: DataFrame, df2: DataFrame): DataFrame = {
// process and return resulting data frame
}
}
Existing code for corresponding unit test is as follows:
import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.spark.sql.DataFrame
import org.scalatest.{FlatSpec, Matchers}
class TestProcess extends FlatSpec with DataFrameSuiteBase with Matchers {
val p:Processor = new Processor
"process()" should "return only one row" in {
df1RDD = sc.parallelize(
Seq("a", 12, 98999),
Seq("b", 42, 99)
)
df1DF = spark.createDataFrame(df1RDD).toDF()
df2RDD = sc.parallelize(
Seq("X", 12, "foo", "spark"),
Seq("Z", 42, "bar", "storm")
)
df2DF = spark.createDataFrame(df2RDD).toDF()
val result = p.process(df1, df2)
}
it should "return spark row" in {
df1RDD = sc.parallelize(
Seq("a", 12, 98999),
Seq("b", 42, 99)
)
df1DF = spark.createDataFrame(df1RDD).toDF()
df2RDD = sc.parallelize(
Seq("X", 12, "foo", "spark"),
Seq("Z", 42, "bar", "storm")
)
df2DF = spark.createDataFrame(df2RDD).toDF()
val result = p.process(df1, df2)
}
}
This code works fine but it has problem that code to create RDD and DF is repeating in each test method. When I try to create RDD outside test methods or inside BeforeAndAfterAll() method, I get error about sc not available. It seems like Spark Testing Base library initiates sc and spark variables only inside test methods.
I would like to know if there is any way I can avoid writing this duplicate code?
Updated code after using WordSpec instead of using FlatSpec
import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.spark.sql.DataFrame
import org.scalamock.scalatest.MockFactory
import org.scalatest.{Matchers, WordSpec}
class TestProcess extends WordSpec with DataFrameSuiteBase with Matchers {
val p:Processor = new Processor
"process()" should {
df1RDD = sc.parallelize(
Seq("a", 12, 98999),
Seq("b", 42, 99)
)
df1DF = spark.createDataFrame(df1RDD).toDF()
df2RDD = sc.parallelize(
Seq("X", 12, "foo", "spark"),
Seq("Z", 42, "bar", "storm")
)
df2DF = spark.createDataFrame(df2RDD).toDF()
val result = p.process(df1, df2)
"return only one row" in {
result.count should equal(1)
}
"return spark row" in {
// assertions to check if 'row' containing 'spark' in last column is in the result or not
}
}
}
Use WordSpec instead of FlatSpec, as it allows common initialization to be grouped before the test clauses, as in
"process()" should {
df1RDD = sc.parallelize(Seq("a", 12, 98999),Seq("b", 42, 99))
df1DF = spark.createDataFrame(df1RDD).toDF()
df2RDD = sc.parallelize(Seq("X", 12, "foo", "spark"), Seq("Z", 42, "bar", "storm"))
df2DF = spark.createDataFrame(df2RDD).toDF()
"return only one row" in {
....
}
"return spark row" in {
....
}
}
EDIT: Also, the following two lines of code hardly justify using a library (spark-testing-base):
val spark = SparkSession.builder.master("local[1]").getOrCreate
val sc = spark.sparkContext
Add these to the top of your class, and you're all set with the SparkContext and all, and no NPEs.
EDIT: I just confirmed with my own test that the spark-testing-base does not work well with WordSpec. If you still want to use it, consider opening a bug report with the library author, as this is definitely an issue with the spark-testing-base.
I have a data frame with two columns : id and value. I want to update the value based on another map.
df.collect.foreach({
df[value] = if (df[id] != 'unknown') mapper.value(df[id]) else df[value]
})
Is this correct way of using ?
I tried :
import com.mapping.data.model.MappingUtils
import com.mapping.data.model.CountryInfo
val mappingPath = "s3://.../"
val input = sc.textFile(mappingPath)
The input is list of jsons where each line is json which I am mapping to the POJO class CountryInfo using MappingUtils which takes care of JSON parsing and conversion:
val MappingsList = input.map(x=> {
val countryInfo = MappingUtils.getCountryInfoString(x);
(countryInfo.getItemId(), countryInfo)
}).collectAsMap
MappingsList: scala.collection.Map[String,com.mapping.data.model.CountryInfo]
def showCountryInfo(x: Option[CountryInfo]) = x match {
case Some(s) => s
}
val events = sqlContext.sql( "select itemId EventList")
val itemList = events.map(row => {
val itemId = row.getAs[String](1);
val çountryInfo = showTitleInfo(MappingsList.get(itemId));
val country = if (countryInfo.getCountry() == 'unknown)' "US" else countryInfo.getCountry()
val language = countryInfo.getLanguage()
Row(itemId, country, language)
})
But I keep getting this error :
org.apache.thrift.transport.TTransportException at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362) at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284) at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:220) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:205) at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:211) at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93) at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:207) at org.apache.zeppelin.scheduler.Job.run(Job.java:170) at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:304) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
I am using Spark 1.6
Your question is bit ambiguous.
Don’t collect large RDDs unnecessarily.
When a collect operation is issued on a RDD, the dataset is copied to
the driver, i.e. the master node. A memory exception will be thrown if
the dataset is too large to fit in memory; take or takeSample can be
used to retrieve only a capped number of elements instead.
The way you are doing by collect method is not correct(if it is large DataFrame it may lead to OOM)..
1) To update any column or add new column you can use withColumn
DataFrame withColumn(java.lang.String colName, Column col)
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
2) To check the condition based on another datastructure..
you can use when otherwise syntax like below
Apache Spark, add an "CASE WHEN ... ELSE ..." calculated column to an existing DataFrame
example :
import org.apache.spark.sql.functions._
val sqlcont = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlcont.jsonRDD(sc.parallelize(Array(
"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": ""}""",
"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": ""}""",
"""{"year":2015, "make": "Chevy", "model": "Volt", "comment": "", "blank": ""}"""
)))
val makeSIfTesla = udf {(make: String) =>
if(make == "Tesla") "S" else make
}
df1.withColumn("make", makeSIfTesla(df1("make"))).show
The above can be also achieved like this..
val rdd = sc.parallelize(
List( (2012,"Tesla","S"), (1997,"Ford","E350"), (2015,"Chevy","Volt"))
)
val sqlContext = new SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
val dataframe = rdd.toDF()
dataframe.foreach(println)
dataframe.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
}).collect().foreach(println)
//[2012,S,S]
//[1997,Ford,E350]
//[2015,Chevy,Volt]
I have the below data which needed to be sorted using spark(scala) in such a way that, I only need id of the person who visited "Walmart" but not "Bestbuy". store might be repetitive because a person can visit the store any number of times.
Input Data:
id, store
1, Walmart
1, Walmart
1, Bestbuy
2, Target
3, Walmart
4, Bestbuy
Output Expected:
3, Walmart
I have got the output using dataFrames and running SQL queries on spark context. But is there any way to do this using groupByKey/reduceByKey etc without dataFrames. Can someone help me with the code, After map-> groupByKey, a ShuffleRDD has been formed and I am facing difficulty in filtering the CompactBuffer!
The code with which I got it using sqlContext is below:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
case class Person(id: Int, store: String)
val people = sc.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(p => Person(p(1)trim.toInt, p(1)))
people.registerTempTable("people")
val result = sqlContext.sql("select id, store from people left semi join (select id from people where store in('Walmart','Bestbuy') group by id having count(distinct store)=1) sample on people.id=sample.id and people.url='Walmart'")
The code which I am trying now is this, but I am struck after the third step:
val data = sc.textFile("examples/src/main/resources/people.txt")
.map(x=> (x.split(",")(0),x.split(",")(1)))
.filter(!_.filter("id"))
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.map{case (x,y) =>
val url = y.flatMap(x=> x.split(",")).toList
if (!url.contains("Bestbuy") && url.contains("Walmart")){
x.map(x=> (x,y))}}
if I do dataFiltered.collect(), I am getting
Array[Any] = Array(Vector((3,Walmart)), (), ())
Please help me how to extract the output after this step
To filter an RDD, just use RDD.filter:
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.filter {
// keep only lists that contain Walmart but do not contain Bestbuy:
case (x, y) => val l = y.toList; l.contains("Walmart") && !l.contains("Bestbuy")
}
dataFiltered.foreach(println) // prints: (3,CompactBuffer(Walmart))
// if you want to flatten this back to tuples of (id, store):
val result = dataFiltered.flatMap { case (id, stores) => stores.map(store => (id, store)) }
result.foreach(println) // prints: (3, Walmart)
I also tried it another way and it worked out
val data = sc.textFile("examples/src/main/resources/people.txt")
.filter(!_.filter("id"))
.map(x=> (x.split(",")(0),x.split(",")(1)))
data.cache()
val dataWalmart = data.filter{case (x,y) => y.contains("Walmart")}.distinct()
val dataBestbuy = data.filter{case (x,y) => y.contains("Bestbuy")}.distinct()
val result = dataWalmart.subtractByKey(dataBestbuy)
data.uncache()