Merging two tree sets - scala

I am trying to get the union of two sets. The are basically binary trees (but not guaranteed to be balanced). This is the code:
class MyNonEmptySet extends MySet{
union(that: MyNonEmptySet): MySet = {
((left union right) union that) incl elem
}
}
class MyEmptySet extends MySet{
union(that: MyNonEmptySet): MySet = that
}
For smaller data sets the union works fine but when the data is a but larger, union doesn't ever return. It just goes on. I want to understand what is going wrong. If it is not returning it should at least run out of memory (stack overflow exception), right? How can I rectify this?
#EDIT1
It works if I change the paranthesis in the implementation of NonEmptySet.
(left union (right union that)) incl elem
I don't understand why? Both should give out the same result right? Why does one method take forever (but does not go out of memory) and the other works instantly for the same data?

The reason that a binary tree is a good data structure is that it is sorted so you can do fast searches in log n time.
Looks like you do not use a sorted binary tree.
Your second algorithm works but all the work is done by
incl elem
That is rather slow.
The first algorithm has a recursive step that is doing an union of itself, but it will never leave the recursive step.
There are great tree set algorithms in Scala, I would just use one of those.
The right way to merge binary trees is to use red-black trees, but that is non trivial:
https://www.wikiwand.com/en/Red%E2%80%93black_tree

Related

How to encode recursive types with constraint for a typesafe avro library

Since I'm kinda really stumped right now with this issue I thought I'd ask here.
So here's the problem. I'm currently trying to write a Library to represent Avro Schemas in a typesafe manner that should then later allow to structurally query a given runtime value of a schema. E.g. Does my schema contain a field of a given name within a certain path? Is the schema flat (contains no nestable types except at top level)? etc.
You can find the complete specification of Avro schemas here: https://avro.apache.org/docs/1.8.2/spec.html
Now I have some troubles deciding on a representation of the schema within my code. Right now I'm using an ADT like this because it makes decoding the AvroSchema (which is JSON) really easy with Circe so you can somewhat ignore things like the Refined Types for this issue.
https://gist.github.com/GrafBlutwurst/681e365ecbb0ecad2acf4044142503a9 Please note that this is not the exact implementation. I have one that is able to decode schemas correctly but is a pain to query afterwards.
Anyhow I was wondering:
1) Does anyone have a good Idea how to encode the Typerestriction on AVRO Union. Avro Unions cannot contain other Unions directly, but can for example contain Records which then again can contain Unions. So union -> union is not allowed but union -> record -> union is ok.
2) would using fixpoint recursion in form of Fix, Free and CoFree make the querying later easier? I'm somewhat on the fence since I have no experience using these yet.
Thanks!
PS: Here's some more elaboration on why Refined is in there. In the end I want to enable some very specific uses eg this pseudocode (I'm not quite sure if it is at all possible yet?:
refine[Schema Refined IsFlat](schema) //because it's flat I know it can only be a Recordtype with Fields of Primitives or Optionals (encoded as Union [null, primitive])
.folder { //wonky name
case AvroInt(k, i) => k + " -> " + i.toString
case AvroString(k, s) => k + " -> " + s
//etc...
} // should result in a function List[Vector[Byte]] => Either[Error,List[String]]
Basically given a schema and assuming it satisfies the IsFlat constraint, provide a function that decodes records and convert them into string lists.

Caching Large Dataframes in Spark Effectively

I am currently working on 11,000 files. Each file will generate a data frame which will be Union with the previous one. Below is the code:
var df1 = sc.parallelize(Array(("temp",100 ))).toDF("key","value").withColumn("Filename", lit("Temp") )
files.foreach( filename => {
val a = filename.getPath.toString()
val m = a.split("/")
val name = m(6)
println("FILENAME: " + name)
if (name == "_SUCCESS") {
println("Cannot Process '_SUCCSS' Filename")
} else {
val freqs=doSomething(a).toDF("key","value").withColumn("Filename", lit(name) )
df1=df1.unionAll(freqs)
}
})
First, i got an error of java.lang.StackOverFlowError on 11,000 files. Then, i add a following line after df1=df1.unionAll(freqs):
df1=df1.cache()
It resolves the problem but after each iteration, it is getting slower. Can somebody please suggest me what should be done to avoid StackOverflowError with no decrease in time.
Thanks!
The issue is that spark manages a dataframe as a set of transformations. It begins with the "toDF" of the first dataframe, then perform the transformations on it (e.g. withColumn), then unionAll with the previous dataframe etc.
The unionAll is just another such transformation and the tree becomes very long (with 11K unionAll you have an execution tree of depth 11K). The unionAll when building the information can get to a stack overflow situation.
The caching doesn't solve this, however, I imagine you are adding some action along the way (otherwise nothing would run besides building the transformations). When you perform caching, spark might skip some of the steps and therefor the stack overflow would simply arrive later.
You can go back to RDD for iterative process (your example actually is not iterative but purely parallel, you can simply save each separate dataframe along the way and then convert to RDD and use RDD union).
Since your case seems to be join unioning a bunch of dataframes without true iterations, you can also do the union in a tree manner (i.e. union pairs, then union pairs of pairs etc.) this would change the depth from O(N) to O(log N) where N is the number of unions.
Lastly, you can read and write the dataframe to/from disk. The idea is that after every X (e.g. 20) unions, you would do df1.write.parquet(filex) and then df1 = spark.read.parquet(filex). When you read the lineage of a single dataframe would be the file reading itself. The cost of course would be the writing and reading of the file.

Multiple types in a list?

Rephrasing of my questions:
I am writing a program that implements a data mining algorithm. In this program I want to save the input data which is supposed to be minded. Imagine the input data to be a table with rows and columns. Each row is going to be represented by an instance of my Scala class (the one in question). The columns of the input data can be of different type (Integer, Double, String, whatnot) and which type will change depending on the input data. I need a way to store a row inside my Scala class instance. Thus I need an ordered collection (like a special List) that can hold (many) different types as elements and it must be possible that the type is only determined at runtime. How can I do this? A Vector or a List require that all elements are supposed to be of the same type. A Tuple can hold different types (which can be determined at runtime if I am not mistaken), but only up to 22 elements which is too few.
Bonus (not sure if I am asking too much now):
I would also like to have the rows' columns to be named and excess-able by name. However, I thinkg this problem can easily be solved by using two lists. (Altough, I just read about this issue somewhere - but I forgot where - and think this was solved more elegantly.)
It might be good to have my collection to be random access (so "Vector" rather than "List").
Having linear algebra (matrix multiplication etc.) capabilities would be nice.
Even more bonus: If I could save matrices.
Old phrasing of my question:
I would like to have something like a data.frame as we know it from R in Scala, but I am only going to need one row. This row is going to be a member in a class. The reason for this construct is that I want methods related to each row to be close to the data itself. Each data row is also supposed to have meta data about itself and it will be possible to give functions so that different rows will be manipulated differently. However I need to save rows somehow within the class. A List or Vector comes to mind, but they only allow to be all Integer, String, etc. - but as we know from data.frame, different columns (here elements in Vector or List) can be of different type. I also would like to save the name of each column to be able to access the row values by column name. That seems the smallest issue though. I hope it is clear what I mean. How can I implement this?
DataFrames in R are heterogenous lists of homogeneous column vectors:
> df <- data.frame(c1=c(r1=1,r2=2), c2=c('a', 'b')); df
c1 c2
r1 1 a
r2 2 b
You could think of each row as a heterogeneous list of scalar values:
> as.list(df['r1',])
$c1
[1] 1
$c2
[1] a
An analogous implementation in scala would be a tuple of lists:
scala> val df = (List(1, 2), List('a', 'b'))
df: (List[Int], List[Char]) = (List(1, 2),List(a, b))
Each row could then just be a tuple:
scala> val r1 = (1, 'a')
r1: (Int, Char) = (1,a)
If you want to name all your variables, another possibility is a case class:
scala> case class Row (col1:Int, col2:Char)
defined class Row
scala> val r1 = Row(col1=1, col2='a')
r1: Row = Row(1,a)
Hope that helps bridge the R to scala divide.

How to obtain the symmetric difference between two DataFrames?

In the SparkSQL 1.6 API (scala) Dataframe has functions for intersect and except, but not one for difference. Obviously, a combination of union and except can be used to generate difference:
df1.except(df2).union(df2.except(df1))
But this seems a bit awkward. In my experience, if something seems awkward, there's a better way to do it, especially in Scala.
You can always rewrite it as:
df1.unionAll(df2).except(df1.intersect(df2))
Seriously though this UNION, INTERSECT and EXCEPT / MINUS is pretty much a standard set of SQL combining operators. I am not aware of any system which provides XOR like operation out of the box. Most likely because it is trivial to implement using other three and there is not much to optimize there.
why not the below?
df1.except(df2)
If you are looking for Pyspark solution, you should use subtract() docs.
Also, unionAll is deprecated in 2.0, use union() instead.
df1.union(df2).subtract(df1.intersect(df2))
Notice that the EXCEPT (or MINUS which is just an alias for EXCEPT) de-dups results. So if you expect "except" set (the diff you mentioned) + "intersect" set to be equal to original dataframe, consider this feature request that keeps duplicates:
https://issues.apache.org/jira/browse/SPARK-21274
As I wrote there, "EXCEPT ALL" can be rewritten in Spark SQL as
SELECT a,b,c
FROM tab1 t1
LEFT OUTER JOIN
tab2 t2
ON (
(t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
)
WHERE
COALESCE(t2.a, t2.b, t2.c) IS NULL
I think it could be more efficient using a left join and then filtering out the nulls.
df1.join(df2, Seq("some_join_key", "some_other_join_key"),"left")
.where(col("column_just_present_in_df2").isNull)

Why doesn't "((left union right) union other)" behave associatively?

The code in the following gist is lifted almost verbatim out of a lecture in Martin Odersky's Functional Programming Principles in Scala course on Coursera:
https://gist.github.com/aisrael/7019350
The issue occurs in line 38, within the definition of union in class NonEmpty:
def union(other: IntSet): IntSet =
// The following expression doesn't behave associatively
((left union right) union other) incl elem
With the given expression, ((left union right) union other), largeSet.union(Empty) takes an inordinate amount of time to complete with sets with 100 elements or more.
When that expression is changed to (left union (right union other)), then the union operation finishes relatively instantly.
ADDED: Here's an updated worksheet that shows how even with larger sets/trees with random elements, the expression ((left ∪ right) ∪ other) can take forever but (left ∪ (right ∪ other)) will finish instantly.
https://gist.github.com/aisrael/7020867
The answer to your question is very much connected to Relational databases - and the smart choices they make. When a database "unions" tables - a smart controller system will make some decisions around things like "How large is Table A? Would it make more sense to Join A & B first, or A & C when the user writes:
A Join B Join C
Anyhow, you can't expect the same behavior when you are writing the code by hand - because you have specified the order you want exactly, using parenthesis. None of those smart decisions can happen automatically. (Though in theory they could, and that's why Oracle ,Teradata, mySql exist)
Consider a ridiculously large example:
Set A - 1 Billion Records
Set B - 500 Million Records
Set C - 10 Records
For arguments sake assume that the union operator takes O(N) records by the SMALLEST of the 2 sets being joined. This is reasonable, each key can be looked up in the other as a hashed retrieval:
A & B runtime = O(N) runtime = 500 Million
(let's assume the class is just smart enough to use the smaller of the two for lookups)
so
(A & B) & C
Results in:
O(N) 500 million + O(N) 10 = 500,000,010 comparisons
Again pointing to the fact that it was forced to compare 1 Billion records to 500 Million records FIRST, per inner parenthesis, then - pull in 10 more.
But consider this:
A & (B & C)
Well now something amazing happens:
(B & C) runtime O(N) = 10 record comparisons (each of the 10 C records is checked against B for existence)
then
A & (result) = O(N) = 10
Total = 20 comparisons
Notice that once (B & C) was completed, we only had to bump 10 records against 1 billion!
Both examples produces the exact same result; one in O(N) = 20 runtime, the other in 500,000,010 !
To summarize, this problem illustrates in just a small way some of the complex thinking that goes into database design and the smart optimization that happens in that software. These things do not always happen automatically in programming languages unless you've coded them that way, or by using a library of some sorts. You could for example write a function that takes several sets and intelligently decides the union order. But, the issue becomes unbelievable complex if other set operations have to be mixed in. Hope this helps.
Associativity is not about performance. Two expressions may be equivalent by associativity but one may be vastly harder than the other to actually compute:
(23 * (14/2)) * (1/7)
Is the same as
23 * ((14/2) * (1/7))
But if it were me evaluating the two, I'd reach the answer (23) in a jiffy with the second one, but take longer if I forced myself to work with just the first one.