How do I subtract an RDD[(Key,Object)] from another one? - scala

I want to change the format of my data, from RDD(Label:String,(ID:String,Data:Array[Double])) to an RDD Object with the label, id and data as components.
But when I print my RDD consecutively twice, the references of objects change :
class Data_Object(private val id:String, private var vector:Vector) extends Serializable {
var label = ""
...
}
First print
(1,ms3.Data_Object#35062c11)
(2,ms3.Data_Object#25789aa9)
Second print
(2,ms3.Data_Object#6bf5d886)
(1,ms3.Data_Object#a4eb65)
I think that explains why the subtract method doesn't work. So can I use subtract with objects as values, or do I return to my classic model ?

Unless you specify otherwise, objects in Scala (and Java) are compared using reference equality (i.e. their memory address). They are also printed out according to this address, hence the Data_Object#6bf5d886 and so on.
Using reference equality means that two Data_Object instances with identical properties will NOT compare as equal unless they are exactly the same object. Also, their references will change from one run to the next.
Particularly in a distributed system like Spark, this is no good - we need to be able to tell whether two objects in two different JVMs are the same or not, according to their properties. Until this is fixed, RDD operations like subtract will not give the results you expect.
Fortunately, this is usually easy to fix in Scala/Spark - define your class as a case class. This automatically generates equals and hashcode and toString methods derived from all of the properties of the class. For example:
case class Data_Object(id:String, label:String, vector:Vector)
If you want to compare your objects according to only some of the properties, you'll have to define your own equals and hashcode methods, though. See Programming in Scala, for example.

Related

MATLAB - How to compare if two objects are the same or different

In C++ I would just compare the memory addresses of both objects. How would I do something similar in MATLAB?
Worst Case would be to have a static variable that iterates in each constructor and every object gets the current value as ID. But is there a better solution?
Thank you in advance.
#Edit:
I'd like to extend this question by assuming I have some given/not changeable classes inheriting handle and overloading eq. If I want to compare two objects of this class can I somehow cast both instances to handle and use the implementation of eq of the super class?
To test that two handle objects a and b refer to the same instance, you only need to use a == b. This is the same as eq(a, b). This is the defined behaviour of == for handle objects. I.e., for handle objects, == tests for equality of instances, not equality of the values within the instances. This is different from value objects.
For this to work you need to be using handle objects (classdef myObject < handle) because it doesn't make sense to test instances of value objects.
N.B. if you also need to get some kind of instance identifier for a handle object, then you need to do something like you describe using a persistent variable. Here's an example. In that case I would make that a base class for all your objects, so you wouldn't have to copy the same code into each class. But that's unnecessary if all you want to do is test two instances.

Scala: Serializing/deserializing a few elements of a class

Consider the following toy class:
class myGiantClass(){
val serializableElement = ...
// lots of other variables and methods here
}
// main program
val listOfGiantObjects: List[myGiantClass] = ....
What I need is to serialize/deserialize listOfGiantObjects. The issue is that myGiantClass contains lots of junk objects and variables which I don't/can't serialize/deserialize. Instead the only element of the myGiantClass that I want to serialize is serializableElement inside each object of listOfGiantObjects.
So after deserialize, listOfGiantObjects is expected to contain a bunch of myGiantClass objects which contain only serializableElement (the rest set to default).
Any ideas?
Of course there are two approaches (or defaults): all elements should be serialized by default, or none.
Within the "all" scenario, you could take a look at the #transient annotation, for marking fields that should not be serialized.
It may seem an unoptimal approach in case of a large number of elements that should not be serialized. However, it does communicate what you are trying to achieve. Moreover, you could arrange your code using composition or inner classes to better define the scope of serialization.
At last resort, ad-hoc serializaion with custom attributes is a way (e.g., to implement the none-by-default scenario).

Statically decide if a Scala class is immutable

I just attended a Scala-lecture at a summer school. The lecturer got the following question:
- "Is there any way for the compiler to tell if a class is immutable?"
The lecturer responded
- "No, there isn't. It would be very nice if it could."
I was surprised. Isnt't it just to check if the class contains any var-members?
What is immutable?
Checking to see if the object only contains val fields is an overapproximation of immutability - the object may very well contain vars, but never assign different values in them. Or the segments of the program assigning values to vars may be unreachable.
According to the terminology of Chris Okasaki, there are immutable data structures and functional data structures.
An immutable data structure (or a class) is a data structure which, once constructed in memory, never changes its components and values - an example of this is a Scala tuple.
However, if you define the immutability of an object as the immutability of itself and all the objects reachable through references from the object, then a tuple may not be immutable - it depends on what you later instantiate it with. Sometimes there is not enough information about the program available at compile time to decide if a given data structure is immutable in the sense of containing only vals. And the information is missing due to polymorphism, whether parametric, subtyping or ad-hoc (type classes).
This is the first problem with deciding immutability - lack of static information.
A functional data structure is a data structure on which you can do operations whose outputs depend solely on the inputs for a given state. An example of such a data structure is a search tree which caches the last item looked up by storing it in a mutable field. Even though every lookup will write the last item searched into the mutable field, so that if the item is looked up again the search doesn't have to be repeated, the outputs of the lookup operation for such a data structure always remain the same given that nobody inserts new items into it. Another example of a functional data structure are splay trees.
In a general imperative programming model, to check if an operation is pure, that is - do the outputs depend solely on inputs, is undecidable. Again, one could use a technique such as abstract interpretation to provide a conservative answer, but this is not an exact answer to the question of purity.
This is the second problem with deciding if something having vars is immutable or functional (observably immutable) - undecidability.
I think the problem is that you need to ensure that all your vals don’t have any var members either. And this you cannot. Consider
class Base
case class Immutable extends Base { val immutable: Int = 0 }
case class Mutable extends Base { var mutable: Int = _ }
case class Immutable_?(b: Base)
Even though Immutable_?(Immutable) is indeed immutable, Immutable_?(Mutable) is not.
If you save a mutable object in a val the object itself is still mutable. So you would have to check if each class you use in a val is immutable.
case class Mut(var mut:Int)
val m = Mut(1)
println(m.toString)
m.mut = 3
println(m.toString)
In addition to what others have said, take a look at effect systems and discussion about supporting one in Scala.
It is not quite as easy since you could have vals that are linked to other mutable classes or, even harder to detect, that calls methods in other classes or objects that are mutable.
Also, you could very well have a immutable class that in fact has vars (to be more efficient for example...).
I guess you could have something that checks if a class looks like it is immutable or not though, but it sounds like it could be pretty confusing.
You can have a class, which can be instantiated to an object, and this object can be mutable or immutable.
Example: A class may contain a List[_], which, at runtime, can be a List[Int] or a List[StringBuffer]. So two different objects of a class could be either mutable, or immutable.

What is exactly the point of auto-generating getters/setters for object fields in Scala?

As we know, Scala generates getters and setters automatically for any public field and make the actual field variable private. Why is it better than just making the field public ?
For one this allows swapping a public var/val with a (couple of) def(s) and still maintain binary compatibility. Secondly it allows overriding a var/val in derived classes.
First, keeping the field public allows a client to read and write the field. Since it's beneficial to have immutable objects, I'd recommend to make the field read only (which you can achieve in Scala by declaring it as "val" rather than "var").
Now back to your actual question. Scala allows you to define your own setters and getters if you need more than the trivial versions. This is useful to maintain invariants. For setters you might want to check the value the field is set to. If you keep the field itself public, you have no chance to do so.
This is also useful for fields declared as "val". Assume you have a field of type Array[X] to represent the internal state of your class. A client could now get a reference to this array and modify it--again you have no chance to ensure the invariant is maintained. But since you can define your own getter you can return a copy of the actual array.
The same argument applies when you make a field of a reference type "final public" in Java--clients can't reset the reference but still modify the object the reference points to.
On a related note: accessing a field via getters in Scala looks like accessing the field directly. The nice thing about this is that it allows to make accessing a field and calling a method without parameters on the object look like the same thing. So if you decide you don't want to store a value in a field anymore but calculate it on the fly, the client does not have to care because it looks like the same thing to him--this is known as the Uniform Access Principle
In short: the Uniform Access Principle.
You can use a val to implement an abstract method from a superclass. Imagine the following definition from some imaginary graphics package:
abstract class circle {
def bounds: Rectangle
def centre: Point
def radius: Double
}
There are two possible subclasses, one where the circle is defined in terms of a bounding box, and one where it's defined in terms of the centre and radius. Thanks to the UAP, details of the implementation can be completely abstracted away, and easily changed.
There's also a third possibility: lazy vals. These would be very useful to avoid recalculating the bounds of our circle again and again, but it's hard to imagine how lazy vals could be implemented without the uniform access principle.

How do I use classes?

I'm fairly new to programming, and there's one thing I'm confused by. What is a class, and how do I use one? I understand a little bit, but I can't seem to find a full answer.
By the way, if this is language-specific, then I'm programming in PHP.
Edit: There's something else I forgot to say. Specifically, I meant to ask how defining functions are used in classes. I've seen examples of PHP code where functions are defined inside classes, but I can't really understand why.
To be as succinct as possible: a class describes a collection of data that can perform actions on itself.
For example, you might have a class that represents an image. An object of this class would contain all of the data necessary to describe the image, and then would also contain methods like rotate, resize, crop, etc. It would also have methods that you could use to ask the object about its own properties, like getColorPalette, or getWidth. This as opposed to being able to directly access the color pallette or width in a raw (non-object) data collection - by having data access go through class methods, the object can enforce constraints that maintain consistency (e.g. you shouldn't be able to change the width variable without actually changing the image data to be that width).
This is how object-oriented programming differs from procedural programming. In procedural programming, you have data and you have functions. The functions act on data, but there's no "ownership" of the data, and no fundamental connection between the data and the functions which make use of it.
In object-oriented programming, you have objects which are data in combination with actions. Each type of data has a defined set of actions that it can perform on itself, and a defined set of properties that it allows functions and other objects to read and write in a defined, constraint-respecting manner.
The point is to decouple parts of the program from each other. With an Image class, you can be assured that all of the code that manipulates the image data is within the Image class's methods. You can be sure that no other code is going to be mucking about with the internals of your images in unexpected ways. On the other hand, code outside your image class can know that there is a defined way to manipulate images (resize, crop, rotate methods, etc), and not have to worry about exactly how the image data is stored, or how the image functions are implemented.
Edit: And one more thing that is sometimes hard to grasp is the relationship between the terms "class" and "object". A "class" is a description of how to create a particular type of "object". An Image class would describe what variables are necessary to store image data, and give the implementation code for all of the Image methods. An Image object, called an "instance" of an image class, is a particular use of that description to store some actual data. For example, if you have five images to represent, you would have five different image "objects", all of the same Image "class".
Classes is a term used in the object oriented programming (OOP) paradigm. They provide abstraction, modularity and much more to your code. OOP is not language specific, other examples of languages supporting it are C++ and Java.
I suggest youtube to get an understanding of the basics. For instance this video and other related lectures.
Since you are using PHP I'll use it in my code examples but most everything should apply.
OOP treats everything as an object, which is a collection of methods (functions) and variables. In most languages objects are represented in code as classes.
Take the following code:
class person
{
$gender = null;
$weight = null;
$height = null;
$age = null;
$firstName = null;
$lastName = null;
function __CONSTRUCT($firstName, $lastName)
{
//__CONSTRUCT is a special method that is called when the class is initialized
$this->firstName = $firstName;
$this->lastName = $lastName;
}
}
This is a valid (if not perfect) class when you use this code you'll first have to initailize an instance of the class which is like making of copy of it in a variable:
$steve = new person('Steve', 'Jobs');
Then when you want to change some property (not technicaly the correct word as there are no properties in PHP but just bear with me in this case I mean variable). We can access them like so:
$steve->age = 54;
Note: this assumes you are a little familiar with programming, which I guess you are.
A class is like a blueprint. Let's suppose you're making a game with houses in it. You'd have a "House" class. This class describes the house and says what can it do and what can be done to it. You can have attributes, like height, width, number of rooms, city where it is located, etc. You can also have "methods" (fancy name for functions inside a class). For example, you can have a "Clean()" method, which would tell all the people inside the house to clean it.
Now suppose someone is playing your game and clicks the "make new house" button. You would then create a new object from that class. In PHP, you'd write "$house = new House;", and now $house has all the attributes and methods of a class.
You can make as many houses as you want, and they will all have the same properties, which you can then change. For example, if the people living in a house decide to add one more room, you could write "$house->numberOfRooms++;". If the default number of rooms for a house was 4, this house would have 5 rooms, and all the others would have 4. As you can see, the attributes are independent from one instance to another.
This is the basics; there is a lot more stuff about classes, like inheritance, access modifiers, etc.
Now, you may ask yourself why is this useful. Well, the point of Object Oriented Programming (OOP) is to think of all the things in the program as independent objects, trying to design them so they can be used regardless of context. For example, your house may be a standalone variable, may be inside an array of houses. If you have a "Person" class with a "residence" attribute, then your house may be that attribute.
This is the theory behind classes and objects. I suggest you look around for examples of code. If you want, you can look at the classes I made for a Pong game I programmed. It's written in Python and may use some stuff you don't understand, but you will get the basic idea. The classes are here.
A class is essentially an abstraction.
You have built-in datatypes such as "int" or "string" or "float", each of which have certain behavior, and operations that are possible.
For example, you can take the square root of a float, but not of a string. You can concatenate two strings, or you can add two integers. Each of these data types represent a general concept (integers, text or numbers with a fixed number of significant digits, which may or may not be fractional)
A class is simply a user-defined datatype that can represent some other concept, including the operations that are legal on it.
For example, we could define a "password" class which implements the behavior expected of a password. That is, we should be able to take a text string and create a password from it. (If I type 'secret02', that is a legal password). It should probably perform some verification on this input string, making sure that it is at least N characters long, and perhaps that it is not a dictionary word. And it should not allow us to read the password. (A password is usually represented as ****** on the screen). Instead, it should simply allow us to compare the password to other passwords, to see if it is identical.
If the password I just typed is the same as the one I originally signed up with, I should be allowed to log in. But what the password actually is, is not something the application I'm logging in to should know. So our password class should define a comparison function, but not a "display" function.
A class basically holds some data, and defines which operations are legal on that data. It creates an abstraction.
In the password example, the data is obviously just a text string internally, but the class allows only a few operations on this data. It prevents us from using the password as a string, and instead only allows the specific operations that would make sense for a password.
In most languages, the members of a class can be either private or public. Anything that is private can only be accessed by other members of the class. That is how we would implement the string stored inside the password class. It is private, so it is still visible to the operations we define in the class, but code outside the class can not just access the string inside a password. They can only access the public members of the class.
A class is a form of structure you could think of, such as int, string and so forth that an instance can be made from using object oriented programming language. Like a template or blueprint the class takes on the structure. You write this structure with every association to the class. Something from a class would be used as an object instance in the Main() method where all the sysync programming steps take place.
This is why you see people write code like Car car = new Car();to draw out a new object from a class. I personally do not like this type of code, its very bad and circular and does not explain which part is the class syntax (arrangement). Too bad many programmers use this syntax and it is difficult for beginners to understand what they are perceiving.
Think of this as,
CarClass theCar = new CarClass(); //
The class essentially takes on the infinitely many forms. You can write properties that describe the CarClass and every car generated will have these. To get them from the property that "gets" what (reads) and "sets" what (writes) data, you simply use the dot operator on the object instance generates in the Main() and state the descriptive property to the actual noun. The class is the noumenon (a word for something like math and numbers, you cannot perceive it to the senses but its a thought like the #1). Instead of writing each item as a variable the class enables us to write a definition of the object to use.
With the ability to write infinitely many things there is great responsibility! Like "Hello World!" how this little first statement says much about our audience as programmers.
So
CarClass theCar = new CarClass(); //In a way this says this word "car" will be a car
theCar.Color = red; //Given the instance of a car we can add that color detail.
Now these are only implementations of the CarClass, not how to build one.
You must be wondering what are some other terms, a field, constructor, and class level methods and why we use them and indexing.
A field is another modifier on a property. These tend to be written on a private class level so nothing from the outside affects it and tends to be focused on the property itself for functionality. It is in another region where you declare it usually with an underscore in front of it. The field will add constraints necessary to maintain data integrity meaning that you will prevent people from writing values that make no sense in the context. (Like real like measurements in the negative... that is just not real.)
The Constructor
The easiest way to describe a constructor is to make claims to some default values on the object properties where the constructor scope is laid. In example a car has a color, a max speed, a model and a company. But what should these values be and should some be used in millions of copies from the CarClass or just a few? The constructor enables one to do this, to generate copies by establishing a basic quality. These values are the defaults assigned to a property in a constructor block. To design a constructor block type ctor[tab][tab]. Inside this simply refer to those properties you write above and place an assigned value on it.
Color = “Red”;
If you go to the main() and now use the car.Color property in any writing output component such as a the console window or textbox you should see the word “Red”. The details are thus implicit and hidden. Instead of offering every word from a book you simply refer to the book then the computer gets the remaining information. This makes code scripts compact and easy to use.
The Class level method should explain how to do some process over and over. Typically a string or some writing you can format some written information for a class and format it with placeholders that are in the writing to display that are represented with your class properties. It makes sense when you make an object instance then need to use the object to display the details in a .ToString() form. The class object instance in a sense can also contain information like a book or box. When we write .ToString() with a ToString override method at class level it will print your custom ToString method and how it should explain the code. You can also write a property .ToString() and read it. This below being a string should read fine as it is...
Console.Writeline(theCar.Color);
Once you get many objects, one at a time you can put them in a list that allows you to add or remove them. Just wait...
Here's a good page about Classes and Objects:
http://ficl.sourceforge.net/oo_in_c.html
This is a resource which I would kindly recommend
http://www.cplusplus.com/doc/tutorial/
not sure why, but starting with C++ to apply OOP might be natural prior of any other language, the above link helped me a lot when I started at least.
Classes are a way programmers mark their territory on code.
They are supposedly necessary for writing big projects.
Linus and his team must have missed that memo developing the linux kernel.
However, they can be good for organization and categorizing code I guess.
It makes it easier to navigate code in an ide such as visual studio with the object browsers.
Here are some usage demonstrations of classes in 31 languages on rosettacode
First of all back to the definitions:
Class definition:
Abstract definition of something, an user-type, a blueprint;
Has States / Fields / Properties (what an object knows) and Methods / Behaviors / Member Functions (what an object does);
Defines objects behavior and default values;
Object definition:
Instance of a Class, Repository of data;
Has a unique identity: the property of an object that distinguishes it from other objects;
Has its own states: describes the data stored in the object;
Exhibits some well defined behavior: follows the class’s description;
Instantiation:
Is the way of instantiate a class to create an object;
Leaves the object in a valid state;
Performed by a constructor;
To use a class you must instantiate the class though a contructor. In PHP a straight-forward example could be:
<?php
class SampleClass {
function __construct() {
print "In SampleClass constructor\n";
}
}
// In SampleClass constructor
$obj = new SampleClass ();
?>