How do I change how keys are serialized in Scalding? - scala

I am grouping by a custom type in my scalding job:
typedPipe
.map(someMapper)
.groupBy(_.nonPrimitiveField)
.sum
.write(sink)
In my output, the keys show up as the toString output, which is not useful. How can I make scalding use a custom serializer for these keys?
My current workaround is to call toTypedPipe and explicitly call my serialization function in the mappers, but this seems wasteful.
The sink is a TypedTsv[(Key, Value)], where Key is the type of the field that I would like to serialize differently.

Well, Tsv is a text format, so, in the end of the day, everything becomes a string.
The simplest way would be to just override .toString on your Key type, or wrap it into another object with .toString overridden. Or, just replace it with a String as a final step (I think, that's what you are already doing anyway). I am not sure what you mean when you say it is "wasteful". It does not add an extra step to the flow if that's your concern, and the conversion to string would have to happen in any case, so that cost is fixed.
typedPipe.
.map(someMapper)
.groupBy(x => beautifulString(x.nonPrimitiveField))
.sum
.write(sink)

Related

Expected parameter scala.Option<Timestamp> vs. Actual argument Timestamp

This is probably a very silly question, but I have a case class which takes as a parameter Option[Timestamp]. The reason this is necessary is because sometimes the timestamp isn't included. However, for testing purposes I'm making an object where I pass in
Timestamp.valueOf("2016-01-27 22:27:32.596150")
But, it seems I can't do this as this is an actual Timestamp, and it's expecting an Option.
How do I convert to Option[Timestamp] from Timestamp. Further, why does this cause a problem to begin with? Isn't the whole benefit of Option that it could be there or not?
Thanks in advance,
Option indicates the possibility of a missing value, but you still need to construct an Option[Timestamp] value. There are two subtypes for Option - None when there is no value, and Some[T] which contains a value of type T.
You can create one directly using Some:
Some(Timestamp.valueOf("2016-01-27 22:27:32.596150"))
or Option.apply:
Option(Timestamp.valueOf("2016-01-27 22:27:32.596150"))

In Scala, is it possible to simultaneously extend a library and have a default conversion?

For example in the following article
http://www.artima.com/weblogs/viewpost.jsp?thread=179766
Two separate examples are given:
Automatic string conversion
Addition of append method
Suppose I want to have automatic string conversion AND a new append method. Is this possible? I have been trying to do both at the same time but I get compile errors. Does that mean the two implicits are conflicting?
You can have any number of implicit conversions from a class provided that each one can be unambiguously determined depending on usage. So the array to string and array to rich-array-class-containing-append is fine since String doesn't have an append method. But you can't convert to StringBuffer which has append methods which would interfere with your rich array append.

Examples of using some Scala Option methods

I have read the blog post recommended me here. Now I wonder what some those methods are useful for. Can you show examples of using forall (as opposed to foreach) and toList of Option?
map: Allows you to transform a value "inside" an Option, as you probably already know for Lists. This operation makes Option a functor (you can say "endofunctor" if you want to scare your colleagues)
flatMap: Option is actually a monad, and flatMap makes it one (together with something like a constuctor for a single value). This method can be used if you have a function which turns a value into an Option, but the value you have is already "wrapped" in an Option, so flatMap saves you the unwrapping before applying the function. E.g. if you have an Option[Map[K,V]], you can write mapOption.flatMap(_.get(key)). If you would use a simple map here, you would get an Option[Option[V]], but with flatMap you get an Option[V]. This method is cooler than you might think, as it allows to chain functions together in a very flexible way (which is one reason why Haskell loves monads).
flatten: If you have a value of type Option[Option[T]], flatten turns it into an Option[T]. It is the same as flatMap(identity(_)).
orElse: If you have several alternatives wrapped in Options, and you want the first one that holds actually a value, you can chain these alternatives with orElse: steakOption.orElse(hamburgerOption).orElse(saladOption)
getOrElse: Get the value out of the Option, but specify a default value if it is empty, e.g. nameOption.getOrElse("unknown").
foreach: Do something with the value inside, if it exists.
isDefined, isEmpty: Determine if this Option holds a value.
forall, exists: Tests if a given predicate holds for the value. forall is the same as option.map(test(_)).getOrElse(true), exists is the same, just with false as default.
toList: Surprise, it converts the Option to a List.
Many of the methods on Option may be there more for the sake of uniformity (with collections) rather than for their usefulness, as they are all very small functions and so do not spare much effort, yet they serve a purpose, and their meanings are clear once you are familiar with the collection framework (as is often said, Option is like a list which cannot have more than one element).
forall checks a property of the value inside an option. If there is no value, the check pass. For example, if in a car rental, you are allowed one additionalDriver: Option[Person], you can do
additionalDriver.forall(_.hasDrivingLicense)
exactly the same thing that you would do if several additional drivers were allowed and you had a list.
toList may be a useful conversion. Suppose you have options: List[Option[T]], and you want to get a List[T], with the values of all of the options that are Some. you can do
for(option <- options; value in option.toList) yield value
(or better options.flatMap(_.toList))
I have one practical example of toList method. You can find it in scaldi (my Scala dependency injection framework) in Module.scala at line 72:
https://github.com/OlegIlyenko/scaldi/blob/f3697ecaa5d6e96c5486db024efca2d3cdb04a65/src/main/scala/scaldi/Module.scala#L72
In this context getBindings method can return either Nil or List with only one element. I can retrieve it as Option with discoverBinding. I find it convenient to be able to convert Option to List (that either empty or has one element) with toList method.

How to workaround the XmlSerialization issue with type name aliases (in .NET)?

I have a collection of objects and for each object I have its type FullName and either its value (as string) or a subtree of inner objects details (again type FullName and value or a subtree).
For each root object I need to create a piece of XML that will be xml de-serializable.
The problem with XmlSerializer is that i.e. following object
int age = 33;
will be serialized to
<int>33</int>
At first this seems to be perfectly ok, however when working with reflection you will be using System.Int32 as type name and int is an alias and this
<System.Int32>33</System.Int32>
will not deserialize.
Now additional complexity comes from the fact that I need to handle any possible data type.
So solutions that utilize System.Activator.CreateInstance(..) and casting won't work unless I go down the path of code gen & compilation as a way of achieving this (which I would rather avoid)
Notes:
A quick research with .NET Reflector revealed that XmlSerializer uses internal class TypeScope, look at its static constructor to see that it initializes an inner HashTable with mappings.
So the question is what is the best (and I mean elegant and solid) way to workaround this sad fact?
I don't exactly see, where your problem originally stems from. XmlSerializer will use the same syntax/mapping for serializing as for deserializing, so there is no conflict when using it for both serializing and deserializing.
Probably the used type-tags are some xml-standard thing, but not sure about that.
I guess the problem is more your usage of reflection. Do you instantiate your
imported/deserialized objects by calling Activator.CreateInstance ?
I would recommend the following instead, if you have some type Foo to be created from the xml in xmlReader:
Foo DeserializedObject = (Foo)Serializer(typeof(Foo)).Deserialize(xmlReader);
Alternatively, if you don't want to switch to the XmlSerializer completely, you could do some preprocessing of your input. The standard way would then be to create some XSLT, in which you transform all those type-elements to their aliases or vice versa. then before processing the XML, you apply your transformation using System.Xml.Xsl.XslCompiledTransform and use your (or the reflector's) mappings for each type.
Why don't you serialize all the field's type as an attribute?
Instead of
<age>
<int>33</int>
</age>
you could do
<age type="System.Int32">
33
</age>

Using alternative comparison in HashSet

I stumbled across this problem when creating a HashSet[Array[Byte]] to use in a kind of HatTrie.
Apparently, the standard equals() method on arrays checks for identity. How can I provide the HashSet with an alternative Comparator that uses .deepEquals() for checking if an element is contained in the set?
Basically, I want this test to pass:
describe ("A HashSet of Byte Array") {
it("must contain arrays that are equivalent to one that has been added") {
val set = new HashSet[Array[Byte]]()
set += "ab".getBytes("UTF-8")
set must contain ("ab".getBytes("UTF-8"))
}
}
I cannot feasibly wrap the Array[Byte] into another object because there's a lot of them. Short of writing a new HashSet implementation for this purpose is there anything I can do?
Mutable data structures, such as Arrays, are contra-indicated for usage in places where the hash code is used. This is because the data structure can change, thus changing the hash code of the data, thus making access to the data inaccurate.
For instance, let's say I have a binary tree to store elements based on their hash code. If the hash is even, I store the data on the left side, if odd on the right side. Then I divide the hash by two, and repeat the process until the hash is 0, at which point I store the data in the node.
Now, I use this structure as base for HashSet, and then store an array on it. The array has an even hash code, so it goes to the left side of the tree. Let's ignore it's exact position.
Later, I change the array, and then look it up on the set. Now the hash code is odd, and I go look on the right side of the tree, and thus can't find it, even though it is stored int he tree -- just on the other side.
So, don't use arrays with hash-based collections. Which doesn't answer your question, of course.
As for your question, you'd have to subclass HashSet, and then override the equals method. I don't know if HashSet is final or descendent from a sealed class, so I don't know if this is viable.
Another option would be creating an alternate comparision method -- not named equals or "==", based specifically on deepEquals, and then using the Pimp My Class method to add it to HashSet.
Edit
I did mean subclass HashSet, but I did not pay enough attention to the question. I thought you were comparing the entire HashSet, instead of just using contains. You could do this:
class MyHashSet[A] extends scala.collection.mutable.HashSet[A] {
override def contains(elem: A): Boolean = elem match {
case arr : Array[_] => this.elements exists (arr deepEquals _)
case _ => super.contains(elem)
}
}
This isn't actually working here, as the first case is not being followed. I'm really lost here, as simple tests on REPL seems to indicate it ought to work. I'm thinking it might have something to do with boxing, but I'm not real clear on what -- or I'd have it working. :-)