Is there any simple way to extract multiple values from Guava's HashCode? - guava

With Guava, hashing can be as simple as
byte[] byteHash = Hashing.md5().hashBytes(aByteArray).asBytes();
but seemingly only as all you want is a byte[] (possibly converted to a hex string), or a single int or long. But in one place I need two longs and in another one I need five int from sha1.
I can see some solutions like reading from new DataInputStream(new ByteArrayInputStream(byteHash)), using a ByteBuffer, or converting manually from the byte[]. However, all of them are extremely ugly (e.g. swallowing an impossible IOException) and long (and also inefficient, but this doesn't bother me here).
So is there any simple way to extract multiple (non-byte) values from Guava's HashCode?

There's nothing built in to HashCode for this, no.
Doing what you need with ByteBuffer seems really easy though, and neither long nor especially inefficient:
ByteBuffer buf = ByteBuffer.wrap(byteHash);
long l1 = buf.getLong();
long l2 = buf.getLong();
(I suppose an asReadOnlyByteBuffer() method could avoid the need for cloning a byte array, but I don't know if that's really necessary.)

Related

Scodec: Using vectorOfN with a vlong field

I am playing around with the Bitcoin blockchain to learn Scala and some useful libraries.
Currently I am trying to decode and encode Blocks with SCodec and my problem is that the vectorOfN function takes its size as an Int. How can I use a long field for the size while still preserving the full value range.
In other words is there a vectorOfLongN function?
This is my code which would compile fine if I were using vintL instead of vlongL:
object Block {
implicit val codec: Codec[Block] = {
("header" | Codec[BlockHeader]) ::
(("numTx" | vlongL) >>:~
{ numTx => ("transactions" | vectorOfN(provide(numTx), Codec[Transaction]) ).hlist })
}.as[Block]
}
You may assume that appropriate Codecs for the Blockheader and the Transactions are implemented. Actually, vlong is used as a simplification for this question, because Bitcoin uses its own codec for variable sized ints.
I'm not a scodec specialist but my common sense suggests that this is not possible because Scala's Vector being a subtype of GenSeqLike is limited to have length of type Int and apply that accepts Int index as its argument. And AFAIU this limitation comes from the underlying JVM platform where you can't have an array of size more than Integer.MAX_VALUE i.e. around 2^31 (see also "Criticism of Java" wiki). And although Vector theoretically could have work this limitation around, it was not done. So it makes no sense for vectorOfN to support Long size as well.
In other words, if you want something like this, you probably should start from creating your own Vector-like class that does support Long indices working around JVM limitations.
You may want to take a look at scodec-stream, which comes in handy when all of your data is not available immediately or does not fit into memory.
Basically, you would use your usual codecs.X and turn it into a StreamDecoder via scodec.stream.decode.many(normal_codec). This way you can work with the data through scodec without the need to load it into memory entirely.
A StreamDecoder then offers methods like decodeInputStream along scodec's usual decode.
(I used it a while ago in a slightly different context – parsing data sent by a client to a server – but it looks like it would apply here as well).

Is string concatenation in scala as costly as it is in Java?

In Java, it's a common best practice to do string concatenation with StringBuilder due to the poor performance of appending strings using the + operator. Is the same practice recommended for Scala or has the language improved on how java performs its string concatenation?
Scala uses Java strings (java.lang.String), so its string concatenation is the same as Java's: the same thing is taking place in both. (The runtime is the same, after all.) There is a special StringBuilder class in Scala, that "provides an API compatible with java.lang.StringBuilder"; see http://www.scala-lang.org/api/2.7.5/scala/StringBuilder.html.
But in terms of "best practices", I think most people would generally consider it better to write simple, clear code than maximally efficient code, except when there's an actual performance problem or a good reason to expect one. The + operator doesn't really have "poor performance", it's just that s += "foo" is equivalent to s = s + "foo" (i.e. it creates a new String object), which means that, if you're doing a lot of concatenations to (what looks like) "a single string", you can avoid creating unnecessary objects — and repeatedly recopying earlier portions from one string to another — by using a StringBuilder instead of a String. Usually the difference is not important. (Of course, "simple, clear code" is slightly contradictory: using += is simpler, using StringBuilder is clearer. But still, the decision should usually be based on code-writing considerations rather than minor performance considerations.)
Scalas String concatenation works the same way as Javas does.
val x = 5
"a"+"b"+x+"c"
is translated to
new StringBuilder()).append("ab").append(BoxesRunTime.boxToInteger(x)).append("c").toString()
StringBuilder is scala.collection.mutable.StringBuilder. That's the reason why the value appended to the StringBuilder is boxed by the compiler.
You can check the behavior by decompile the bytecode with javap.
I want to add: if you have a sequence of strings, then there is already a method to create a new string out of them (all items, concatenated). It's called mkString.
Example: (http://ideone.com/QJhkAG)
val example = Seq("11111", "2222", "333", "444444")
val result = example.mkString
println(result) // prints "111112222333444444"
Scala uses java.lang.String as the type for strings, so it is subject to the same characteristics.

Is there a substring proxy in scala?

Using strings as String objects is pretty convenient for many string processing tasks.
I need extract some substrings to process and scala String class provide me with such functionality. But it is rather expensive: new String object is created every time substring function is used. Using tuples (string : String, start : Int, stop : Int) solves the performance problem, but makes code much complicated.
Is there any library for creating string proxys, that stores original string, range bound and is compatibles with other string functions?
Java 7u6 and later now implement #substring as a copy, not a view, making this answer obsolete.
If you're running your Scala program on the Sun/Oracle JVM, you shouldn't need to perform this optimization, because java.lang.String already does it for you.
A string is stored as a reference to a char array, together with an offset and a length. Substrings share the same underlying array, but with a different offset and/or length.
Look at the implementation of String (in particular substring(int beginIndex, int endIndex)): it's already represented as you wish.

Scala: read and save all elements of an Iterable

I have an Iterable[T] that is really a stream of unknown length, and want to read it all and save it into something that is still an instance of Iterable. I really do have to read it and save it; I can't do it in a lazy way. The original Iterable can have a few thousand elements, at least. What's the most efficient/best/canonical way? Should I use an ArrayBuffer, a List, a Vector?
Suppose xs is my Iterable. I can think of doing these possibilities:
xs.toArray.toIterable // Ugh?
xs.toList // Fast?
xs.copyToBuffer(anArrayBuffer)
Vector(xs: _*) // There's no toVector, sadly. Is this construct as efficient?
EDIT: I see by the questions I should be more specific. Here's a strawman example:
def f(xs: Iterable[SomeType]) { // xs might a stream, though I can't be sure
val allOfXS = <xs all read in at once>
g(allOfXS)
h(allOfXS) // Both g() and h() take an Iterable[SomeType]
}
This is easy. A few thousand elements is nothing, so it hardly matters unless it's a really tight loop. So the flippant answer is: use whatever you feel is most elegant.
But, okay, let's suppose that this is actually in some tight loop, and you can predict or have benchmarked your code enough to know that this is performance-limiting.
Your best performance for an immutable solution will likely be a Vector, used like so:
Vector() ++ xs
In my hands, this can copy a 10k iterable about 4k-5k times per second. List is about half the speed.
If you're willing to try a mutable solution under the hood, xs.toArray.toIterable usually takes the cake with about 10k copies per second. ArrayBuffer is about the same speed as List.
If you actually know the size of the target (i.e. size is O(1) or you know it from somewhere else), you can shave off another 20-30% of the execution speed by allocating just the right size and writing a while loop.
If it's actually primitives, you can gain a factor of 10 by writing your own specialized Iterable-like-thing that acts on arrays and converts to regular collections via the underlying array.
Bottom line: for a great blend of power, speed, and flexibility, use Vector() ++ xs in most situations. xs.toIndexedSeq defaults to the same thing, with the benefit that if it's already a Vector that it will take no time at all (and chains nicely without using parens), and the drawback that you are relying upon a convention, not a specification for behavior (and it takes 1-3 more characters to type).
How about Stream.force?
Forces evaluation of the whole stream and returns it.
This is hard. An Iterable's methods are defined in terms of its iterator, but that gets overridden by subtraits. For instance, IndexedSeq methods are usually defined in terms of apply.
There is the question of why do you want to copy the Iterable, but I suppose you might be guarding against the possibility of it being mutable. If you do not want to copy it, then you need to rephrase your question.
If you are going to copy it, and you want to be sure all elements are copied in a strict manner, you could use .toList. That will not copy a List, but a List does not need to be copied. For anything else, it will produce a new copy.

Does SqlDataReader have an equivalent to Get*(int index) with a string key?

I'm trying to use a SqlDataReader (I'm quite aware of the beauty of Linq, etc, but the application I'm building is partly a Sql Generator, so Linq doesn't fit my needs). Unfortunately, I'm not sure what the best practices are when using SqlDataReader. I use code like the following in several places in my code:
using (SqlDataReader reader = ...)
{
int ID = reader.GetInt32(0);
int tableID = reader.GetInt32(1);
string fieldName = reader[2] as string;
...//More, similar code
}
But it feels very unstable. If the database changes (which is actually extremely unlikely in this case) the code breaks. Is there an equivalent to SqlDataReader's GetInt32, GetString, GetDecimal, that takes a column name instead of an index? What's considered best practice in this case? What's fastest? These parts of my code are the most time intensive portions of my code (I've profiled it a few times) and so speed is important.
[EDIT]
I'm aware of using the indexer with a string, I misworded the above. I'm running into slow runtime. My code works fine, but I am looking for any way I can steal back a few seconds inside these loops. Would accessing by string slow me down? I know that the db-access is the primary time intensive operation, there's nothing I can do about that, so I want to cut back the processing time for each element accessed.
[EDIT]
I've decided to just run with GetOrdinal unless someone has more concrete examples. I'll run efficiency test later. I'll try to remember to post them when I actually run the tests.
The indexer property takes a string key, so you can do the following.
reader["text_column"] as string;
Convert.ToInt32(reader["numeric_column"]);
Additional suggestion
If you're concerned about the string lookup being slow, and assuming numeric lookup is quicker, you could try using GetOrdinal to find the column indices before looping through a large result set.
int textColumnIndex = reader.GetOrdinal("text_column");
int numericColumnIndex = reader.GetOrdinal("numeric_column");
while (reader.Read())
{
string text = reader[textColumnIndex] as string;
int number = Convert.ToInt32(reader[numericColumnIndex]);
}