What is the most mature library for building a Data Analytics Pipeline in Java/Scala for Hadoop?

What is the most mature library for building a Data Analytics Pipeline in Java/Scala for Hadoop? - scala

I found many options recently, and interesting in their comparisons primarely by maturity and stability.
Crunch - https://github.com/cloudera/crunch
Scrunch - https://github.com/cloudera/crunch/tree/master/scrunch
Cascading - http://www.cascading.org/
Scalding https://github.com/twitter/scalding
FlumeJava
Scoobi - https://github.com/NICTA/scoobi/

As I'm a developer of Scoobi, don't expect an unbiased answer.
First of all, FlumeJava is an internal google project that provides a (awesomely productive) abstraction ontop of MapReduce (not hadoop though). They released a paper about it, which is what projects like Scoobi and Crunch are based on.
If your only criteria is the maturity -- I guess Cascading is your best bet.
However, if you're looking for the (imho superior) FlumeJava style abstraction, you'll want to pick between (S)crunch and Scoobi.
The biggest difference, superficial as it may be is that crunch is written in Java, with Scala bindings (Scrunch). And Scoobi is written in Scala with Java bindings (scoobij). They're both really solid choices, and you won't go wrong which ever you choose. I'm sure there's quite a similar story with Crunch, but Scoobi is being used in real projects and is under continual development. We're pretty very active in fixing bugs and implementing features.
Anyway, they're both great projects with great people behind them and were both released within days of each other. They provide the same abstraction (with similiar api), so switching between the two won't be an issue in the slightest. My recommendation is to give them both a try, and see what works for you. There' no lock in in either project, so you don't need to commit :)
And if you have any feedback for either project, please be sure to provide it :)

I'm a big Scoobi fan myself and I've used it in production. I like the way it allows you to write type-safe Hadoop programs in a very idiomatic Scala way. If that is not necessarily your thing and you like the Cascading model but are scared off by the huge amount of boilerplate code you'd have to write, Twitter has recently open sourced its own Scala abstraction layer on top of Cascading called Scalding.
Announcement: https://dev.twitter.com/blog/scalding
GitHub: https://github.com/twitter/scalding
I guess it's all a matter of taste at this point since feature-wise most of the frameworks are very close to one another.

Scalding also has the advantage of significant open source projects built atop it, such as Matrix API and Algebird.
Here are some examples:
http://sujitpal.blogspot.com/2012/08/scalding-for-impatient.html
Cascalog was released almost two years before Scalding, and arguably has more advanced features for building robust workflows:
https://github.com/nathanmarz/cascalog/wiki

Related

Understanding Hadoop Packages and Classes

I have been using CDH and HDP for a while (both in the pseudo-distributed mode) on a VM as well as installing natively on Ubuntu. Although my question is probably relevant to all Projects within the Apache Hadoop Ecosystem, let me ask this specifically in the context of Avro.
What is the best way to go about figuring out what the different packages and the classes within the packages do. I usually end up referring to the Javadoc for the project (Avro in this case) but the overviews for packages and classes end up being awfully inadequate.
For e.g. Take two of the Avro packages: org.apache.avro.specific and org.apache.avro.generic These are used for creating Specific and Generic Readers and Writers (respectively) but I'm not a 100% sure what these are for. I have used the Specific Package for in cases when I have used Avro Code Generation and the Generic ones when I don't want to use code generation. However, I am not sure if that is the only reason for using one vs. the other.
Another example: The Encoder\Decoder Classes are used for low-level SerDe, the DatumReader\DatumWrite for a "medium-level" Serde while most application layer interactions with Avro will probably use Generic\Specific Readers\Writers. Without having struggled through the pain of using these classes, how is a user to know what to use for what?
Is there a better way to get a good overview of each package (clearly the javadoc is not well documented) and the classes within the package?
PS: I have similar questions for essentially all other Hadoop Projects (Hive, HBASE etc.) - the Javadocs seem to be grossly inadequate overall. I just wonder what other developers end up doing to figure these out.
Any inputs would be great.

I download the source code and skim through it to get the idea what it does. If there is javadoc, I read that too. I tend to concentrate on the interfaces that I need and move on from there, that way I put everything into context and it makes it easier to figure out the usage. I use the call hierarchy and the type hierarchy views a lot.
These are very general guidelines, and ultimately it is the time you spend with the project that will make you understand it.
Hadoop ecosystem is quickly growing and changes are introduced on monthly bases. that's why javadoc is not so good. Another reason is that hadoop software tends to lean towards the infrastructure and not towards the end user. People developing tools will spend time learning the APIs and internals while everybody else is kinda supposed to be blissfully ignorant of all those, and just use some high level domain specific language for the tool.

What library should I use for accessing Riak from Scala?

For a project I'm using both Scala and Riak (two things I have never worked with before ;) ).
Google searches seem to suggest using Riakki. However, it seems like that particular library hasn't been maintained since 2009 and doesn't even compile on my system. There is a more up-to-date fork on GitHub that does seem to work with more recent Scala versions. But Riakki seems to depend on Jiak, which has been deprecated since february of last year.
Seems like the only reasonable choice would be to use the official Riak Java-library from Scala. That's certainly possible, but I'd like to do things the Scala-way as I'm trying to learn the language. Having to interface with a Java-style API might ruin a bit of the fun. Writing my own wrapper sounds like it will be too much work.
tl;dr: I want to use Riak from Scala. What are other people using?
edit: just found Ryu (can't link to it - annoying limit on amount of hyperlinks per question for new users). Doesn't seem all that mature though.

Stackmob recently opensourced Scalariak.
Scaliak is a scala-ified version of the High-Level Riak Java Client w/
a Functional Twist. It is currently being used in production at
StackMob.
Scaliak is currently feature incomplete vs. the original High-Level
Riak Java Client. What is currently supported are mostly features
being used in production (there have been a few features implemented
and subsequently not used).
There is also Raiku which states that it is async.

I'm in the same bucket - excuse the bad pun - although I have some experience with Scala. I'm thinking of using the official Java client.
When you are toiling up a steep learning curve, you don't need to be dealing with incomplete and potentially wobbly API's. In my experience, using Java API's from Scala is minimally painful.
I think there'll be enough delight in playing with our new Raik toy that we'll forget about whatever un-Scala-ish foibles the Java API inflicts upon us. All the best.

I'm the author of yet another Scala Riak client, simply called riak-scala-client. It is based on Akka and Spray, it is not built on top of the existing Java client, and most importantly it is completely non-blocking.
Check it out at http://riak.scalapenos.com and let me know what you think.

Bringing Scala into my company

Now i know that this one is actually not a very technical question but one that has been bothering me for some time. Actually we are using a lot of C++ and PHP at our company and some of our developers are really hoping for a new and modern language to come by to help us getting more productive. I have been talking about what scala can do and the other coders seem to gain some interest in the language. The tough job is, how do you convince your boss to consider scala as a language for the company. I saw the presentation "Sneaking Scala into your company", but it deals with the situation that you are using Java at your company which we don't.
How do you fight of the usual "that is just esoteric stuff" and "we can already do that in $LANGUAGE" arguments. I was planing to give a talk about Scala, and since I don't have much time I need ideas how to get people interested in the language rather then setting of reactions like "currying? we can already do something like this with boost::bind".
How did you guys do it?
Regards,
raichoo
EDIT: Gave my talk yesterday, people were very excited. My company is going to give it a try! Thanks for all your suggestions.

If you don't already have killer arguments, what are you basing your reasoning on that Scala will make your company more productive?
Don't like something then hunt for reasons to use it at work. Let the reasons speak for themselves..
"A hammer looking for nails"

Using it to do some stuff around the side, as datamigrations, testing and similar things will make sure the necessary experience is built and can give it some exposure.
ScalaTest is really nice to help with acceptance/integration testing. (Yes, I know it is nice for unit testing, but I do not see that immediately happening with C++/PHP target code, and it would probably be unwise).
Proof of Concept and other Prototypes are great for 2 reasons
1) It showcases the capabilities
2) You are certain they will be thrown away if you have to reimplement them in C++/PHP
Now a bad time to introduce Scala would be when you REALLY need it : hopes will be high, it will not immediately work as intended, hopes are dashed and everybody will blame Scala. As a result it will be burnt for a long time in the organisation.
Sooner or later some suit will think it was his idea to introduce Scala and use it on a formal project. If that project is moderately successful, then it is sold.
These kind of changes are complicated people issues, and the harder you push, the harder you will face push-back. On the other hand the persistent mind can move mountains.

Redo some of your work related code in Scala and compare KLOC, code structure and performence, if it looks and works better, show it to your peers and your managers.
In other words:
Talk is cheap. Show me the code.
-- Torvalds, Linus (2000-08-25)

In case of our company (and I assume, many companies share the same scenario), move to Scala (from Java) was initiated by tech people, who 1. wanted to work more productive writing code (living in the 21st century utilize modern approaches), 2. have less troubles building concurrent applications (Actors concept promoted by Scala is a way simpler than Java thread-based concurrency) 2.1 have a simpler way of building scalable staged event driven architectures.
In our company, transition to Scala was more or less simple, because Scala was literlly sold to business people as a library to Java :) -> from their POV, we're still using the same platform (JVM), application servers, etc., but developers are having more fun from their work, and therefore, are more inspired and work more efficiently.

Maybe you could pitch Scala by showing off the suite of tools that is used for development? For example, if you are not already using Eclipse in your company, show your execs a demo of what a modern IDE can do for your productivity.
There is a book called "Fearless Change" (Linda Rising) that describes a pattern language for "powerless leaders" (I LOVE that role title!). SE-radio had a really motivating interview with the author: http://www.se-radio.net/podcast/2009-06/episode-139-fearless-change-linda-rising. Listen up on that interview to collect a few non-technical strategies that can help you in this struggle!

I haven't used Scala yet for any real business code, but I know people who have.
One group used it to write a tool to analyze log files. So they didn't use it for mission-critical business code, but for a non-critical tool to support the project.
Another person I know is an architect and he just went and wrote some Scala code on his own for some production code without telling his manager. After the code was deployed successfully he did tell it. One of the things he mentioned is that because Scala runs on the JVM, the people who support the application don't even notice - to them, Scala is just another library that's included with the application (they were already used to the JVM). Ofcourse this approach is risky and not everybody will be in the position or be willing to do this.
You could start small - use it as your personal preferred scripting language for small things that you need yourself. Tell your fellow developers about it and make them enthusiasts too. If they also start using it then you can step it up to make some side code for your project (such as for example that log analyser tool).

This isn't a really easy task. I would concentrate on the fact that you will be able to produce code and therefore products faster and with a higher quality. That's always the two reasons, business wants to hear from you and will listen to.
Maybe you can show an example of 1-2 very small projects you did in your company with C++/PHP and compare the effort, quality etc. with a similar/the same implemenation in Scala? This would be very impressive and should also convince people who are not on the coding side.

There was a very good talk at Scala Days 2010 by David Copeland:
Sneaking Scala into your organisation
The executive summary: Testing. You can use Scala for testing without affecting release code.

Are there any disadvantages of using C# 3.0 features?

I like C# 3.0 features especially lambda expressions, auto implemented properties or in suitable cases also implicitly typed local variables (var keyword), but when my boss revealed that I am using them, he asked me not to use any C# 3.0 features in work. I was told that these features are not standard and confusing for most developers and its usefulness is doubtful. I was restricted to use only C# 2.0 features and he is also considering forbidding anonymous methods.
Since we are targeting .NET Framework 3.5, I cannot see any reason for these restrictions. In my opinion, maybe the only disadvantage is that my few co-workers and the boss (also a programmer) would have to learn some basics of C# 3.0 which should not be difficult. What do you think about it? Is my boss right and am I missing something? Are there any good reasons for such a restriction in a development company where C# is a main programming language?

I have had a similar experience (asked not to use Generics, because the may be confusing to my colleagues).
The fact is, that we now use generics and non of my colleagues are having a problem with them. They may not have grasped how to create generic classes, but they sure do understand how to use them.
My opinion on that is that any developer can learn how to use these language features. They may seem advanced at first but as people get used to them the shock of newness lessens.
The main argument for using these features (or any new language features) is that this is a simple and easy way to help my colleagues advance their skills, rather than stagnating.
As for your particular problem - not using lambdas. Lots of the updates to the BCL have overloads that take delegates as parameters - these are in many cases most easily expressed as lambdas, not using them this way is ignoring some of the new and updated uses of the BCL.
In regards to the issues with your peers not being able to learn lambdas - I found that Jon Skeets C# in depth deals with how they evolved from delegates in a manner that was easy to follow and real eye opener. I would recommend you get a copy for your boss and colleagues.

You boss is going to need to understand that language (and other) improvements are designed to give developers more capabilities, and make them more efficient in completing the task at hand, and that if he is not going to allow them for unknown reasons then:
The development team isn't producing at its greatest potential.
The company isn't benefiting from increased efficiency/productivity.
like others have said developers aren't worth their salt if they can't keep up with some of the latest improvements in the language that they are using on a daily basis. I suspect your boss hasn't done much coding lately and it is his inability to understand the latest language improvements that has motivated this decision.

I was told that these features are not standard and confusing for most developers and its usefulness is doubtful. I was restricted to use only C# 2.0 features and he is also considering forbidding anonymous methods.
Presumably roughly translates to your boss meaning...
These features are confusing for me, and I don't find them useful because I don't understand them.
Which is fairly symptomatic of the Blub paradox (well, or just sheer laziness). Either way there's no merit in what he's saying, and you should start looking for another job if he continues down that road.

If the project is strictly C# 3+ from now on, then you would not break the build by including these items. However, before using them you should be aware of the following:
You can't use them if the project lead gets to make the decision and votes no.
Other than that, you should use them where it makes the code significantly easier to maintain.
You should not use them in ways that are confusing, or unnecessary in the sense that they do not significantly improve the maintainability of the code. This does mean you should not use them where the code is effectively the same or barely improved.

If Microsoft didn't define the standard and these were features that they added to a non-Microsoft language, I would say your boss might have a point. However, since Microsoft defines the language and uses these very features in implementing significant parts of .NET 3.5 (and 4.0), I'd say that you'd be foolish to ignore them. You may not choose to use some of them -- var, for instance, may not be acceptable in all environments due to coding standards -- but a blanket policy of avoiding new features seems unreasonable.
The trickier bit is when should you start using new features, because they can be confusing and may delay development. In general, I choose to use new language features and platform elements on new projects. I often avoid using them on projects that are currently in development when the feature/framework enhancement comes out, deferring until the next project. On a long project, I might introduce them at a significant milestone if the amount of rearchitecting is small or the feature is worth the changes. Normally, I'd wait until the project is due for significant changes anyway and then evaluate if refactoring to newer features is warranted.

The jury is still out on the long term consequences of some features, but if their main rationale is 'it is confusing to other developers' or something similar than I would be concerned about the quality of the talent.

I like C# 3.0 features especially
lambda expressions, auto implemented
properties or in suitable cases also
implicitly typed local variables (var
keyword), but when my boss revealed
that I am using them, he asked me not
to use any C# 3.0 features in work. I
was told that these features are not
standard and confusing for most
developers and its usefulness is
doubtful.
He's got a point.
Following that line of thought, let's make a rule against generic collections since List<T> doesn't make any sense (angle brackets? wtf?).
While we're at it, let's eliminate all interfaces (when are you ever gonna need a class without any implementation?).
Hell, let's go ahead eliminate inheritance since its so tricky these days (is-a? has-a? can't we all just be friends?).
And use of recursion is grounds for dismissal (Foo() invokes Foo()? Surely you must be joking!).
Errrm... back to reality.
Its not that C# 3.0 features are confusion to programmers, its that the features are confusing to your boss. He's familiar with one technology and stubbornly refuses to part with it. You're about to enter the Twilight Zone Blub Paradox:
Programmers get very attached to their
favorite languages, and I don't want
to hurt anyone's feelings, so to
explain this point I'm going to use a
hypothetical language called Blub.
Blub falls right in the middle of the
abstractness continuum. It is not the
most powerful language, but it is more
powerful than Cobol or machine
language.
And in fact, our hypothetical Blub
programmer wouldn't use either of
them. Of course he wouldn't program in
machine language. That's what
compilers are for. And as for Cobol,
he doesn't know how anyone can get
anything done with it. It doesn't even
have x (Blub feature of your choice).
As long as our hypothetical Blub
programmer is looking down the power
continuum, he knows he's looking down.
Languages less powerful than Blub are
obviously less powerful, because
they're missing some feature he's used
to. But when our hypothetical Blub
programmer looks in the other
direction, up the power continuum, he
doesn't realize he's looking up. What
he sees are merely weird languages. He
probably considers them about
equivalent in power to Blub, but with
all this other hairy stuff thrown in
as well. Blub is good enough for him,
because he thinks in Blub.
When we switch to the point of view of
a programmer using any of the
languages higher up the power
continuum, however, we find that he in
turn looks down upon Blub. How can you
get anything done in Blub? It doesn't
even have y.
C# 3.0 isn't hard. Sure you can abuse it, but it isn't hard or confusing to any programmer with more than week of C# 3.0 experience. Your boss's skills have just fallen behind and he wants to bring the rest of the team down to his level. DON'T LET HIM!
Continue using anonymous funcs, the var keyword, auto-properties, and what have you to your hearts content. You won't lose your job over it. If he gets pissy about it, laugh it off.

Like it or not, if you plan on using LINQ in any situation, you're going to have to utilize some of the C# 3.0 language specs.
Your boss is going to have to warm up to them if he wants to utilize the feature sets you get from 3.5, which are numerous and worth your time investing in.
Also, from my experience in leading teams, I've found that using the 3.0 specs actually has helped devs readability and understanding of the code base. There's about a weeks worth of time that is spent by the dev trying to understand what the syntax means, but once they get it they much prefer the new way over the old way.

Perhaps you can do a presentation once a week on each feature to everyone and get some of the developers on your side to help convince management of the benefits.
I recently moved from a bleeding edge C# house to a C# house that was running mostly on dot.Net 1.1 and some 2.0 projects, using mostly only 1.1 features. Luckily management stay away from the code. Most of the developers love all the new features in the newer frameworks, they just don't have the time or inclination to figure them out by themselves. Once I managed to show them how they can make their own lives easier they started using them by themselves and we have migrated several projects to gain the new language features and better tool advantages.

Some people are just afraid of change, because maybe you'll make them all look stupid using fancy new technologies. Could also be that your boss doesn't want the team learning new things instead of getting work done the old fasioned way.
The var keyword can certainly be abused, but in most cases reduces redundant code. LINQ is the main thing you want from .Net 3.5 because of the huge time saving in the amount of code you have to write. Your boss should be encouraging you to use it. Also the base class libraries now take delegates are parameters, so you will be limiting yourself a lot by not using them. Lambda's are just some fancy syntactic sugar to make delegates cleaner.
I would refer you to Effectively Integrating into Software Development Teams and Leading by Example. Two really great articles on how to deal with teams that are afraid of change.

Real-World QVT

QVT (Query View Transformation) is a OMG specification of a Model-to-Model transformation language. Some tools already implement it (Eclipse, androMDA). I'm wondering whether it is really used in real-world cases. Will it ever take off and be used to tackle real-world problems? Is anybody using the QVT language?

From observing the MDD community for our own projects, I'd guess that QVT will eventually pick up. Currently ATL and Kermeta seem to be very popular, and from looking at the postings in the groups not only in academia.
There's an implementation of Declarative QVT now (see the M2M Eclipse group for the announcement), that'll be very interesting for us. We've been using the ModelMorf prototype, but it was a prototype and had a very huge turnaround time. I hope that with the integration of dQVT into the Eclipse tool chain we'll be able to use it for our own projects (a SoftEng tool, see http://rcos.iist.unu.edu, sorry, academic of course :).
I guess the pain of doing Model-driven development by hand/with man-power is not high enough yet...once the tools really increase the order of magnitude of productivity, that'll change.

Seems like QVT is beeing used for Model Driven Security applications. It is a good choice because of clearly defined semantics and provability. This still is reasearch however. France Telecom is experimenting with QVT. They want to use it for database migrations and a generative approach for applications.
http://smartqvt.elibel.tm.fr/events/QVT%20Experimentations%20at%20France%20Telecom.pdf
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4159881

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse