Related
I have been using CDH and HDP for a while (both in the pseudo-distributed mode) on a VM as well as installing natively on Ubuntu. Although my question is probably relevant to all Projects within the Apache Hadoop Ecosystem, let me ask this specifically in the context of Avro.
What is the best way to go about figuring out what the different packages and the classes within the packages do. I usually end up referring to the Javadoc for the project (Avro in this case) but the overviews for packages and classes end up being awfully inadequate.
For e.g. Take two of the Avro packages: org.apache.avro.specific and org.apache.avro.generic These are used for creating Specific and Generic Readers and Writers (respectively) but I'm not a 100% sure what these are for. I have used the Specific Package for in cases when I have used Avro Code Generation and the Generic ones when I don't want to use code generation. However, I am not sure if that is the only reason for using one vs. the other.
Another example: The Encoder\Decoder Classes are used for low-level SerDe, the DatumReader\DatumWrite for a "medium-level" Serde while most application layer interactions with Avro will probably use Generic\Specific Readers\Writers. Without having struggled through the pain of using these classes, how is a user to know what to use for what?
Is there a better way to get a good overview of each package (clearly the javadoc is not well documented) and the classes within the package?
PS: I have similar questions for essentially all other Hadoop Projects (Hive, HBASE etc.) - the Javadocs seem to be grossly inadequate overall. I just wonder what other developers end up doing to figure these out.
Any inputs would be great.
I download the source code and skim through it to get the idea what it does. If there is javadoc, I read that too. I tend to concentrate on the interfaces that I need and move on from there, that way I put everything into context and it makes it easier to figure out the usage. I use the call hierarchy and the type hierarchy views a lot.
These are very general guidelines, and ultimately it is the time you spend with the project that will make you understand it.
Hadoop ecosystem is quickly growing and changes are introduced on monthly bases. that's why javadoc is not so good. Another reason is that hadoop software tends to lean towards the infrastructure and not towards the end user. People developing tools will spend time learning the APIs and internals while everybody else is kinda supposed to be blissfully ignorant of all those, and just use some high level domain specific language for the tool.
Now firstly I realise the title is extremely broad, so let me describe the use case.
Background:
I'm currently teaching myself Scala+Gradle (because I like the flexibility and power of gradle and the much more legible build files)
As such with learning new languages its often best to make applications that you can actually use, and being primarily a PHP (with Symfony) programmer and formerly a Java programmer, there are many patterns that could carry across from both paradigms.
Use Case:
I'm writing an application where I am experimenting with a Provider+Interface(trait) layout, the goal is to define traits that encompass all the expected functionality for any particular type of component e.g. a ConfigReaderTrait and a YamlConfigReager as a provider. Theoretically the advantage of this would be to allow me to switch out core mechanisms or even architectural components with minimal effort, this allows for a great deal of R&D and experimenting.
PHP Symfony Influence
Now currently I work as a pure PHP dev, and as such has been introduced to Symfony, which has a brilliant Dependency Injection framework where the dependencies are defined in yaml files, and can be delegated to sub directories. I like this, because unlike with SBT I am unphased by using different languages for different purposes (eg groovy with gradle for build scripts) and I want to maintain a separation of concerns.
Such that each type of interface/trait or bundle of related functionality should be able to have its own DI config, and I would prefer it separate from the scala code itself.
Now for Scala....
Obviously things are not the same across different languages, and if you don't embrace the differences you may aswell go back to the previous language and leave things at that.
That said, I am not yet convinced by the DI frameworks I see for scala.
Guice for example is really a modified java framework (which is fine
because scala can use java libs, but because they don't function in
the entirely same paradigm of coding languages it feels as though
scala's capabilities are not leveraged)
MacWire annoyed me a bit,because you had to define the dependencies
in the files where you used them. Which does not assist in my
interface/provider concept.
SubCut so far seems to be the best suited to what I would expect.
But while going through all of this (and bare in mind this is all in the research phase, I havent used any of them yet) it seemed that DI in Scala is still very scattered, and in its infancy, by that I mean that there are different implementations with different applications, but not one flexible enough or power enough to compare to Symfonys DI. particularly not for my application.
Comments? Thoughts?
My 5 cents:
I have actually stopped using dependency injection frameworks after switching to Scala from Java.
The language allows for a few nice ways of doing it without a framework (multiple parameter lists and currying as well as the mixins for doing injection the way the 'cake pattern' does)
and I find myself more and more just using constructor or method parameter based injection as it clearly documents what dependencies a given piece of logic has and where it got those dependencies from.
It's also fairly easy to create different modules sets of implementations or factories for implementations using Scala objects and then selecting between those at runtime. This will give you the guarantee that it wont compile unless there is an implementation available, as opposed to the big ones in Java-land that will fail in runtime, effectively pushing a compile time problem into runtime.
This also removes the 'magic' of how dependencies are created and wired (reflection, runtime weaving, macros, XML, binding context to thread local etc). I think this makes it much easier for new developers to jump into a project and understand how the codebase is interconnected.
Regarding declaring implementations in non-code like XML I have found that projects rarely or never change those files without making a new release so then they might as well be code with all the benefits that bring (IDE support, performance, type checking).
I've recently been learning about inversion of control through dependency injection, and using Castle Windsor. I like it. I get it. The lightbulb over my head is burning brightly. But I have a nagging concern about the logging facility and the ILogger interface. What is that really doing there? Is it for me, or just for Windsor itself?
Since ILogger is intended to abstract away the differences between log4net and Nlog and whatever other logging frameworks it supports, it has to represent the lowest common denominator between them. The various frameworks are similar, but not necessarily identical. If there were some fantastic feature in one, but not in the other, it would either have to be left out of ILogger, or the ILogger implementations for the other logging frameworks would have to have no-op implementations of it, or something else that's not very satisfying.
Long before Windsor, I was a fan of log4net, and a lot of my favorite libraries use it, like NHibernate. So if I'm building a new application, I'll use log4net. I'm willing to commit to it, and I consider it a stable dependency -- as stable a dependency as needing System.Web for example. I would not write my components to use ILogger, I would write them to use ILog. But I get the impression that Windsor expects me to use ILogger for my own logging. Isn't that saddling my project with a dependency on Windsor, when I shouldn't have any dependency on my IoC container?
I see the point of Windsor having the logging facility so it can log its own operations using whichever logging framework the project wants to use. That seems perfectly sensible. But if I don't use ILogger for my own code, and just go straight to log4net's ILog, what am I giving up? Will I regret this?
The obvious response is that I might want to change logging frameworks in six months. But I won't. log4net is mature and stable. It's a project with a limited and very well-defined scope, which it implements nearly perfectly. It can be considered "finished". At most, I might need to write a custom appender to handle messages. (Maybe I want to write them onto a postcard and drop them in the mail for some reason.) But that's easily done within the log4net framework, and I would use it just like any other log4net appender. I would be no more likely to change logging frameworks than I would be to change web platforms.
It seems as if you have already made your mind up to use log4net directly. This is perfectly reasonable, as you consider it to be a stable dependency. I have just switched from log4net to NLog for our projects, so it does happen that you may change logging frameworks in future, which is where an abstraction has its advantages.
Another consideration when thinking about using a logging abstraction (other than losing functionality specific to a particular logging framework), is the extra overhead of learning the abstraction. Does this make the code more or less complex for developers to pick up?
In our case, we found NLog was so easy to install and configure directly, that we decided to lose our custom logging abstraction and switch from log4net (which we found a bit verbose in its xml configuration compared to NLog).
A repeating theme in my development work has been the use of or creation of an in-house plug-in architecture. I've seen it approached many ways - configuration files (XML, .conf, and so on), inheritance frameworks, database information, libraries, and others. In my experience:
A database isn't a great place to store your configuration information, especially co-mingled with data
Attempting this with an inheritance hierarchy requires knowledge about the plug-ins to be coded in, meaning the plug-in architecture isn't all that dynamic
Configuration files work well for providing simple information, but can't handle more complex behaviors
Libraries seem to work well, but the one-way dependencies have to be carefully created.
As I seek to learn from the various architectures I've worked with, I'm also looking to the community for suggestions. How have you implemented a SOLID plug-in architecture? What was your worst failure (or the worst failure you've seen)? What would you do if you were going to implement a new plug-in architecture? What SDK or open source project that you've worked with has the best example of a good architecture?
A few examples I've been finding on my own:
Perl's Module::Plugable and IOC for dependency injection in Perl
The various Spring frameworks (Java, .NET, Python) for dependency injection.
An SO question with a list for Java (including Service Provider Interfaces)
An SO question for C++ pointing to a Dr. Dobbs article
An SO question regarding a specific plugin idea for ASP.NET MVC
These examples seem to play to various language strengths. Is a good plugin architecture necessarily tied to the language? Is it best to use tools to create a plugin architecture, or to do it on one's own following models?
This is not an answer as much as a bunch of potentially useful remarks/examples.
One effective way to make your application extensible is to expose its internals as a scripting language and write all the top level stuff in that language. This makes it quite modifiable and practically future proof (if your primitives are well chosen and implemented). A success story of this kind of thing is Emacs. I prefer this to the eclipse style plugin system because if I want to extend functionality, I don't have to learn the API and write/compile a separate plugin. I can write a 3 line snippet in the current buffer itself, evaluate it and use it. Very smooth learning curve and very pleasing results.
One application which I've extended a little is Trac. It has a component architecture which in this situation means that tasks are delegated to modules that advertise extension points. You can then implement other components which would fit into these points and change the flow. It's a little like Kalkie's suggestion above.
Another one that's good is py.test. It follows the "best API is no API" philosophy and relies purely on hooks being called at every level. You can override these hooks in files/functions named according to a convention and alter the behaviour. You can see the list of plugins on the site to see how quickly/easily they can be implemented.
A few general points.
Try to keep your non-extensible/non-user-modifiable core as small as possible. Delegate everything you can to a higher layer so that the extensibility increases. Less stuff to correct in the core then in case of bad choices.
Related to the above point is that you shouldn't make too many decisions about the direction of your project at the outset. Implement the smallest needed subset and then start writing plugins.
If you are embedding a scripting language, make sure it's a full one in which you can write general programs and not a toy language just for your application.
Reduce boilerplate as much as you can. Don't bother with subclassing, complex APIs, plugin registration and stuff like that. Try to keep it simple so that it's easy and not just possible to extend. This will let your plugin API be used more and will encourage end users to write plugins. Not just plugin developers. py.test does this well. Eclipse as far as I know, does not.
In my experience I've found there are really two types of plug-in Architectures.
One follows the Eclipse model which is meant to allow for freedom and is open-ended.
The other usually requires plugins to follow a narrow API because the plugin will fill a specific function.
To state this in a different way, one allows plugins to access your application while the other allows your application to access plugins.
The distinction is subtle, and sometimes there is no distiction... you want both for your application.
I do not have a ton of experience with Eclipse/Opening up your App to plugins model (the article in Kalkie's post is great). I've read a bit on the way eclipse does things, but nothing more than that.
Yegge's properties blog talks a bit about how the use of the properties pattern allows for plugins and extensibility.
Most of the work I've done has used a plugin architecture to allow my app to access plugins, things like time/display/map data, etc.
Years ago I would create factories, plugin managers and config files to manage all of it and let me determine which plugin to use at runtime.
Now I usually just have a DI framework do most of that work.
I still have to write adapters to use third party libraries, but they usually aren't that bad.
One of the best plug-in architectures that I have seen is implemented in Eclipse. Instead of having an application with a plug-in model, everything is a plug-in. The base application itself is the plug-in framework.
http://www.eclipse.org/articles/Article-Plug-in-architecture/plugin_architecture.html
I'll describe a fairly simple technique that I have use in the past. This approach uses C# reflection to help in the plugin loading process. This technique can be modified so it is applicable to C++ but you lose the convenience of being able to use reflection.
An IPlugin interface is used to identify classes that implement plugins. Methods are added to the interface to allow the application to communicate with the plugin. For example the Init method that the application will use to instruct the plugin to initialize.
To find plugins the application scans a plugin folder for .Net assemblies. Each assembly is loaded. Reflection is used to scan for classes that implement IPlugin. An instance of each plugin class is created.
(Alternatively, an Xml file might list the assemblies and classes to load. This might help performance but I never found an issue with performance).
The Init method is called for each plugin object. It is passed a reference to an object that implements the application interface: IApplication (or something else named specific to your app, eg ITextEditorApplication).
IApplication contains methods that allows the plugin to communicate with the application. For instance if you are writing a text editor this interface would have an OpenDocuments property that allows plugins to enumerate the collection of currently open documents.
This plugin system can be extended to scripting languages, eg Lua, by creating a derived plugin class, eg LuaPlugin that forwards IPlugin functions and the application interface to a Lua script.
This technique allows you to iteratively implement your IPlugin, IApplication and other application-specific interfaces during development. When the application is complete and nicely refactored you can document your exposed interfaces and you should have a nice system for which users can write their own plugins.
I once worked on a project that had to be so flexible in the way each customer could setup the system, which the only good design we found was to ship the customer a C# compiler!
If the spec is filled with words like:
Flexible
Plug-In
Customisable
Ask lots of questions about how you will support the system (and how support will be charged for, as each customer will think their case is the normal case and should not need any plug-ins.), as in my experience
The support of customers (or
fount-line support people) writing
Plug-Ins is a lot harder than the
Architecture
Usualy I use MEF. The Managed Extensibility Framework (or MEF for short) simplifies the creation of extensible applications. MEF offers discovery and composition capabilities that you can leverage to load application extensions.
If you are interested read more...
In my experience, the two best ways to create a flexible plugin architecture are scripting languages and libraries. These two concepts are in my mind orthogonal; the two can be mixed in any proportion, rather like functional and object-oriented programming, but find their greatest strengths when balanced. A library is typically responsible for fulfilling a specific interface with dynamic functionality, whereas scripts tend to emphasise functionality with a dynamic interface.
I have found that an architecture based on scripts managing libraries seems to work the best. The scripting language allows high-level manipulation of lower-level libraries, and the libraries are thus freed from any specific interface, leaving all of the application-level interaction in the more flexible hands of the scripting system.
For this to work, the scripting system must have a fairly robust API, with hooks to the application data, logic, and GUI, as well as the base functionality of importing and executing code from libraries. Further, scripts are usually required to be safe in the sense that the application can gracefully recover from a poorly-written script. Using a scripting system as a layer of indirection means that the application can more easily detach itself in case of Something Bad™.
The means of packaging plugins depends largely on personal preference, but you can never go wrong with a compressed archive with a simple interface, say PluginName.ext in the root directory.
I think you need to first answer the question: "What components are expected to be plugins?"
You want to keep this number to an absolute minimum or the number of combinations which you must test explodes. Try to separate your core product (which should not have too much flexibility) from plugin functionality.
I've found that the IOC (Inversion of Control) principal (read springframework) works well for providing a flexible base, which you can add specialization to to make plugin development simpler.
You can scan the container for the "interface as a plugin type advertisement" mechanism.
You can use the container to inject common dependencies which plugins may require (i.e. ResourceLoaderAware or MessageSourceAware).
The Plug-in Pattern is a software pattern for extending the behaviour of a class with a clean interface. Often behaviour of classes is extended by class inheritance, where the derived class overwrites some of the virtual methods of the class. A problem with this solution is that it conflicts with implementation hiding. It also leads to situations where derived class become a gathering places of unrelated behaviour extensions. Also, scripting is used to implement this pattern as mentioned above "Make internals as a scripting language and write all the top level stuff in that language. This makes it quite modifiable and practically future proof". Libraries use script managing libraries. The scripting language allows high-level manipulation of lower level libraries. (Also as mentioned above)
Is there a general procedure for programming extensibility capability into your code?
I am wondering what the general procedure is for adding extension-type capability to a system you are writing so that functionality can be extended through some kind of plugin API rather than having to modify the core code of a system.
Do such things tend to be dependent on the language the system was written in, or is there a general method for allowing for this?
I've used event-based APIs for plugins in the past. You can insert hooks for plugins by dispatching events and providing access to the application state.
For example, if you were writing a blogging application, you might want to raise an event just before a new post is saved to the database, and provide the post HTML to the plugin to alter as needed.
This is generally something that you'll have to expose yourself, so yes, it will be dependent on the language your system is written in (though often it's possible to write wrappers for other languages as well).
If, for example, you had a program written in C, for Windows, plugins would be written for your program as DLLs. At runtime, you would manually load these DLLs, and expose some interface to them. For example, the DLLs might expose a gimme_the_interface() function which could accept a structure filled with function pointers. These function pointers would allow the DLL to make calls, register callbacks, etc.
If you were in C++, you would use the DLL system, except you would probably pass an object pointer instead of a struct, and the object would implement an interface which provided functionality (accomplishing the same thing as the struct, but less ugly). For Java, you would load class files on-demand instead of DLLs, but the basic idea would be the same.
In all cases, you'll need to define a standard interface between your code and the plugins, so that you can initialize the plugins, and so the plugins can interact with you.
P.S. If you'd like to see a good example of a C++ plugin system, check out the foobar2000 SDK. I haven't used it in quite a while, but it used to be really well done. I assume it still is.
I'm tempted to point you to the Design Patterns book for this generic question :p
Seriously, I think the answer is no. You can't write extensible code by default, it will be both hard to write/extend and awfully inefficient (Mozilla started with the idea of being very extensible, used XPCOM everywhere, and now they realized it was a mistake and started to remove it where it doesn't make sense).
what makes sense to do is to identify the pieces of your system that can be meaningfully extended and support a proper API for these cases (e.g. language support plug-ins in an editor). You'd use the relevant patterns, but the specific implementation depends on your platform/language choice.
IMO, it also helps to use a dynamic language - makes it possible to tweak the core code at run time (when absolutely necessary). I appreciated that Mozilla's extensibility works that way when writing Firefox extensions.
I think there are two aspects to your question:
The design of the system to be extendable (the design patterns, inversion of control and other architectural aspects) (http://www.martinfowler.com/articles/injection.html). And, at least to me, yes these patterns/techniques are platform/language independent and can be seen as a "general procedure".
Now, their implementation is language and platform dependend (for example in C/C++ you have the dynamic library stuff, etc.)
Several 'frameworks' have been developed to give you a programming environment that provides you pluggability/extensibility but as some other people mention, don't get too crazy making everything pluggable.
In the Java world a good specification to look is OSGi (http://en.wikipedia.org/wiki/OSGi) with several implementations the best one IMHO being Equinox (http://www.eclipse.org/equinox/)
Find out what minimum requrements you want to put on a plugin writer. Then make one or more Interfaces that the writer must implement for your code to know when and where to execute the code.
Make an API the writer can use to access some of the functionality in your code.
You could also make a base class the writer must inherit. This will make wiring up the API easier. Then use some kind of reflection to scan a directory, and load the classes you find that matches your requirements.
Some people also make a scripting language for their system, or implements an interpreter for a subset of an existing language. This is also a possible route to go.
Bottom line is: When you get the code to load, only your imagination should be able to stop you.
Good luck.
If you are using a compiled language such as C or C++, it may be a good idea to look at plugin support via scripting languages. Both Python and Lua are excellent languages that are used to script a large number of applications (Civ4 and blender use Python, Supreme Commander uses Lua, etc).
If you are using C++, check out the boost python library. Otherwise, python ships with headers that can be used in C, and does a fairly good job documenting the C/python API. The documentation seemed less complete for Lua, but I may not have been looking hard enough. Either way, you can offer a fairly solid scripting platform without a terrible amount of work. It still isn't trivial, but it provides you with a very good base to work from.