Related
I have been using CDH and HDP for a while (both in the pseudo-distributed mode) on a VM as well as installing natively on Ubuntu. Although my question is probably relevant to all Projects within the Apache Hadoop Ecosystem, let me ask this specifically in the context of Avro.
What is the best way to go about figuring out what the different packages and the classes within the packages do. I usually end up referring to the Javadoc for the project (Avro in this case) but the overviews for packages and classes end up being awfully inadequate.
For e.g. Take two of the Avro packages: org.apache.avro.specific and org.apache.avro.generic These are used for creating Specific and Generic Readers and Writers (respectively) but I'm not a 100% sure what these are for. I have used the Specific Package for in cases when I have used Avro Code Generation and the Generic ones when I don't want to use code generation. However, I am not sure if that is the only reason for using one vs. the other.
Another example: The Encoder\Decoder Classes are used for low-level SerDe, the DatumReader\DatumWrite for a "medium-level" Serde while most application layer interactions with Avro will probably use Generic\Specific Readers\Writers. Without having struggled through the pain of using these classes, how is a user to know what to use for what?
Is there a better way to get a good overview of each package (clearly the javadoc is not well documented) and the classes within the package?
PS: I have similar questions for essentially all other Hadoop Projects (Hive, HBASE etc.) - the Javadocs seem to be grossly inadequate overall. I just wonder what other developers end up doing to figure these out.
Any inputs would be great.
I download the source code and skim through it to get the idea what it does. If there is javadoc, I read that too. I tend to concentrate on the interfaces that I need and move on from there, that way I put everything into context and it makes it easier to figure out the usage. I use the call hierarchy and the type hierarchy views a lot.
These are very general guidelines, and ultimately it is the time you spend with the project that will make you understand it.
Hadoop ecosystem is quickly growing and changes are introduced on monthly bases. that's why javadoc is not so good. Another reason is that hadoop software tends to lean towards the infrastructure and not towards the end user. People developing tools will spend time learning the APIs and internals while everybody else is kinda supposed to be blissfully ignorant of all those, and just use some high level domain specific language for the tool.
I've seen it in the context of classes. I suspect it means that the class could use being broken down into logical subunits, but I can't find a good definition. Could you give some examples?
Thanks for the help.
Edit: I love the smart replies, but I'm obviously referring to "monolithic" within a software context. I know about monoliths, megaliths, dolmens, and all the stone-related contexts. Gee, I have enough of them in my country...
Interesting question. I don't think there are any formal definitions of what a monolithic class is, but you've got the idea. A class that contains multiple components that are logically unconnected, or pointlessly coupled, is a monolithic class.
If you've read The Pragmatic Programmer, which I strongly recommend, you can define a monolithic class as an anti-pattern that goes against almost everything from that book.
As for examples, you'll find more in the realm of chip and OS design, where there are formal definitions of monolithic chips/kernels, which are similar to a monolithic class. Here are some examples, although each of them can be argued against being on this list:
JOGL - Java bindings for OpenGL. This could be arguable, and with good reason.
Most academic projects - For obvious reasons.
If you started programming alone, rather than joining a team, then chances are you can open one of your first projects, and there will be a class that is monolithic.
If you look up the etymology of the word you'll see it comes from the Greek monos (single) and lithos (stone). In the context of software as you mention it, it describes a single-tiered application in which the code for the user interface and the data access are combined into a single program from a single platform.
"Monolithic" is a term that has been used to flame succesful software. This link exposes the assumptions inherent in the term, and their limited usefulness.
The basic assumption is that a system works better if it is built from software components that each have an individual, well-defined task. Intuitively, this seems right. If each component works, the entire system must work, right?
In reality, it's not that easy. A larger, compositional (non-monolithic) system can miss a critical function, even when there is no single component to blame. This happens when the architectural design fails to allocate a function to any specific component. This can happen especially if it's a function which doesn't map cleanly to a single component.
Now Linux (to continue with the linked example) in reality is not monolithic. It has a modular userspace on top of a monolithic kernel, a userspace that comes with many separate utilities. Except when it doesn't.
My definition of a Monolithic design in software development, is a design which requires additional functionality to be added to a single indivisible block of code.
PRO:
Everything is in one place, and therefore easy to find
Can be simpler, given there less relations to consider (can also be more complex see cons)
CONS:
Over time as functionality is added the complexity of the system may exponentially increase, to the point new features are extremely hard or impossible to implement
Can make it difficult for multiple developers to work with e.g Entity Framework EDMX files have the entire database in a single file which can be extremely difficult for multiple developers to work on.
Reduced re-usability, by definition it does not have smaller components which can be then reused and re-purposed to solve other problems, unless a complete copy of the code is made and then modified.
A monolithic architecture is a model of software structure which is created as one piece where all Rails tools (ActionMailer, ActiveJob, ActionCable, etc.) can be gathered together with the code that these tools applies. The tools are not connected with each other but they are also not autonomous.
If one feature needs changes, it will influence the work of the whole process and other features because they are parts of one process.
Let’s recall what Ruby on Rails is, what it can offer, its pros and cons. Its most important benefit is that it is easy to work with.
If you write rails new you immediately get a new application at once, then you can create any REST API you want and use Rails helpers and generators, which makes development even easier.
If you need to send emails in your Rails app, then use Rails ActionMailer. When you need to do some hard processing, ActiveJob will help you. With Rails 5 you will also be able to use websockets out of the box. Thus, it will be easy to create chats or make your application more interactive.
In case you use correct DSL syntax, you can use all that and even more immediately. Moreover, you don’t have to know everything about the internal implementation of these tools, consider it’s DSL, and receive the expected result.
It means something is the opposite of modular. A modular application can have parts, referred to as modules, replaced without requiring replacement of the entire application. Whereas a monolithic application, after having a part fixed or upgraded, must be replaced in it's entirety.
From Wikipedia: "Modularity is desirable, in general, as it supports reuse of parts of the application logic and also facilitates maintenance by allowing repair or replacement of parts of the application without requiring wholesale replacement."
So in the context of a monolithic class, all its features are self-contained and if you want to add or alter a feature to the class you would need to alter/add code in the class and recompile it. Conversely a modular class exposes access to functionality which is implemented externally. For example a "Calculator" class may use a separate "Add" class for actually adding numbers; call a "Multiply" function from a separate library; or even call an "Amortize" function from a web service. As long as the each of these functional parts can be altered externally from the class, it is modular.
I am in a Web Scripting class at school and am working on my first assignment. I tend to overdo things and delve deeper into my subject than what is required in my classes. Right now I am researching CGI.pm to do my HTTP requests and it says there are two programming styles for CGI.pm:
An object-oriented style
A function-oriented style
Unless I overlooked the clear answer or am not knowledgeable enough to discern the answer for myself from the documentation provided at: http://perldoc.perl.org/CGI.html I just don't know what the pros and cons are of using these two different styles.
With that being said what are the pros and cons of using the two different styles? Which one is more commonly used? As far as using object-oriented style it says I can only use one CGI object at the time. Why is that?
Thanks for all your help. You have all made studying Computer Science very enjoyable, satisfying, and rewarding for me. =D
Behind the scenes, CGI.pm is doing the same thing despite the styles. The functional interface actually uses a secret object that you don't see.
For many small-scale CGI projects, you're probably never going to need more than one CGI object at a time, so the functional interface is fine. This might be the more common style, but only because most people make small scripts for very specific tasks. If you have a lot of other stuff going on, you might not like CGI.pm importing a long list (and it is long) of function names into your script. Some of the function names might clash with those other modules want to import.
I, however, always use the object-oriented interface. I don't have to worry about name collisions, and it's apparent where any method came from since you see its object. It's also easy to pass the object as arguments to other parts of large applications, etc.
Some people might complain about the extra typing, but that's never been the slow part of programming for me. I've been doing Perl for a long time and I don't mind the syntax. However, I only use CGI to get the input and maybe send the output. I don't mess with any of the HTML stuff.
When it talks about one CGI.pm object at a time, it's referring to access to the input. Once you've read STDIN, for instance, another CGI.pm object won't be able to read that. You can have as many objects as you like though. They just won't share data and the first one gets all of POST data.
You can actually use a mixture though. You can import some things, like :html, but still use the OO interface to deal with the input.
I strongly recommend using the object interface.
Will it be absolutely required for your classwork? No, in fact it is arguably overkill for even small production projects.
However, if you are serious about learning to use CGI.pm for larger scale projects you will need to learn the object method. If you reach the point of needing two objects you will have to use the object interface. Programming, like most everything else, gets better with practice. Practicing now on relatively easier problems will help you be ready for more complex ones.
In fact I'd recommend it as a general rule in programming (although there are exceptions) that if faced with two methods of using a particular tool making a habit of using the one most likely to be used in production code and/or the one that is the correct answer for more of the problem space.
I recently wrote a small tool to generate a class for each tier I hand write for the boring "forms over data" work where I spend almost 90% of my time (depressing I know) ... more on this as the economy improves ;)
My question is this - will using this tool instead of hand typing all this code from day to day actually hurt me as a developer? I feel like I will always be making changes to this tool and thus I "should" stay on top of the patterns used/ choices made/ etc... but some small part of me feels like I might lose my edge ... am I wrong?
If the tool can spit the code out without thought, then it probably saves you lots of thoughtless typing.
Writing the tool in the first place requires thinking, so I'd guess you'd be more "on the edge" maintaining and writing the tool.
That's good! Of course writing a tool to do all the job for you is impossible and wrong.
But automating repeatable tasks is always good - and sometimes writing specific types of code is repeatable.
It is even encouraged in the "Pragmatic Programmer" book.
Make sure that in the source control you have checked in a code generator and not its output (unless you have to modify the code later by hand)!
You are most definitely not wrong. I use code generators anywhere I can - I currently use CodeSmith to create my DAO's by looking at the database.
What edge are you afraid of losing? In my mind going to code generation is actually giving you an edge.
Larry Wall (of Perl fame) describes the three cardinal virtues of programming as Laziness, Impatience, and Hubris.
Congratulations! You have shown good laziness, in that you have identified some work you can pass off to an automated process and done so. (Bad laziness leads to cutting corners, procrastination, and generally postponing rather than eliminating work.) If you can successfully palm off some work onto another program, you are spending less time on annoying triviality and more on accomplishing things and learning.
Generate what you can. Code generation is one of the best tools I've picked up over the last 2 or 3 years. Typing the same code over and over (or copy and pasting it) is prone to error.
Spending less time doing something by having something/someone else do it, and more time researching better ways to do it will generally lead to doing it in a better way.
This doesn't have to just apply to programming....
Your code generator (at least in principle - I haven't looked at it myself) is The Right Thing, at least as far as it goes.
The next step would be to see whether you can, instead of generating all this redundant code, create a base class whose functionality matches the generated code and then derive your application code from it. Using inheritance rather than generation will allow you to benefit from improvements without needing to re-run the generator on all your projects. Perhaps more importantly, if you customize the generated code, the customizations would be lost if you re-run the generator, but customizations in a derived class will be preserved when the base class is changed.
No. Why do you think IDE's are so popular. Imagine if all the people who use Visual Studio had to programmatically create the GUI's without help from the IDE, it'd be terrible. I would be willing to bet most people who use VisualStudio won't know how to manualy create the forms they're creating in the IDE. But there's nothing wrong with that.
I believe in code generation wherever possible to remove the rote tasks of programming. You will not lose your edge, you will probably become a better programmer because you will spend more time working on the important and interesting stuff.
BTW, your tool sounds interesting. Have you released it anywhere?
Code generation is fine as long as you understand what you are generating. Physicists use calculators because they understand the formulas they are automating and realize that their precious time is better spent on important tasks.
Code generation is one of those invaluable DO:s that The Pragmatic Programmer advocates. I truly recommend that book. Here's a Pragmatic Programmer quick ref.
Its almost hypocritical not to code generate. Here we are automating all of these tasks that were traditionally done by hand... and yet many of us still hand crank all of our code, even if it can be easily generated.
My only experience with code generation is the macros of Common Lisp. They are used all the time. Everything that automats repetitive tasks is beneficial; that is what programming is about.
Read the story of Mac.
Imagine that each time you made a change to the tool and regenerated your code, that you made that design change by hand on all of your modules.
Since I've started generating code and gotten up to speed, I've found that I rarely get bugs in the generated code.
I find that writing code gen does help me learn the nuances of good architecture. You start seeing common patterns as opposed to a narrow view of your design. That said, don't use code gen as a substitute for good object-oriented code, and don't love your code gen so much you ignore new technologies. For example, if you're in .NET and are writing code-gen for data access, you'd better have a good excuse for not using Linq to SQL or NHibernate. Similarly, Dynamic Data can help in many forms-on-data scenarios. So, my advice: spike new stuff and code gen as needed.
My 2cents on code gen is that it is also critical for use in refactoring. I have found that partial classes and a good file comparison utility (Araxis or BeyondCompare) are essential.
Keep your generated code in one file and the custom Tweaks you made for that class in another file.
This practice will allow you to make those comprehensive framework changes implemented quickly and will also help you move to a new paradigm while easily being able to save your custom logic.
CodeSmith FTW!
While build servers are great to make sure all your code compiles, it doesn't address the differences in signatures with your stored procs or the like. If you routinely run the code gen you can more easily identify when those changes occur. A unit test will tell you the SP is wrong, code gen will tell you how to make it right.
i have a big web application running in perl CGI. It's running ok, it's well written, but as it was done in the past, all the html are defined hardcoded in the CGI calls, so as you could imagine, it's hard to mantain, improve and etc. So now i would like to start to add some templating and integrate with a framework (catalyst or CGI::application). My question is: Somebody here has an experience like that? There is any things that i must pay attention for? I'm aware that with both frameworks i can run native CGI scripts, so it's good because i can run both (CGI native ad "frameworked" code) together without any trauma. Any tips?
Write tests first (for example with Test::WWW::Mechanize). Then when you change things you always know if something breaks, and what it is that breaks.
Then extract HTML into templates, and commonly used subs into modules. After that it's a piece of cake to switch to a framework.
In general, go step by step so that you always have a working application.
Extricate the HTML from the processing logic in the CGI script. Identify all code that affects the HTML output, as these are candidates for becoming template variables. Separate that into a HTML file, with the identified parts marked with template variables. Eventually you will be able to refactor the page such that all processing is done at the start of the code and the HTML template just called up at the end of all processing.
In this kind of situation, rewriting from scratch basically, the old code is useful for A) testing, and B) design details. Ideally you'd make a set of tests, for all the basic functionality that you want to replicate, or at least tests that parse the final result pages so you can see the new code is returning the same information for the same inputs.
Design details within the code might be useless, depending on how much the framework handles automatically. If you have a good set of tests, and a straightforward conversion works well, you're done. If the behavior of the new doesn't match the old, you probably need to dig deeper into the "why?" and that'll probably be something odd looking, that doesn't make sense at first glance.
One thing to remember to do first is, find out if anyone has made something similar in the framework you're using. You could save yourself a LOT of time and money.
Here is how I did it using Python instead of Perl, but that should not matter:
Separated out HTML and code into distinct files. I used a template engine for that.
Created functions from the code which rendered a template with a set of parameters.
Organized the functions (which I termed views, inspired by Django) in a sensible way. (Admin views, User views, etc.) The views all follow the same calling convention!
Refactored out the database and request stuff so that the views would only contain view specific code (read: Handling GET, POST requests, etc. but nothing low-level!). Relied heavily on existing libraries for that.
I am here at the moment. :-) The next obvious step is of course:
Write a dispatcher which maps URLs to your views. This will also lead to nicer URLs and nicer 404- and error handling of course.
One of the assumptions that frameworks make is that the urls map to the code. For example in a framework you'll often see the following:
http://app.com/docs/list
http://app.com/docs/view/123
Usually though the old CGI scripts don't work like that, you're more likely to have something like:
http://app.com/docs.cgi?action=view&id=123
To take advantage of the framework you may well need to change all the urls. Whether you can do this, and how you keep old links working, may well form a large part of your decision.
Also frameworks provide support for some sort of ORM (object relational mapper) which abstracts the database calls and lets you only deal with objects. For Catalyst this is usually DBIx::Class. You should evaluate what the cost of switching to this will be.
You'll probably find that you want to do a complete rewrite, with the old code as a reference platform. This may be much less work than you expect. However start with a few toy sites to get a feel for whichever framework/orm/template you decide to go with.