How to fine tune fine tune GitHub Copilot?

How to fine tune fine tune GitHub Copilot? - github

We can fine tune language models like BERT, GPT-3.
Can I fine tune GitHub Copilot model?
I have already looked into examples from https://copilot.github.com/ but cant find the details.
Would really appreciate if someone had fine tuned Github Copilot.

OpenAI API offers the "Davinci Codex" machine learning model with a pay-per-hit subscription, similar to the the non-coding version of the davinci model.
OpenAI should enable the fine-tuning option to Davinci Codex as well. When they do it you will be able to use it via API calls.
After checking that prerequisite, I think you could link the OpenAI API to your local installation of Github Copilot via some code changes, in theory that should be possible.
The first step is probably to have a fork of the copilot VSCode extension that calls the OpenAI Codex API (or an entirely custom extension which inserts text in your code)
Then you would point it to your fine-tuned version of the model. To learn about fine-tuning OpenAI models you should look at their documentation:
https://beta.openai.com/docs/guides/fine-tuning
Note that they have also a openai CLI that allows you to do most of the data loading and fine tuning tasks.
Unfortunately at the moment you can only fine tune the non-coding version of OpenAI models, hope that they will make available Codex soon.

There does not seem to be a client-facing feature allowing you to fine-tune Copilot directly.
Here are two illustration as to why this feature is, for now (Q2 2022) missing.
The Copilot feature page initially included this:
How will GitHub Copilot get better over time?
GitHub Copilot doesn’t actually test the code it suggests, so the code may not even compile or run. GitHub Copilot can only hold a very limited context, so even single source files longer than a few hundred lines are clipped and only the immediately preceding context is used. And GitHub Copilot may suggest old or deprecated uses of libraries and languages. You can use the code anywhere, but you do so at your own risk.
As Tomek Korbak explains on Twitter:
Actually, Copilot's completions will always be optimised for human's liking, not necessarily compiler's liking.
That's because the language model training objective (predicting the next token in text) is great at capturing short-term dependencies (which explains the human feel of generated snippets).
But it struggles to capture long-term, global, semantic properties of generated sequences such as compilability. And there's no easy way of including compilability as a signal for their training.
The standard way -- fine-tuning language models using RL with compilability as a reward -- notoriously leads to catastrophic forgetting: less diverse and less accurate completions.
Tomek references "Energy-Based Models for Code Generation under Compilability Constraints (pdf)"
Our solution (KL-DPG) boosts compilability rate of generated sequences from 55% to 70%.
RL fine-tuning can do better but at a cost of catastrophic forgetting.
Overall, energy-based models (EBMs) turn out to be great at expressing weird, sequence-level constraints that would be super hard as to express as normalised priors for autoregressive language models.
EBMs provide a way of injecting our structured, symbolic knowledge into large language models without breaking them down or sacrificing their uncanny abilities.
The space of further applications in controllable generation is huge.
So not so easy.
Tanishq Mathew Abraham explains in "Coding with GitHub Copilot"
I wonder if the GitHub team might also develop a way of perhaps fine-tuning GitHub Copilot to specific use-cases.
For example, there may be a specific GitHub Copilot models for fastai, JAX, etc. They would be fine-tuned on the source code of of these libraries and codebases that use these libraries.
But making sure that the tool does not provide outdated suggestions would still be a challenge.
I don’t think it would be possible to provide suggestions for a brand-new library that does not have enough codebases using it to train on.
Additionally, for situations like fastai where there are older APIs and newer APIs, when fine-tuning a model, the codebases using the older APIs would have to be filtered out.

No, not at all. There is no storing of GitHub Pilot's model in the client's system and also no access is provided to the model since now they have started charging for their services hence making it more obvious that their project isn't and won't be open-sourced.

Related

Any pytorch tools to monitor neural network's training?

Are there any tools to monitor network's training in PyTorch? Like tensorboard in tensorflow.

PyTorch 1.1.0 supports TensorBoard natively with torch.utils.tensorboard. The API is very similar to tensorboardX. See the documentation for more details.

I am using tensorboardX. It supports most (if not all) of the features of TensorBoard. I am using the Scalar, Images, Distributions, Histograms and Text. I haven't tried the rest, like audio and graph, but the repo also contains examples for those use cases. The installation can be done easily with pip. It's all explained in the README file of the repo.
There are also other github repos which implement a wrapper for PyTorch (and other languages/frameworks) to tensorboard. As far as I know they support fewer functionalities. But have a look at:
Crayon
Tensorboard-Logger

I have asked this question before in the forums. Tensorboard seems very convenient for Tensorflow and it is also made part of the library/framework itself. However, PyTorch wouldn't take the same approach. But there is a library called visdom here that is released by Facebook, that helps you log the training information. This gives you the flexibility of logging information the way you want. While this means a lot of flexibility, it also means you need to write some extra code to make things work.

Following up on blckbird's answer, I'm also a big fan of Tensorboard-PyTorch. However I also found that its API is relatively low level and I was writing a lot of similar code over and over to do the logging. So (shameless plug) I've written a small package on top of it to automate monitoring network training experiments with minimal code. Hopefully someone else finds it helpful. pytorch-monitor

Minetorch helps me a lot at the past 2 Kaggle competitions. I think it's ready for others to use. It has built-in tensorboard or matplotlib supported. And many other features which make the work easy, includes:
Logger
Tensorboard supported
Matplotlib (to generate png to file)
Auto resume training
Auto best model saving
Hook points for customize
...
It's still in developing so any issues or PRs are very welcomed : )

Understanding Hadoop Packages and Classes

I have been using CDH and HDP for a while (both in the pseudo-distributed mode) on a VM as well as installing natively on Ubuntu. Although my question is probably relevant to all Projects within the Apache Hadoop Ecosystem, let me ask this specifically in the context of Avro.
What is the best way to go about figuring out what the different packages and the classes within the packages do. I usually end up referring to the Javadoc for the project (Avro in this case) but the overviews for packages and classes end up being awfully inadequate.
For e.g. Take two of the Avro packages: org.apache.avro.specific and org.apache.avro.generic These are used for creating Specific and Generic Readers and Writers (respectively) but I'm not a 100% sure what these are for. I have used the Specific Package for in cases when I have used Avro Code Generation and the Generic ones when I don't want to use code generation. However, I am not sure if that is the only reason for using one vs. the other.
Another example: The Encoder\Decoder Classes are used for low-level SerDe, the DatumReader\DatumWrite for a "medium-level" Serde while most application layer interactions with Avro will probably use Generic\Specific Readers\Writers. Without having struggled through the pain of using these classes, how is a user to know what to use for what?
Is there a better way to get a good overview of each package (clearly the javadoc is not well documented) and the classes within the package?
PS: I have similar questions for essentially all other Hadoop Projects (Hive, HBASE etc.) - the Javadocs seem to be grossly inadequate overall. I just wonder what other developers end up doing to figure these out.
Any inputs would be great.

I download the source code and skim through it to get the idea what it does. If there is javadoc, I read that too. I tend to concentrate on the interfaces that I need and move on from there, that way I put everything into context and it makes it easier to figure out the usage. I use the call hierarchy and the type hierarchy views a lot.
These are very general guidelines, and ultimately it is the time you spend with the project that will make you understand it.
Hadoop ecosystem is quickly growing and changes are introduced on monthly bases. that's why javadoc is not so good. Another reason is that hadoop software tends to lean towards the infrastructure and not towards the end user. People developing tools will spend time learning the APIs and internals while everybody else is kinda supposed to be blissfully ignorant of all those, and just use some high level domain specific language for the tool.

Using Feature Toggling and IoC in lieu of Branching Code -- Good or Bad Idea?

Our clients get to choose when to upgrade. So, my team literally has to maintain and support dozens of versions of our software product. As you can imagine that results in a lot of branching and merging as hot fixes and service packs have to be propagated across all these flavors. I'm not happy with the situation. The obvious solution is simply not to maintain so many different versions of our product, but that obvious solution is not available to me. So, I'm exploring creative options to lower the team's maintenance work. I'm considering using a mix of Feature Toggling and IoC as a way to implement n-number of versions of our software product. The idea is that I could use a single code base for my product and manage behaviors and features via configuration management. This would be in lieu of having to propagate code across multiple branches. Is this a reasonable approach or am I just trading off one problem for another?

That sounds reasonable, in that this would be the way I'd address such a problem in a greenfield environment.
Let's not call it a Feature Toggle, though. As the name implies, a Feature Toggle is an on/off switch, which may not be what you need.
Sometimes, an upgrade also involves changed behaviour in existing features. That implies that you're probably going to need something more sophisticated than an on/off switch.
The Strategy pattern is a more flexible way of modelling variation in behaviour. Each Strategy can represent a particular version of a particular behaviour, and if you don't want the behaviour at all, you can provide a Null Object implementation. In other words, Feature Toggle can be implemented with a Strategy.
You can inject the Strategies into your application kernel using Dependency Injection, and you could make the choice of Strategies configurable via a configuration system. Most DI Containers I've heard about (on .NET and Java) support file-based configuration.
This essentially describes an add-in architecture.
Now, even for a greenfield application, this is no easy feat to pull off. If you have a headless system, it's not that hard, but once you have UI involved, you start to realise that you're going to need to componentise the UI architecture as well, so that you can plug in UI elements via Strategies.
On a decade-old code base, this would be what I'd call an 'interesting challenge', to say the least.

Criteria for selecting a library for Enterprise usage

What are your criteria for selection a (open source) library (or framework) for enterprise usage?
Some libraries are pretty small and can be easily checked for security flaws or tested for performance. But most libraries are too big to be reviewed before you can start to use them.
When I think of me selecting a library, most if the selection process is just gut feeling. When I try to be more specific, these are the first criteria which come to my mind:
How many developers are working on the project? My feeling is that more developers will find more bugs and security issues. In addition it will be harder to introduce security issues intentionally.
How good is the support? Compared to closed source libraries, I've got the feeling that the support of open source is often much better since you have a community around the globe which will be available whenever you need them.
How wide spread is the library? Are there any books about it on the market? Which other projects are using the library?
What are your criteria? Feel free to edit this note as community wiki.

For me, it depends on whether or not it is paid for or not. In your case, you give the impression you are looking at open source libraries.
In that specific case, I'll look at test coverage. Regardless of the number of contributors, if there aren't any unit tests that I can run myself (as well as enhance and test my use cases for if they fall outside the coverage of the unit tests provided), then that's a massive issue for me.
It's not that I don't appreciate the work that is done already in providing the library, but code in projects like this should have unit tests already with good coverage in order to gain traction.
If there are no libraries that have unit tests, then I would start searching for the library on search engines, actively seeking out negative replies. People who have negative feelings about the code and can crystalize the objective basis for those feelings in terms of how the code failed them will provide more valuable feedback than the masses that say "it works great".
Now for a commercial piece of code, it's completely different. At that point, I'd start looking at the company and it's support staff as a whole, and using that as a determination (as well as tests of your own to see if the library is right for you) as to whether or not to use that company's offering.

Quite often in open source libraries you cannot get reliable support. In such situations your best bet is to fix it yourself, which involves the following requirements.
You need to have the ability to read often messy and undocumented code.
The technical ability to ask the right questions from the right people -- i.e., these people aren't being paid to fix problems and they will only answer you if you make it easy enough for them.
Then you need the ability to fix the bug and get the patch accepted -- because if the patch isn't accepted .....
With this in mind I would be inclined to get a commercial library, or dual licensed library so that I could pay to get a competent engineer (motivated by the money I pay his company) to fix my problem.

How do I plan an enterprise level web application?

I'm at a point in my freelance career where I've developed several web applications for small to medium sized businesses that support things such as project management, booking/reservations, and email management.
I like the work but find that eventually my applications get to a point where the overhear for maintenance is very high. I look back at code I wrote 6 months ago and find I have to spend a while just relearning how I originally coded it before I can make a fix or feature additions. I do try to practice using frameworks (I've used Zend Framework before, and am considering Django for my next project)
What techniques or strategies do you use to plan out an application that is capable of handling a lot of users without breaking and still keeping the code clean enough to maintain easily?
If anyone has any books or articles they could recommend, that would be greatly appreciated as well.

Although there are certainly good articles on that topic, none of them is a substitute of real-world experience.
Maintainability is nothing you can plan straight ahead, except on very small projects. It is something you need to take care of during the whole project. In fact, creating loads of classes and infrastructure code in advance can produce code which is even harder to understand than naive spaghetti code.
So my advise is to clean up your existing projects, by continuously refactoring them. Look at the parts which were a pain to change, and strive for simpler solutions that are easier to understand and to adjust. If the code is even too bad for that, consider rewriting it from scratch.
Don't start new projects and expect them to succeed, just because your read some more articles or used a new framework. Instead, identify the failures of your existing projects and fix their specific problems. Whenever you need to change your code, ask yourself how to restructure it to support similar changes in the future. This is what you need to do anyway, because there will be similar changes in the future.
By doing those refactorings you'll stumble across various specific questions you can ask and read articles about. That way you'll learn more than by just asking general questions and reading general articles about maintenance and frameworks.
Start cleaning up your code today. Don't defer it to your future projects.
(The same is true for documentation. Everyone's first docs were very bad. After several months they turn out to be too verbose and filled with unimportant stuff. So complement the documentation with solutions to the problems you really had, because chances are good that next year you'll be confronted with a similar problem. Those experiences will improve your writing style more than any "how to write good" style guide.)

I'd honestly recommend looking at Martin Fowlers Patterns of Enterprise Application Architecture. It discusses a lot of ways to make your application more organized and maintainable. In addition, I would recommend using unit testing to give you better comprehension of your code. Kent Beck's book on Test Driven Development is a great resource for learning how to address change to your code through unit tests.

To improve the maintainability you could:
If you are the sole developer then adopt a coding style and stick to it. That will give you confidence later when navigating through your own code about things you could have possibly done and the things that you absolutely wouldn't. Being confident where to look and what to look for and what not to look for will save you a lot of time.
Always take time to bring documentation up to date. Include the task into development plan; include that time into the plan as part any of change or new feature.
Keep documentation balanced: some high level diagrams, meaningful comments. Best comments tell that cannot be read from the code itself. Like business reasons or "whys" behind certain chunks of code.
Include into the plan the effort to keep code structure, folder names, namespaces, object, variable and routine names up to date and reflective of what they actually do. This will go a long way in improving maintainability. Always call a spade "spade". Avoid large chunks of code, structure it by means available within your language of choice, give chunks meaningful names.
Low coupling and high coherency. Make sure you up to date with techniques of achieving these: design by contract, dependency injection, aspects, design patterns etc.
From task management point of view you should estimate more time and charge higher rate for non-continuous pieces of work. Do not hesitate to make customer aware that you need extra time to do small non-continuous changes spread over time as opposed to bigger continuous projects and ongoing maintenance since the administration and analysis overhead is greater (you need to manage and analyse each change including impact on the existing system separately). One benefit your customer is going to get is greater life expectancy of the system. The other is accurate documentation that will preserve their option to seek someone else's help should they decide to do so. Both protect customer investment and are strong selling points.
Use source control if you don't do that already
Keep a detailed log of everything done for the customer plus any important communication (a simple computer or paper based CMS). Refresh your memory before each assignment.
Keep a log of issues left open, ideas, suggestions per customer; again refresh your memory before beginning an assignment.
Plan ahead how the post-implementation support is going to be conducted, discuss with the customer. Make your systems are easy to maintain. Plan for parameterisation, monitoring tools, in-build sanity checks. Sell post-implementation support to customer as part of the initial contract.
Expand by hiring, even if you need someone just to provide that post-implementation support, do the admin bits.
Recommended reading:
"Code Complete" by Steve Mcconnell
Anything on design patterns are included into the list of recommended reading.

The most important advice I can give having helped grow an old web application into an extremely high available, high demand web application is to encapsulate everything. - in particular
Use good MVC principles and frameworks to separate your view layer from your business logic and data model.
Use a robust persistance layer to not couple your business logic to your data model
Plan for statelessness and asynchronous behaviour.
Here is an excellent article on how eBay tackles these problems
http://www.infoq.com/articles/ebay-scalability-best-practices

Use a framework / MVC system. The more organised and centralized your code is the better.
Try using Memcache. PHP has a built in extension for it, it takes about ten minutes to set up and another twenty to put in your application. You can cache whatever you want to it - I cache all my database records in it - for every application. It does wanders.
I would recommend using a source control system such as Subversion if you aren't already.

You should consider maybe using SharePoint. It's an environment that is already designed to do all you have mentioned, and has many other features you maybe haven't thought about (but maybe you will need in the future :-) )
Here's some information from the official site.
There are 2 different SharePoint environments you can use: Windows Sharepoint Services (WSS) or Microsoft Office Sharepoint Server (MOSS). WSS is free and ships with Windows Server 2003, while MOSS isn't free, but has much more features and covers almost all you enterprise's needs.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse