Predict Class Probabilities in Spark RandomForestClassifier - scala

I built random forest models using ml.classification.RandomForestClassifier. I am trying to extract the predict probabilities from the models but I only saw prediction classes instead of the probabilities. According to this issue link, the issue is resolved and it leads to this github pull request and this. However, It seems it's resolved in the version 1.5. I'm using the AWS EMR which provides Spark 1.4.1 and sill have no idea how to get the predict probabilities. If anyone knows how to do it, please share your thought or solutions. Thanks!

I have already answered a similar question before.
Unfortunately, with MLLIb you can't get the probabilities per instance for classification models till version 1.4.1.
There is JIRA issues (SPARK-4362 and SPARK-6885) concerning this exact topic which is IN PROGRESS as I'm writing the answer now. Nevertheless, the issue seems to be on hold since November 2014
There is currently no way to get the posterior probability of a prediction with Naive Baye's model during prediction. This should be made available along with the label.
And here is a note from #sean-owen on the mailing list on a similar topic regarding the Naive Bayes classification algorithm:
This was recently discussed on this mailing list. You can't get the probabilities out directly now, but you can hack a bit to get the internal data structures of NaiveBayesModel and compute it from there.
Reference : source.
This issue has been resolved with Spark 1.5.0. Please refer to the JIRA issue for more details.
Concerning AWS, there is not much you can do now for that. A solution might be if you can fork the emr-bootstrap-actions for spark and configure it for you needs, then you'll be able to install Spark on AWS using the bootstrap step.
Nevertheless, this might seem a little complicated.
There is some thing you might need to consider :
update the spark/config.file to install you spark-1.5. Something like :
+3 1.5.0 python s3://support.elasticmapreduce/spark/install-spark-script.py s3://path.to.your.bucket.spark.installation/spark/1.5.0/spark-1.5.0.tgz
this file list above must be a proper build of spark located in an specified s3 bucket you own for the time being.
To build your spark, I advice you reading about it in the examples section about building-spark-for-emr and also the official documentation. That should be about it! (I hope I haven't forgotten anything)
EDIT : Amazon EMR release 4.1.0 offers an upgraded version of Apache Spark (1.5.0). You can check here for more details.

Unfortunately this isn't possible with version 1.4.1, you could extend the random forest class and copy some of the code I added in that pull request if you can't upgrade - but be sure to switch back to the regular version once you are able to upgrade.

Spark 1.5.0 is now supported natively on EMR with the emr-4.1.0 release! No more need to use the emr-bootstrap-actions, which btw only work on 3.x AMIs, not emr-4.x releases.

Related

How to validate ksql script?

I would like to know if there is a way for checking if a .ksql script is syntactically correct?
I know that you can send a POST request to the server, this however would also execute the containing ksql commands. I would love to have some kind of endpoint where you can pass your statement and it returns you either an error code or an OK like:
curl -XPOST <ksqldb-server>/ksql/validate -d '{"ksql": "<ksql-statement>"}' .
My question aims for a way to check the synatx in an automated fashion without the need to cleaning up everything afterwards.
Thanks for you help!
Note: I am also well aware that I could run everything separately using, e.g., a docker-compose file and tear everything down again. This however is quite resource heavy and and harder to maintain.
one option could be to use the ksql test runner (see here: https://docs.ksqldb.io/en/latest/how-to-guides/test-an-app/) and look at the errors to check if the statement is valid. Let me know if it works for your scenario.
By now I've found a way to test for my use case. I had a ksqldb cluster already in place with all other systems needed for the Kafka ecosystem (Zookeeper, Broker,...). I had to compromise a little but and go through the effort of deploying everything but here is my approach:
Use proper naming (let it be prefixed with test or whatever suits your use case) for your streams, tables,... the queries' sink property should include the prefixed topic in order to find it easily, or you simply assign an QUERY_ID (https://github.com/confluentinc/ksql/pull/6553).
Deploy the streams, tables,... to your ksqldb server using its rest API. Since I was programming in Python, I made use of the ksql package using pip (https://github.com/bryanyang0528/ksql-python).
Cleanup the ksqldb server by filtering for the naming that you assigned to the ksql resources and run the corresponding DROP or TERMINATE statement. Consider also, you will have dependencies that result in multiple streams/tables reusing a topics. The statements can be looked up in the official developer guide (https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/quick-reference/).
If you had errors in step 2, step 3 should have cleaned up the leftovers so that you can adjust your ksql scripts until they run through smoothly.
Note: I could not make any assumptions on how the streams,... look like. If you can, I would prefer the ksql-test-runner.

How do I structure my project folder in eclipse for Cucumber project which has sprint wise delivery

I am trying to create an automation framework using cucumber and trying to replicate a real time scenario (sprint wise delivery).
How do I structure my folders/source folders/packages in eclipse? Below is the structure which I am about to follow but I am not quite convinced if it is right.
I am trying to structure in such a way that when I give the command
"mvn test -Dcucumber.options="src\test\resources\sprint1\features", then it should run all the features under sprint1, similarly for sprint2 and so on.
Any suggestions or inputs would be helpful.
P.S: Since I am new to cucumber, a detailed explaination on the folder structure for real time sprint wise delivery would be much appreciated.
Thanks :)
I would not consider the file structure you are thinking of.
The reason is that after a while, it doesn't matter when a feature was added to the system. So organizing features based on time is a bad idea.
If you still need to be able to run the features for a specific sprint, consider using tags instead. That would allow you only to run the features connected to the sprint you are interested in.
I would not to that either, because after a while it doesn't matter which sprint a piece of functionality was added. It should still pass all executions, even if it is 27 sprints old.
If this organization is bad, how should you do it instead?
This is a question where a lot of people have a lot of opinions and the debate can get very heated.
My take is that it is interesting to make sure that the code is easy to use. With that I mean easy to navigate and understand for a new developer. If you want, think of usability in any other product.
Given this, I would organize the features after functional areas in different packages. A package for each area, one for viewing products, one for ordering products, one for paying etc.
I would also try to take a step further and organize the source code in a similar way.
But I would never organize using a temporal approach as you are thinking of.
You should not organize your tests as per the sprint because a particular sprint will end on a particular time. If you want to run some feature files together for temporary basis(till the time sprint is not over), you can add tags on the top in the feature files.
For example:
You have following 2 feature files:
src/test/resources/sprint1/file1.feature
src/test/resources/sprint1/file2.feature
Just add "#sprint1" on top of each feature as shown below:
//1. file1.feature
#sprint1
Feature: sprint1 : features : file1
Scenario: Some scenario desc..
Given ....
When ....
Then ....
//2. file2.feature
#sprint1
Feature: sprint1 : features : file1
Scenario: Some scenario desc..
Given ....
When ....
Then ....
Now to run these both files you need to execute the following code in your command prompt:
cucumber --tags #sprint1
By executing this command, all the files which contains "#sprint1" tag will run. After the sprint is over, you can delete this extra tag from feature files

How to merge two SonarQube instances

Currently we are running two sonar instances in different locations, we are planning to merge these two instances to one. Is it possible to merge two SonarQube instances?
This feature does not exist currently, and we don't expect to work on this in a near future.
The best advice we can give you is:
Keep your biggest SQ instance
For every project of your other instance, use the "sonar.projectDate" analysis parameter to rebuild your analysis history on the biggest instance. For example:
Check out code for version 1.0 and run an analysis with "-sonar.projectDate=2010-12-21"
Check out code for version 1.1 and run an analysis with "-sonar.projectDate=2011-08-13"
...etc.

Why Rational Team Concert changes the files' last modified attribute?

I'm having some issues with the installation of Rational Team Concert on my server.
The thing is that when I upload some changes to the server (any kind), it changes the last modified attribute of the file, but it shouldn't do it.
Is there a way to avoid this behavior?
Thank you in advance!
This is something that we have tried to add to RTC SCM (and we still plan to). However, we found that it needs to be an option on load/update.
There are numerous details and discussions available # this work item on jazz.net
Regarding timestamp, getting over the fact that relying on it in a version control tool isn't always considered a best-practice (see "What's the equivalent of use-commit-times for git?"), it is actually a complex issue:
an SCM loader wouldn't use just timestamp to determined what file has changed (Task 179263)
you can have various requirements for that timestamp (like in Defect 159043, where the file timestamp of the modified file on disk that of when it was delivered, not when I accepted.). The variable JAZZ_CCM_SKIP_MOD_TIME=true is mentioned so check if that could improve your specific case.
it is all based on the assumption the timestamp is correctly set by the local workstation, which isn't always true, as illustrated in Task 77201

ADO.net Data Services [DataWebkey]

I am using VS 2008(Professional edition) with SP1.I am new to ADO.NET DataServices.I am watching Mike Taulty videos.
He used [DataWebKey] attribute to specifty the key field and he referred the namespace
Microsoft.Data.Web. To follow that example I am trying to refer the same assembly,but it is not found in my system.
How to fix it?
Looks like you're not the first person to come across this.
http://mtaulty.com/CommunityServer/blogs/mike_taultys_blog/archive/2008/05/19/10424.aspx
Apparently you should use DataServiceKey which is in System.Data.Services.Common.
I see that Mike's videos mostly date from mid 2008. ADO.NET Data Services has changed since then, that may be why you're unable to find the right reference.
It think you're better off trying to find some more recent material, preferably from the last 6 months.