How to get StratifiedKFold in Scala Spark MLLib - scala

I searched a bit and not finding any - wrote a StratifiedKFold method which can be used with Scala, Spark (MLlib). I am posting the answer below

a java version of stratifiedKFold of JavaRDD.
/*
stritified KFolds. this will promise each category will be fairly splited into training and testing samples.
for example 3 folds, category '0' has 3 samples, then all the 3 output folds will be like: training with 2 samples, testing with 1 samples.
the split will be as fair as possible, like: to split 11 samples into 3, result will be [4, 4, 3].
if the count of one category is less than k, then some folds will not contain samples of this category, this bad split will cause inaccurate result. but this is not mandatory.
*/
private static Tuple2<JavaRDD<LabeledPoint>, JavaRDD<LabeledPoint>>[] stritifyKFolds(JavaRDD<LabeledPoint> data, Broadcast<Integer> kBC){
JavaRDD<List<List<LabeledPoint>>> foldedLP = data.mapToPair( //map to (key,list) pair.
lp -> {
List<LabeledPoint> list = new ArrayList<>();
list.add(lp);
return new Tuple2<>(lp.label(), list);
}
).reduceByKey( //aggregate
(List<LabeledPoint> list1, List<LabeledPoint> list2) -> {
list1.addAll(list2);
return list1; //put the LabeledPoint with the same key into one list.
}
).values() //get list only after aggregate.
.map( //split each list into K folds.
list -> {
//shuffule and then put into different folds.
Collections.shuffle(list);
int total = list.size();
if(total < kBC.value()){
log.warn("category size {} is less than folds number {}, this will break stratification and leads to bad folds splitting", total, kBC.value());
}
//assign each element into folds.
List<List<LabeledPoint>> keyFolds = new ArrayList<>();
int avg = total/kBC.value(); //averge number of elements.
int remain = total - kBC.value(); //remain number of elements, which will be assign to folds from left to right fairly.
for(int i=0, index = 0, count=0; i<kBC.value(); i++, index += count){
//get current folds count
count = (i<remain) ? avg+1 : avg;
keyFolds.add(list.subList(index, index+count));
}
return keyFolds;
}
);
foldedLP.persist(StorageLevel.MEMORY_AND_DISK_SER());
Tuple2<JavaRDD<LabeledPoint>, JavaRDD<LabeledPoint>>[] result = new Tuple2[kBC.value()];
//initialize folds, each fold is a tuple of (training RDD, testing RDD).
for(int i=0; i<kBC.value(); i++){
final int ii = i; //must be final variable if used in lambda.
JavaRDD<LabeledPoint> training_i = foldedLP.flatMap(
list -> {
List<LabeledPoint> noII = new ArrayList<>();
for(int j=0; j<kBC.value(); j++){
if(j != ii){
noII.addAll(list.get(j)); //get the all except ii_th list iterator.
}
}
return noII.iterator();
}
);
JavaRDD<LabeledPoint> testing_i = foldedLP.flatMap(
list -> list.get(ii).iterator() //get the ii_th list iterator.
);
result[i] = new Tuple2<>(training_i, testing_i);
}
foldedLP.unpersist();
return result;
}

def StratifiedKFold(nSamples: Int, k: Int, labels: List[Int],shuffle: Boolean = false): (Map[Int,List[List[Int]]],Int)= {
var idxs = (0 until nSamples).toArray
val unqLabels = labels.distinct
val noOfLabels = unqLabels.length
val idxsbylabel = idxs.groupBy { x => labels(x) }
var stratifiedidxs: Map[Int,List[List[Int]]] = Map(1 -> List(List(1)))
for ( i <- 0 to noOfLabels-1){
val labelsgroup_i_arr = if(shuffle) bshuffle(idxsbylabel(i).toArray) else idxsbylabel(i).toArray
val noOfParts = if(labelsgroup_i_arr.length%k==0) labelsgroup_i_arr.length/k else (labelsgroup_i_arr.length/k)+1
val labelsgroup_i_lst = List.concat(labelsgroup_i_arr)
stratifiedidxs = stratifiedidxs + (i -> labelsgroup_i_lst.grouped(noOfParts).toList)
}
(stratifiedidxs,noOfLabels)
}

Related

Split One Row into many based on splitting string in multiple cells

Trying to split one row into many based on string in two cells. it is similar to the question
LINQ to separate column value of a row to different rows in .net
but i need to split based on Product & Cost Columns rather than product column only
SNo.
Product
Cost
1
colgate,closeup,pepsodent
50,100,150
2
rin,surf
100
into
SNo.
Product
Cost
1
colgate
50
1
closeup
100
1
pepsodent
150
2
rin
100
2
surf
100
I'm using Linq to Object with Entity Framework
Try the following. Since you have not presented any model it can be inaccurate in names.
var loaded = ctx.Products.ToList();
var query =
from p in loaded
from sp in p.Product.Split(',').Zip(p.Cost.Split(','), (p, c) => (p, c))
select new
{
Sno = p.Sno,
Product = sp.p,
Cost = sp.c
};
var splitted = query.ToList();
Using #SvyatoslavDanyliv naming, here is an answer:
var loaded = ctx.Products.ToList();
var query =
from p in loaded
from sp in p.Product.Split(',').Zip(p.Cost.Split(','), (p, c) => (p, c))
select new
{
Sno = p.Sno,
Product = sp.p,
Cost = sp.c
};
var splitted = query.ToList();
It feels a bit complicated to me. I would prefer using an extension method to create a variant of Zip that repeats the last element of a shorter sequence to match the longer sequence:
public static class EnumerableExt {
public static IEnumerable<(T1 First,T2 Second)> ZipExtend<T1,T2>(this IEnumerable<T1> s1, IEnumerable<T2> s2) {
var s1e = s1.GetEnumerator();
var s2e = s2.GetEnumerator();
T1 s1eLast = default;
T2 s2eLast = default;
bool has_s2 = false;
if (s1e.MoveNext()) {
do {
s1eLast = s1e.Current;
if (s2e.MoveNext()) {
s2eLast = s2e.Current;
has_s2 = true;
}
else if (!has_s2)
yield break;
yield return (s1eLast, s2eLast);
} while (s1e.MoveNext());
if (has_s2)
while (s2e.MoveNext())
yield return (s1eLast, s2e.Current);
}
yield break;
}
}
Then the answer is:
var query =
from p in loaded
from pr in p.Product.Split(',').ZipExtend(p.Cost.Split(','))
select new
{
Sno = p.Sno,
Product = pr.First,
Cost = pr.Second
};
var splitted = query.ToList();

Google/OR-Tools Get Duration And Distance

I'm trying to understand the solution call in the MVRP examples
I have two matrixes, duration and distance that have been returned via calls to google
My solution is based on distance but given that i have the data already returned i want to find the index associated with the duration.
unfortunately I'm not sure completely what is going on under the hood of the the Routing Calls so hoping for a simple fast answer for look up and what index to use
for simplicity sake I will show the google example rather than my code and highlight what im looking for:
public string PrintSolution()
{
// Inspect solution.
string ret = "";
long maxRouteDistance = 0;
for (int i = 0; i < _data.Drivers; ++i)
{
ret += $"Route for Vehicle {i}:";
ret += Environment.NewLine;
long routeDistance = 0;
var index = _routing.Start(i);
while (_routing.IsEnd(index) == false)
{
ret += $"{_manager.IndexToNode((int) index)} -> ";
var previousIndex = index;
index = _solution.Value(_routing.NextVar(index));
long legDistance = _routing.GetArcCostForVehicle(previousIndex, index, i);
//LOOKING FOR
//long legDuration = ??? what index am is using here to find in my duration matrix which is built the same as indexes as distance
ret += " leg distance: " + legDistance;
routeDistance += legDistance;
}
ret += $"{_manager.IndexToNode((int) index)}";
ret += Environment.NewLine;
ret += $"Distance of the route: {routeDistance}m";
ret += Environment.NewLine;
ret += Environment.NewLine;
maxRouteDistance = Math.Max(routeDistance, maxRouteDistance);
}
ret += $"Maximum distance of the routes: {maxRouteDistance}m";
ret += Environment.NewLine;
return ret;
}
#Mizux
disclaimer: This is a simplification but should help you to understand.
In OR-Tools Routing there is a primal "hidden" dimension without name but you can retrieve the cost using RoutingModel::GetArcCostForVehicle()
For any "regular" dimension you can get inspect the CumulVar at each node.
e.g. supposing you have created two dimensions using RoutingModel::AddDimension() whose name were "Distance" and "Duration".
note: CumulVar is an accumulator so if you want the "arc cost" you'll need something like this dim.CumulVar(next_index) - dim.CumulVar(index)
Then in you PrintFunction you can use:
public string PrintSolution()
{
...
RoutingDimension distanceDimension = routing.GetMutableDimension("Distance");
RoutingDimension durationDimension = routing.GetMutableDimension("Duration");
for (int i = 0; i < _manager.getNumberOfVehicles(); ++i)
{
while (_routing.IsEnd(index) == false)
{
...
IntVar distanceVar = distanceDimension.CumulVar(index);
IntVar durationVar = durationDimension.CumulVar(index);
long distance = _solution.Value(distanceVar);
long duration = _solution.Value(durationVar);
...
}
}
}

How to shuffling the order of a list from snapshot.docs from Stream in firestore [duplicate]

I'm looking every where on the web (dart website, stackoverflow, forums, etc), and I can't find my answer.
So there is my problem: I need to write a function, that print a random sort of a list, witch is provided as an argument. : In dart as well.
I try with maps, with Sets, with list ... I try the method with assert, with sort, I look at random method with Math on dart librabry ... nothing can do what I wana do.
Can some one help me with this?
Here some draft:
var element03 = query('#exercice03');
var uneliste03 = {'01':'Jean', '02':'Maximilien', '03':'Brigitte', '04':'Sonia', '05':'Jean-Pierre', '06':'Sandra'};
var alluneliste03 = new Map.from(uneliste03);
assert(uneliste03 != alluneliste03);
print(alluneliste03);
var ingredients = new Set();
ingredients.addAll(['Jean', 'Maximilien', 'Brigitte', 'Sonia', 'Jean-Pierre', 'Sandra']);
var alluneliste03 = new Map.from(ingredients);
assert(ingredients != alluneliste03);
//assert(ingredients.length == 4);
print(ingredients);
var fruits = <String>['bananas', 'apples', 'oranges'];
fruits.sort();
print(fruits);
There is a shuffle method in the List class. The methods shuffles the list in place. You can call it without an argument or provide a random number generator instance:
var list = ['a', 'b', 'c', 'd'];
list.shuffle();
print('$list');
The collection package comes with a shuffle function/extension that also supports specifying a sub range to shuffle:
void shuffle (
List list,
[int start = 0,
int end]
)
Here is a basic shuffle function. Note that the resulting shuffle is not cryptographically strong. It uses Dart's Random class, which produces pseudorandom data not suitable for cryptographic use.
import 'dart:math';
List shuffle(List items) {
var random = new Random();
// Go through all elements.
for (var i = items.length - 1; i > 0; i--) {
// Pick a pseudorandom number according to the list length
var n = random.nextInt(i + 1);
var temp = items[i];
items[i] = items[n];
items[n] = temp;
}
return items;
}
main() {
var items = ['foo', 'bar', 'baz', 'qux'];
print(shuffle(items));
}
You can use shuffle() with 2 dots like Vinoth Vino said.
List cities = ["Ankara","London","Paris"];
List mixed = cities..shuffle();
print(mixed);
// [London, Paris, Ankara]

Many to many - get appropriate records from table - EF LINQ

I have two tables with relationship many-to-many. Let's say A and B tables.
I also have List<List<int>> TagIdList with ids of B table's elements.
How can I find every elements from table A, who have all TagIdList[i] elements? I need just ids from table A, so it doesn't have to be all TASKS rows from table.
Example:
A: TASKS:
id: 1,2,3,4,5,6
B: TAGS:
id: 1,2,3,4
A-B links:
1-2; 1-3; 2-1; 2-2; 5-3; 5-4; 6-1; 6-6;
List<List<int>> TagIdList //(ids from TAGS)
TagIdList[0]= {2,3}
TagIdList[1]= {1}
TagIdList[2]= {2,6}
Result: (ids from TASKS)
i=0; -> 1
i=1; -> 2,6
i=2; -> null
I've tried:
List<int> tags = model.TagIdList[i].IdList; //I've got it from my View
List<TASKS> tasks = myEntity.TASKS.Where(t => t.TAGS == tags).ToList();
And I can't get tasks, because there was an error: Unable to create a constant value of type. Only primitive types are supported in this context.
Any ideas?
Your problem is here: myEntity.TASKS.Where(t => t.TAGS == tags)
If i understand the question correct you need something like this:
A compare method for two lists of int (Taken from here)
public static bool ScrambledEquals<T>(IEnumerable<T> list1, IEnumerable<T> list2) {
var cnt = new Dictionary<T, int>();
foreach (T s in list1) {
if (cnt.ContainsKey(s)) {
cnt[s]++;
} else {
cnt.Add(s, 1);
}
}
foreach (T s in list2) {
if (cnt.ContainsKey(s)) {
cnt[s]--;
} else {
return false;
}
}
return cnt.Values.All(c => c == 0);
}
Than use this method inside linq expression:
myEntity.TASKS.AsEnumerable().Where(t => ScrambledEquals<int>(t.TAGS.Select(tag=>tag.id).ToList(),tags))
I found solution. It's maybe not ideal, but works.
List<int> tags = model.TagIdList[i].IdList;
List<List<int>> whole_list = new List<List<int>>();
foreach (var t in tags) //I'm looking for every tasks' ids in every given tag
{
var temp = myEntity.TAGS.Find(t).TASKS.Select(task => task.Id).ToList();
whole_list.Add(temp);
}
//now I want first list from whole_list to compare with the other lists
List<int> collection = whole_list[0]; //collection it's tasks
whole_list.Remove(collection);
//I'm taking a part which is common to all lists from whole_list
foreach (List<int> k in whole_list)
{
var temp = collection.Intersect(k);
collection = temp.ToList();
}
//the collection is now what I wanted for TagIdList[i] - every task which has every tags from TagIdList[i].IdList

Comparing consecutive elements in a queue

I have a queue of elements, sorted by date. I need to extract the first n elements, which have the same date and add them to a temporary ArrayList, from which I choose one of them and scrap the others. After that I need to continue doing the same thing for the next n elements of the queue with the same date (extract them to the temp list and so on) until I have no more items in the queue.
// some notes to help you understand the code
PriorityQueue<Results> r, size(4), elementsEqualByTime(1=2,3=4);
List<Comments> c, size(2);
ArrayList temp;
if (c.size() != r.size() && resultIter.hasNext()) {
//first iteration will compare element 0 to itself -> 100% true
ResultObject r2 = resultIter.next();
ResultObject r1 = r2;
while (resultIter.hasNext() && r1.getTime().equals(r2.getTime())) {
temp.add(r1);
//we add the matching elements before we continue
r1 = r2;
temp.add(r1);
if (resultIter.hasNext()) {
//after we add the 2 matching elements we continue
r2 = resultIter.next();
}
}
//use the items in temp
temp.clear();
}
Right now it works for the 1st set of elements, but on the 2nd iteration it adds no elements to the temp ArrayList. I'd appreciate help with this solution, but am also open to different suggestions.
boolean Check (List<Element> elements,Element element)
{
for(Element element1:elements)
if(element1.equals(element))
return true;
return false;
}
void Stuff()
{
// some notes to help you understand the code
PriorityQueue<Element> r = new PriorityQueue<Element>();
List<Element> c;
List<Element> temp = new ArrayList<Element>();
for(Element element:r)
{
if(!Check(temp, element))
{
// do stuff with temp
temp = new ArrayList<Element>();
}
temp.add(element);
}
}
while (commentIter.hasNext()) {
Comment c1 = null;
temp.add(arrayQueue[0]);
for (int i = 1; i < arrayQueue.length; i++) {
if (!arrayQueue[i].getTime().equals(arrayQueue[i - 1].getTime())) {
c1 = commentIter.next();
//do stuff with the results
temp = new HashSet<ResultObject>();
}
temp.add(arrayQueue[i]);
}
if (!temp.isEmpty()) {
c1 = commentIter.next();
//do stuff with the results
}
temp = new HashSet<ResultObject>();
}
That's a tested solution which works.