Generate a list of weighted random letters - flutter

I'm trying to create a word game and want to present letters to a user for them to build words. Given that, I have a source list of available letters named lettersList that for now is just the 26 letters of the English alphabet.
For the user, I only want them to have say 5 letters to build a word with. To generate that list of 5 letters, I have the following:
var randomList = new List.generate(5, (_) => lettersList[Random().nextInt(lettersList.length)]).toList();
This sort of works since I don't mind duplicates, but I want certain letters to appear more than others such as vowels and the most common consonants.
So the only solution I could think of was to augment my lettersList to add more of the characters I want to show up more often (e.g., so for the letter E maybe I'll add 5 more instances of it in the letterList or add 3 instances of the letter N) and change my code to use shuffle instead.
lettersList.shuffle();
return lettersList.take(5).toList();
So even though that works, I'm just curious, is there a better or more efficient way to do this?

The way you have suggested (using multiples of some letters) isn't a bad idea per-se but doesn't have a lot of flexibility.
What you really want to be able to do is to have a weighted set of values for each letter, and then choose between them.
A simple way of doing this would be to just define the weights for each letter in the alphabet, i.e.
{ 'a': 1, "b": 0.8, "c": 1.2 ... }
Then, to get a random distribution, you could use random.nextDouble() * <sum of all the weights>. This would result in a number between 0 and the sum total - all you need to do is figure out which position corresponds to that. You could do that by starting at 0, and checking if each respective number's weight added to the running total is bigger than the random double.
You could then wrap this up in a class, potentially doing some initialization of defaults. You can check it out on dartpad but I've included it below as well.
This class handles the random distribution in a generic way:
import 'dart:math';
import 'package:collection/collection.dart';
class WeightedRandom<T> {
WeightedRandom(Map<T, double> allWeights)
: _totalWeight = allWeights.values.sum,
_allWeightsList = allWeights.entries.toList(growable: false);
final double _totalWeight;
final Random _random = Random.secure();
final List<MapEntry<T, double>> _allWeightsList;
T getNext() {
final weightedRandom = _random.nextDouble() * _totalWeight;
double totalSoFar = 0;
for (final entry in _allWeightsList) {
if (weightedRandom < totalSoFar + entry.value) {
return entry.key;
}
totalSoFar += entry.value;
}
return _allWeightsList.last.key;
}
}
And this one makes it letter-specific with the added bonus of setting defaults:
class RandomWeightedLetter extends WeightedRandom<String> {
static final _defaultWeights = Map.fromEntries(
List.generate(26, (ind) => MapEntry(String.fromCharCode(ind + 97), 1.0)));
RandomWeightedLetter._(Map<String, double> allWeights): super(allWeights);
factory RandomWeightedLetter(Map<String, double> specialWeights) {
for (final entry in specialWeights.entries) {
assert(entry.key.length == 1 &&
entry.key.codeUnits.first >= 97 &&
entry.key.codeUnits.first <= 122);
}
final allWeights = _defaultWeights..addAll(specialWeights);
return RandomWeightedLetter._(allWeights);
}
}
You can use it in a pretty simple way, i.e.:
void main() {
final random = RandomWeightedLetter({'f': 26});
final counts = Map.fromEntries(
List.generate(26, (ind) => MapEntry(String.fromCharCode(ind + 97), 0)));
const rounds = 100000;
for (int i = 0; i < rounds; ++i) {
final randomLetter = random.getNext();
counts[randomLetter] = counts[randomLetter]! + 1;
}
print(counts.map((key, value) => MapEntry(key, value / (rounds / random._totalWeight))));
}
(which prints out something like this, showing that the distribution works):
{a: 0.95421, b: 0.99144, c: 0.9894000000000001, d: 0.98634, e: 0.99297, f: 26.18646, g: 0.9679800000000001, h: 1.03479, i: 0.9741, j: 0.9945, k: 1.02408, l: 0.9639, m: 0.98481, n: 0.9537, o: 1.0098, p: 0.99093, q: 1.00827, r: 0.97971, s: 1.0251000000000001, t: 1.02204, u: 0.97104, v: 1.01286, w: 0.98634, x: 0.94911, y: 1.04142, z: 1.0047}
Then, to exactly what you want, you could simply to this:
final random = RandomWeightedLetter(...);
final randomList = List.generate((_) => random.next());
Note that this isn't particularly optimized - you could pre-calculate the 'buckets' and do some fancier algorithm than just iterating through each time to go from the random double to the letter, but this is probably good enough for a fairly small set of potential values. If you were going to have a ton of potential values, you'd want to do something smarter - a simple way to do it would be to calculate the max for each potential response, and then use something like a balanced tree to 'sort' the new value into it.

Related

AssemblyScript - Linear Nested Class Layout

I'm working on a linear data layout where components are alongside each other in memory. Things were going ok until I realized I don't have a way for making offsetof and changetype calls when dealing with nested classes.
For instance, this works as intended:
class Vec2{
x:u8
y:u8
}
const size = offsetof<Vec2>() // 2 -- ok
const ptr = heap.alloc(size)
changeType<Vec2>(ptr).x = 7 // memory = [7,0] -- ok
Naturally this approach fails when nesting classes
class Player{
position:Vec2
health:u8
}
const size = offsetof<Player>() //5 -- want 3, position is a pointer
const ptr = heap.alloc(size)
changeType<Player>(ptr).position.x = 7 //[0,0,0,0,0] -- want [7,0,0], instead accidentally changed pointer 0
The goal is for the memory layout to look like this:
| Player 1 | Player 2 | ...
| x y z h | x y z h |
Ideally I'd love to be able to create 'value-type' fields, or if this isnt a thing, are there alternative approaches?
I'm hoping to avoid extensive boilerplate whenever writing a new component, ie manual size calculation and doing a changetype for each field at its offset etc.
In case anybody is interested I'll post my current solution here. The implementation is a little messy but is certainly automatable using custom scripts or compiler transforms.
Goal: Create a linear proxy for the following class so that the main function behaves as expected:
class Foo {
position: Vec2
health: u8
}
export function main(): Info {
const ptr = heap.alloc(FooProxy.size)
const foo = changetype<FooProxy>(ptr)
foo.health = 3
foo.position.x = 9
foo.position.y = 10
}
Solution: calculate offsets and alignments for each field.
class TypeMetadataBase{
get align():u32{return 0}
get offset():u32{return 0}
}
class TypeMetadata<T> extends TypeMetadataBase{
get align():u32{return alignof<T>()}
get offset():u32{return offsetof<T>()}
constructor(){
super()
if(this.offset == 0)
throw new Error('offset shouldnt be zero, for primitive types use PrimitiveMetadata')
}
};
class PrimitiveMetadata<T> extends TypeMetadataBase{
get align():u32{return sizeof<T>()}
get offset():u32{return sizeof<T>()}
};
class LinearSchema{
metadatas:StaticArray<TypeMetadataBase>
size:u32
offsets:StaticArray<u32>
constructor(metadatas:StaticArray<TypeMetadataBase>){
let align:u32 = 0
const offsets = new StaticArray<u32>(metadatas.length)
for (let i = 0; i < metadatas.length; i++){
if(metadatas[i].align !== 0)
while(align % metadatas[i].align !== 0)
align++
offsets[i] = align
align += metadatas[i].offset
}
this.offsets = offsets
this.metadatas = metadatas
this.size = align
}
}
class Vec2 {
x: u8
y: u8
}
class FooSchema extends LinearSchema{
constructor(){
super([
new PrimitiveMetadata<u8>(),
new TypeMetadata<Vec2>(),
])
}
}
const schema = new FooSchema()
class FooProxy{
static get size():u32{return schema.size}
set health(value:u8){store<u8>(changetype<usize>(this) + schema.offsets[0],value)}
get health():u8{return load<u8>(changetype<usize>(this) + schema.offsets[0])}
get position():Vec2{return changetype<Vec2>(changetype<usize>(this) + schema.offsets[1])}
}

Multiple word search using trie in dart

I'm trying to implement trie search in flutter. And here's the entire trie.dart file.
The idea is something like this, say we have I have a list of recipe names:
Burger
French Fries
Ice Cream
Garlic Parmesan Butter
Now I need to search using prefix so if the user searches for bur it'll show Burger. But if someone write Garlic Butter I need to return Garlic Parmesan Butter. So, basically if the search query has multiple words I need to show the correct name.
Here's the part where I get all words with prefix:
List<String> getAllWordsWithPrefix(String prefix) {
StringBuffer fullPrefix = new StringBuffer();
return _getAllWordsWithPrefixHelper(prefix, _head, fullPrefix);
}
List<String> _getAllWordsWithPrefixHelper(
String prefix, _TrieNode node, StringBuffer fullPrefix) {
if (prefix.length == 0) {
String pre = fullPrefix.toString();
return _collect(
new StringBuffer(pre.substring(0, max(pre.length - 1, 0))), node, []);
}
for (_TrieNode child in node.children) {
if ((child.char == prefix.substring(0, 1)) ||
(!_isCaseSensitive &&
child.char.substring(0, child.char.length).toLowerCase() ==
prefix.substring(0, 1).toLowerCase())) {
fullPrefix.write(child.char);
return _getAllWordsWithPrefixHelper(
prefix.substring(1), child, fullPrefix);
}
}
return [];
}
And finally I'm using the trie search in the following way(thought this might help someway):
class Search {
final Map<String, RecipeModel> _map = Map.fromIterable(
Store.instance.getAllRecipes(),
// ignore: non_constant_identifier_names
// key: (recipe) => RecipeModel().recipeName!,
key: (recipe) => recipe.recipeName!,
);
late final Trie trie;
Search() {
// This will be O[n]
trie = Trie.list(_map.keys.toList());
}
late RecipeModel recipe;
RecipeModel returnRecipe(String? suggestion) {
if (suggestion == null) return recipe;
// This will be O(1) instead of O(n) [better]
final RecipeModel? found = _map[suggestion];
return found ?? recipe;
}
List<String> returnSuggestions(String prefix) {
//will return O[W*L] ad-hoc search was O[n^2]
return trie.getAllWordsWithPrefix(prefix);
}
}
First, your TrieNode is using a List, which means you have to do a for-loop O(N) search for each char. It would be faster to use a Map.
(You could also use a 26D array, instead of a map, if you know that you'll only have English letters: [0] = 'A', [1] = 'B', ... [25] = 'Z')
Part I: garlic butter
Now I need to search using prefix so if the user searches for bur it'll show Burger. But if someone write Garlic Butter I need to Garlic Parmesan Butter. So, basically if the search query has multiple words I need to show the correct name.
bur is easy to do, as that's what a Trie data structure is for, but the garlic butter one is harder. You probably want to look into how to implement a fuzzy search or "did you mean...?" type algorithm:
Fuzzy search algorithm (approximate string matching algorithm)
However, maybe this is more complex than you want.
Here are some alternatives I thought of:
Option #1
Change your TrieNode to store a String variable stating the "standard" word. So the final TrieNode of garlic parmesan butter and garlic butter would both have an instance variable that stores garlic butter (the main standard/common word you want to use).
Your TrieNode would be like this:
class _TrieNode {
//...other vars...
String _standardWord;
}
Then your add method would look like this:
String standardWord = 'garlic butter';
trie.addWord('garlic butter', standardWord);
trie.addWord('garlic parmesan butter', standardWord);
Internally, your add method would set the standardWord of the last char's TrieNode to the 2nd param of addWord.
Internally, your find method would return the standardWord, which was set in the last char's TrieNode.
Then you only have to test for garlic butter and don't have to worry about parmesan.
Does that make sense?
Of course, you can also flip this around so that it always returns garlic parmesan butter instead:
String standardWord = 'garlic parmesan butter';
trie.addWord('garlic butter', standardWord);
trie.addWord('garlic parmesan butter', standardWord);
This requires that you know all phrases that it could be in advance, and then you add them to your Trie, with all common phrases pointing to the same standard word.
Option #2
Split your phrase/sentence into words, based on the spaces between words.
So your Trie would look like this:
trie.addWord('garlic');
trie.addWord('parmesan');
trie.addWord('butter');
When you have all of the matching words, you then need to write an algorithm (logic) to piece them together in a meaningful way.
Here's an example:
Set<String> wordsFromTrie = {};
List<String> words = "garlic parmesan butter".split(RegExp(r'\s+'));
words.forEach((word) => wordsFromTrie.add(trie.findWord(word)));
if(wordsFromTrie.contains("garlic") && wordsFromTrie.contains("butter")) {
print("Garlic parmesan butter!");
}
Option #3
Use some type of Regex matching instead of a Trie, so for example, your Regex would be RegExp(r"garlic.*butter",caseSensitive: false) and this would match all garlic butter regardless of what's in the middle.
This will be a bit slower, as you'll have a bunch of if-statements with Regexes. You could make it faster by first testing someString.startsWith('garlic') (after stripping spaces and lower-casing).
Option #4
Use a combination of #1 and #3.
This would mean you'd have a special RegexTrieNode that you would add.
trie.addWord('garlic',RegExp(r'[^b]+'),'butter');
When you hit RegexTrieNode it would continue to match each char until it stops matching. When it stops matching, you would need to go to the next child, butter, for the rest of the matching.
It's quite complicated, and it won't work for all Regexes, but it's doable.
Part II: bur => burger
Basically, once you reach the end of bur, you need to keep going down (or up?) the Trie for TrieNodes that only have 1 child.
Why only 1 child? Well, if you have burrito and burger, which does bur match? It's ambiguous.
Here's some example code, roughly based on your code. It uses null-safety. If you don't have that enabled, then remove all ? from the code.
import 'dart:math';
void main() {
var trie = Trie();
trie.addWord("burger");
// All results will be "burger".
print(trie.findWord("b"));
print(trie.findWord("bur"));
print(trie.findWord("burger"));
// This will be null, but fixable
// if want to allow longer strings.
print(trie.findWord("burgerme"));
}
class Trie {
TrieNode head = TrieNode();
void addWord(String? word) {
if(word == null || word.isEmpty) {
return;
}
var currentNode = head;
// Rune is a unicode codepoint (unicode char).
word.runes.forEach((rune) {
var childNode = currentNode.children[rune];
if(childNode == null) {
childNode = TrieNode();
currentNode.children[rune] = childNode; // Add to parent.
}
currentNode = childNode; // Next node.
});
// Last node is the last char.
currentNode.endOfWord = true;
}
String? findWord(String partial) {
var word = StringBuffer();
var currentNode = head;
partial.runes.forEach((rune) {
var childNode = currentNode.children[rune];
if(childNode == null) {
return null; // Not found.
}
word.writeCharCode(rune);
currentNode = childNode; // Next node.
});
// Prevent "burgerme" from matching to "burger".
// Uncomment this if-block if want to allow it.
if(currentNode.endOfWord && partial.length > word.length) {
return null;
}
// This logic allows "bur" to match to "burger".
while(!currentNode.endOfWord) {
// Ambiguous: "bur" could match either "burger" or "burrito".
if(currentNode.children.length != 1) {
return null; // Don't know.
}
var onlyChild = currentNode.children.entries.first;
word.writeCharCode(onlyChild.key);
currentNode = onlyChild.value; // Next node.
}
return word.toString();
}
}
class TrieNode {
Map<int,TrieNode> children = {};
bool endOfWord = false;
}

How to shuffling the order of a list from snapshot.docs from Stream in firestore [duplicate]

I'm looking every where on the web (dart website, stackoverflow, forums, etc), and I can't find my answer.
So there is my problem: I need to write a function, that print a random sort of a list, witch is provided as an argument. : In dart as well.
I try with maps, with Sets, with list ... I try the method with assert, with sort, I look at random method with Math on dart librabry ... nothing can do what I wana do.
Can some one help me with this?
Here some draft:
var element03 = query('#exercice03');
var uneliste03 = {'01':'Jean', '02':'Maximilien', '03':'Brigitte', '04':'Sonia', '05':'Jean-Pierre', '06':'Sandra'};
var alluneliste03 = new Map.from(uneliste03);
assert(uneliste03 != alluneliste03);
print(alluneliste03);
var ingredients = new Set();
ingredients.addAll(['Jean', 'Maximilien', 'Brigitte', 'Sonia', 'Jean-Pierre', 'Sandra']);
var alluneliste03 = new Map.from(ingredients);
assert(ingredients != alluneliste03);
//assert(ingredients.length == 4);
print(ingredients);
var fruits = <String>['bananas', 'apples', 'oranges'];
fruits.sort();
print(fruits);
There is a shuffle method in the List class. The methods shuffles the list in place. You can call it without an argument or provide a random number generator instance:
var list = ['a', 'b', 'c', 'd'];
list.shuffle();
print('$list');
The collection package comes with a shuffle function/extension that also supports specifying a sub range to shuffle:
void shuffle (
List list,
[int start = 0,
int end]
)
Here is a basic shuffle function. Note that the resulting shuffle is not cryptographically strong. It uses Dart's Random class, which produces pseudorandom data not suitable for cryptographic use.
import 'dart:math';
List shuffle(List items) {
var random = new Random();
// Go through all elements.
for (var i = items.length - 1; i > 0; i--) {
// Pick a pseudorandom number according to the list length
var n = random.nextInt(i + 1);
var temp = items[i];
items[i] = items[n];
items[n] = temp;
}
return items;
}
main() {
var items = ['foo', 'bar', 'baz', 'qux'];
print(shuffle(items));
}
You can use shuffle() with 2 dots like Vinoth Vino said.
List cities = ["Ankara","London","Paris"];
List mixed = cities..shuffle();
print(mixed);
// [London, Paris, Ankara]

How to only emit consistent calculations?

I'm using reactive programming to do a bunch of calculations. Here is a simple example that tracks two numbers and their sum:
static void Main(string[] args) {
BehaviorSubject<int> x = new BehaviorSubject<int>(1);
BehaviorSubject<int> y = new BehaviorSubject<int>(2);
var sum = Observable.CombineLatest(x, y, (num1, num2) => num1 + num2);
Observable
.CombineLatest(x, y, sum, (xx, yy, sumsum) => new { X = xx, Y = yy, Sum = sumsum })
.Subscribe(i => Console.WriteLine($"X:{i.X} Y:{i.Y} Sum:{i.Sum}"));
x.OnNext(3);
Console.ReadLine();
}
This generates the following output:
X:1 Y:2 Sum:3
X:3 Y:2 Sum:3
X:3 Y:2 Sum:5
Notice how second output result is "incorrect" because it is showing that 3+2=3. I understand why this is happening (x is updated before the sum is updated) but I want my output calculations to be atomic/consistent - no value should be emitted until all dependent calculations are complete. My first approach was this...
Observable.When(sum.And(Observable.CombineLatest(x, y)).Then((s, xy) => new { Sum = s, X = xy[0], Y = xy[1] } ));
This seems to work for my simple example. But my actual code has LOTS of calculated values and I couldn't figure out how to scale it. For example, if there was a sum and squaredSum, I don't know how to wait for each of these to emit something before taking action.
One method that should work (in-theory) is to timestamp all the values I care about, as shown below.
Observable
.CombineLatest(x.Timestamp(), y.Timestamp(), sum.Timestamp(), (xx, yy, sumsum) => new { X = xx, Y = yy, Sum = sumsum })
.Where(i=>i.Sum.Timestamp>i.X.Timestamp && i.Sum.Timestamp>i.Y.Timestamp)
// do the calculation and subscribe
This method could work for very complicated models. All I have to do is ensure that no calculated value is emitted that is older than any core data value. I find this to be a bit of a kludge. It didn't actually work in my console app. When I replaced Timestamp with a custom extension that assigned a sequential int64 it did work.
What is a simple, clean way to handle this kind of thing in general?
=======
I'm making some progress here. This waits for a sum and sumSquared to emit a value before grabbing the data values that triggered the calculation.
var all = Observable.When(sum.And(sumSquared).And(Observable.CombineLatest(x, y)).Then((s, q, data)
=> new { Sum = s, SumSquared = q, X = data[0], Y = data[1] }));
This should do what you want:
Observable.CombineLatest(x, y, sum)
.DistinctUntilChanged(list => list[2])
.Subscribe(list => Console.WriteLine("{0}+{1}={2}", list[0], list[1], list[2]));
It waits until the sum has been updated, which means that all its sources must have been updated too.
You problem isn't because x is updated before the sum is updated per se. It's really about the way that you've constructed your query.
You've effectively created two queries: Observable.CombineLatest(x, y, (num1, num2) => num1 + num2) & Observable.CombineLatest(x, y, sum, (xx, yy, sumsum) => new { X = xx, Y = yy, Sum = sumsum }). Since in each you're subscribing to x then you've create two subscriptions. Meaning that when x updates then two lots of updates occur.
You need to avoid creating two subscriptions.
If you write your code like this:
BehaviorSubject<int> x = new BehaviorSubject<int>(1);
BehaviorSubject<int> y = new BehaviorSubject<int>(2);
Observable
.CombineLatest(x, y, (num1, num2) => new
{
X = num1,
Y = num2,
Sum = num1 + num2
})
.Subscribe(i => Console.WriteLine($"X:{i.X} Y:{i.Y} Sum:{i.Sum}"));
x.OnNext(3);
...then you correctly get this output:
X:1 Y:2 Sum:3
X:3 Y:2 Sum:5
I've started to get my head around this some more. Here is a more detailed example of what I'm trying to accomplish. This is some code that validates a first and last name, and should only generate a whole name when both parts are valid. As you can see I'm trying to use a bunch of small independently defined functions, like "firstIsValid", and then compose them together to calculate something more complex.
It seems like the challenge I'm facing here is trying to correlate inputs and outputs in my functions. For example, "firstIsValid" generates an output that says some first name was valid, but doesn't tell you which one. In option 2 below, I'm able to correlate them using Zip.
This strategy won't work if a validation function does not generate one output for each input. For example, if the user is typing web addresses and we're trying to validate them on the web, maybe we'd do a Throttle and/or Switch. There might be 10 web addresses for a single "webAddressIsValid". In that situation, I think I have to include the output with the input. Maybe have an IObservable> where the string is the web address and the bool is whether it is valid or not.
static void Main(string[] args) {
var first = new BehaviorSubject<string>(null);
var last = new BehaviorSubject<string>(null);
var firstIsValid = first.Select(i => string.IsNullOrEmpty(i) || i.Length < 3 ? false : true);
var lastIsValid = last.Select(i => string.IsNullOrEmpty(i) || i.Length < 3 ? false : true);
// OPTION 1 : Does not work
// Output: bob smith, bob, bob roberts, roberts
// firstIsValid and lastIsValid are not in sync with first and last
//var whole = Observable
// .CombineLatest(first, firstIsValid, last, lastIsValid, (f, fv, l, lv) => new {
// First = f,
// Last = l,
// FirstIsValid = fv,
// LastIsValid = lv
// })
// .Where(i => i.FirstIsValid && i.LastIsValid)
// .Select(i => $"{i.First} {i.Last}");
// OPTION 2 : Works as long as every change in a core data value generates one calculated value
// Output: bob smith, bob robert
var firstValidity = Observable.Zip(first, firstIsValid, (f, fv) => new { Name = f, IsValid = fv });
var lastValidity = Observable.Zip(last, lastIsValid, (l, lv) => new { Name = l, IsValid = lv });
var whole =
Observable.CombineLatest(firstValidity, lastValidity, (f, l) => new { First = f, Last = l })
.Where(i => i.First.IsValid && i.Last.IsValid)
.Select(i => $"{i.First.Name} {i.Last.Name}");
whole.Subscribe(i => Console.WriteLine(i));
first.OnNext("bob");
last.OnNext("smith");
last.OnNext(null);
last.OnNext("roberts");
first.OnNext(null);
Console.ReadLine();
}
Another approach here. Each value gets a version number (like a timestamp). Any time a calculated value is older than the data (or other calculated values it relies upon) we can ignore it.
public class VersionedValue {
static long _version;
public VersionedValue() { Version = Interlocked.Increment(ref _version); }
public long Version { get; }
}
public class VersionedValue<T> : VersionedValue {
public VersionedValue(T value) { Value = value; }
public T Value { get; }
public override string ToString() => $"{Value} {Version}";
}
public static class ExtensionMethods {
public static IObservable<VersionedValue<T>> Versioned<T>(this IObservable<T> values) => values.Select(i => new VersionedValue<T>(i));
public static VersionedValue<T> AsVersionedValue<T>(this T obj) => new VersionedValue<T>(obj);
}
static void Main(string[] args) {
// same as before
//
var whole = Observable
.CombineLatest(first.Versioned(), firstIsValid.Versioned(), last.Versioned(), lastIsValid.Versioned(), (f, fv, l, lv) => new {
First = f,
Last = l,
FirstIsValid = fv,
LastIsValid = lv
})
.Where(i => i.FirstIsValid.Version > i.First.Version && i.LastIsValid.Version > i.Last.Version)
.Where(i => i.FirstIsValid.Value && i.LastIsValid.Value)
.Select(i => $"{i.First.Value} {i.Last.Value}");

Extract numbers and store them in variable in Scala and Spark

I have a file like below:
0; best wrap ear market pair pair break make
1; time sennheiser product better earphone fit
1; recommend headphone pretty decent full sound earbud design
0; originally buy work gym work well robust sound quality good clip
1; terrific sound great fit toss mine profuse sweater headphone
0; negative experienced sit chair back touch chair earplug displace hurt
...
and i want to extract number and store it in a for each document, i've tried :
var grouped_with_wt = data.flatMap({ (line) =>
val words = line.split(";").split(" ")
words.map(w => {
val a =
(line.hashCode(),(vocab_lookup.value(w), a))
})
}).groupByKey()
expected output is :
(1453543,(best,0),(wrap,0),(ear,0),(market,0),(pair,0),(break,0),(make,0))
(3942334,(time,1),(sennheiser,1),(product,1),(better,1),(earphone,1),(fit,1))
...
after generating above results i used them in this code to generate final results:
val Beta = DenseMatrix.zeros[Int](V, S)
val Beta_c = grouped_with_wt.flatMap(kv => {
kv._2.map(wt => {
Beta(wt._1,wt._2) +=1
})
})
final results:
1 0
1 0
1 0
1 0
...
This code doesn't work well , Can anybody help me? I want a code like above.
val inputRDD = sc.textFile("input dir ")
val outRDD = inputRDD.map(r => {
val tuple = r.split(";")
val key = tuple(0)
val words = tuple(1).trim().split(" ")
val outArr = words.map(w => {
new Tuple2(w,key)
})
(r.hashCode, outArr.mkString(","))
})
outRDD.saveAsTextFile("output dir")
output
(-1704185638,(best,0),(wrap,0),(ear,0),(market,0),(pair,0),(pair,0),(break,0),(make,0))
(147969209,(time,5),(sennheiser,5),(product,5),(better,5),(earphone,5),(fit,5))
(1145947974,(recommend,1),(headphone,1),(pretty,1),(decent,1),(full,1),(sound,1),(earbud,1),(design,1))
(838871770,(originally,4),(buy,4),(work,4),(gym,4),(work,4),(well,4),(robust,4),(sound,4),(quality,4),(good,4),(clip,4))
(934228708,(terrific,5),(sound,5),(great,5),(fit,5),(toss,5),(mine,5),(profuse,5),(sweater,5),(headphone,5))
(659513416,(negative,-3),(experienced,-3),(sit,-3),(chair,-3),(back,-3),(touch,-3),(chair,-3),(earplug,-3),(displace,-3),(hurt,-3))