Solving project Euler 25 in a functional manner using Scala - scala

I am currently using Project Euler to learn Scala.
I am stuck on problem 25 with a java.lang.OutOfMemoryError exception.
Here is the question:
What is the index of the first term in the Fibonacci sequence to contain 1000 digits?
What I have come up with:
def fibonacciIndex(numOfDigits: Int): Option[Int] = {
lazy val fibs: LazyList[Int] = 0 #:: fibs.scanLeft(1)(_ + _)
fibs.find(_.toString.length == numOfDigits)
}
I am trying to do this in a purely functional way.
Has anyone got some suggestions to improve the memory usage?
Thanks in advance!
EDIT: I was overflowing the Int type. Solved like so:
def fibonacciIndex(numOfDigits: Int): Int = {
lazy val bigFibs: LazyList[BigInt] = BigInt(0) #:: bigFibs.scanLeft(BigInt(1))(_ + _)
bigFibs.indexWhere(_.toString.length == numOfDigits)
}

Let's run your program in parts.
lazy val fibs: LazyList[Int] = 0 #:: fibs.scanLeft(1)(_ + _)
fibs.take(100).foreach(println)
It prints:
0
1
1
2
3
5
8
13
21
34
55
89
144
233
377
610
987
1597
2584
4181
6765
10946
17711
28657
46368
75025
121393
196418
317811
514229
832040
1346269
2178309
3524578
5702887
9227465
14930352
24157817
39088169
63245986
102334155
165580141
267914296
433494437
701408733
1134903170
1836311903
-1323752223
512559680
-811192543
-298632863
-1109825406
-1408458269
1776683621
368225352
2144908973
-1781832971
363076002
-1418756969
-1055680967
1820529360
764848393
-1709589543
-944741150
1640636603
695895453
-1958435240
-1262539787
1073992269
-188547518
885444751
696897233
1582341984
-2015728079
-433386095
1845853122
1412467027
-1036647147
375819880
-660827267
-285007387
-945834654
-1230842041
2118290601
887448560
-1289228135
-401779575
-1691007710
-2092787285
511172301
-1581614984
-1070442683
1642909629
572466946
-2079590721
-1507123775
708252800
-798870975
-90618175
-889489150
As you can see integer overflows, and so it never outputs more than certain number of digits (11 I think including -). So it keep in computing next value.
Additionally LazyList remembers each value that it computed. It is lazy but memoizes. So each new number that is computed here
fibs.find(_.toString.length == numOfDigits)
is also remembered. It continues until you run out of memory.
To fix that you should replace LazyList with e.g. Iterator or Iterable and store numbers as BigInts.

Related

Pyspark Cosine similarity Invalid argument, not a string or column

I am trying to calculate cosine distances of 2 title and headline columns via using pre-trained bert model just like below
title
headline
title_array
headline_array
arrayed
Dance Gavin Dance bass player Tim Feerick dead at 34
Prince Harry and Meghan Markle make secret visit to see Queen ahead of Invictus Games
["Dance Gavin Dance bass player Tim Feerick dead at 34"]
["Prince Harry and Meghan Markle make secret visit to see Queen ahead of Invictus Games"]
["Dance Gavin Dance bass player Tim Feerick dead at 34", "Prince Harry and Meghan Markle make secret visit to see Queen ahead of Invictus Games"]
# downloading bert
model = SentenceTransformer('bert-base-nli-mean-tokens')
from sentence_transformers import SentenceTransformer
import numpy as np
from pyspark.sql.types import FloatType
import pyspark.sql.functions as f
#udf(FloatType())
def cosine_similarity(sentence_embeddings, ind_a, ind_b):
s = sentence_embeddings
return np.dot(s[ind_a], s[ind_b]) / (np.linalg.norm(s[ind_a]) * np.linalg.norm(s[ind_b]))
#udf_bert = udf(cosine_similarity, FloatType())
''''
s0 = "our president is a good leader he will not fail"
s1 = "our president is not a good leader he will fail"
s2 = "our president is a good leader"
s3 = "our president will succeed"
sentences = [s0, s1, s2, s3]
sentence_embeddings = model.encode(sentences)
s = sentence_embeddings
print(f"{s0} <--> {s1}: {udf_bert(sentence_embeddings, 0, 1)}")
print(f"{s0} <--> {s2}: {cosine_similarity(sentence_embeddings, 0, 2)}")
print(f"{s0} <--> {s3}: {cosine_similarity(sentence_embeddings, 0, 3)}")
'''''
test_df = test_df.withColumn("Similarities", (cosine_similarity(model.encode(test_df.arrayed), 0, 1))
As we see from the example , algorithm takes concatenation of two array of strings and calculate distances of cosine among them.
When I only run the algorithm/function with the sample texts commented out , it is working. But when I try to apply it into my dataframe via registering as a udf and call with dataframe I am facing with the error below:
TypeError Traceback (most recent call last)
<command-757165186581086> in <module>
26 '''''
27
---> 28 test_df = test_df.withColumn("Similarities", f.lit(cosine_similarity(model.encode(test_df.arrayed), 0, 1)))
/databricks/spark/python/pyspark/sql/udf.py in wrapper(*args)
197 #functools.wraps(self.func, assigned=assignments)
198 def wrapper(*args):
--> 199 return self(*args)
200
201 wrapper.__name__ = self._name
/databricks/spark/python/pyspark/sql/udf.py in __call__(self, *cols)
177 judf = self._judf
178 sc = SparkContext._active_spark_context
--> 179 return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
180
181 # This function is for improving the online help system in the interactive interpreter.
/databricks/spark/python/pyspark/sql/column.py in _to_seq(sc, cols, converter)
60 """
61 if converter:
---> 62 cols = [converter(c) for c in cols]
63 return sc._jvm.PythonUtils.toSeq(cols)
64
/databricks/spark/python/pyspark/sql/column.py in <listcomp>(.0)
60 """
61 if converter:
---> 62 cols = [converter(c) for c in cols]
63 return sc._jvm.PythonUtils.toSeq(cols)
64
/databricks/spark/python/pyspark/sql/column.py in _to_java_column(col)
44 jcol = _create_column_from_name(col)
45 else:
---> 46 raise TypeError(
47 "Invalid argument, not a string or column: "
48 "{0} of type {1}. "
TypeError: Invalid argument, not a string or column: [-0.29246375 0.02216947 0.610355 -0.02230968 0.61386955 0.15291359]
The input of a UDF is a Column or a column name, that's why Spark is complaining Invalid argument, not a string or column: [-0.29246375 0.02216947 0.610355 -0.02230968 0.61386955 0.15291359]. You'll need to pass arrayed only, and refer model inside your UDF. Something like this
#udf(FloatType())
def cosine_similarity(sentence_embeddings, ind_a, ind_b):
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
s = model.encode(arrayed)
return np.dot(s[ind_a], s[ind_b]) / (np.linalg.norm(s[ind_a]) * np.linalg.norm(s[ind_b]))
test_df = test_df.withColumn("Similarities", (cosine_similarity(test_df.arrayed, 0, 1))

Is it possible to disable JVM JIT loop optimisation

I have pretty trivial scala code:
def main(): Int = {
var i: Int = 0
var limit = 0
while (limit < 1000000000) {
i = inc(i)
limit = limit + 1
}
i
}
def inc(i: Int): Int = i + 1
I am playing with JVM JIT method inlining on inc method. When inlining is enabled I get surprisingly good examples 2s vs 4ns - what I would like to make sure or at least validate is that no loop optimisation took palace in the mean time. I took look at machine code which seems ok
0x000000010b22a4d6: mov 0x8(%rbp),%r10d ; implicit exception: dispatches to 0x000000010b22a529
0x000000010b22a4da: cmp $0xf8033d43,%r10d ; {metadata('IncWhile$')}
0x000000010b22a4e1: jne L0001 ;*iload_3
; - IncWhile$::main#4 (line 7)
0x000000010b22a4e3: cmp $0x3b9aca00,%ebx
0x000000010b22a4e9: jge L0000 ;*if_icmpge
; - IncWhile$::main#7 (line 7)
0x000000010b22a4eb: sub %ebx,%r14d
0x000000010b22a4ee: add $0x3b9aca00,%r14d ;*iadd
; - IncWhile$::inc#2 (line 16)
; - IncWhile$::main#12 (line 9)
0x000000010b22a4f5: mov $0x3b9aca00,%ebx ;*if_icmpge
; - IncWhile$::main#7 (line 7)
L0000: mov $0xffffff65,%esi
I also checked flight recorder and didn't find anything suspicious but as I am not regular user I would like to double check with someone more experienced.
Code can be found on github
Of course, the loop has been optimized. No regular CPU can execute
1 billion operations in just a few nanoseconds.
There are many loop optimizations in HotSpot - do you want to disable all of them? E.g.
-XX:LoopUnrollLimit=0
-XX:+UseCountedLoopSafepoints
-XX:-UseLoopPredicate
-XX:-PartialPeelLoop
-XX:-LoopUnswitching
-XX:-LoopLimitCheck
etc. To disable most of loop optimizations, use
-XX:LoopOptsCount=0

Spark: All RDD data not getting saved to Cassandra table

Hi, I am trying to load RDD data to a Cassandra Column family using Scala. Out of a total 50 rows , only 28 are getting stored into cassandra table.
Below is the Code snippet:
val states = sc.textFile("state.txt")
//list o fall the 50 states of the USA
var n =0 // corrected to var
val statesRDD = states.map{a =>
n=n+1
(n, a)
}
scala> statesRDD.count
res2: Long = 50
cqlsh:brs> CREATE TABLE BRS.state(state_id int PRIMARY KEY, state_name text);
statesRDD.saveToCassandra("brs","state", SomeColumns("state_id","state_name"))
// this statement saves only 28 rows out of 50, not sure why!!!!
cqlsh:brs> select * from state;
state_id | state_name
----------+-------------
23 | Minnesota
5 | California
28 | Nevada
10 | Georgia
16 | Kansas
13 | Illinois
11 | Hawaii
1 | Alabama
19 | Maine
8 | Oklahoma
2 | Alaska
4 | New York
18 | Virginia
15 | Iowa
22 | Wyoming
27 | Nebraska
20 | Maryland
7 | Ohio
6 | Colorado
9 | Florida
14 | Indiana
26 | Montana
21 | Wisconsin
17 | Vermont
24 | Mississippi
25 | Missouri
12 | Idaho
3 | Arizona
(28 rows)
Can anyone please help me in finding where the issue is?
Edit:
I understood why only 28 rows are getting stored in Cassandra, it's because I have made the first column a PRIMARY KEY and It looks like in my code, n is incremented maximum to 28 and then it starts again with 1 till 22 (total 50).
val states = sc.textFile("states.txt")
var n =0
var statesRDD = states.map{a =>
n+=1
(n, a)
}
I tried making n an accumulator variable as well(viz. val n = sc.accumulator(0,"Counter")), but I don't see any differnce in the output.
scala> statesRDD.foreach(println)
[Stage 2:> (0 + 0) / 2]
(1,New Hampshire)
(2,New Jersey)
(3,New Mexico)
(4,New York)
(5,North Carolina)
(6,North Dakota)
(7,Ohio)
(8,Oklahoma)
(9,Oregon)
(10,Pennsylvania)
(11,Rhode Island)
(12,South Carolina)
(13,South Dakota)
(14,Tennessee)
(15,Texas)
(16,Utah)
(17,Vermont)
(18,Virginia)
(19,Washington)
(20,West Virginia)
(21,Wisconsin)
(22,Wyoming)
(1,Alabama)
(2,Alaska)
(3,Arizona)
(4,Arkansas)
(5,California)
(6,Colorado)
(7,Connecticut)
(8,Delaware)
(9,Florida)
(10,Georgia)
(11,Hawaii)
(12,Idaho)
(13,Illinois)
(14,Indiana)
(15,Iowa)
(16,Kansas)
(17,Kentucky)
(18,Louisiana)
(19,Maine)
(20,Maryland)
(21,Massachusetts)
(22,Michigan)
(23,Minnesota)
(24,Mississippi)
(25,Missouri)
(26,Montana)
(27,Nebraska)
(28,Nevada)
I am curious to know what is causing n to not getting updated after value 28? Also, what are the ways in which I can create a counter which I can use for creating RDD?
There are some misconceptions about distributed systems embedded inside your question. The real heart of this is "How do I have a counter in a distributed system?"
The short answer is you don't. For example what you've done in your code example originally is something like this.
Task One {
var x = 0
record 1: x = 1
record 2: x = 2
}
Task Two {
var x = 0
record 20: x = 1
record 21: x = 2
}
Each machine is independently creating a new x variable set at 0 which gets incremented within it's own context, independently over the other nodes.
For most use cases the "counter" question can be replaced with "How can I get a Unique Identifier per Record in a distributed system?"
For this most users end up using a UUID which can be generated on independent machines with infinitesimal chances of conflicts.
If the question can be "How can I get a monotonic increasing unique indentifier?"
Then you can use zipWithUniqueIndex which will not count but will generate monotonically increasing ids.
If you just want them number to start with it's best to do it on the local system.
Edit; Why can't I use an accumulator?
Accumulators store their state (surprise) per task. You can see this with a little example:
val x = sc.accumulator(0, "x")
sc.parallelize(1 to 50).foreachPartition{ it => it.foreach(y => x+= 1); println(x)}
/*
6
7
6
6
6
6
6
7
*/
x.value
// res38: Int = 50
The accumulators combine their state after finishing their tasks, which means you can't use them as a global distributed counter.

Last rune of golang unicode/norm iterator not being read

I'm using the golang.org/x/text/unicode/norm package to iterate over runes in a []byte. I've chosen this approach as I need to inspect each rune and maintain information about the sequence of runes. The last call to iter.Next() does not read the last rune. It gives 0 bytes read on the last rune.
Here is the code:
package main
import (
"fmt"
"unicode/utf8"
"golang.org/x/text/unicode/norm"
)
func main() {
var (
n int
r rune
it norm.Iter
out []byte
)
in := []byte(`test`)
fmt.Printf("%s\n", in)
fmt.Println(in)
it.Init(norm.NFD, in)
for !it.Done() {
ruf := it.Next()
r, n = utf8.DecodeRune(ruf)
fmt.Printf("bytes read: %d. val: %q\n", n, r)
buf := make([]byte, utf8.RuneLen(r))
utf8.EncodeRune(buf, r)
out = norm.NFC.Append(out, buf...)
}
fmt.Printf("%s\n", out)
fmt.Println(out)
}
This produces the following output:
test
[116 101 115 116]
bytes read: 1. val: 't'
bytes read: 1. val: 'e'
bytes read: 1. val: 's'
bytes read: 0. val: '�'
tes�
[116 101 115 239 191 189]
It is possible this is a bug in golang.org/x/text/unicode/norm and its Init() function.
In the package's test and example that I see all use InitString. So as a workaround, if you change:
it.Init(norm.NFD, in)
to:
it.InitString(norm.NFD, `test`)
things will work as expected.
I would suggest opening up a bug report, but beware that since this is in the "/x" directory that package is considered experimental by go developers.
(BTW, I used my the go debugger to help me track down what's going on, but I should say its use was far the kind of debugger I'd like to see.)

How to calculate a Mod b in Casio fx-991ES calculator

Does anyone know how to calculate a Mod b in Casio fx-991ES Calculator. Thanks
This calculator does not have any modulo function. However there is quite simple way how to compute modulo using display mode ab/c (instead of traditional d/c).
How to switch display mode to ab/c:
Go to settings (Shift + Mode).
Press arrow down (to view more settings).
Select ab/c (number 1).
Now do your calculation (in comp mode), like 50 / 3 and you will see 16 2/3, thus, mod is 2. Or try 54 / 7 which is 7 5/7 (mod is 5).
If you don't see any fraction then the mod is 0 like 50 / 5 = 10 (mod is 0).
The remainder fraction is shown in reduced form, so 60 / 8 will result in 7 1/2. Remainder is 1/2 which is 4/8 so mod is 4.
EDIT:
As #lawal correctly pointed out, this method is a little bit tricky for negative numbers because the sign of the result would be negative.
For example -121 / 26 = -4 17/26, thus, mod is -17 which is +9 in mod 26. Alternatively you can add the modulo base to the computation for negative numbers: -121 / 26 + 26 = 21 9/26 (mod is 9).
EDIT2: As #simpatico pointed out, this method will not work for numbers that are out of calculator's precision. If you want to compute say 200^5 mod 391 then some tricks from algebra are needed. For example, using rule
(A * B) mod C = ((A mod C) * B) mod C we can write:
200^5 mod 391 = (200^3 * 200^2) mod 391 = ((200^3 mod 391) * 200^2) mod 391 = 98
As far as I know, that calculator does not offer mod functions.
You can however computer it by hand in a fairly straightforward manner.
Ex.
(1)50 mod 3
(2)50/3 = 16.66666667
(3)16.66666667 - 16 = 0.66666667
(4)0.66666667 * 3 = 2
Therefore 50 mod 3 = 2
Things to Note:
On line 3, we got the "minus 16" by looking at the result from line (2) and ignoring everything after the decimal. The 3 in line (4) is the same 3 from line (1).
Hope that Helped.
Edit
As a result of some trials you may get x.99991 which you will then round up to the number x+1.
You need 10 ÷R 3 = 1
This will display both the reminder and the quoitent
÷R
There is a switch a^b/c
If you want to calculate
491 mod 12
then enter 491 press a^b/c then enter 12. Then you will get 40, 11, 12. Here the middle one will be the answer that is 11.
Similarly if you want to calculate 41 mod 12 then find 41 a^b/c 12. You will get 3, 5, 12 and the answer is 5 (the middle one). The mod is always the middle value.
You can calculate A mod B (for positive numbers) using this:
Pol( -Rec( 1/2πr , 2πr × A/B ) , Y ) ( πr - Y ) B
Then press [CALC], and enter your values for A and B, and any value for Y.
/ indicates using the fraction key, and r means radians ( [SHIFT] [Ans] [2] )
type normal division first and then type shift + S->d
Here's how I usually do it. For example, to calculate 1717 mod 2:
Take 1717 / 2. The answer is 858.5
Now take 858 and multiply it by the mod (2) to get 1716
Finally, subtract the original number (1717) minus the number you got from the previous step (1716) -- 1717-1716=1.
So 1717 mod 2 is 1.
To sum this up all you have to do is multiply the numbers before the decimal point with the mod then subtract it from the original number.
Note: Math error means a mod m = 0
It all falls back to the definition of modulus: It is the remainder, for example, 7 mod 3 = 1.
This because 7 = 3(2) + 1, in which 1 is the remainder.
To do this process on a simple calculator do the following:
Take the dividend (7) and divide by the divisor (3), note the answer and discard all the decimals -> example 7/3 = 2.3333333, only worry about the 2. Now multiply this number by the divisor (3) and subtract the resulting number from the original dividend.
so 2*3 = 6, and 7 - 6 = 1, thus 1 is 7mod3
Calculate x/y (your actual numbers here), and press a b/c key, which is 3rd one below Shift key.
Simply just divide the numbers, it gives yuh the decimal format and even the numerical format. using S<->D
For example: 11/3 gives you 3.666667 and 3 2/3 (Swap using S<->D).
Here the '2' from 2/3 is your mod value.
Similarly 18/6 gives you 14.833333 and 14 5/6 (Swap using S<->D).
Here the '5' from 5/6 is your mod value.