Related
Problem:
suppose, I have a text file containing data like
TATTGCTTTGTGCTCTCACCTCTGATTTTACTGGGGGCTGTCCCCCACCACCGTCTCGCTCTCTCTGTCA
AAGAGTTAACTTACAGCTCCAATTCATAAAGTTCCTGGGCAATTAGGAGTGTTTAAATCCAAACCCCTCA
GATGGCTCTCTAACTCGCCTGACAAATTTACCCGGACTCCTACAGCTATGCATATGATTGTTTACAGCCT
And I want to find occurrences of character 'A', 'T', 'AAA' , etc. in it.
My Approach
val source = scala.io.Source.fromFile(filePath)
val lines = source.getLines().filter(char => char != '\n')
for (line <- lines) {
val aList = line.filter(ele => ele == 'A')
println(aList)
}
This will give me output like
AAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAA
My Question
How can I find total count of occurrences of 'A', 'T', 'AAA' etc. here? can I use map reduce functions for that? How?
There is even a shorter way:
lines.map(_.count(_ == 'A')).sum
This counts all A of each line, and sums up the result.
By the way there is no filter needed here:
val lines = source.getLines()
And as Leo C mentioned in his comment, if you start with Source.fromFile(filePath) it can be just like this:
source.count(_ == 'A')
As SoleQuantum mentions in his comment he wants call count more than once. The problem here is that source is a BufferedSource which is not a Collection, but just an Iterator, which can only be used (iterated) once.
So if you want to use the source mire than once you have to translate it first to a Collection.
Your example:
val stream = Source.fromResource("yourdata").mkString
stream.count(_ == 'A') // 48
stream.count(_ == 'T') // 65
Remark: String is a Collection of Chars.
For more information check: iterators
And here is the solution to get the count for all Chars:
stream.toSeq
.filterNot(_ == '\n') // filter new lines
.groupBy(identity) // group by each char
.view.mapValues(_.length) // count each group > HashMap(T -> TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT, A -> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA, G -> GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG, C -> CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
.toMap // Map(T -> 65, A -> 48, G -> 36, C -> 61)
Or as suggested by jwvh:
stream
.filterNot(_ == '\n')
.groupMapReduce(identity)(_=>1)(_+_))
This is Scala 2.13, let me know if you have problems with your Scala version.
Ok after the last update of the question:
stream.toSeq
.filterNot(_ == '\n') // filter new lines
.foldLeft(("", Map.empty[String, Int])){case ((a, m), c ) =>
if(a.contains(c))
(a + c, m)
else
(s"$c",
m.updated(a, m.get(a).map(_ + 1).getOrElse(1)))
}._2 // you only want the Map -> HashMap( -> 1, CCCC -> 1, A -> 25, GGG -> 1, AA -> 4, GG -> 3, GGGGG -> 1, AAA -> 5, CCC -> 1, TTTT -> 1, T -> 34, CC -> 9, TTT -> 4, G -> 22, CCCCC -> 1, C -> 31, TT -> 7)
Short explanation:
The solution uses a foldLeft.
The initial value is a pair:
a String that holds the actual characters (none to start)
a Map with the Strings and their count (empty at the start)
We have 2 main cases:
the character is the same we have a already a String.
Just add the character to the actual String.
the character is different. Update the Map with the actual String; the new character is the now the actual String.
Quite complex, let me know if you need more help.
Since scala.io.Source.fromFile(filePath) produces stream of chars you can use count(Char => Boolean) function directly on your source object.
val source = scala.io.Source.fromFile(filePath)
val result = source.count(_ == 'A')
You can use Partition method and then just use length on it.
val y = x.partition(_ == 'A')._1.length
You can get the count by doing the following:
lines.flatten.filter(_ == 'A').size
In general regular expressions are a very good tool to find sequences of characters in a string.
You can use the r method, defined with an implicit conversion over strings, to turn a string into a pattern, e.g.
val pattern = "AAA".r
Using it is then fairly easy. Assuming your sample input
val input =
"""TATTGCTTTGTGCTCTCACCTCTGATTTTACTGGGGGCTGTCCCCCACCACCGTCTCGCTCTCTCTGTCA
AAGAGTTAACTTACAGCTCCAATTCATAAAGTTCCTGGGCAATTAGGAGTGTTTAAATCCAAACCCCTCA
GATGGCTCTCTAACTCGCCTGACAAATTTACCCGGACTCCTACAGCTATGCATATGATTGTTTACAGCCT"""
Counting the number of occurrences of a pattern is straightforward and very readable:
pattern.findAllIn(input).size // returns 4
The iterator returned by regular expressions operations can also be used for more complex operations using the matchData method, e.g. printing the index of each match:
pattern. // this code would print the following lines
findAllIn(input). // 98
matchData. // 125
map(_.start). // 131
foreach(println) // 165
You can read more on Regex in Scala on the API docs (here for version 2.13.1)
I try to compare the tuple members (date) of an IO tuple with a normal tuple.
d1 ->(Integer, Int, Int) and d2 -> IO (Integer, Int, Int),
Is it possible to compare these two tuples?
I've tried something like that:
import Data.Time.Clock
import Data.Time.Calendar
import Data.Time.LocalTime
-- helper functions for tuples
x_fst (x,_,_) = x
x_snd (_,x,_) = x
x_trd (_,_,x) = x
getDate :: IO (Integer, Int, Int)
getDate = do
now <- getCurrentTime
tiz <- getCurrentTimeZone
let zoneNow = utcToLocalTime tiz now
let date#(year, month, day) = toGeorgian $ localDay zoneNow
return $ date -- here I will return an IO tuple -> IO (Integer, Int, Int)
compareDates :: a -> IO (Integer, Int, Int) -> IO Bool
compareDates d1 d2 = do
let year1 = x_fst d1
let year2 = x_fst d2
let month1 = x_snd d1
let month2 = x_snd d2
let day1 = x_trd d1
let day2 = x_trd d2
return $ (year1 == year2 && month1 == month2 && day1 == day2)
But I get the message that I can't compare an IO tuple with a normal tuple:
Couldn't match expected type `(Integer, Integer, Integer)`
with actual type `IO (Integer, Int, Int)`
In the second argument of `compareDates`, namely `date`
Is there a way around it? I would appreciate any help.
Thanks.
With the help of the comment / chat section I got it to work with the following code:
getDate :: IO Day
getDate = do
now <- getCurrentTime
tz <- getCurrentTimeZone
return . localDay $ utcToLocalTime tz now
main = do
d2 <- getDate
return $ fromGregorian 2019 6 15 == d2
Any hints how to draw branching schema in spirit of attached image is welcomed.
Note that I would like to do it in graphviz for fast editing and future changes.
I made an attempt to imitate the famous git branching strategy from http://nvie.com/posts/a-successful-git-branching-model/ using GraphViz.
This is the original picture:
And this is the result:
The code:
strict digraph g{
rankdir="TB";
nodesep=0.5;
ranksep=0.25;
splines=line;
forcelabels=false;
// general
node [style=filled, color="black",
fontcolor="black", font="Consolas", fontsize="8pt" ];
edge [arrowhead=vee, color="black", penwidth=2];
// branch names
node [fixedsize=false, penwidth=0, fillcolor=none, shape=none, width=0, height=0, margin="0.05"];
subgraph {
rank=sink;
me [label="master", group="master"];
}
subgraph {
rank=sink;
de [label="develop", group="develop"];
}
// tags
node [shape=cds, fixedsize=false, fillcolor="#C6C6C6", penwidth=1, margin="0.11,0.055"]
t1 [label="0.1"]
t2 [label="0.2"]
t3 [label="1.0"]
// graph
node [width=0.2, height=0.2, fixedsize=true, label="", margin="0.11,0.055", shape=circle, penwidth=2, fillcolor="#FF0000"]
// branches
node [group="master", fillcolor="#27E4F9"];
m1;
m2;
m3;
m4;
subgraph {
rank=source;
ms [label="", width=0, height=0, penwidth=0];
}
m1 -> m2 -> m3 -> m4;
ms -> m1 [color="#b0b0b0", style=dashed, arrowhead=none ];
m4 -> me [color="#b0b0b0", style=dashed, arrowhead=none ];
node [group="hotfixes", fillcolor="#FD5965"];
h1;
node [group="release", fillcolor="#52C322"];
r1;
r2;
r3;
r4;
r5;
r1 -> r2 -> r3 -> r4;
node [group="develop", fillcolor="#FFE333"];
d1;
d2;
d3;
d4;
d5;
d6;
d7;
d8;
d9;
d10;
d1 -> d2 -> d3 -> d4 -> d5 -> d6 -> d7 -> d8 -> d9 -> d10;
d10 -> de [color="#b0b0b0", style=dashed, arrowhead=none ];
node [group="feature 1", fillcolor="#FB3DB5"];
fa1;
fa2;
fa3;
fa4;
fa5;
fa6;
subgraph fas1 {
fa1 -> fa2 -> fa3;
}
subgraph fas2 {
fa4 -> fa5 -> fa6;
}
node [group="feature 2", fillcolor="#FB3DB5"];
fb1;
fb2;
fb3;
fb4;
subgraph{ rank=same; fa6; fb4; } // hack
subgraph{ rank=same; fa1; fb1; } // hack
fb1 -> fb2 -> fb3 -> fb4;
// nodes
m1 -> d1;
m1 -> h1;
h1 -> m2;
h1 -> d5;
d3 -> fa1;
fa3 -> d6;
d6 -> r1;
r2 -> d7;
r4 -> d8;
r4 -> m3;
d9 -> r5;
r5 -> m4;
r5 -> d10;
d7 -> fa4;
fa6 -> d9;
d3 -> fb1;
fb4 -> d9;
// tags connections
edge [color="#b0b0b0", style=dotted, len=0.3, arrowhead=none, penwidth=1];
subgraph {
rank="same";
m1 -> t1;
}
subgraph {
rank="same";
m2 -> t2 ;
}
subgraph {
rank="same";
m3 -> t3;
}
}
Hope this helps someone.
This particular diagram was made with inkscape, therefore it will be difficult to match it with graphviz's output.
Here's how you may match some of it with graphviz:
Use a different group attribute for each branch in order to get straight lines for each branch (here's another example of using group, and one using weight)
Define the branches in the right order to have them appear from top to bottom
Use shape, style, width and height have some nodes stand out, and hide others
Use some \n newline cheating to have labels on top of the nodes (you may also try labelloc="t", or using xlabel instead of label)
digraph g{
rankdir="LR";
pad=0.5;
nodesep=0.6;
ranksep=0.5;
forcelabels=true;
node [width=0.12, height=0.12, fixedsize=true,
shape=circle, style=filled, color="#909090",
fontcolor="deepskyblue", font="Arial bold", fontsize="14pt" ];
edge [arrowhead=none, color="#909090", penwidth=3];
node [group="release3"];
s3 [label="release 3\n\n", width=0.03, height=0.03, shape=box];
r30 [label=" R3.0\n\n\n"];
e3 [label="", width=0.03, height=0.03, shape=box];
e3f [label="", width=0.03, height=0.03, shape=circle, color="#b0b0b0"];
s3 -> r30 -> e3;
e3 -> e3f [color="#b0b0b0", style=dashed];
node [group="release2"];
s2 [label="release 2\n\n", width=0.03, height=0.03, shape=box];
b2 [label="", width=0.03, height=0.03, shape=box];
r20 [label=" R2.0\n\n\n"];
e2 [label="", width=0.03, height=0.03, shape=box];
e2f [label="", width=0.03, height=0.03, shape=circle, color="#b0b0b0"];
s2 -> b2 -> r20 -> e2;
e2 -> e2f [color="#b0b0b0", style=dashed];
node [group="release1"];
s1 [label="release 1\n\n", width=0.03, height=0.03, shape=box];
ttest [label=" test\n\n\n"];
b1 [label="", width=0.03, height=0.03, shape=box];
r10 [label=" R1.0\n\n\n"];
r11 [label=" R1.1\n\n\n"];
e1 [label="", width=0.03, height=0.03, shape=box];
e1f [label="", width=0.03, height=0.03, shape=circle, color="#b0b0b0"];
s1 -> ttest -> b1 -> r10 -> r11 -> e1;
e1 -> e1f [color="#b0b0b0", style=dashed];
b1 -> s2;
b2 -> s3;
}
I've got a function that creates an Async workflow, and the function that takes 10 arguments in curry style. e.g.
let createSequenceCore a b c d e f g h i j =
async {
...
}
I want to create another function to start that workflow, so I've got
let startSequenceCore a b c d e f g h i j =
Async.StartImmediate (createSequenceCore a b c d e f g h i j)
Is there any way I can get rid of those redundant parameters? I tried the << operator, but that only lets me remove one.
let startSequenceCore a b c d e f g h i =
Async.StartImmediate << (createSequenceCore a b c d e f g h i)
(I added Haskell and Scala to this question even though the code itself is F#, as really what I want is just how to do this kind of currying, which would apply to any; I'd think a Haskell or Scala answer would be easily portable to F# and could well be marked as the correct answer).
NOTE Reasonably well showing that there is not an easy solution to this could also get the bounty.
UPDATE geesh I'm not going to give 100 points to an answer that argues with the question rather than answering it, even if it's the highest voted, so here:
I've got a function that creates an Async workflow, and the function that takes 4 arguments in curry style. e.g.
let createSequenceCore a b c d =
async {
...
}
I want to create another function to start that workflow, so I've got
let startSequenceCore a b c d =
Async.StartImmediate (createSequenceCore a b c d)
Is there any way I can get rid of those redundant parameters? I tried the << operator, but that only lets me remove one.
let startSequenceCore a b c =
Async.StartImmediate << (createSequenceCore a b c)
10 arguments sounds like too many... How about you'd create a record with 10 properties instead, or maybe a DU where you don't need all 10 in every case? Either way, you'd end up with a single argument that way and normal function composition works as expected again.
EDIT: When you actually need it, you can create a more powerful version of the << and >> operators thusly:
let (<.<) f = (<<) (<<) (<<) f
let (<..<) f = (<<) (<<) (<.<) f
let (<...<) f = (<<) (<<) (<..<) f
let flip f a b = f b a
let (>.>) f = flip (<.<) f
let (>..>) f = flip (<..<) f
let (>...>) f = flip (<...<) f
and then you can just write:
let startSequenceCore =
Async.StartImmediate <...< createSequenceCore
or
let startSequenceCore =
createSequenceCore >...> Async.StartImmediate
P.S.: The argument f is there, so that the type inference infers generic args as opposed to obj.
As already mentioned by #Daniel Fabian, 10 arguments is way too many. In my experience even 5 arguments is too many and the code becomes unreadable and error prone. Having such functions usually signals a bad design. See also Are there guidelines on how many parameters a function should accept?
However, if you insist, it's possible to make it point-free, although I doubt it gains any benefit. I'll give an example in Haskell, but I believe it'd be easy to port to F# as well. The trick is to nest the function composition operator:
data Test = Test
deriving (Show)
createSequenceCore :: Int -> Int -> Int -> Int -> Int
-> Int -> Int -> Int -> Int -> Int -> Test
createSequenceCore a b c d e f g h i j = Test
-- the original version
startSequenceCore :: Int -> Int -> Int -> Int -> Int
-> Int -> Int -> Int -> Int -> Int -> IO ()
startSequenceCore a b c d e f g h i j =
print (createSequenceCore a b c d e f g h i j)
-- and point-free:
startSequenceCore' :: Int -> Int -> Int -> Int -> Int
-> Int -> Int -> Int -> Int -> Int -> IO ()
startSequenceCore' =
(((((((((print .) .) .) .) .) .) .) .) .) . createSequenceCore
Replacing f with (f .) lifts a function to work one argument inside, as we can see by adding parentheses to the type of (.):
(.) :: (b -> c) -> ((a -> b) -> (a -> c))
See also this illuminating blog post by Conal Elliott: Semantic editor combinators
You could tuple the arguments to createSequenceCore:
let createSequenceCore(a, b, c, d, e, f, g, h, i, j) =
async {
...
}
let startSequenceCore =
createSequenceCore >> Async.StartImmediate
I am assuming you just want to write clean code as opposed to allow currying one parameter at a time.
Just write your own composeN function.
let compose4 g f x0 x1 x2 x4 =
g (f x0 x1 x2 x4)
let startSequenceCore =
compose4 Async.StartImmediate createSequenceCore
This is mainly a logic problem I guess...
I use this smtlib formula:
(declare-fun a () Bool)
(declare-fun b () Bool)
(declare-fun c () Bool)
(declare-fun d () Bool)
(assert (xor (and a (xor b c)) d))
Which is a term of this structure(in my opinion, at least):
XOR
| |
AND d
| |
a XOR
| |
b c
My guess: The resultSet would look like this:
{ab, ac, d}
But its this using scala^z3 ctx.checkAndGetAllModels():
{ab, d, ac, ad, abcd}
Why is ad and abcd in there?
Is it possible to get only the results I would expect?
Using Scala (without Z3) to show that there are, in fact, more solutions to the constraint:
val tf = Seq(true, false)
val allValid =
for(a <- tf; b <- tf; c <- tf; d <- tf;
if((a && (b ^ c)) ^ d)) yield (
(if(a) "a" else "") + (if(b) "b" else "") +
(if(c) "c" else "") + (if(d) "d" else ""))
allValid.mkString("{ ", ", ", " }")
Prints:
{ abcd, ab, ac, ad, bcd, bd, cd, d }
So unless I'm missing something, the question is, why does it not find all solutions? Now here is the answer to that one. (Spoiler alert: "getAllModels" doesn't really get all models.) First, let's reproduce what you observed:
import z3.scala._
val ctx = new Z3Context("MODEL" -> true)
val a = ctx.mkFreshConst("a", ctx.mkBoolSort)
val b = ctx.mkFreshConst("b", ctx.mkBoolSort)
val c = ctx.mkFreshConst("c", ctx.mkBoolSort)
val d = ctx.mkFreshConst("d", ctx.mkBoolSort)
val cstr0 = ctx.mkXor(b, c)
val cstr1 = ctx.mkAnd(a, cstr0)
val cstr2 = ctx.mkXor(cstr1, d)
ctx.assertCnstr(cstr2)
Now, if I run: ctx.checkAndGetAllModels.foreach(println(_)), I get:
d!3 -> false
a!0 -> true
c!2 -> false
b!1 -> true
d!3 -> true // this model is problematic
a!0 -> false
d!3 -> false
a!0 -> true
c!2 -> true
b!1 -> false
d!3 -> true
a!0 -> true
c!2 -> false
b!1 -> false
d!3 -> true
a!0 -> true
c!2 -> true
b!1 -> true
Now, the problem is that the second model is an incomplete model. Z3 can return it, because whatever the values for b and c are, the constraint is satisfied (b and c are don't-care variables). The current implementation of checkAndGetAllModels simply negates the model to prevent repetition; in this case, it will ask for another model such that (not (and d (not a))) holds. This will prevent all other models with this two values from being returned. In a sense, the incomplete model actually represents four valid, completed, models.
By the way, what happens if you use the DSL of ScalaZ3 with the findAll function is that all models will be completed with default values when they are incomplete (and before they are used to compute the next one). In the context of the DSL we can do this, because we know the set of variables that appear in the formula. In this case, it's harder to guess how the model should be completed. One option would be for ScalaZ3 to remember which variables were used. A better solution would be for Z3 to have an option to always return values for don't-care variables, or perhaps simply to list all don't-care variables in a model.