Spark: All RDD data not getting saved to Cassandra table - scala

Hi, I am trying to load RDD data to a Cassandra Column family using Scala. Out of a total 50 rows , only 28 are getting stored into cassandra table.
Below is the Code snippet:
val states = sc.textFile("state.txt")
//list o fall the 50 states of the USA
var n =0 // corrected to var
val statesRDD = states.map{a =>
n=n+1
(n, a)
}
scala> statesRDD.count
res2: Long = 50
cqlsh:brs> CREATE TABLE BRS.state(state_id int PRIMARY KEY, state_name text);
statesRDD.saveToCassandra("brs","state", SomeColumns("state_id","state_name"))
// this statement saves only 28 rows out of 50, not sure why!!!!
cqlsh:brs> select * from state;
state_id | state_name
----------+-------------
23 | Minnesota
5 | California
28 | Nevada
10 | Georgia
16 | Kansas
13 | Illinois
11 | Hawaii
1 | Alabama
19 | Maine
8 | Oklahoma
2 | Alaska
4 | New York
18 | Virginia
15 | Iowa
22 | Wyoming
27 | Nebraska
20 | Maryland
7 | Ohio
6 | Colorado
9 | Florida
14 | Indiana
26 | Montana
21 | Wisconsin
17 | Vermont
24 | Mississippi
25 | Missouri
12 | Idaho
3 | Arizona
(28 rows)
Can anyone please help me in finding where the issue is?
Edit:
I understood why only 28 rows are getting stored in Cassandra, it's because I have made the first column a PRIMARY KEY and It looks like in my code, n is incremented maximum to 28 and then it starts again with 1 till 22 (total 50).
val states = sc.textFile("states.txt")
var n =0
var statesRDD = states.map{a =>
n+=1
(n, a)
}
I tried making n an accumulator variable as well(viz. val n = sc.accumulator(0,"Counter")), but I don't see any differnce in the output.
scala> statesRDD.foreach(println)
[Stage 2:> (0 + 0) / 2]
(1,New Hampshire)
(2,New Jersey)
(3,New Mexico)
(4,New York)
(5,North Carolina)
(6,North Dakota)
(7,Ohio)
(8,Oklahoma)
(9,Oregon)
(10,Pennsylvania)
(11,Rhode Island)
(12,South Carolina)
(13,South Dakota)
(14,Tennessee)
(15,Texas)
(16,Utah)
(17,Vermont)
(18,Virginia)
(19,Washington)
(20,West Virginia)
(21,Wisconsin)
(22,Wyoming)
(1,Alabama)
(2,Alaska)
(3,Arizona)
(4,Arkansas)
(5,California)
(6,Colorado)
(7,Connecticut)
(8,Delaware)
(9,Florida)
(10,Georgia)
(11,Hawaii)
(12,Idaho)
(13,Illinois)
(14,Indiana)
(15,Iowa)
(16,Kansas)
(17,Kentucky)
(18,Louisiana)
(19,Maine)
(20,Maryland)
(21,Massachusetts)
(22,Michigan)
(23,Minnesota)
(24,Mississippi)
(25,Missouri)
(26,Montana)
(27,Nebraska)
(28,Nevada)
I am curious to know what is causing n to not getting updated after value 28? Also, what are the ways in which I can create a counter which I can use for creating RDD?

There are some misconceptions about distributed systems embedded inside your question. The real heart of this is "How do I have a counter in a distributed system?"
The short answer is you don't. For example what you've done in your code example originally is something like this.
Task One {
var x = 0
record 1: x = 1
record 2: x = 2
}
Task Two {
var x = 0
record 20: x = 1
record 21: x = 2
}
Each machine is independently creating a new x variable set at 0 which gets incremented within it's own context, independently over the other nodes.
For most use cases the "counter" question can be replaced with "How can I get a Unique Identifier per Record in a distributed system?"
For this most users end up using a UUID which can be generated on independent machines with infinitesimal chances of conflicts.
If the question can be "How can I get a monotonic increasing unique indentifier?"
Then you can use zipWithUniqueIndex which will not count but will generate monotonically increasing ids.
If you just want them number to start with it's best to do it on the local system.
Edit; Why can't I use an accumulator?
Accumulators store their state (surprise) per task. You can see this with a little example:
val x = sc.accumulator(0, "x")
sc.parallelize(1 to 50).foreachPartition{ it => it.foreach(y => x+= 1); println(x)}
/*
6
7
6
6
6
6
6
7
*/
x.value
// res38: Int = 50
The accumulators combine their state after finishing their tasks, which means you can't use them as a global distributed counter.

Related

altair bar chart for grouped data sorting NOT working

`Reservation_branch_code | ON_ACCOUNT | Rcount
:-------------------------------------------------:
0 1101 | 170 | 5484
1 1103 | 101 | 5111
2 1118 | 1 | 232
3 1121 | 0 | 27
4 1126 | 90 | 191`
would like to chart sorted by "Rcount" and x axis is "Reservation_branch_code"
this below code gives me chart without Sorting by Rcount
base =alt.Chart(df1).transform_fold(
['Rcount', 'ON_ACCOUNT'],
as_=['column', 'value']
)
bars = base.mark_bar().encode(
# x='Reservation_branch_code:N',
x='Reservation_branch_code:O',
y=alt.Y('value:Q', stack=None), # stack =None enables layered bar
color=alt.Color('column:N', scale=alt.Scale(range=["#f50520", "#bab6b7"])),
tooltip=alt.Tooltip(['ON_ACCOUNT','Rcount']),
#order=alt.Order('value:Q')
)
text = base.mark_text(
align='center',
color='black',
baseline='middle',
dx=0,dy=-8, # Nudges text to right so it doesn't appear on top of the bar
).encode(
x='Reservation_branch_code:N',
y='value:Q',
text=alt.Text('value:Q', format='.0f')
)
rule = alt.Chart(df1).mark_rule(color='blue').encode(
y='mean(Rcount):Q'
)
(bars+text+rule).properties(width=790,height=330)
i sorted data in dataframe...used in that df in altair chart
but not found X axis is not sorted by Rcount column........Thanks
You can pass a list with the sort order:
import altair as alt
from vega_datasets import data
source = data.barley()
alt.Chart(source).mark_bar().encode(
x='sum(yield):Q',
y=alt.Y('site:N', sort=source['site'].unique().tolist())
)

Strange 64-bit time format from game, you can recognize it?

i capture this 64-bit time format from a game and trying to understand it.
I can not use a date delta because every now and then the value totally changes and even becomes negative as seen below.
v1:int64=-5990085973098618987; //2021-01-25 13:30:00
v2:int64=-5990085973321147595; //4 mins later
v3:int64=6140958949625363349; //7 mins later
v4:int64=6140958948894898101; //11 mins later
v5:int64=-174740204032730139; //16 mins later
v6:int64=-174740204054383467; //18 mins later
v7:int64=-6490439358095090795; //23 mins later
I tried to split the 64-bit into two 32-bit containers to get low and high part. still strange values.
I also tried using pdouble(#value)^ to get float value of the 64-bit data, still strange values.
So kind of running out of ideas, maybe some kind of bitfield data or something else going on.
hipart: -1394675573 | lopart: 1466441621 | hex: acdef08b|57681f95 | swap: -7701322112560996692
hipart: -1394675573 | lopart: 1243913013 | hex: acdef08b|4a249b35 | swap: 3862721007994330796
hipart: 1429803424 | lopart: -458425451 | hex: 553911a0|e4acfb95 | swap: -7639322244965910187
hipart: 1429803424 | lopart: -1188890699 | hex: 553911a0|b922f7b5 | swap: -5334757052947285675
hipart: -40684875 | lopart: -760849435 | hex: fd9332b5|d2a65be5 | swap: -1919757392230050819
hipart: -40684875 | lopart: -782502763 | hex: fd9332b5|d15bf495 | swap: -7641381711494605827
hipart: -1511173174 | lopart: -1467540587 | hex: a5ed53ca|a8871b95 | swap: -7702413578668347995
Any ideas welcomed, thanks in advance
//mbs
--EDIT: So far, thanks to Martin Rosenau we are able to encode like this:
func mulproc_nfsw(i:int64;key:uint32):int64;
begin
if (blnk i) or (blnk key) then exit;
p:pointer=#i;
hi:uint32=uint32(p+4)^; //30864159 (hex: 01d6f31f)
lo:uint32=uint32(p)^; //748455936 (hex: 2c9c8800)
hi64:int64=hi*key; //0135b55a acdef08b <-- keep
lo64:int64=lo*key; //1d566e0b a65f2800 <-- keep
q:pointer=#result; //-5990085971773806592
uint32(q+4)^:=hi64; //acdef08b
uint32(q)^:=lo64; //a65f2800
end;
func encode_time_nfsw(j:juncture):int64;
begin
if blnk j then exit; //input: '2021-01-25 13:37:07'
key:uint32=$A85A2115; //encode key
ft:int64=j.filetime; //hex: 01d6f31f 2c9c8800
result:=mulproc_nfsw(ft,key);
end;
--EDIT2: Finally, thanks to fpiette we are able to decode also:
func decode_time_nfsw(i:int64):juncture;
begin
if blnk i then exit; //input: -5990085971773806592
key:uint32=$3069263D; //decode key
ft:int64=mulproc_nfsw(i,key);
result.setfiletime(ft);
end;
I checked my suspicion that the high and the low 32 bits are simply multiplied by A85A2115 (hex):
We get a FILETIME structure. Then we perform a 32x32->32 bit multiplication (this means we throw away the high 32 bits of the 64 bits result) of the high and the low word independently.
Example:
25 Jan 2021 13:37:07 (and some milliseconds)
Unencrypted FILETIME:
High dword = 1D6F31F (hex)
Low dword = 2C9CA481 (hex)
Multiplication
High dword: 1D6F31F * A85A2115 = 135B55AACDEF08B (hex)
Low dword: 2C9CA481 * A85A2115 = 1D5680CA57681F95 (hex)
Now only take the low 32 bits of the results:
High dword: ACDEF08B (hex)
Low dword: 57681F95 (hex)
Unfortunately, I don't know how to do the the "reverse operation"; I did it by searching for the result in a loop with the following pseudo-code:
encryptedValue = 57681F95 (hex)
originalValue = 0
product = 0
while product not equal to encryptedValue
// 32-bit addition discarding carry:
product = product + A85A2115 (hex)
originalValue = originalValue + 1
end_of_while_loop
We get the following results:
25 Jan 2021 13:37:07 => acdef08b|57681f95
25 Jan 2021 13:40:51 => acdef08b|4a249b35
25 Jan 2021 13:45:07 => 553911a0|e4acfb95
25 Jan 2021 13:49:03 => 553911a0|b922f7b5
25 Jan 2021 13:53:53 => fd9332b5|d2a65be5
25 Jan 2021 13:55:50 => fd9332b5|d15bf495
25 Jan 2021 14:00:39 => a5ed53ca|a8871b95
Addendum
The reverse operation seems to be done by multiplying with 3069263D (hex) (and only using the low 32 bits).
Encrypting:
2C9CA481 * A85A2115 = 1D5680CA57681F95
=> Result: 57681F95
Decrypting:
57681F95 * 3069263D = 10876CAF2C9CA481
=> Result: 2C9CA481

How to traverse M*N grid in KDB

How to traverse m*n grid in Qlang, you can traverse up , down or diagonally.
to find how many possible ways end point can be reached.
Like Below :
0
|
------- ------
| | |
( 0 1) (1 1) (1 0)
| . |
------ ----- ------ -----
| | . | |
( 0 1) (1 0) ( 1 1) (2 0)
....
(2 2) ..................... (2 2)
One way of doing it using .z.s to recursively call the initial function with different arguments and summing to give total number of paths.
f:{
// When you reach a wall, there is only one way to corner so return valid path
if[any 1=(x;y);:1];
// Otherwise spawn 3 paths - one up, one right and one diagonally
:.z.s[x-1;y] + .z.s[x;y-1] + .z.s[x-1;y-1]
}
q)f[2;2]
3
q)f[2;3]
5
q)f[3;3]
13
If you are travelling along the edges and not the squares you can change the first line to:
if[any 0=(x;y);:1];
A closed form solution is just finding the Delannoy Number, which could be implemented something like this when you are travelling along edges.
d:{
k:1+min(x;y);
f:{prd 1+til x};
comb:{[f;m;n] f[m] div f[n]*f[m-n]}[f];
(sum/) (2 xexp til k) * prd (x;y) comb/:\: til k
}
q)d[3;3]
63f
This is much quicker for larger boards as I think the complexity of the first solution is O(3^m+n) while the complexity of the second is O(m*n)
q)\t f[7;7]
13
q)\t f[10;10]
1924
q)\t d[7;7]
0
q)\t d[100;100]
1

An issue with argument "sortv" of function seqIplot()

I'm trying to plot individual sequences by means of function seqIplot() in TraMineR. These individual sequences represent work trajectories, completed by former school's graduates via a WEB questionnaire.
Using argument "sortv", I'd like to sort my sequences according to the order of the levels of one covariate, the year of graduation, named "PROMO".
"PROMO" is a factor variable contained in a data frame named "covariates.seq", gathering covariates together:
str(covariates.seq)
'data.frame': 733 obs. of 6 variables:
$ ID_SQ : Factor w/ 733 levels "1","2","3","5",..: 1 2 3 4 5 6
7 8 9 10 ...
$ SEXE : Factor w/ 2 levels "Féminin","Masculin": 1 1 1 1 2 1
1 2 2 1 ...
$ PROMO : Factor w/ 6 levels "1997","1998",..: 1 2 2 4 4 3 2 2
2 2 ...
$ DEPARTEMENT : Factor w/ 10 levels "BC","GCU","GE",..: 1 4 7 8 7 9
9 7 7 4 ...
$ NIVEAU_ADMISSION: Factor w/ 2 levels "En Premier Cycle",..: NA 1 1 1 1
1 NA 1 1 1 ...
$ FILIERE_SECTION : Factor w/ 4 levels "Cursus Classique",..: NA 4 2 NA
1 1 NA NA 4 3 ..
I'm also using "SEXE", the graduates' gender, as a grouping variable. To plot the individual sequences so, my command is as follows:
seqIplot(sequences, group = covariates.seq$SEXE,
sortv = covariates.seq$PROMO,
cex.axis = 0.7, cex.legend = 0.7)
I expected that, by using a process time axis (with the year of graduation as sequence-dependent origin), sorting the sequences according to the order of the levels of "PROMO" would give a plot with groups of sequences from the longest (for the older graduates) to the shortest (for the younger graduates).
But I've got an issue: in the output plot, the sequences don't appear to be correctly sorted according to the levels of "PROMO". Indeed, by using "sortv = covariates.seq$PROMO" as in the command above, the plot doesn't show groups of sequences from the longest to the shortest, as expected. It looks like the plot obtained without using the argument "sortv" (see Figures below).
Without using argument "sortv"
Using "sortv = covariates.seq$PROMO"
Note that I have 733 individual sequences in my object "sequences", created as follows:
labs <- c("En poste","Au chômage (d'au moins 6 mois)", "Autre situation
(d'au moins 6 mois)","En poursuite d'études (thèse ou hors
thèse)", "En reprise d'études / formation (d'au moins 6 mois)")
codes <- c("En poste", "Au chômage", "Autre situation", "En poursuite
d'études", "En reprise d'études / formation")
sequences <- seqdef(situations, alphabet = labs, states = codes, left =
NA, right = "DEL", missing = NA,
cnames = as.character(seq(0,7400/365,1/365)),
xtstep = 365)
The values of the covariates are sorted in the same order as the individual sequences. The covariate "PROMO" doesn't contain any missing value.
Something's going wrong, but what?
Thank you in advance for your help,
Best,
Arnaud.
Using a factor as sortv argument in seqIplot works fine as illustrated by the example below:
sdc <- c("aabbccdd","bbbccc","aaaddd","abcabcab")
sd <- seqdecomp(sdc, sep="")
seq <- seqdef(sd)
fac <- factor(c("2000","2001","2001","2000"))
par(mfrow=c(1,3))
seqIplot(seq, with.legend=FALSE)
seqIplot(seq, sortv=fac, with.legend=FALSE)
seqlegend(seq)

POS tagging in Scala

I tried to POS tag a sentence in Scala using Stanford parser like below
val lp:LexicalizedParser = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz");
lp.setOptionFlags("-maxLength", "50", "-retainTmpSubcategories")
val s = "I love to play"
val parse :Tree = lp.apply(s)
val taggedWords = parse.taggedYield()
println(taggedWords)
I got an error type mismatch; found : java.lang.String required: java.util.List[_ <: edu.stanford.nlp.ling.HasWord] in the line val parse :Tree = lp.apply(s)
I don't know whether this is the right way of doing it or not. Are there any other easy ways of POS tagging a sentence in Scala?
You might like to consider the FACTORIE toolkit (http://github.com/factorie/factorie). It is a general library for machine learning and graphical models that happens to include an extensive suite of natural language processing components (tokenization, token normalization, morphological analysis, sentence segmentation, part-of-speech tagging, named entity recognition, dependency parsing, mention finding, coreference).
Furthermore it is written entirely in Scala, and it is released under the Apache License.
Documentation is currently sparse, but will be improving in the coming months.
For example, once Maven-based installation is finished you can type at the command line:
bin/fac nlp --pos1 --parser1 --ner1
to launch a socket-listening multi-threaded NLP server. Then query it by piping plain text to its socket number:
echo "Mr. Jones took a job at Google in New York. He and his Australian wife moved from New South Wales on 4/1/12." | nc localhost 3228
The output is then
1 1 Mr. NNP 2 nn O
2 2 Jones NNP 3 nsubj U-PER
3 3 took VBD 0 root O
4 4 a DT 5 det O
5 5 job NN 3 dobj O
6 6 at IN 3 prep O
7 7 Google NNP 6 pobj U-ORG
8 8 in IN 7 prep O
9 9 New NNP 10 nn B-LOC
10 10 York NNP 8 pobj L-LOC
11 11 . . 3 punct O
12 1 He PRP 6 nsubj O
13 2 and CC 1 cc O
14 3 his PRP$ 5 poss O
15 4 Australian JJ 5 amod U-MISC
16 5 wife NN 6 nsubj O
17 6 moved VBD 0 root O
18 7 from IN 6 prep O
19 8 New NNP 9 nn B-LOC
20 9 South NNP 10 nn I-LOC
21 10 Wales NNP 7 pobj L-LOC
22 11 on IN 6 prep O
23 12 4/1/12 NNP 11 pobj O
24 13 . . 6 punct O
Of course there is a programmatic API to all this functionality as well.
import cc.factorie._
import cc.factorie.app.nlp._
val doc = new Document("Education is the most powerful weapon which you can use to change the world.")
DocumentAnnotatorPipeline(pos.POS1).process(doc)
for (token <- doc.tokens)
println("%-10s %-5s".format(token.string, token.posLabel.categoryValue))
will output:
Education NN
is VBZ
the DT
most RBS
powerful JJ
weapon NN
which WDT
you PRP
can MD
use VB
to TO
change VB
the DT
world NN
. .
I found a very simple way to do POS tagging in Scala
Step 1
Download stanford tagger version 3.2.0 form the link below
http://nlp.stanford.edu/software/stanford-postagger-2013-06-20.zip
Step 2
Add stanford-postagger jar present in the folder to your project and also place the english-left3words-distsim.tagger file present in the models folder in your project
Then, with the code below you can pos tag a sentence in Scala
val tagger = new MaxentTagger(
"english-left3words-distsim.tagger")
val art_con = "My name is Rahul"
val tagged = tagger.tagString(art_con)
println(tagged)
Output: My_PRP$ name_NN is_VBZ Rahul_NNP
I believe the API of the Stanford Parser has changed somewhat, as it does sometimes. apply has the signature, public Tree apply(java.util.List<? extends HasWord> words), and this is what you see in the error message.
What you should use now is parse, which has the signature public Tree parse(java.lang.String sentence).