list of unique elements from textstat_simil with loop - corpus

This is a followup question from a response here.
Below, I can generate jaccard similarity for adjacent pairs (1993 then 1994). How can I use as.list() to return a list of unique items for each comparison? Here's what I'm working with:
library("quanteda")
data_corpus_inaugural$president <- paste(data_corpus_inaugural$President,
data_corpus_inaugural$FirstName,
sep = ", "
)
head(data_corpus_inaugural$president, 10)
## [1] "Washington, George" "Washington, George" "Adams, John"
## [4] "Jefferson, Thomas" "Jefferson, Thomas" "Madison, James"
## [7] "Madison, James" "Monroe, James" "Monroe, James"
## [10] "Adams, John Quincy"
simpairs <- lapply(unique(data_corpus_inaugural$president), function(x) {
dfmat <- corpus_subset(data_corpus_inaugural, president == x) %>%
dfm(remove_punct = TRUE)
df <- data.frame()
years <- sort(dfmat$Year)
for (i in seq_along(years)[-length(years)]) {
sim <- textstat_simil(
dfm_subset(dfmat, Year %in% c(years[i], years[i + 1])),
method = "jaccard"
)
df <- rbind(df, as.data.frame(sim))
}
df
})
do.call(rbind, simpairs)
## document1 document2 jaccard
## 1 1789-Washington 1793-Washington 0.09250399
## 2 1801-Jefferson 1805-Jefferson 0.20512821
## 3 1809-Madison 1813-Madison 0.20138889
## 4 1817-Monroe 1821-Monroe 0.29436202
## 5 1829-Jackson 1833-Jackson 0.20693928
## 6 1861-Lincoln 1865-Lincoln 0.14055885
## 7 1869-Grant 1873-Grant 0.20981595
## 8 1885-Cleveland 1893-Cleveland 0.23037543
## 9 1897-McKinley 1901-McKinley 0.25031211
## 10 1913-Wilson 1917-Wilson 0.21285564
## 11 1933-Roosevelt 1937-Roosevelt 0.20956522
## 12 1937-Roosevelt 1941-Roosevelt 0.20081549
## 13 1941-Roosevelt 1945-Roosevelt 0.18740157
## 14 1953-Eisenhower 1957-Eisenhower 0.21566976
## 15 1969-Nixon 1973-Nixon 0.23451777
## 16 1981-Reagan 1985-Reagan 0.24381368
## 17 1993-Clinton 1997-Clinton 0.24199623
## 18 2001-Bush 2005-Bush 0.24170616
## 19 2009-Obama 2013-Obama 0.24739195
So, for example, how would I retrieve the items in 1789-Washington that are not in 1793-Washington, and vice versa? ... And retrieve unique items for each pair of adjacent texts?

Related

How to convert output of Emboss:Palindrome into gff/bed file (perl)

I am sorry ton ask this kind of stupid question but I could not find it by myself... I learned perl a while ago and I am a little lost.
I want to convert this kind of output :
Palindromes of: seq1
Sequence length is: 24
Start at position: 1
End at position: 24
Minimum length of Palindromes is: 6
Maximum length of Palindromes is: 12
Maximum gap between elements is: 6
Number of mismatches allowed in Palindrome: 0
Palindromes:
1 aaaaaaaaaaa 11
|||||||||||
24 ttttttttttt 14
Palindromes of: seq2
Sequence length is: 15
Start at position: 1
End at position: 15
Minimum length of Palindromes is: 6
Maximum length of Palindromes is: 12
Maximum gap between elements is: 6
Number of mismatches allowed in Palindrome: 0
Palindromes:
1 aaaaaac 7
|||||||
15 ttttttg 9
Into a gff or bed file :
seq1 1 24
seq2 1 15
I found a perl module to do it : https://metacpan.org/pod/Bio::Tools::GFF
This is my little script :
#!/usr/bin/perl
use strict;
use warnings 'all';
use Bio::Tools::EMBOSS::Palindrome;
use Bio::Tools::GFF;
my $filename = "truc.pal";
# a simple script to turn palindrome output into GFF3
my $parser = Bio::Tools::EMBOSS::Palindrome->new(-file => $filename);
my $out = Bio::Tools::GFF->new(-gff_version => 3,
-file => ">$filename.gff");
while( my $seq = $parser->next_seq ) {
for my $feat ( $seq->get_SeqFeatures ) {
$out->write_feature($feat);
}
}
This is the result :
##gff-version 3
seq1 palindrome similarity 14 24 . - 1 allowed_mismatches=0;end=24;maximum gap=6;maximum_length=12;minimum_length=6;seqlength=24;start=1
seq2 palindrome similarity 9 15 . - 1 allowed_mismatches=0;end=15;maximum gap=6;maximum_length=12;minimum_length=6;seqlength=15;start=1
The issue is : I want to have it the result the start and the end of the palindrome and the specific position in the last line.
Exemple of what I want:
##gff-version 3
seq1 palindrome similarity 1 24 . - 1 mismatches=0;gap_positions=11-14;gap_size=3
seq2 palindrome similarity 1 15 . - 1 mismatches=0;gap_positions=7-9;gap_size=2
Thank you in advance.

Trouble with evalfis()

I'm having trouble with the fuzzy inference system; suddenly after I reinstalled Windows OS and MATLAB in my laptop, when I tried to run the code, a message appeared: "Error: File: readfis.m Line: 1 Column: 1
The input character is not valid in MATLAB statements or expressions."
C = evalfis(B, fis_name);
Even when I try to run the example codes, the same message appears.
In a first moment I thought it could be related to the reinstallation, but after I copied the .m file into another PC and tried to run it, the situation was the same, and I don't know why...
After I tried again, I opened the specified readfis.m file; it seemed strange for me; I put it below... However, when I searched into the files of MATLAB (following path C:\Program Files\MATLAB\R2014a\toolbox\fuzzy\fuzzy) I opened the evalfis.m file, which was radically different, it was more like an usual .m file; I just copied the .m file located in the fuzzy toolbox and pasted it into the working directory, replacing the other .m file; after doing that, I could run the code succesfully.
Here I put the code of the REPLACING evalfis.m file:
function [output,IRR,ORR,ARR] = evalfis(input, fis, numofpoints);
% EVALFIS Perform fuzzy inference calculations.
%
% Y = EVALFIS(U,FIS) simulates the Fuzzy Inference System FIS for the
% input data U and returns the output data Y. For a system with N
% input variables and L output variables,
% * U is a M-by-N matrix, each row being a particular input vector
% * Y is M-by-L matrix, each row being a particular output vector.
%
% Y = EVALFIS(U,FIS,NPts) further specifies number of sample points
% on which to evaluate the membership functions over the input or output
% range. If this argument is not used, the default value is 101 points.
%
% [Y,IRR,ORR,ARR] = EVALFIS(U,FIS) also returns the following range
% variables when U is a row vector (only one set of inputs is applied):
% * IRR: the result of evaluating the input values through the membership
% functions. This is a matrix of size Nr-by-N, where Nr is the number
% of rules, and N is the number of input variables.
% * ORR: the result of evaluating the output values through the membership
% functions. This is a matrix of size NPts-by-Nr*L. The first Nr
% columns of this matrix correspond to the first output, the next Nr
% columns correspond to the second output, and so forth.
% * ARR: the NPts-by-L matrix of the aggregate values sampled at NPts
% along the output range for each output.
%
% Example:
% fis = readfis('tipper');
% out = evalfis([2 1; 4 9],fis)
% This generates the response
% out =
% 7.0169
% 19.6810
%
% See also READFIS, RULEVIEW, GENSURF.
% Kelly Liu, 10-10-97.
% Copyright 1994-2005 The MathWorks, Inc.
ni = nargin;
if ni<2
disp('Need at least two inputs');
output=[];
IRR=[];
ORR=[];
ARR=[];
return
end
% Check inputs
if ~isfis(fis)
error('The second argument must be a FIS structure.')
elseif strcmpi(fis.type,'sugeno') & ~strcmpi(fis.impMethod,'prod')
warning('Fuzzy:evalfis:ImplicationMethod','Implication method should be "prod" for Sugeno systems.')
end
[M,N] = size(input);
Nin = length(fis.input);
if M==1 & N==1,
input = input(:,ones(1,Nin));
elseif M==Nin & N~=Nin,
input = input.';
elseif N~=Nin
error(sprintf('%s\n%s',...
'The first argument should have as many columns as input variables and',...
'as many rows as independent sets of input values.'))
end
% Check the fis for empty values
checkfis(fis);
% Issue warning if inputs out of range
inRange = getfis(fis,'inRange');
InputMin = min(input,[],1);
InputMax = max(input,[],1);
if any(InputMin(:)<inRange(:,1)) | any(InputMax(:)>inRange(:,2))
warning('Fuzzy:evalfis:InputOutOfRange','Some input values are outside of the specified input range.')
end
% Compute output
if ni==2
numofpoints = 101;
end
[output,IRR,ORR,ARR] = evalfismex(input, fis, numofpoints);
...And this is the content of the REPLACED evalfis.m file:
## Copyright (C) 2011-2014 L. Markowsky <lmarkov#users.sourceforge.net>
##
## This file is part of the fuzzy-logic-toolkit.
##
## The fuzzy-logic-toolkit is free software; you can redistribute it
## and/or modify it under the terms of the GNU General Public License
## as published by the Free Software Foundation; either version 3 of
## the License, or (at your option) any later version.
##
## The fuzzy-logic-toolkit is distributed in the hope that it will be
## useful, but WITHOUT ANY WARRANTY; without even the implied warranty
## of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with the fuzzy-logic-toolkit; see the file COPYING. If not,
## see <http://www.gnu.org/licenses/>.
## -*- texinfo -*-
## #deftypefn {Function File} {#var{output} =} evalfis (#var{user_input}, #var{fis})
## #deftypefnx {Function File} {#var{output} =} evalfis (#var{user_input}, #var{fis}, #var{num_points})
## #deftypefnx {Function File} {[#var{output}, #var{rule_input}, #var{rule_output}, #var{fuzzy_output}] =} evalfis (#var{user_input}, #var{fis})
## #deftypefnx {Function File} {[#var{output}, #var{rule_input}, #var{rule_output}, #var{fuzzy_output}] =} evalfis (#var{user_input}, #var{fis}, #var{num_points})
##
## Return the crisp output(s) of an FIS for each row in a matrix of crisp input
## values.
## Also, for the last row of #var{user_input}, return the intermediate results:
##
## #table #var
## #item rule_input
## a matrix of the degree to which
## each FIS rule matches each FIS input variable
## #item rule_output
## a matrix of the fuzzy output for each (rule, FIS output) pair
## #item fuzzy_output
## a matrix of the aggregated output for each FIS output variable
## #end table
##
## The optional argument #var{num_points} specifies the number of points over
## which to evaluate the fuzzy values. The default value of #var{num_points} is
## 101.
##
## #noindent
## Argument #var{user_input}:
##
## #var{user_input} is a matrix of crisp input values. Each row
## represents one set of crisp FIS input values. For an FIS that has N inputs,
## an input matrix of z sets of input values will have the form:
##
## #example
## #group
## [input_11 input_12 ... input_1N] <-- 1st row is 1st set of inputs
## [input_21 input_22 ... input_2N] <-- 2nd row is 2nd set of inputs
## [ ... ] ...
## [input_z1 input_z2 ... input_zN] <-- zth row is zth set of inputs
## #end group
## #end example
##
## #noindent
## Return value #var{output}:
##
## #var{output} is a matrix of crisp output values. Each row represents
## the set of crisp FIS output values for the corresponding row of
## #var{user_input}. For an FIS that has M outputs, an #var{output} matrix
## corresponding to the preceding input matrix will have the form:
##
## #example
## #group
## [output_11 output_12 ... output_1M] <-- 1st row is 1st set of outputs
## [output_21 output_22 ... output_2M] <-- 2nd row is 2nd set of outputs
## [ ... ] ...
## [output_z1 output_z2 ... output_zM] <-- zth row is zth set of outputs
## #end group
## #end example
##
## #noindent
## The intermediate result #var{rule_input}:
##
## The matching degree for each (rule, input value) pair is specified by the
## #var{rule_input} matrix. For an FIS that has Q rules and N input variables,
## the matrix will have the form:
## #example
## #group
## in_1 in_2 ... in_N
## rule_1 [mu_11 mu_12 ... mu_1N]
## rule_2 [mu_21 mu_22 ... mu_2N]
## [ ... ]
## rule_Q [mu_Q1 mu_Q2 ... mu_QN]
## #end group
## #end example
##
## #noindent
## Evaluation of hedges and "not":
##
## Each element of each FIS rule antecedent and consequent indicates the
## corresponding membership function, hedge, and whether or not "not" should
## be applied to the result. The index of the membership function to be used is
## given by the positive whole number portion of the antecedent/consequent
## vector entry, the hedge is given by the fractional portion (if any), and
## "not" is indicated by a minus sign. A "0" as the integer portion in any
## position in the rule indicates that the corresponding FIS input or output
## variable is omitted from the rule.
##
## For custom hedges and the four built-in hedges "somewhat," "very,"
## "extremely," and "very very," the membership function value (without the
## hedge or "not") is raised to the power corresponding to the hedge. All
## hedges are rounded to 2 digits.
##
## For example, if "mu(x)" denotes the matching degree of the input to the
## corresponding membership function without a hedge or "not," then the final
## matching degree recorded in #var{rule_input} will be computed by applying
## the hedge and "not" in two steps. First, the hedge is applied:
##
## #example
## #group
## (fraction == .05) <=> somewhat x <=> mu(x)^0.5 <=> sqrt(mu(x))
## (fraction == .20) <=> very x <=> mu(x)^2 <=> sqr(mu(x))
## (fraction == .30) <=> extremely x <=> mu(x)^3 <=> cube(mu(x))
## (fraction == .40) <=> very very x <=> mu(x)^4
## (fraction == .dd) <=> <custom hedge> x <=> mu(x)^(dd/10)
## #end group
## #end example
##
## After applying the appropriate hedge, "not" is calculated by:
## #example
## minus sign present <=> not x <=> 1 - mu(x)
## minus sign and hedge present <=> not <hedge> x <=> 1 - mu(x)^(dd/10)
## #end example
##
## Hedges and "not" in the consequent are handled similarly.
##
## #noindent
## The intermediate result #var{rule_output}:
##
## For either a Mamdani-type FIS (that is, an FIS that does not have constant or
## linear output membership functions) or a Sugeno-type FIS (that is, an FIS
## that has only constant and linear output membership functions),
## #var{rule_output} specifies the fuzzy output for each (rule, FIS output) pair.
## The format of rule_output depends on the FIS type.
##
## For a Mamdani-type FIS, #var{rule_output} is a #var{num_points} x (Q * M)
## matrix, where Q is the number of rules and M is the number of FIS output
## variables. Each column of this matrix gives the y-values of the fuzzy
## output for a single (rule, FIS output) pair.
##
## #example
## #group
## Q cols Q cols Q cols
## --------------- --------------- ---------------
## out_1 ... out_1 out_2 ... out_2 ... out_M ... out_M
## 1 [ ]
## 2 [ ]
## ... [ ]
## num_points [ ]
## #end group
## #end example
##
## For a Sugeno-type FIS, #var{rule_output} is a 2 x (Q * M) matrix.
## Each column of this matrix gives the (location, height) pair of the
## singleton output for a single (rule, FIS output) pair.
##
## #example
## #group
## Q cols Q cols Q cols
## --------------- --------------- ---------------
## out_1 ... out_1 out_2 ... out_2 ... out_M ... out_M
## location [ ]
## height [ ]
## #end group
## #end example
##
## #noindent
## The intermediate result #var{fuzzy_output}:
##
## The format of #var{fuzzy_output} depends on the FIS type ('mamdani' or
## 'sugeno').
##
## For either a Mamdani-type FIS or a Sugeno-type FIS, #var{fuzzy_output}
## specifies the aggregated fuzzy output for each FIS output.
##
## For a Mamdani-type FIS, the aggregated #var{fuzzy_output} is a
## #var{num_points} x M matrix. Each column of this matrix gives the y-values
## of the fuzzy output for a single FIS output, aggregated over all rules.
##
## #example
## #group
## out_1 out_2 ... out_M
## 1 [ ]
## 2 [ ]
## ... [ ]
## num_points [ ]
## #end group
## #end example
##
## For a Sugeno-type FIS, the aggregated output for each FIS output is a 2 x L
## matrix, where L is the number of distinct singleton locations in the
## #var{rule_output} for that FIS output:
##
## #example
## #group
## singleton_1 singleton_2 ... singleton_L
## location [ ]
## height [ ]
## #end group
## #end example
##
## Then #var{fuzzy_output} is a vector of M structures, each of which has an index and
## one of these matrices.
##
## #noindent
## Examples:
##
## Seven examples of using evalfis are shown in:
## #itemize #bullet
## #item
## cubic_approx_demo.m
## #item
## heart_disease_demo_1.m
## #item
## heart_disease_demo_2.m
## #item
## investment_portfolio_demo.m
## #item
## linear_tip_demo.m
## #item
## mamdani_tip_demo.m
## #item
## sugeno_tip_demo.m
## #end itemize
##
## #seealso{cubic_approx_demo, heart_disease_demo_1, heart_disease_demo_2, investment_portfolio_demo, linear_tip_demo, mamdani_tip_demo, sugeno_tip_demo}
## #end deftypefn
## Author: L. Markowsky
## Keywords: fuzzy-logic-toolkit fuzzy inference system fis
## Directory: fuzzy-logic-toolkit/inst/
## Filename: evalfis.m
## Last-Modified: 20 Aug 2012
function [output, rule_input, rule_output, fuzzy_output] = ...
evalfis (user_input, fis, num_points = 101)
## If evalfis was called with an incorrect number of arguments, or
## the arguments do not have the correct type, print an error message
## and halt.
if ((nargin != 2) && (nargin != 3))
puts ("Type 'help evalfis' for more information.\n");
error ("evalfis requires 2 or 3 arguments\n");
elseif (!is_fis (fis))
puts ("Type 'help evalfis' for more information.\n");
error ("evalfis's second argument must be an FIS structure\n");
elseif (!is_input_matrix (user_input, fis))
puts ("Type 'help evalfis' for more information.\n");
error ("evalfis's 1st argument must be a matrix of input values\n");
elseif (!is_pos_int (num_points))
puts ("Type 'help evalfis' for more information.\n");
error ("evalfis's third argument must be a positive integer\n");
endif
## Call a private function to compute the output.
## (The private function is also called by gensurf.)
[output, rule_input, rule_output, fuzzy_output] = ...
evalfis_private (user_input, fis, num_points);
endfunction

An issue with argument "sortv" of function seqIplot()

I'm trying to plot individual sequences by means of function seqIplot() in TraMineR. These individual sequences represent work trajectories, completed by former school's graduates via a WEB questionnaire.
Using argument "sortv", I'd like to sort my sequences according to the order of the levels of one covariate, the year of graduation, named "PROMO".
"PROMO" is a factor variable contained in a data frame named "covariates.seq", gathering covariates together:
str(covariates.seq)
'data.frame': 733 obs. of 6 variables:
$ ID_SQ : Factor w/ 733 levels "1","2","3","5",..: 1 2 3 4 5 6
7 8 9 10 ...
$ SEXE : Factor w/ 2 levels "Féminin","Masculin": 1 1 1 1 2 1
1 2 2 1 ...
$ PROMO : Factor w/ 6 levels "1997","1998",..: 1 2 2 4 4 3 2 2
2 2 ...
$ DEPARTEMENT : Factor w/ 10 levels "BC","GCU","GE",..: 1 4 7 8 7 9
9 7 7 4 ...
$ NIVEAU_ADMISSION: Factor w/ 2 levels "En Premier Cycle",..: NA 1 1 1 1
1 NA 1 1 1 ...
$ FILIERE_SECTION : Factor w/ 4 levels "Cursus Classique",..: NA 4 2 NA
1 1 NA NA 4 3 ..
I'm also using "SEXE", the graduates' gender, as a grouping variable. To plot the individual sequences so, my command is as follows:
seqIplot(sequences, group = covariates.seq$SEXE,
sortv = covariates.seq$PROMO,
cex.axis = 0.7, cex.legend = 0.7)
I expected that, by using a process time axis (with the year of graduation as sequence-dependent origin), sorting the sequences according to the order of the levels of "PROMO" would give a plot with groups of sequences from the longest (for the older graduates) to the shortest (for the younger graduates).
But I've got an issue: in the output plot, the sequences don't appear to be correctly sorted according to the levels of "PROMO". Indeed, by using "sortv = covariates.seq$PROMO" as in the command above, the plot doesn't show groups of sequences from the longest to the shortest, as expected. It looks like the plot obtained without using the argument "sortv" (see Figures below).
Without using argument "sortv"
Using "sortv = covariates.seq$PROMO"
Note that I have 733 individual sequences in my object "sequences", created as follows:
labs <- c("En poste","Au chômage (d'au moins 6 mois)", "Autre situation
(d'au moins 6 mois)","En poursuite d'études (thèse ou hors
thèse)", "En reprise d'études / formation (d'au moins 6 mois)")
codes <- c("En poste", "Au chômage", "Autre situation", "En poursuite
d'études", "En reprise d'études / formation")
sequences <- seqdef(situations, alphabet = labs, states = codes, left =
NA, right = "DEL", missing = NA,
cnames = as.character(seq(0,7400/365,1/365)),
xtstep = 365)
The values of the covariates are sorted in the same order as the individual sequences. The covariate "PROMO" doesn't contain any missing value.
Something's going wrong, but what?
Thank you in advance for your help,
Best,
Arnaud.
Using a factor as sortv argument in seqIplot works fine as illustrated by the example below:
sdc <- c("aabbccdd","bbbccc","aaaddd","abcabcab")
sd <- seqdecomp(sdc, sep="")
seq <- seqdef(sd)
fac <- factor(c("2000","2001","2001","2000"))
par(mfrow=c(1,3))
seqIplot(seq, with.legend=FALSE)
seqIplot(seq, sortv=fac, with.legend=FALSE)
seqlegend(seq)

Spark: All RDD data not getting saved to Cassandra table

Hi, I am trying to load RDD data to a Cassandra Column family using Scala. Out of a total 50 rows , only 28 are getting stored into cassandra table.
Below is the Code snippet:
val states = sc.textFile("state.txt")
//list o fall the 50 states of the USA
var n =0 // corrected to var
val statesRDD = states.map{a =>
n=n+1
(n, a)
}
scala> statesRDD.count
res2: Long = 50
cqlsh:brs> CREATE TABLE BRS.state(state_id int PRIMARY KEY, state_name text);
statesRDD.saveToCassandra("brs","state", SomeColumns("state_id","state_name"))
// this statement saves only 28 rows out of 50, not sure why!!!!
cqlsh:brs> select * from state;
state_id | state_name
----------+-------------
23 | Minnesota
5 | California
28 | Nevada
10 | Georgia
16 | Kansas
13 | Illinois
11 | Hawaii
1 | Alabama
19 | Maine
8 | Oklahoma
2 | Alaska
4 | New York
18 | Virginia
15 | Iowa
22 | Wyoming
27 | Nebraska
20 | Maryland
7 | Ohio
6 | Colorado
9 | Florida
14 | Indiana
26 | Montana
21 | Wisconsin
17 | Vermont
24 | Mississippi
25 | Missouri
12 | Idaho
3 | Arizona
(28 rows)
Can anyone please help me in finding where the issue is?
Edit:
I understood why only 28 rows are getting stored in Cassandra, it's because I have made the first column a PRIMARY KEY and It looks like in my code, n is incremented maximum to 28 and then it starts again with 1 till 22 (total 50).
val states = sc.textFile("states.txt")
var n =0
var statesRDD = states.map{a =>
n+=1
(n, a)
}
I tried making n an accumulator variable as well(viz. val n = sc.accumulator(0,"Counter")), but I don't see any differnce in the output.
scala> statesRDD.foreach(println)
[Stage 2:> (0 + 0) / 2]
(1,New Hampshire)
(2,New Jersey)
(3,New Mexico)
(4,New York)
(5,North Carolina)
(6,North Dakota)
(7,Ohio)
(8,Oklahoma)
(9,Oregon)
(10,Pennsylvania)
(11,Rhode Island)
(12,South Carolina)
(13,South Dakota)
(14,Tennessee)
(15,Texas)
(16,Utah)
(17,Vermont)
(18,Virginia)
(19,Washington)
(20,West Virginia)
(21,Wisconsin)
(22,Wyoming)
(1,Alabama)
(2,Alaska)
(3,Arizona)
(4,Arkansas)
(5,California)
(6,Colorado)
(7,Connecticut)
(8,Delaware)
(9,Florida)
(10,Georgia)
(11,Hawaii)
(12,Idaho)
(13,Illinois)
(14,Indiana)
(15,Iowa)
(16,Kansas)
(17,Kentucky)
(18,Louisiana)
(19,Maine)
(20,Maryland)
(21,Massachusetts)
(22,Michigan)
(23,Minnesota)
(24,Mississippi)
(25,Missouri)
(26,Montana)
(27,Nebraska)
(28,Nevada)
I am curious to know what is causing n to not getting updated after value 28? Also, what are the ways in which I can create a counter which I can use for creating RDD?
There are some misconceptions about distributed systems embedded inside your question. The real heart of this is "How do I have a counter in a distributed system?"
The short answer is you don't. For example what you've done in your code example originally is something like this.
Task One {
var x = 0
record 1: x = 1
record 2: x = 2
}
Task Two {
var x = 0
record 20: x = 1
record 21: x = 2
}
Each machine is independently creating a new x variable set at 0 which gets incremented within it's own context, independently over the other nodes.
For most use cases the "counter" question can be replaced with "How can I get a Unique Identifier per Record in a distributed system?"
For this most users end up using a UUID which can be generated on independent machines with infinitesimal chances of conflicts.
If the question can be "How can I get a monotonic increasing unique indentifier?"
Then you can use zipWithUniqueIndex which will not count but will generate monotonically increasing ids.
If you just want them number to start with it's best to do it on the local system.
Edit; Why can't I use an accumulator?
Accumulators store their state (surprise) per task. You can see this with a little example:
val x = sc.accumulator(0, "x")
sc.parallelize(1 to 50).foreachPartition{ it => it.foreach(y => x+= 1); println(x)}
/*
6
7
6
6
6
6
6
7
*/
x.value
// res38: Int = 50
The accumulators combine their state after finishing their tasks, which means you can't use them as a global distributed counter.

How does one extract a unified-diff style patch subset?

Every time I want to take a subset of a patch, I'm forced to write a script to only extract the indices that I want.
e.g. I have a patch that applies to sub directories
'yay' and 'foo'.
Is there a way to create a new patch or apply only a subset of a patch? i.e. create a new patch from the existing patch that only takes all indices that are under sub directory 'yay'. Or all indices that are not under sub directory 'foo'
If I have a patch like ( excuse the below pseudo-patch):
Index : foo/bar
yada
yada
- asdf
+ jkl
yada
yada
Index : foo/bah
blah
blah
- 28
+ 29
blah
blah
blah
Index : yay/team
go
huskies
- happy happy
+ joy joy
cougars
suck
How can I extract or apply only the 'yay' subdirectory like:
Index : yay/team
go
huskies
- happy happy
+ joy joy
cougars
suck
I know if I script up a solution I'll be re-inventing the wheel...
Take a look at the filterdiff utility, which is part of patchutils.
For example, if you have the following patch:
$ cat example.patch
diff -Naur orig/a/bar new/a/bar
--- orig/a/bar 2009-12-02 12:41:38.353745751 -0800
+++ new/a/bar 2009-12-02 12:42:17.845745951 -0800
## -1,3 +1,3 ##
4
-5
+e
6
diff -Naur orig/a/foo new/a/foo
--- orig/a/foo 2009-12-02 12:41:32.845745768 -0800
+++ new/a/foo 2009-12-02 12:42:25.697995617 -0800
## -1,3 +1,3 ##
1
2
-3
+c
diff -Naur orig/b/baz new/b/baz
--- orig/b/baz 2009-12-02 12:41:42.993745756 -0800
+++ new/b/baz 2009-12-02 12:42:37.585745735 -0800
## -1,3 +1,3 ##
-7
+z
8
9
Then you can run the following command to extract the patch for only things in the a directory like this:
$ cat example.patch | filterdiff -i 'new/a/*'
--- orig/a/bar 2009-12-02 12:41:38.353745751 -0800
+++ new/a/bar 2009-12-02 12:42:17.845745951 -0800
## -1,3 +1,3 ##
4
-5
+e
6
--- orig/a/foo 2009-12-02 12:41:32.845745768 -0800
+++ new/a/foo 2009-12-02 12:42:25.697995617 -0800
## -1,3 +1,3 ##
1
2
-3
+c
Here's my quick and dirty Perl solution.
perl -ne '#a = split /^Index :/m, join "", <>; END { for(#a) {print "Index :", $_ if (m, yay/team,)}}' < foo.patch
In response to sigjuice's request in the comments, I'm posting my script solution. It isn't 100% bullet proof, and I'll probably use filterdiff instead.
base_usage_str=r'''
python %prog index_regex patch_file
description:
Extracts all indices from a patch-file matching 'index_regex'
e.g.
python %prog '^evc_lib' p.patch > evc_lib_p.patch
Will extract all indices which begin with evc_lib.
-or-
python %prog '^(?!evc_lib)' p.patch > not_evc_lib_p.patch
Will extract all indices which do *not* begin with evc_lib.
authors:
Ross Rogers, 2009.04.02
'''
import re,os,sys
from optparse import OptionParser
def main():
parser = OptionParser(usage=base_usage_str)
(options, args) = parser.parse_args(args=sys.argv[1:])
if len(args) != 2:
parser.print_help()
if len(args) == 0:
sys.exit(0)
else:
sys.exit(1)
(index_regex,patch_file) = args
sys.stderr.write('Extracting patches for indices found by regex:%s\n'%index_regex)
#print 'user_regex',index_regex
user_index_match_regex = re.compile(index_regex)
# Index: verification/ring_td_cte/tests/mmio_wr_td_target.e
# --- sw/cfg/foo.xml 2009-04-30 17:59:11 -07:00
# +++ sw/cfg/foo.xml 2009-05-11 09:26:58 -07:00
index_cre = re.compile(r'''(?:(?<=^)|(?<=\n))(--- (?:.*\n){2,}?(?![ #\+\-]))''')
patch_file = open(patch_file,'r')
all_patch_sets = index_cre.findall(patch_file.read())
patch_file.close()
for file_edit in all_patch_sets:
# extract index subset
index_path = re.compile('\+\+\+ (?P<index>[\w_\-/\.]+)').search(file_edit).group('index').strip()
if user_index_match_regex.search(index_path):
sys.stderr.write("Index regex matched index: "+index_path+"\n")
print file_edit,
if __name__ == '__main__':
main()