I know that this is a question that has been asked before and I have looked at the answers, but they don't seem to apply to my question. I've written a binomial mixture model that includes the usage of a covariate in WinBUGS. The model is syntactically correct and the data loads fine, but I get the error 'multiple definitions of node 'lambda[1]'. Lambda is not defined within the data and it's subscripted to provide an estimate for every year there is data.
My code is the following:
model {
#Priors
for (k in 1:7) {
lambda[k]~dnorm(0,0.01)
p[k]~dunif(0,1)
}
alpha0~dunif(-10,10)
alpha1~dunif(-10,10)
beta0~dunif(-10,10)
beta1~dunif(-10,10)
#Likelihood
#Ecological model for true abundance (process model)
for (k in 1:7) { #Loop over years
lambda[k]<-exp(alpha.lam[k])
for (i in 1:R) { #Loop over R sites
N[i,k]~dpois(lambda[k]) #Abundance
log(lambda[k])<-alpha0+alpha1*well[i,k]
#Observation model for replicated counts
for (j in 1:T) { #Loop over repeated counts within a year
y[i,j,k] ~dbin(p[k],N[i,k]) #Detection
p[k]<-exp(lp[k])/(1+exp(lp[k]))
lp[k]<-beta0+beta1*well[i,k]
#Assess model fit using Chi-squared discrepancy
#Compute fit statistic "E" for observed data
eval[i,j,k]<-p[k]*N[i,k] #Expected values
E[i,j,k]<-pow((y[i,j,k]-eval[i,j,k]),2)/(eval[i,j,k]+0.5)
#Generate replicate data and compute fit stats for them
y.new[i,j,k]~dbin(p[k],N[i,k])
E.new[i,j,k]<-pow((y.new[i,j,k]-eval[i,j,k]),2)/(eval[i,j,k]+0.5)
}#j
}#i
}#k
#Derived and other quantities
for(k in 1:7) {
totalN[k]<-sum(N[,k]) #Total pop. size across all sites
mean.abundance[k]<-exp(alpha.lam[k])
}
fit<-sum(E[,,])
fit.new<-sum(E.new[,,])
}
#Data
list(R=669,T=3)
Can anyone tell me why I'm getting this error. Many thanks in advance.
Related
I'm estimating last mile delivery costs in an large urban network using by-route distances. I have over 8000 customer agents and over 100 retail store agents plotted in a GIS map using lat/long coordinates. Each customer receives deliveries from its nearest store (by route). The goal is to get two distance measures in this network for each store:
d0_bar: the average distance from a store to all of its assigned customers
d1_bar: the average distance between all customers common to a single store
I've written a startup function with a simple foreach loop to assign each customer to a store based on by-route distance (customers have a parameter, "customer.pStore" of Store type). This function also adds, in turn, each customer to the store agent's collection of customers ("store.colCusts"; it's an array list with Customer type elements).
Next, I have a function that iterates through the store agent population and calculates the two average distance measures above (d0_bar & d1_bar) and writes the results to a txt file (see code below). The code works, fortunately. However, the problem is that with such a massive dataset, the process of iterating through all customers/stores and retrieving distances via the openstreetmap.org API takes forever. It's been initializing ("Please wait...") for about 12 hours. What can I do to make this code more efficient? Or, is there a better way in AnyLogic of getting these two distance measures for each store in my network?
Thanks in advance.
//for each store, record all customers assigned to it
for (Store store : stores)
{
distancesStore.print(store.storeCode + "," + store.colCusts.size() + "," + store.colCusts.size()*(store.colCusts.size()-1)/2 + ",");
//calculates average distance from store j to customer nodes that belong to store j
double sumFirstDistByStore = 0.0;
int h = 0;
while (h < store.colCusts.size())
{
sumFirstDistByStore += store.distanceByRoute(store.colCusts.get(h));
h++;
}
distancesStore.print((sumFirstDistByStore/store.colCusts.size())/1609.34 + ",");
//calculates average of distances between all customer nodes belonging to store j
double custDistSumPerStore = 0.0;
int loopLimit = store.colCusts.size();
int i = 0;
while (i < loopLimit - 1)
{
int j = 1;
while (j < loopLimit)
{
custDistSumPerStore += store.colCusts.get(i).distanceByRoute(store.colCusts.get(j));
j++;
}
i++;
}
distancesStore.print((custDistSumPerStore/(loopLimit*(loopLimit-1)/2))/1609.34);
distancesStore.println();
}
Firstly a few simple comments:
Have you tried timing a single distanceByRoute call? E.g. can you try running store.distanceByRoute(store.colCusts.get(0)); just to see how long a single call takes on your system. Routing is generally pretty slow, but it would be good to know what the speed limit is.
The first simple change is to use java parallelism. Instead of using this:
for (Store store : stores)
{ ...
use this:
stores.parallelStream().forEach(store -> {
...
});
this will process stores entries in parallel using standard Java streams API.
It also looks like the second loop - where avg distance between customers is calculated doesn't take account of mirroring. That is to say distance a->b is equal to b->a. Hence, for example, 4 customers will require 6 calculations: 1->2, 1->3, 1->4, 2->3, 2->4, 3->4. Whereas in case of 4 customers your second while loop will perform 9 calculations: i=0, j in {1,2,3}; i=1, j in {1,2,3}; i=2, j in {1,2,3}, which seems wrong unless I am misunderstanding your intention.
Generally, for long running operations it is a good idea to include some traceln to show progress with associated timing.
Please have a look at above and post results. With more information additional performance improvements may be possible.
I'm training doc2vec, and using callbacks trying to see if alpha is decreasing over training time using this code:
class EpochSaver(CallbackAny2Vec):
'''Callback to save model after each epoch.'''
def __init__(self, path_prefix):
self.path_prefix = path_prefix
self.epoch = 0
os.makedirs(self.path_prefix, exist_ok=True)
def on_epoch_end(self, model):
savepath = get_tmpfile(
'{}_epoch{}.model'.format(self.path_prefix, self.epoch)
)
model.save(savepath)
print(
"Model alpha: {}".format(model.alpha),
"Model min_alpha: {}".format(model.min_alpha),
"Epoch saved: {}".format(self.epoch + 1),
"Start next epoch"
)
self.epoch += 1
def train():
workers = multiprocessing.cpu_count()*4
model = Doc2Vec(
DocIter(),
vec_size=600, alpha=0.03, min_alpha=0.00025, epochs=20,
min_count=10, dm=1, hs=1, negative=0, workers=workers,
callbacks=[EpochSaver("./checkpoints")]
)
print(
"HS", model.hs, "Negative", model.negative, "Epochs",
model.epochs, "Workers: ", model.workers, "Model alpha:
{}".format(model.alpha)
)
And while training I see that alpha is not changing over time. On each callback I see alpha = 0.03.
Is it possible to check if alpha is decreasing? Or it really not decreasing at all during training?
One more question:
How can I benefit from all my cores while training doc2vec?
As we can see, each core is not loaded more than +-30%.
The model.alpha property only holds the initially-configured starting-alpha – it's not updated to the effective learning-rate through training.
So, even if the value is being decreased properly (and I expect that it is), you wouldn't see it in the logging you've added.
Separate observations about your code:
in gensim versions at least through 3.5.0, maximum training throughput is most often reached with some value for workers between 3 and the number of cores – but usually not the full number of cores (if it's higher than 12) or larger. So workers=multiprocessing.cpu_count()*4 is likely going to much slower than what you could achieve with a lower number.
if your corpus is large enough to support 600-dimensional vectors, and discarding words with fewer than min_count=10 examples, negative sampling may work faster and get better results than the hs mode. (The pattern in published work seems to be to prefer negative-sampling with larger corpuses.)
When dealing with component connecting of big data, I find it very difficult to merging them in spark.
The data structure in my research can be simplified to RDD[Array[Int]]. For example:
RDD[Array(1,2,3), Array(1,4), Array(5,6), Array(5,6,7,8), Array(9), Array(1)]
The objective is to merge two Array if they have intersection set, ending up with arrays without any intersection. Therefore after merging, it should be:
RDD[Array(1,2,3,4), Array(5,6,7,8), Array(9)]
The problem is kind of component connecting in Pregel framework in Graph Algo. One solution is to first find the edge connection between two Array using cartesian product and then merge them. However, in my case, there are 300K Array with total size 1G. Therefore, the time and memory complexity would be roughly 300K*300K. When I run the program in my Mac Pro in spark, it is completely stuck.
Baiscally, it is like:
Thanks
Here is my solution. Might not be decent enough, but works for a small data. Whether it can apply to large data needs further proof.
def mergeCanopy(canopies:RDD[Array[Int]]):Array[Array[Int]] = {
/*
try to merge two canopies
*/
val s = Set[Array[Int]]()
val c = canopies.aggregate(s)(mergeOrAppend, _++_)
return c.toArray
def mergeOrAppend(disjoint: Set[Array[Int]], cluster: Array[Int]):Set[Array[Int]] = {
var disjoints = disjoint
for (clus <- disjoint) {
if (clus.toSet.&(cluster.toSet) != Set()) {
disjoints += (clus.toSet++cluster.toSet).toArray
disjoints -= clus
return disjoints
}
}
disjoints += cluster
return disjoints
}
I am new to word2vec. With applying this method, I am trying to form some clusters based on words extracted by word2vec from scientific publications' abstracts. To this end, I have first retrieved sentences from the abstracts via stanfordNLP and put each sentence into a line in a text file. Then the text file required by deeplearning4j word2vec was ready to process (http://deeplearning4j.org/word2vec).
Since the texts come from scientific fields, there are a lot of mathematical terms or brackets. See the sample sentences below:
The meta-analysis showed statistically significant effects of pharmacopuncture compared to conventional treatment = 3.55 , P = .31 , I-2 = 16 % ) .
90 asymptomatic hypertensive subjects associated with LVH , DM , or RI were randomized to receive D&G herbal capsules 1 gm/day , 2 gm/day , or identical placebo capsules in double-blind and parallel fashion for 12 months .
After preparing the text file, I have run word2vec as below:
SentenceIterator iter = new LineSentenceIterator(new File(".../filename.txt"));
iter.setPreProcessor(new SentencePreProcessor() {
#Override
public String preProcess(String sentence) {
//System.out.println(sentence.toLowerCase());
return sentence.toLowerCase();
}
});
// Split on white spaces in the line to get words
TokenizerFactory t = new DefaultTokenizerFactory();
t.setTokenPreProcessor(new CommonPreprocessor());
log.info("Building model....");
Word2Vec vec = new Word2Vec.Builder()
.minWordFrequency(5)
.iterations(1)
.layerSize(100)
.seed(42)
.windowSize(5)
.iterate(iter)
.tokenizerFactory(t)
.build();
log.info("Fitting Word2Vec model....");
vec.fit();
log.info("Writing word vectors to text file....");
// Write word vectors
WordVectorSerializer.writeWordVectors(vec, "abs_terms.txt");
This script creates a text file containing many words withe their related vector values in each row as below:
pills -4.559159278869629E-4 0.028691953048110008 0.023867368698120117 ...
tricuspidata -0.00431067543104291 -0.012515762820839882 0.0074045853689312935 ...
As a subsequent step, this text file has been used to form some clusters via k-means in spark. See the code below:
val rawData = sc.textFile("...abs_terms.txt")
val extractedFeatureVector = rawData.map(s => Vectors.dense(s.split(' ').slice(2,101).map(_.toDouble))).cache()
val numberOfClusters = 10
val numberOfInterations = 100
//We use KMeans object provided by MLLib to run
val modell = KMeans.train(extractedFeatureVector, numberOfClusters, numberOfInterations)
modell.clusterCenters.foreach(println)
//Get cluster index for each buyer Id
val AltCompByCluster = rawData.map {
row=>
(modell.predict(Vectors.dense(row.split(' ').slice(2,101)
.map(_.toDouble))),row.split(',').slice(0,1).head)
}
AltCompByCluster.foreach(println)
As a result of the latest scala code above, I have retrieved 10 clusters based on the word vectors suggested by word2vec. However, when I have checked my clusters no obvious common words appeared. That is, I could not get reasonable clusters as I expected. Based on this bottleneck of mine I have a few questions:
1) From some tutorials for word2vec I have seen that no data cleaning is made. In other words, prepositions etc. are left in the text. So how should I apply cleaning procedure when applying word2vec?
2) How can I visualize the clustering results in a explanatory way?
3) Can I use word2vec word vectors as input to neural networks? If so which neural network (convolutional, recursive, recurrent) method would be more suitable for my goal?
4) Is word2vec meaningful for my goal?
Thanks in advance.
This is an extension of my previous question: https://dsp.stackexchange.com/questions/28095/choosing-low-pass-filter-parameters
I am recording people from an overheard camera. I have tracks of each's head using some software. I want to periodicity from tracks due to head wobbling.
I apply low-pass butterworth filter. I want the starting point and ending point of the filtered to be same as unfiltered tracks.
Data:
K>> [xcor_i,ycor_i ]
ans =
-101.7000 -77.4040
-102.4200 -77.4040
-103.6600 -77.4040
-103.9300 -76.6720
-103.9900 -76.5130
-104.0000 -76.4780
-105.0800 -76.4710
-106.0400 -77.5660
-106.2500 -77.8050
-106.2900 -77.8570
-106.3000 -77.8680
-106.3000 -77.8710
-107.7500 -78.9680
-108.0600 -79.2070
-108.1200 -79.2590
-109.9500 -80.3680
-111.4200 -80.6090
-112.8200 -81.7590
-113.8500 -82.3750
-115.1500 -83.2410
-116.1500 -83.4290
-116.3700 -83.8360
-117.5000 -84.2910
-117.7400 -84.3890
-118.8800 -84.7770
-119.8400 -85.2270
-121.1400 -85.3250
-123.2200 -84.9800
-125.4700 -85.2710
-127.0400 -85.7000
-128.8200 -85.7930
-130.6500 -85.8130
-132.4900 -85.8180
-134.3300 -86.5500
-136.1700 -87.0760
-137.6500 -86.0920
-138.6900 -86.9760
-140.3600 -87.9000
-142.1600 -88.4660
-144.7200 -89.3210
Code(answer by #SleuthEye):
dataOut_x = xcor_i(1)+filter(b,a,xcor_i-xcor_i(1));
dataOut_y = ycor_i(1)+filter(b,a,ycor_i-ycor_i(1));
Output:
In the above example, the endpoint(to the left) is different for filtered and unfiltered tracks. How can I ensure it is same?
Your question is pretty ambiguous, and doesn't really have a specific question. I'm assuming you want to have your filtered data start at the same points as the measured data, but are unsure why this is not happening already, and how to do so.
A low pass filter is a filter which lowers the effect of rapid changes. One way of doing this, and the method which appears to be used here, is by using a rolling average. A rolling average is simply an average (mean) of the previous data points. It looks like you are using a rolling average of 5 data points. Therefore you need five points of raw data before your filter will give you a single data point.
-101.7000 -77.4040 }
-102.4200 -77.4040 } }
-103.6600 -77.4040 } }
-103.9300 -76.6720 } }
-103.9900 -76.5130 } Filter point 1. }
-104.0000 -76.4780 } Filter point 2.
-105.0800 -76.4710
-106.0400 -77.5660
-106.2500 -77.8050
-106.2900 -77.8570
-106.3000 -77.8680
-106.3000 -77.8710
In order to solve this problem, you could just append the first data point to the data set four times, as this means that the filter will produce the same number of points. This is a pretty rough solution, however, as you are creating new data. This could be achieved quite simply, for example if your dataset is called myArray:
firstEntry = myArray(1,:);
myNewArray = [firstEntry; firstEntry; firstEntry; firstEntry; myArray];
This will create four data points equal to your first data point, which should then allow you to apply the low pass filter to your data, and have it start at the same point.
Hope this helps, although it's worth bearing in mind that filtering ALWAYS results in a loss of data.
Because you don't want to implement it but want someone else to:
The theory as above is correct, but instead you need to add 2 values at the end of your vectors:
x_last = xcor_i(end);
y_last = ycor_i(end);
xcor_i = [xcor_i;x_last;x_last];
ycor_i = [ycor_i;y_last;y_last];
This gives the following:
As you can see the ends are pretty close to being the same now.