Is there an efficient way to combine dcast, melt and merge on large dataset in R? - merge

I have 4 csv files of 63290 obs and 6 columns each. I transform them into datatable format. I am trying to use dcast and melt on each to create a count variable. But I am getting this error message: cannot allocate vector of size 8.0Gb
I will then have to merge the 4 datasets into one for my econometric analysis.
Does anyone have an idea of how I could complete this task?
For the dcast and melt part, I use the following code for the first dataset data1:
data1_dt <-as.data.table(read.csv("data1.csv")) new_DT <-dcast(data1_dt, id+regime+status+centre+year ~ month, value.var = "niu", fun.aggregat = length, drop = FALSE) new_DT1<-(new_DT[, treat2:=ifelse(rowSums(.SD)>=2,1,0),.SDcols=patterns("^[0-9]+$")][]) new_DT2<-melt(ew_DT1, measure.vars=patterns("^[0-9]+$"),variable.name="month",value.name="treat1") rm("new_DT","new_DT1") data1new<-as.data.frame(new_DT2) write.csv(data1new,file = "c:/RDATA2022?data1new.csv", row.names=FALSE)
But I couldn't pursue because of the error message.

Related

Efficient load CSV coordinate format (COO) input to local matrix spark

I want to convert CSV coordinate format (COO) data into a local matrix. Currently I'm first converting them to CoordinateMatrix and then converting to LocalMatrix. But is there a better way to do this?
Example data:
0,5,5.486978435
0,3,0.438472867
0,0,6.128832321
0,7,5.295923198
0,1,7.738270234
Code:
var loadG = sqlContext.read.option("header", "false").csv("file.csv").rdd.map("mapfunctionCreatingMatrixEntryOutOfRow")
var G = new CoordinateMatrix(loadG)
var matrixG = G.toBlockMatrix().toLocalMatrix()
A LocalMatrix will be stored on a single machine and hence not make use of Spark's strengths. In other words, using Spark seems a bit wasteful, although still possible.
The easiest way to get the CSV file to a LocalMatrix is to first read the CSV with Scala, not Spark:
val entries = Source.fromFile("data.csv").getLines()
.map(_.split(","))
.map(a => (a(0).toInt, a(1).toInt, a(2).toDouble))
.toSeq
The SparseMatrix variant of the LocalMatrix has a method for reading COO formatted data. The number of rows and columns need to be specified to use this. Since the matrix is sparse this should in most cases be done by hand but it's possible to get the highest values in the data as follows:
val numRows = entries.map(_._1).max + 1
val numCols = entries.map(_._2).max + 1
Then create the matrix:
val matrixG = SparseMatrix.fromCOO(numRows, numCols, entries)
The matrix will be stored in CSC format on the machine. Printing the example input above will yield the following output:
1 x 8 CSCMatrix
(0,0) 6.128832321
(0,1) 7.738270234
(0,3) 0.438472867
(0,5) 5.486978435
(0,7) 5.295923198

How do I merge or combine error rates?

Let's say I have a dataset that has 9 continuous columns of data and 4 columns of categorical data. In Matlab, I separate the columns into two groups and do training/testing (naïve bayes) on them separately and determine that the continuous columns have an error rate of 0.45 and the categorical columns have an error 0.33. My question is - how do I determine the combined error?
EDIT - Simple pseudocode overview added:
for x = 1:num_iterations
Mdl_NB1 = fitcnb(TrainingSet_Con,TrainingTargets,'Distribution','normal');
Mdl_NB2 = fitcnb(TrainingSet_Dis,TrainingTargets,'Distribution','mn');
[NB1_label,NB1_Posterior,NB1_Cost] = predict(Mdl_NB1,TestPoint_Con);
[NB2_label,NB2_Posterior,NB2_Cost] = predict(Mdl_NB2,TestPoint_Dis);
NB1_cumulLoss = NB1_cumulLoss + resubLoss(Mdl_NB1);
NB2_cumulLoss = NB2_cumulLoss + resubLoss(Mdl_NB2);
end
NB1_avg_score = NB1_cumulLoss/num_iterations
NB2_avg_score = NB2_cumulLoss/num_iterations
total_avg_score = ???
The three obvious choices, in principle, are:
(A+B) / 2
A * B
(A*(CountA/TotalCount)) + (B*(CountB/TotalCount))
But not sure if any of these are right, in this case.
This does not make sense; you are effectively building two separate models. So either build one model with all columns (maybe with 'Distribution','mvmn') or combine both models into one with something like
Mdl_Ens = fitcnb([NB1_Posterior; NB2_Posterior],TrainingTargets,'Distribution','normal');
NEns_cumulLoss = NEns_cumulLoss + resubLoss(Mdl_Ens);
to actually build a single model out of the output of the two models based on a subset of the columns each.

How to apply word2vec for k-means clustering?

I am new to word2vec. With applying this method, I am trying to form some clusters based on words extracted by word2vec from scientific publications' abstracts. To this end, I have first retrieved sentences from the abstracts via stanfordNLP and put each sentence into a line in a text file. Then the text file required by deeplearning4j word2vec was ready to process (http://deeplearning4j.org/word2vec).
Since the texts come from scientific fields, there are a lot of mathematical terms or brackets. See the sample sentences below:
The meta-analysis showed statistically significant effects of pharmacopuncture compared to conventional treatment = 3.55 , P = .31 , I-2 = 16 % ) .
90 asymptomatic hypertensive subjects associated with LVH , DM , or RI were randomized to receive D&G herbal capsules 1 gm/day , 2 gm/day , or identical placebo capsules in double-blind and parallel fashion for 12 months .
After preparing the text file, I have run word2vec as below:
SentenceIterator iter = new LineSentenceIterator(new File(".../filename.txt"));
iter.setPreProcessor(new SentencePreProcessor() {
#Override
public String preProcess(String sentence) {
//System.out.println(sentence.toLowerCase());
return sentence.toLowerCase();
}
});
// Split on white spaces in the line to get words
TokenizerFactory t = new DefaultTokenizerFactory();
t.setTokenPreProcessor(new CommonPreprocessor());
log.info("Building model....");
Word2Vec vec = new Word2Vec.Builder()
.minWordFrequency(5)
.iterations(1)
.layerSize(100)
.seed(42)
.windowSize(5)
.iterate(iter)
.tokenizerFactory(t)
.build();
log.info("Fitting Word2Vec model....");
vec.fit();
log.info("Writing word vectors to text file....");
// Write word vectors
WordVectorSerializer.writeWordVectors(vec, "abs_terms.txt");
This script creates a text file containing many words withe their related vector values in each row as below:
pills -4.559159278869629E-4 0.028691953048110008 0.023867368698120117 ...
tricuspidata -0.00431067543104291 -0.012515762820839882 0.0074045853689312935 ...
As a subsequent step, this text file has been used to form some clusters via k-means in spark. See the code below:
val rawData = sc.textFile("...abs_terms.txt")
val extractedFeatureVector = rawData.map(s => Vectors.dense(s.split(' ').slice(2,101).map(_.toDouble))).cache()
val numberOfClusters = 10
val numberOfInterations = 100
//We use KMeans object provided by MLLib to run
val modell = KMeans.train(extractedFeatureVector, numberOfClusters, numberOfInterations)
modell.clusterCenters.foreach(println)
//Get cluster index for each buyer Id
val AltCompByCluster = rawData.map {
row=>
(modell.predict(Vectors.dense(row.split(' ').slice(2,101)
.map(_.toDouble))),row.split(',').slice(0,1).head)
}
AltCompByCluster.foreach(println)
As a result of the latest scala code above, I have retrieved 10 clusters based on the word vectors suggested by word2vec. However, when I have checked my clusters no obvious common words appeared. That is, I could not get reasonable clusters as I expected. Based on this bottleneck of mine I have a few questions:
1) From some tutorials for word2vec I have seen that no data cleaning is made. In other words, prepositions etc. are left in the text. So how should I apply cleaning procedure when applying word2vec?
2) How can I visualize the clustering results in a explanatory way?
3) Can I use word2vec word vectors as input to neural networks? If so which neural network (convolutional, recursive, recurrent) method would be more suitable for my goal?
4) Is word2vec meaningful for my goal?
Thanks in advance.

Filtering by Column in Matlab for a list or variety of values

I've produced a code which separates data within a text file into the required format, filters the data and averages the output (in this case, the value in the fourth column)
I am trying to filter the data in column one for a list of values at the same time, with no strict pattern for the values. e.g 1001, 1007, 1048, 1192, 1200 ....
Currently my code only filters by a certain value (1001) is there a way of incorporating a list of values into this function?
C_f = C(C(:,1) == 1001 , :);
Any help would be much appreciated!
See if this is what you want,
val = [1000 1001];
ind = ismember(C(:,1),val);
C_f = C(ind,:)

IDL and MatLab getting strange values from NetCDF file

I have a NetCDF file, which contains data representing total precipitation across the globe over several months (so it's stored in a three dimensional array). I first ensured that the data was sensible, and the way it was formed, both in XConv and ncdump. All looks sensible - values vary from very small (~10^-10 - this makes sense, as this is model data, and effectively represents zero) to about 5x10^-3.
The problems start when I try to handle this data in IDL or MatLab. The arrays generated in these programs are full of huge negative numbers such as -4x10^4, with occasional huge positive numbers, such as 5000. Strangely, looking at a plot of the data in MatLab with respect to latitude and longitude (at a specific time), the pattern of rainfall looks sensible, but the values are just completely wrong.
In IDL, I'm reading the file in to write it to a text file so it can be handled by some software that takes very basic text files. Here's the code I'm using:
PRO nao_heaps
address = '/Users/levyadmin/Downloads/'
file_base = 'output'
ncid = ncdf_open(address + file_base + '.nc')
MONTHS=['january','february','march','april','may','june','july','august','september','october','november','december']
varid_field = ncdf_varid(ncid, "tp")
varid_lon = ncdf_varid(ncid, "longitude")
varid_lat = ncdf_varid(ncid, "latitude")
varid_time = ncdf_varid(ncid, "time")
ncdf_varget,ncid, varid_field, total_precip
ncdf_varget,ncid, varid_lat, lats
ncdf_varget,ncid, varid_lon, lons
ncdf_varget,ncid, varid_time, time
ncdf_close,ncid
lats = reform(lats)
lons = reform(lons)
time = reform(time)
total_precip = reform(total_precip)
total_precip = total_precip*1000. ;put in mm
noLats=(size(lats))(1)
noLons=(size(lons))(1)
noMonths=(size(time))(1)
; the data may not be an integer number of years (otherwise we could make this next loop cleaner)
av_precip=fltarr(noLons,noLats,12)
for month=0, 11 do begin
year = 0
while ( (year*12) + month lt noMonths ) do begin
av_precip(*,*,month) = av_precip(*,*,month) + total_precip(*,*, (year*12)+month )
year++
endwhile
av_precip(*,*,month) = av_precip(*,*,month)/year
endfor
fname = address + file_base + '.dat'
OPENW,1,fname
PRINTF,1,'longitude'
PRINTF,1,lons
PRINTF,1,'latitude'
PRINTF,1,lats
for month=0,11 do begin
PRINTF,1,MONTHS(month)
PRINTF,1,av_precip(*,*,month)
endfor
CLOSE,1
END
Anyone have any ideas why I'm getting such strange values in MatLab and IDL?!
AH! Found the answer. NetCDF files use an offset, and a scale factor for the data to keep the size of the file to a minimum. To get the correct values, I simply need to:
total_precip = offset + (scale_factor * total_precip) ;put into correct range
At present I'm getting the scale factor and offset from ncdump, and hard coding them into my IDL program, but does anyone know how I can get them dynamically in my IDL code..?