Read file and represent rows as Vectors - scala

I have a dataset that contains DocID, WordID and frequency (count) as shown below.
Note that the first three numbers represent 1. the number of documents, 2. the
number of words in the vocabulary and 3. the total number of words in the collection.
189
1430
12300
1 2 1
1 39 1
1 42 3
1 77 1
1 95 1
1 96 1
2 105 1
2 108 1
3 133 3
What I want to do is to read the data (ignore the first three lines), combine the words per document and finally represent each document as a vector that contains the frequency of the wordID.
Based on the above dataset the representation of documents 1, 2 and 3 will be (note that vocab_size can be extracted by the second line of the data):
val data = Array(
Vectors.sparse(vocab_size, Seq((2, 1.0), (39, 1.0), (42, 3.0), (77, 1.0), (95, 1.0), (96, 1.0))),
Vectors.sparse(vocab_size, Seq((105, 1.0), (108, 1.0))),
Vectors.sparse(vocab_size, Seq((133, 3.0))))
The problem is that I am not quite sure how to read the .txt.gz file as RDD and create an Array of sparse vectors as described above. Please note that I actually want to pass the data array in the PCA transformer.

Something like this should do the trick:
sc.textFile("path/to/file").flatMap(r => r.split(' ') match {
case Array(doc, word, freq) => Some((doc.toInt, (word.toInt, freq.toDouble)))
case _ => None
}).groupByKey().mapValues(a => Vectors.sparse(vocab_size, a.toSeq))
Note that the groupByKey method will load all the keys for each document into memory, so you might want to use one of its variants reduceByKey or aggregateByKey instead (I would have, but I don't know the methods you have on your sparse vectors, although you probably have something to merge them together).

Related

Performing random trials in pyspark

I am learning pyspark recently and wanted to apply in one of the problems. Basically i want to perform random trials on each record in a dataframe.My dataframe is structured as below.
order_id,order_date,distribution,quantity
O1,D1,3 4 4 5 6 7 8 ... ,10
O2,D2,1 6 9 10 12 16 18 ..., 20
O3,D3,7 12 15 16 18 20 ... ,50
Here distribution column is 100 percentile points where each value is space separated.
I want to loop through each of these rows in the dataframe and randomly select a point in the distribution and add those many days to order_date and create a new column arrival_date.
At the end i want to get the avg(quantity) by arrival_date. So my final dataframe should look like
arrival_date,qty
A1,5
A2,10
What i have achieved till now is below
df = spark.read.option("header",True).csv("/tmp/test.csv")
def randSample(row):
order_id = row.order_id
quantity = int(row.quantity)
data = []
for i in range(1,20):
n = random.randint(0,99)
randnum = int(float(row.edd.split(" ")[n]))
arrival_date = datetime.datetime.strptime(row.order_date.split(" ")[0], "%Y-%m-%d") + datetime.timedelta(days=randnum)
data.append((arrival_date, quantity))
return data
finalRDD = df.rdd.map(randSample)
The calculations look correct, however the finalRDD is structured as list of lists as below
[
[(),(),(),()]
,[(),(),(),()]
,[(),(),(),()]
,[(),(),(),()]
]
Each of the list inside the main list is a single record . And each tuple inside the nested list is a trial of that record.
Basically i want the final output as flattened records, so that i can perform the average.
[
(),
(),
(),
]

Find the count of word pairs in kdb+

I have a file which contain multiple rows of item codes as follows. There are 1 million rows similar to these
1. 123,134,256,345,789.....
2. 123,256,345,678,789......
.
.
I would like to find the count of all the pair of words/items per row in the file using q in kdb+. i.e. any two pair of words that occur in the same row can be considered a word pair.
e.g:
(123,134),(123,256),(134,256), (123,345) (123,789), (134,789) are some of the word pairs in row 1
(123,256),(123,345),(123,345),(678,789),(345,789) are some of the word pairs in row 2
word/item pair count
`123,134----1
123,256---2
345,789---2`
I am reading the file using read0 and have been able to convert each line into list using vs and using count each group to count the number of words, but now I want to find the count of all the word pairs per row in the file.
Thanks in advance for your help
I'm not 100% I understand your definition of a word-pair. Perhaps you could expand a little if my logic doesn't match what you were looking for.
In the example below, I've created a 5x5 matrice of symbols for testing - selected distinct pairs of values from each row, and then checked how many rows each of these appeared in, in total.
Please double check with your own results.
q)test:5 cut`$string 25?5
q)test
2 0 1 0 0
2 4 4 2 0
1 0 0 3 4
2 1 1 4 4
3 0 3 4 0
q)count each group raze {l[where(count'[l:distinct distinct each asc'[x cross x:distinct x]])>1]} each test
0 2| 2
1 2| 2
0 1| 2
2 4| 2
0 4| 3
1 3| 1
1 4| 2
0 3| 2
3 4| 2
To add some other cases to Matthew's answer above, if what you want is to break the list down into pairs in this way:
l:"a,b,c,d,e,f,g"
becomes
"a,b"
"b,c"
"c,d"
"d,e"
"e,f"
"f,g"
so only taking valid pairs, you could use something like this:
f:{count each group b flip 0 1+\:til 1+count[b:","vs x]-1}
q)f l
,"a" ,"b"| 1
,"b" ,"c"| 1
,"c" ,"d"| 1
,"d" ,"e"| 1
,"e" ,"f"| 1
,"f" ,"g"| 1
where we're splitting the input list on ".", then using indexing to get a list of each element and the element directly to its right, then grouping the resultant list of pairs to count the distinct pairs. If you want to split it so l becomes
"a,b"
"c,d"
"e,f"
then you could use this:
g:{count each group b flip 0 1+\:2*til count[b:","vs x]div 2}
q)g l
,"a" ,"b"| 1
,"c" ,"d"| 1
,"e" ,"f"| 1
Which uses a similar approach, starting with the even-positioned elements and getting those to their right, and repeating as above.
You can easily apply these to the rows read with read0:
r:read0`:file.txt
f each r
will output a dictionary of the counts of each pair for each row, and this can be summed to give the total count of each word pair with each method throughout the file.
Hope this helps - it's still not clear what you mean by pairs, so if neither my answer not Matthew's is of some use, you could edit in a more complete explanation of what you'd like and we can help with that.
If you want to consider all possible combinations of 2 pairs in each row then this may be of help. The following function can be used to give distinct combinations, where x is the size of the list and y is the length of the combination:
q)comb:{$[x=y;enlist til x;1=y;flip enlist til x;.z.s[x;y],.z.s[x;y-1],'x-:1]}
q)comb[3;2]
0 1
0 2
1 2
From here we can index into each list to get the pairs, then raze to give a single list of all pairs, group to get the indices where each pair occurs and then count the number of indices in each group:
q)a
123 134 256 345 789
123 256 345 678 789
q)count each group raze{x comb[count x;2]}'[a]
123 134| 1
123 256| 2
134 256| 1
...
345 789| 2
...

Rows without repetitions - MATLAB

I have a matrix (4096x4) containing all possible combinations of four values taken from a pool of 8 numbers.
...
3 63 39 3
3 63 39 19
3 63 39 23
3 63 39 39
...
I am only interested in the rows of the matrix that contain four unique values. In the above section, for example, the first and last row should be removed, giving us -
...
3 63 39 19
3 63 39 23
...
My current solution feels inelegant-- basically, I iterate across every row and add it to a result matrix if it contains four unique values:
result = [];
for row = 1:size(matrix,1)
if length(unique(matrix(row,:)))==4
result = cat(1,result,matrix(row,:));
end
end
Is there a better way ?
Approach #1
diff and sort based approach that must be pretty efficient -
sortedmatrix = sort(matrix,2)
result = matrix(all(diff(sortedmatrix,[],2)~=0,2),:)
Breaking it down to few steps for explanation
Sort along the columns, so that the duplicate values in each row end up next to each other. We used sort for this task.
Find the difference between consecutive elements, which will catch those duplicate after sorting. diff was the tool for this purpose.
For any row with at least one zero indicates rows with duplicate rows. To put it other way, any row with no zero would indicate rows with no duplicate rows, which we are looking to have in the output. all got us the job done here to get a logical array of such matches.
Finally, we have used matrix indexing to select those rows from matrix to get the expected output.
Approach #2
This could be an experimental bsxfun based approach as it won't be memory-efficient -
matches = bsxfun(#eq,matrix,permute(matrix,[1 3 2]))
result = matrix(all(all(sum(matches,2)==1,2),3),:)
Breaking it down to few steps for explanation
Find a logical array of matches for every element against all others in the same row with bsxfun.
Look for "non-duplicity" by summing those matches along dim-2 of matches and then finding all ones elements along dim-2 and dim-3 getting us the same indexing array as had with our previous diff + sort based approach.
Use the binary indexing array to select the appropriate rows from matrix for the final output.
Approach #3
Taking help from MATLAB File-exchange's post combinator
and assuming you have the pool of 8 values in an array named pool8, you can directly get result like so -
result = pool8(combinator(8,4,'p'))
combinator(8,4,'p') basically gets us the indices for 8 elements taken 4 at once and without repetitions. We use these indices to index into the pool and get the expected output.
For a pool of a finite number this will work. Create is unique array, go through each number in pool, count the number of times it comes up in the row, and only keep IsUnique to 1 if there are either one or zero numbers found. Next, find positions where the IsUnique is still 1, extract those rows and we finish.
matrix = [3,63,39,3;3,63,39,19;3,63,39,23;3,63,39,39;3,63,39,39;3,63,39,39];
IsUnique = ones(size(matrix,1),1);
pool = [3,63,39,19,23,6,7,8];
for NumberInPool = 1:8
Temp = sum((matrix == pool(NumberInPool))')';
IsUnique = IsUnique .* (Temp<2);
end
UniquePositions = find(IsUnique==1);
result = matrix(UniquePositions,:)

Delete rows from a cell given a specific condition

I have a cell type big-variable sorted out by FIRM (A(:,2)) and I want to erase all the rows in which the same firm doesn't appear at least 3 times in a row. In this example, A:
FIRM
1997 'ABDR' 0,56 464 1641 19970224
1997 'ABDR' 0,65 229 9208 19970424
1997 'ABDR' 0,55 125 31867 19970218
1997 'ABD' 0,06 435 8077 19970311
1997 'ABD' 0,00 150 44994 19970804
1997 'ABFI' 2,07 154 46532 19971209
I would keep only A:
1997 'ABDR' 0,56 464 1641 19970224
1997 'ABDR' 0,65 229 9208 19970424
1997 'ABDR' 0,55 125 31867 19970218
Thanks a lot.
Notes:
I used fopen and textscanto import the csv file.
I performed some changes on some variables for all of them to fit in a cell-type variable
I converted some number-elements into stings
F_x=num2cell(Data{:,x});
I got new variable just with year
F_ya=max(0,fix(log10(F_y)+1)-4);
F_yb=fix(F_y./10.^F_ya);
F_yc = num2cell(F_yb);
Create new cell A w/ variables I need
A=[F_5C Data{:,1} Data{:,2} Data{:,3} Data{:,4} F_xa F_xb];
Meaning that within the cell I have some variables that are strings and others that are numbers.
I'm going to assume that your names are stored in a cell array. As such, your names would actually be:
names = {'ABDR', 'ABDR', 'ABDR', 'ABD', 'ABD', 'ABFI'};
We can then use strcmpi. What this function does is that it string compares two strings together. It returns true if the strings match and false otherwise. This is also case insensitive, so ABDR would be the same as abdr.
You would call strcmpi like so:
v = strcmpi(str1, str2);
Alternatively str2 can be a cell array. How this would work is that it would take a single string str1 and compare with each string in each cell of the cell array. It would then return a logical vector that is the same size as str2 which indicates whether we have a match at this particular location or not.
As such, we can go through each element of names and see how many matches we have overall with the entire names cell array. We can then figure out which locations we need to select by checking to see if we have at least 3 matches or more per name in the names array. In other words, we simply sum up the logical vector for each string within names and filter those that sum up to 3 or more. We can use cellfun to help us perform this. As such:
sums = cellfun(#(x) sum(strcmpi(x,names)), names);
Doing this thus gives:
sums =
3 3 3 2 2 1
Now, we need those locations that have three or more. As such:
locations = sums >= 3
locations =
1 1 1 0 0 0
As such, these are the rows that you can use to filter out your matrix. This is also a logical vector. Assuming that A contains your data, you would simply do A(locations,:) to filter out all those rows that have occurrences of three or more times for a particular name. I really don't know how you constructed A, so I'm assuming it's like a 2D matrix. If you put in the code that you used to construct this matrix, I'll modify my post to get it working for you. In any case, what's important is locations. This tells you what rows you need to select to match your criteria.

Using SUM and UNIQUE to count occurrences of value within subset of a matrix

So, presume a matrix like so:
20 2
20 2
30 2
30 1
40 1
40 1
I want to count the number of times 1 occurs for each unique value of column 1. I could do this the long way by [sum(x(1:2,2)==1)] for each value, but I think this would be the perfect use for the UNIQUE function. How could I fix it so that I could get an output like this:
20 0
30 1
40 2
Sorry if the solution seems obvious, my grasp of loops is very poor.
Indeed unique is a good option:
u=unique(x(:,1))
res=arrayfun(#(y)length(x(x(:,1)==y & x(:,2)==1)),u)
Taking apart that last line:
arrayfun(fun,array) applies fun to each element in the array, and puts it in a new array, which it returns.
This function is the function #(y)length(x(x(:,1)==y & x(:,2)==1)) which finds the length of the portion of x where the condition x(:,1)==y & x(:,2)==1) holds (called logical indexing). So for each of the unique elements, it finds the row in X where the first is the unique element, and the second is one.
Try this (as specified in this answer):
>>> [c,~,d] = unique(a(a(:,2)==1))
c =
30
40
d =
1
3
>>> counts = accumarray(d(:),1,[],#sum)
counts =
1
2
>>> res = [c,counts]
Consider you have an array of various integers in 'array'
the tabulate function will sort the unique values and count the occurances.
table = tabulate(array)
look for your unique counts in col 2 of table.