Count words Scala and create a dictionnary - scala

I have a csv file with a type and a description text
type ; text
0 ; hello world
0 ; hello text 2
1 ; text1
1 ; text
2 ; world base
2 ; Hey you
2 ; test
In fact, I want to create a dictionnary and have another csv file structured like this with a unique line of each type and the frequence of each word on the description
type ; hello ; world ; text ; 2 ; text1 ; base ; hey ; you ; test
0 ; 2 ; 1 ; 1 ; 1 ; 0 ; 0 ; 0 ; 0 ; 0
1 ; 0 ; 0 ; 1 ; 0 ; 1 ; 0 ; 0 ; 0 ; 0
2 ; 0 ; 1 ; 0 ; 0 ; 0 ; 1 ; 1 ; 1 ; 1
I have tons of lines on my csv file with many Strings, this is just an example.
I am just starting to work with spark and scala these days. Any help is needed.
Thanks

Try:
import org.apache.spark.sql.functions._
df.withColumn("text", explode(split($"text", "\\s+")))
.groupBy("type")
.pivot("text")
.count.na.fill(0)

Related

How often does the value w change and how does it change step by step

function [w]=example3(v)
w=[0];
for x =v
for y=v
w(x,y)=x+y;
end
end
end
An example:
v=[5 2];
[w]=example3(v)
0 0 0 0 0
0 4 0 0 7
0 0 0 0 0
0 0 0 0 0
0 7 0 0 10
I have this code and I'm trying to figure out how often does the value change for w. More than that though, I want to , step by step, how the value of w change (maybe just the first few iterations of it).
As suggested by others, you can define a break point where you assign a value to w and see its value.
If you want to do it programmatically, you could add this:
function [w]=example3(v)
w=[0];
idx = 0; % Count number of cycles
for x =v
for y=v
w(x,y)=x+y;
idx = idx + 1;
if idx < 6 % Only display first 5 values
disp(['x: ' num2str(x) ' - y: ' ...
num2str(y) ' - w: ' num2str(w(x,y))])
end
end
end
end

Write for logical value Matlab

Let's say T=1:20 ; P=[2 6 9 11 15 19].
How to write a logical value for P in range T?
The answer I want is: flag= [0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0].
Use ismember made for exactly this task -
ismember(T,P)
You can define a logical vector flag the size of T, then use P as an index vector of the flag to raise to true:
T=1:20 ; P=[2 6 9 11 15 19] ;
flag = false(size(T)) ;
flag(P) = true ;
flag =
0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0
For the fun of it, an alternative to Hoki's answer:
T(P) = 0;
flag = ~T
This sets all values that are in P equal to zero, and then checks if the values in T is 0 or not. This of course has the downside that it overwrites T. Note: I would go for Hoki's answer!

Merge and create variable in merged dataset in same step

I am merging two datasets in SAS, "set_x" and "set_y", and would like to create a variable "E" in the resulting merged dataset "matched":
* Create set_x *;
data set_x ;
input merge_key A B ;
datalines;
1 24 25
2 25 25
3 30 32
4 32 32
5 20 32
6 1 1
;
run;
* Create set_y *;
data set_y ;
input merge_key C D ;
datalines;
1 1 1
2 2 1
3 1 1
4 2 1
5 1 1
7 1 1
;
run;
* Merge and create E *;
data matched unmatched_x unmatched_y ;
merge set_x (in=x) set_y (in=y) ;
by merge_key ;
if x=y then output matched;
else if x then output unmatched_x ;
else if y then output unmatched_y ;
IF C = 2 THEN DO ;
E = A ;
END;
ELSE DO ;
E = floor(B - D) ;
END ;
run ;
However, in the resulting table "matched", the values of E are missing. E is correctly calculated if I only output matched values (i.e. using if x=y;).
Is it possible to create "E" in the same data step if outputting unmatched as well as matched observations?
You have output the results before E is computed, then E is set to missing when next iteration starts. So you want E to be available before the output,
data matched unmatched_x unmatched_y ;
merge set_x (in=x) set_y (in=y) ;
by merge_key ;
IF C = 2 THEN DO ;
E = A ;
END;
ELSE DO ;
E = floor(B - D) ;
END ;
if x=y then output matched;
else if x then output unmatched_x ;
else if y then output unmatched_y ;
run ;

How to count number of 1 and 0 in the matrix?

I have an image of which I cut out only one column. After that I made it to be logical so there are be only 0 and 1 in this column.
Suppose my values in this column are
1111000110000000000000011111111
I want to count the length of each block of ones or each block of zeros.
The result would be
1 - 4 (first 1)
0 - 3 (first 0)
1 - 2
and so on...
I know only count for the entire column but I can't do it for each distinct block. Anyone please help me.
Let vec be a row vector (1-by-n) of zeros and ones, then you can use the following code
rl = ( find( vec ~= [vec(2:end), vec(end)+1] ) );
data = vec( rl );
rl(2:end) = rl(2:end) - rl(1:end-1);
rl will give you the number of consecutive zeros and ones, while data will tell you for each block if it is zero or one.
This question is closely related to run length coding.
Demo:
vec = [1 1 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1];
rl = ( find( vec ~= [vec(2:end), vec(end)+1] ) );
data = vec( rl ),
rl(2:end) = rl(2:end) - rl(1:end-1),
data =
1 0 1 0 1
rl =
4 3 2 14 8

count no.of occurrences per unique id

I am new to command line. I have long text file (samp.txt) with following columns with space delimited. Awk/sed/perl help appreciated.
Id Pos Re Va Cn SF:R1 SR He Ho NC
c|371443199 22 G A R Pass:8 0 1 0 0
c|371443199 25 C A M Pass:13 0 0 1 0
c|371443199 22 G A R Pass:8 0 1 0 0
c|367079424 17 C G S Pass:19 0 0 1 0
c|371443198 17 G A R Pass:18 0 1 0 0
c|367079424 17 G A R Pass:18 0 0 1 0
I want count for each unique id (count unique id how many occurrences), count 6th column (6th column =pass), count how many He (from 8th column) and how many Ho (9 th column). I would like to get result like this
Id CountId Countpass CountHe CountHO
cm|371443199 3 3 2 1
cm|367079424 2 2 0 2
awk '{ids[$1]++; pass[$1] = "?"; he[$1] += $8; ho[$1] += $9} END {OFS = "\t"; print "Id", "CountId", "Countpass", "CountHe", "CountHO"; for (id in ids) {print id, ids[id], pass[id], he[id], ho[id]}' inputfile
Broken out onto multiple lines:
awk '{
ids[$1]++;
pass[$1] = "?"; # I'm not sure what you want here
he[$1] += $8;
ho[$1] += $9
}
END {
OFS = "\t";
print "Id", "CountId", "Countpass", "CountHe", "CountHO";
for (id in ids) {
print id, ids[id], pass[id], he[id], ho[id]
}' inputfile
You seem to have a typo in your input, where you put ...98 instead of ...99. Assuming this is the case, your other information and expected output makes sense.
Using an array to store the ids to preserve the original order of the ids.
use strict;
use warnings;
use feature 'say'; # to enable say()
my $hdr = <DATA>; # remove header
my %hash;
my #keys;
while (<DATA>) {
my ($id,$pos,$re,$va,$cn,$sf,$sr,$he,$ho,$nc) = split;
$id =~ s/^c\K/m/;
$hash{$id}{he} += $he;
$hash{$id}{ho} += $ho;
$hash{$id}{pass}{$sf}++;
$hash{$id}{count}++;
push #keys, $id if $hash{$id}{count} == 1;
}
say join "\t", qw(Id CountId Countpass CountHe CountHO);
for my $id (#keys) {
say join "\t", $id,
$hash{$id}{count}, # occurences of id
scalar keys $hash{$id}{pass}, # the number of unique passes
#{$hash{$id}}{qw(he ho)};
}
__DATA__
Id Pos Re Va Cn SF:R1 SR He Ho NC
c|371443199 22 G A R Pass:8 0 1 0 0
c|371443199 25 C A M Pass:13 0 0 1 0
c|371443199 22 G A R Pass:8 0 1 0 0
c|367079424 17 C G S Pass:19 0 0 1 0
c|371443198 17 G A R Pass:18 0 1 0 0
c|367079424 17 G A R Pass:18 0 0 1 0
Output:
Id CountId Countpass CountHe CountHO
cm|371443199 3 2 2 1
cm|367079424 2 2 0 2
cm|371443198 1 1 1 0
Note: I made the output tab-delimited for easier post-processing. If you want it pretty instead, use printf to get some fixed width fields.