i currently have three data sets in SAS 9.3
Data set "Main" contains SKU ID's and Customer ID's as well as various other variables such as week.
Customer_ID week var2 var3 SKU_ID
1 1 x x 1
1 2 x x 1
1 3 x x 1
1 1 x x 2
1 2 x x 2
2 1 x x 1
2 2 x x 1
2 3 x x 1
2 1 x x 2
2 2 x x 2
data set "standard" contains the standard location for each Customer_ID.
data set "overrides" contains data override location (if applicable) for a certain sku for certain customers for instance. Thus, it contains SKU_ID, customer_id and location
standard data set
customer_id location
1 A
1 A
2 C
2 C
override dataset
customer_id sku_id location
1 1 A
1 2 B
When merging all of the data sets this is what i get
Customer_ID week var2 var3 SKU_ID location
1 1 x x 1 A
1 2 x x 1 A
1 3 x x 1 A
1 1 x x 2 B
1 2 x x 2 A
2 1 x x 1 C
2 2 x x 1 C
2 3 x x 1 C
versus what i want it to look like
Customer_ID week var2 var3 SKU_ID location
1 1 x x 1 A
1 2 x x 1 A
1 3 x x 1 A
1 1 x x 2 B
1 2 x x 2 B
2 1 x x 1 C
2 2 x x 1 C
2 3 x x 1 C
proc sort data=overrides; by Location SKU_ID; run;
Proc sort data= main; by Location SKU_ID;
run;
Proc sort data= Standard; by Location;
run;
data Loc_Standard No_LOC;
Merge Main(in = a) Standard(in = b);
by Location;
if a and b then output Loc_standard;
else if b then output No_LOC;
run;
/*overwrites standard location if an override for a sku exist*/
Data Loc_w_overrides;
Merge Loc_standard overrides;
by Location SKU_ID;
run;
That is how SAS combines datasets. When datasets have observations to contribute to a BY group the values from the datasets are read in the order they appear in the MERGE statement. But when one dataset runs out of new observations for the BY group then SAS does not read those values in. So the value read from the other dataset is no longer replaced.
Either drop the original variable and just use the value from the second dataset. Basically this will setup an 1 to Many merge.
Or rename the override variable and add your own logic for when to apply the override.
I am not sure how you are getting the result you posted since you do not have any standards for CUSTOMER_ID=2 in your posted data. If the values of location to not depend on customer_id then why is that variable in the standards and override datasets?
Perhaps you meant that the standards dataset only has SKU_ID and location?
data main_w_standards;
merge main standards;
by sku_id ;
run;
proc sort data=main_w_standards;
by customer_id sku_id;
run;
data main_w_overrides;
merge main_w_standards overrides(in=in2 rename=(location=override));
by customer_id SKU_ID;
if in2 then location=override;
drop override;
run;
Why not UPDATE the STANDARD(loc) with OVERIDE(oride) and then merge with customer data.
data loc;
input customer_id Sku_id location:$1.;
cards;
1 1 A
1 2 A
;;;;
proc print;
data oride;
input customer_id sku_id location:$1.;
cards;
1 1 A
1 2 B
;;;;
run;
proc print;
data locoride;
update loc oride;
by cu: sk:;
run;
I have a csv file with a type and a description text
type ; text
0 ; hello world
0 ; hello text 2
1 ; text1
1 ; text
2 ; world base
2 ; Hey you
2 ; test
In fact, I want to create a dictionnary and have another csv file structured like this with a unique line of each type and the frequence of each word on the description
type ; hello ; world ; text ; 2 ; text1 ; base ; hey ; you ; test
0 ; 2 ; 1 ; 1 ; 1 ; 0 ; 0 ; 0 ; 0 ; 0
1 ; 0 ; 0 ; 1 ; 0 ; 1 ; 0 ; 0 ; 0 ; 0
2 ; 0 ; 1 ; 0 ; 0 ; 0 ; 1 ; 1 ; 1 ; 1
I have tons of lines on my csv file with many Strings, this is just an example.
I am just starting to work with spark and scala these days. Any help is needed.
Thanks
Try:
import org.apache.spark.sql.functions._
df.withColumn("text", explode(split($"text", "\\s+")))
.groupBy("type")
.pivot("text")
.count.na.fill(0)
I want to make an experiment where I create a list of many lists of randomly generated sequences that all contain every digit 0 to 9 inclusive, that is, the generation function is to generate random numbers and place them in a list of integers while there is at least 1 digit not found in the list.
The intention for the experiment is to try to make some generalizations about things like expected number # of digits in such a function, how long can a sequence get(can my program loop indefinitely and never find that last digit?), and other interesting things(for me).
I am using PERL for the experiment.
The idea seemed simple at first, I sat down, created a list, and figured I can just make a loop that runs an arbitrary amount of times (I decided to choose 100 times), which calls a function generate_sequence(input: none, output: list of numbers that contains at least 1 of every digit) and adds it to the list.
I quickly realized that I struggle cleanly specifying what it means, pragmatically, to generate a list of numbers that contains one of every digit.
My original attempt was to make a list of digits(0..9), and as I generate numbers, I would search the list for that digit if it is in the list, and remove it. This way, it would generate numbers until the list of digits "still needed" is empty. This approach seems unappealing and can involve a lot of redundant tasks such as checking whether the digit generated is in the list of digits needed every single time a number is generated...
Is there a more elegant solution to such a problem? I am really unhappy with the way I am approaching the function.
In general, I need a function F that accepts nothing, and returns a list of randomly generated numbers that contains every digit 1..9, that is, it stops as soon as every digit from 1 to 9 inclusive is generated.
Thanks ahead of time.
Well, the problem is if you 'roll randomly' you don't actually know how many iterations you're going to need - in theory it could be infinite.
If you're doing it in perl you're probably much better off using the List::Util module and shuffle - feed it a list of elements you want to shuffle.
E.g.
#!/usr/bin/env perl
use strict;
use warnings;
use List::Util qw( shuffle );
my #shuffled = shuffle ( 0..9 );
print #shuffled;
You could reproduce this quite easily, but why bother when List::Util is core as of 5.7.3
However it does sound like you're trying to generate a list, that might contain repeats, until you hit a terminate condition.
I'm not entirely sure why, but that would be best done using a hash, and counting occurences. (And terminate when your 'keys' is complete).
E.g.:
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
my #list_of_numbers;
while ( keys %seen < 10 ) {
my $gen = int rand ( 10 );
$seen{$gen}++;
push ( #list_of_numbers, $gen );
}
print #list_of_numbers;
Note - there's actually an extremely small chance of this rolling extremely long sequences, because of the nature of 'random' - it means in theory you might have a very long 'streak' of not rolling a 6.
For bonus points in %seen you have a frequency spread of your generated numbers.
A python implementation:
from random import randint
s = set(range(10))
def f():
result = []
t = set()
while 1:
n = randint(0, 9)
result.append(n)
t.add(n)
if t == s:
return result
For example:
for i in range(10):
print(len(f()))
....:
20
34
69
22
23
25
20
29
30
32
This should work (python):
import random
nums = []
while any([ i not in set(nums) for i in range(1,11)]):
nums.append(random.randrange(1, 11, 1))
or more specific to what you are trying to do:
import random
lengths = []
for i in range(1000):
nums = []
while any([ i not in set(nums) for i in set(range(1,11))]):
nums.append(random.randrange(1, 11, 1))
lengths.append(len(nums))
This approach counts the iterations needed to fill a dictionary of digits:
import random
c = 0
d = dict()
while len(d.keys()) <10:
d[random.randint(0,9)] = 1
c += 1
print c
Wrote this before you switched to just Perl...
from random import randrange
def F():
todo = set(range(10))
nums = []
while todo:
r = randrange(10)
nums.append(r)
todo.discard(r)
return nums
>>> F()
[8, 2, 2, 3, 1, 0, 3, 9, 3, 4, 7, 4, 7, 5, 0, 9, 5, 5, 6]
Another:
def F():
done = 0
nums = []
while done < 1023:
r = randrange(10)
nums.append(r)
done |= 1 << r
return nums
In Clojure, I am keeping track of both the random list and the existing values, thus avoiding a search on the growing list.
(defn random-list [ up-to ]
(loop [ n [] tries [] ]
(if (> (count n) (dec up-to))
tries
(let [i (rand-int up-to) n-tries (conj tries i)]
(if (some #{i} n )
(recur n n-tries)
(recur (conj n i) n-tries))))))
We can define similar functions:
(defn random-list-to-10 []
(random-list 10))
(random-list-to-10)
; [3 6 9 0 8 0 5 7 3 8 1 8 4 3 4 2]
We can also take only a few random elements:
(take 5 (random-list 10))
; (6 1 0 9 5)
Here is a possible Perl implementation that counts the iterations needed to fill a hash with the 10 digts:
#!/usr/bin/perl
my $count = 0;
my %dict = ();
while (scalar keys %dict < 10) {
$dict{int(rand(10))} = 1;
$count ++;
}
print $count;
(see online demo)
In clojure (though probably not the most elegant):
(loop [n []
s (set (range 0 10))]
(if (= s (set n))
n
(recur (conj n (rand-int 10)) s)))
Sample output:
user=> (loop [n [] s (set (range 0 10))] (if (= s (set n)) n (recur (conj n (rand-int 10)) s)))
[0 6 2 8 5 2 0 0 9 3 0 3 0 1 7 5 0 4]
user=> (loop [n [] s (set (range 0 10))] (if (= s (set n)) n (recur (conj n (rand-int 10)) s)))
[2 1 7 7 3 2 8 8 4 7 5 0 1 3 0 3 0 4 0 0 3 7 3 4 5 8 1 3 8 5 3 5 5 9 4 0 2 1 2 7 8 3 9 7 8 6]
user=> (loop [n [] s (set (range 0 10))] (if (= s (set n)) n (recur (conj n (rand-int 10)) s)))
[7 1 8 3 1 1 0 6 8 4 9 7 0 0 2 7 4 0 1 1 8 8 4 3 9 8 4 2 8 3 2 8 4 6 0 9 9 7 2 3 0 3 0 4 2 4 0 5]
user=> (loop [n [] s (set (range 0 10))] (if (= s (set n)) n (recur (conj n (rand-int 10)) s)))
[9 1 9 0 9 5 3 0 3 8 4 0 1 6 3 0 1 8 0 3 8 3 5 4 3 9 8 8 8 8 2 2 8 9 9 3 9 2 5 1 1 3 4 6 3 1 4 0 2 6 7]
user=> (loop [n [] s (set (range 0 10))] (if (= s (set n)) n (recur (conj n (rand-int 10)) s)))
[4 1 5 5 5 5 2 2 5 5 3 1 5 3 5 1 4 2 4 2 3 1 4 7 1 9 3 8 0 8 4 0 9 3 4 9 9 1 8 8 0 6]
user=> (loop [n [] s (set (range 0 10))] (if (= s (set n)) n (recur (conj n (rand-int 10)) s)))
[0 4 0 9 1 8 4 8 6 6 6 9 8 4 9 0 9 3 3 7 6 1 4 3 8 1 1 4 9 5 1 4 1 2]
user=>
I am merging two datasets in SAS, "set_x" and "set_y", and would like to create a variable "E" in the resulting merged dataset "matched":
* Create set_x *;
data set_x ;
input merge_key A B ;
datalines;
1 24 25
2 25 25
3 30 32
4 32 32
5 20 32
6 1 1
;
run;
* Create set_y *;
data set_y ;
input merge_key C D ;
datalines;
1 1 1
2 2 1
3 1 1
4 2 1
5 1 1
7 1 1
;
run;
* Merge and create E *;
data matched unmatched_x unmatched_y ;
merge set_x (in=x) set_y (in=y) ;
by merge_key ;
if x=y then output matched;
else if x then output unmatched_x ;
else if y then output unmatched_y ;
IF C = 2 THEN DO ;
E = A ;
END;
ELSE DO ;
E = floor(B - D) ;
END ;
run ;
However, in the resulting table "matched", the values of E are missing. E is correctly calculated if I only output matched values (i.e. using if x=y;).
Is it possible to create "E" in the same data step if outputting unmatched as well as matched observations?
You have output the results before E is computed, then E is set to missing when next iteration starts. So you want E to be available before the output,
data matched unmatched_x unmatched_y ;
merge set_x (in=x) set_y (in=y) ;
by merge_key ;
IF C = 2 THEN DO ;
E = A ;
END;
ELSE DO ;
E = floor(B - D) ;
END ;
if x=y then output matched;
else if x then output unmatched_x ;
else if y then output unmatched_y ;
run ;
Consider a set,
S = {1,2,3,4,5,6,7}
I am trying to come up with a function which takes S as the input and gives me ALL possible arrays:
[ 1 ~ 2 ; 1 ~ 3 ; 1 ~ 4 ; 1 ~ 5 ; . . . ; 6 ~ 7 ]
[ 1 2 ~ 3 ; 1 2 ~ 4 ; ...; 2 3 ~ 1 ; 2 3 ~ 4....; 5 6 ~ 7]
.
.
.
[ 2 3 4 5 6 7 ~ 1 ; 1 3 4 5 6 7 ~ 2 ; ... ; 1 2 3 4 5 6 ~ 7 ]
Here notice that '~' is sort of like a delimiter placed in between the elements of k - combination such that the set appearing before the delimiter is always unique in each array.
For example, we want both 7-combinations
[ 2 3 4 5 6 7 ~ 1 ] and [ 1 2 3 4 5 6 ~ 7 ].
But we want only one of
[ 1 2 3 4 5 6 ~ 7 ] and [ 1 3 4 5 6 2 ~ 7 ].
My Code :
clear all
for k = 1:7
Set = nchoosek(1:7,k);
for i = 1:length(Set)
A = setdiff(1:7,Set(i,:));
P = nchoosek( A , 2 ); % trialing it for only A~B where B has only 2elements
L = length( P );
S = repmat( Set ( i,: ) , L,1);
for j = 1:L
S1(j,:) = setdiff( S(j,:) , P(j,:) );
W(j,:) = [ S1(j,:) , 0 , P(j,:) ];
end
W1(i,k) = {W};
end
end
This however produces an error at k=2.
Any ideas to make this work and efficiently.
I think I can outline how to achieve this.
to get the subset (for A) use setdiff
s = 1:7
b = 4
tmp = setdiff(s,b)
for the permutation use randperm
t2 = randperm(length(tmp))
A = tmp(t2)
for the specific subsets just pick the first n entries of A
Put the whole thing in some loops to create the set you describe.