I have a dataset like this:
ID PersonID ClassID Attended Converted
1 1 1 1 0
2 1 1 1 1
3 1 1 1 1
4 2 1 1 1
5 3 2 0 0
6 3 2 1 1
7 4 2 1 0
I'm building a report that groups by ClassID (actually I'm using a parameter that allows grouping on a few different cols, but for simplicity here, I'm just using ClassID). I need to do a calculation in each group footer. In order to do the calculation, I need to count records with PersonIDs unique to that group. The catch is, in one case, these records also need to match a criteria. EG:
X = [Count of records where Converted = 1 with distinct PersonID]
Y = [Count of records where Attended = 1]
Then I need to display the quotient as a percentage:
(X/Y)*100
So the final report would look something like this:
ID PersonID Attended Converted
CLASS 1 GROUP
1 1 1 0
2 1 1 1
3 1 1 1
4 2 1 1
Percent= 2/4 = 50%
CLASS 2 GROUP
5 3 0 0
6 3 1 1
7 4 1 0
Percent= 1/2 = 50%
Notice in Class 1 Group, there are 3 records with Converted = 1 but 'X' (the numerator) is equal to 2 because of the duplicate PersonID. How can I calculate this in Crystal Reports?
I had to create a few different formulas to make this work with the help of this site.
First I created a function called fNull as suggested by that site, that is just blank. I was wondering if just typing null in its place would do the job but didn't get to testing it. Next I created formulas to evaluate if a row was attended and if a row was converted.
fTrialAttended:
//Allows row to be counted if AttendedTrial is true
if {ConversionData.AttendedTrial} = true
then CStr({ConversionData.PersonID})
else {#fNull}
fTrialsConverted:
//Allows row to be counted if Converted is true
if {ConversionData.Converted} = true
then CStr({ConversionData.PersonID})
else {#fNull}
Note that I'm returning the PersonID if attended or converted is true. This lets me do the distinct count in the next formula (X from the original question):
fX:
DistinctCount({#fTrialsConverted}, {ConversionData.ClassID})
This is placed in the group footer. Again remember #fTrialsConverted is returning the PersonID of trials converted (or fNull, which won't be counted). One thing I don't understand is why I had to explicitly include the group by field (ClassID) if it's in the group footer, but I did or it would count the total across all groups. Next, Y was just a straight up count.
fY:
//Counts the number of trials attended in the group
Count({#fTrialsAttended}, {ConversionData.ClassID})
And finally a formula to calculate the percentage:
if {#fY} = 0 then 0
else ({#fX}/{#fY})*100
The last thing I'll share is I wanted to also calculate the total across all groups in the report footer. Counting total Y was easy, it's the same as the fY formula except you leave out the group by parameter. Counting total X was trickier because I need the sum of the X from each group and Crystal can't sum another sum. So I updated my X formula to also keep a running total in a global variable:
fX:
//Counts the number of converted trials in the group, distinct to a personID
whileprintingrecords;
Local NumberVar numConverted := DistinctCount({#fTrialsConverted}, {#fGroupBy});
global NumberVar rtConverted := rtConverted + numConverted; //Add to global running total
numConverted; //Return this value
Now I can use rtConverted in the footer for the calculation. This lead to just one other bewildering thing that took me a couple hours to figure out. rtConverted was not being treated as a global variable until I explicitly added the global keyword, despite all the documentation I've seen saying global is the default. Once I figured that out, it all worked great.
Related
I have the following DataFrame df that represents a graph with nodes A, B, C and D. Each node belongs to a group 1 or 2:
src dst group_src group_dst
A B 1 1
A B 1 1
B A 1 1
A C 1 2
C D 2 2
D C 2 2
I need to calculate the distinct number of nodes and the number of edges per group. The result should be the following:
group nodes_count edges_count
1 2 3
2 2 2
The edge A->C is not considered because the nodes belong to different groups.
I do not know how to stack the columns group_src and group_dst in order to group by unique column group. Also I do not know how to calculate the number of edges inside the group.
df
.groupBy("group_src","group_dst")
.agg(countDistinct("srcId","dstId").as("nodes_count"))
I think it may be necessary to use two steps:
val edges = df.filter($"group_src" === $"group_dst")
.groupBy($"group_src".as("group"))
.agg(count("*").as("edges_count"))
val nodes = df.select($"src".as("id"), $"group_src".as("group"))
.union(df.select($"dst".as("id"), $"group_dst".as("group"))
.groupBy("group").agg(countDistinct($"id").as("nodes_count"))
nodes.join(edges, "group")
You can accomplish "stacking" of columns by using .union() after selecting specific columns.
Is it possible in OpenRefine to fill down blank cells with a counter instead of copying the top non-blank value?
In this example image:
Or here the same example as typed text - image this as a column from top to bottom:
1
1
blank
1
blank
blank
blank
blank
blank
1
I would like to see the column filled as follows (again, imagine top to bottom):
1
1
2
1
2
3
4
5
6
1
Thanks, help is very much appreciated.
It's not really simple. You have to:
1 Replace the blanks with something else, such as an "x"
2 Create a unique record for the entire dataset
3 Use this Jython script:
import itertools
data = row['record']['cells']['YOUR COLUMN NAME']['value']
x = itertools.count(2)
liste = []
for i, el in enumerate(data):
if data[i] == "x":
liste.append(x.next())
else:
x = itertools.count(2)
liste.append(el)
return ",".join([str(x) for x in liste])
4 Use Blank down to clear duplicates
5 Split the first multivalued cell.
Here is a screencast of the operations described above.
If you know a little Python, you can also transform your file using pandas. I do not know what is the most elegant way to do it, but this script should work.
import itertools
import pandas as pd
x = itertools.count(2)
def set_x():
global x
x = itertools.count(2)
set_x()
def increase(value):
if not value:
return next(x)
else:
set_x()
return value
data = pd.read_csv("your_file.csv", na_values=['nan'], keep_default_na=False)
data['column 1'] = data['column 1'].apply(lambda row: increase(row))
print(data)
data.to_csv("final_file.csv")
Here are two simple solutions using GREL.
Use records
You could move the column to the beginning, telling OpenRefine to use the numbers as records. You might need to transform the column to text to really convince OpenRefine to use it as records.
Then either add a new column or transform the existing one with the following expression.
1 + row.index - row.record.fromRowIndex
Use record markers
In case you don't want to use records or don't have a static number, you can create a similar setup. Imagine you have an incomplete counter like in the following table and want to fill it.
Origin
Desired
1
1
2
1
1
2
2
3
1
1
To fill the missing cells first add a new column based on your orignal column using the following expression and name it record_row_index.
if(isNonBlank(value), row.index, "")
After that fill down the original column and the new column record_row_index.
Then create a new column based on the original filled column using the following expression.
value + row.index - cells["record_row_index"].value
Hint: the expression is expecting both columns to be of type number.
If one of them is of type text, you can either transform the column beforehand or use toNumber() in the expression.
The following table shows how these operations are working together.
Origin
Origin filled
row.index
record_row_index
Desired
1
1
0
0
1 + 0 - 0 = 1
1
1
0
1 + 1 - 0 = 2
1
1
2
2
1 + 2 - 2 = 1
2
2
3
3
2 + 3 - 3 = 2
2
4
3
2 + 4 - 3 = 3
1
1
5
5
1 + 5 - 5 = 1
I have two tables
table 1 (orders) columns: (date,symbol,qty)
table 2 (marketData) columns: (date,symbol,close price)
I want to add the close for T+0 to T+5 to table 1.
{[nday]
value "temp0::update date",string[nday],":mdDates[DateInd+",string[nday],"] from orders";
value "temp::temp0 lj 2! select date",string[nday],":date,sym,close",string[nday],":close from marketData";
table1::temp
} each (1+til 5)
I'm sure there is a better way to do this, but I get a 'loop error when I try to run this function. Any suggestions?
See here for common errors. Your loop error is because you're setting views with value, not globals. Inside a function value evaluates as if it's outside the function so you don't need the ::.
That said there's lots of room for improvement, here's a few pointers.
You don't need the value at all in your case. E.g. this line:
First line can be reduced to (I'm assuming mdDates is some kind of function you're just dropping in to work out the date from an integer, and DateInd some kind of global):
{[nday]
temp0:update date:mdDates[nday;DateInd] from orders;
....
} each (1+til 5)
In this bit it just looks like you're trying to append something to the column name:
select date",string[nday],":date
Remember that tables are flipped dictionaries... you can mess with their column names via the keys, as illustrated (very noddily) below:
q)t:flip `a`b!(1 2; 3 4)
q)t
a b
---
1 3
2 4
q)flip ((`$"a","1"),`b)!(t`a;t`b)
a1 b
----
1 3
2 4
You can also use functional select, which is much neater IMO:
q)?[t;();0b;((`$"a","1"),`b)!(`a`b)]
a1 b
----
1 3
2 4
Seems like you wanted to have p0 to p5 columns with prices corresponding to date+0 to date+5 dates.
Using adverb over to iterate over 0 to 5 days :
q)orders:([] date:(2018.01.01+til 5); sym:5?`A`G; qty:5?10)
q)data:([] date:20#(2018.01.01+til 10); sym:raze 10#'`A`G; price:20?10+10.)
q)delete d from {c:`$"p",string[y]; (update d:date+y from x) lj 2!(`d`sym,c )xcol 0!data}/[ orders;0 1 2 3 4]
date sym qty p0 p1 p2 p3 p4
---------------------------------------------------------------
2018.01.01 A 0 10.08094 6.027448 6.045174 18.11676 1.919615
2018.01.02 G 3 13.1917 8.515314 19.018 19.18736 6.64622
2018.01.03 A 2 6.045174 18.11676 1.919615 14.27323 2.255483
2018.01.04 A 7 18.11676 1.919615 14.27323 2.255483 2.352626
2018.01.05 G 0 19.18736 6.64622 11.16619 2.437314 4.698096
I'm working on a database that basically looks like this (in its simplest form):
{Phase} {Code} {Qty}
Example:
{Qty} for {Phase}="R" and {Code}="Nat" = 0
{Qty} for {Phase}="F" and {Code}="Nat" = 5
{Qty} for {Phase}="R" and {Code}="Int" = 10
{Qty} for {Phase}="F" and {Code}="Int" = 15
I am trying to get a result to show me the Qty for phase "R" and Code "Nat" (where R is <> 0) otherwise give me the qty for phase "F". So for the above example I would get an answer = 5 for Nat (because qty for phase R is 0) and an answer of 10 where the code is Int (because qty for phase R <> 0)
I have used three formula fields to do this:
1: if ({PHASE}="F" and {CODE}="NAT") then {QTY} else 0
2: if ({PHASE}="R" and {Code}="NAT") then {QTY} else 0
3: if {2} = 0 then {1} else {2}
Formula fields 1 & 2 come up with the correct amounts. However formula field {3} returns both phases. For example Code "Int" Phase "R" shows as qty = 25 instead of qty = 10.
How do I get around this?
You need to group by {table.code} because this is not a one-row calculation, but needs to be a calculation over each code for 2+ phases (meaning 2+ rows of data).
Create a formula with two variables that will store the values of each phase, F and R. This formula needs to go in the Details section of the report.
whileprintingrecords;
numbervar Fqty;
numbervar Rqty;
if {table.phase}="F" then Fqty:={table.qty}
else Rqty:={table.qty};
Now, in the group footer, you can reference both quantity values via the variables.
whileprintingrecords;
numbervar Fqty;
numbervar Rqty;
if Fqty=0 then Rqty else Fqty
And you're done. Don't forget to reset the two variables in the group header so you don't carry the quantity over between different codes.
So, presume a matrix like so:
20 2
20 2
30 2
30 1
40 1
40 1
I want to count the number of times 1 occurs for each unique value of column 1. I could do this the long way by [sum(x(1:2,2)==1)] for each value, but I think this would be the perfect use for the UNIQUE function. How could I fix it so that I could get an output like this:
20 0
30 1
40 2
Sorry if the solution seems obvious, my grasp of loops is very poor.
Indeed unique is a good option:
u=unique(x(:,1))
res=arrayfun(#(y)length(x(x(:,1)==y & x(:,2)==1)),u)
Taking apart that last line:
arrayfun(fun,array) applies fun to each element in the array, and puts it in a new array, which it returns.
This function is the function #(y)length(x(x(:,1)==y & x(:,2)==1)) which finds the length of the portion of x where the condition x(:,1)==y & x(:,2)==1) holds (called logical indexing). So for each of the unique elements, it finds the row in X where the first is the unique element, and the second is one.
Try this (as specified in this answer):
>>> [c,~,d] = unique(a(a(:,2)==1))
c =
30
40
d =
1
3
>>> counts = accumarray(d(:),1,[],#sum)
counts =
1
2
>>> res = [c,counts]
Consider you have an array of various integers in 'array'
the tabulate function will sort the unique values and count the occurances.
table = tabulate(array)
look for your unique counts in col 2 of table.