I want to create a formula for a table that looks like this but don't know what forumals to do to scan these files.
ID UDOfficer DATE
1 6 Jan
1 7 Jan
1 9 Jan
2 6 June
3 6 April
4 6 May
5 5 Dec
6 7 Nov
7 6 April
7 4 April
What I want to create:
A formula to put in a crosstab's column to capture UDOfficer = 6 and all others, but if UDOfficer is in 6, 7 and 9, the all others can not count that ID that was already counted for UDOFFICER 6.
OUTPUTCrosstable
DATE UDOFF6 UDOFFOTHER
JAN 1 0
APR 2 0
MAY 1 0
JUN 1 0
NOV 0 1
DEC 0 1
You can use Grouping as well as IF Else to make it work as per your requirement.
First create the group with Date.
Create 2 formulas and place in detail to count the occurrences.
Create a formula #UDOFF6 and write below code:
if UDOfficer =6
then 1
else 0
Create another formula UDOFFOTHER and write below code:
if UDOfficer <> 6
then 1
else 0
Take sum of both formulas in group footer of Date your problem will be solved.
Related
I have a date-level promotion data frame that looks something like this:
ID
Date
Promotions
Converted to customer
1
2-Jan
2
0
1
10-Jan
3
1
1
14-Jan
3
0
2
10-Jan
19
1
2
10-Jan
8
0
2
10-Jan
12
0
Now I want to see what were the number of promotions it took to convert someone into a customer
For eg., It took (2+3) promotions to convert ID 1 to the customer and (19) to convert ID 2 to the customer.
Eg.
ID
Date
1
5
2
19
I am unable to think of an idea to solve it. Can you please help me?
#Corralien and mozway have helped with the solution in Python. But I am unable to implement it in Pyspark because of the huge dataframe size (>1 TB).
You can use:
prom = (df.groupby('ID')['Promotions'].cumsum()
.where(df['Converted to customer'].eq(1))
.dropna().astype(int))
out = df.loc[prom.index, ['ID', 'Date']].assign(Promotion=prom)
print(out)
# Output
ID Date Promotion
1 1 10-Jan 5
3 2 10-Jan 19
Use one groupby to generate a mask to hide the rows, then one groupby.sum for the sum:
mask = (df.groupby('ID', group_keys=False)['Converted to customer']
.apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
)
out = df[~mask].groupby('ID')['Promotions'].sum()
Output:
ID
1 5
2 19
Name: Promotions, dtype: int64
Alternative output:
df[~mask].groupby('ID', as_index=False).agg(**{'Number': ('Promotions', 'sum')})
Output:
ID Number
0 1 5
1 2 19
If you potentially have groups without conversion to customer, you might want to also aggregate the "" column as indicator:
mask = (df.groupby('ID', group_keys=False)['Converted to customer']
.apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
)
out = (df[~mask]
.groupby('ID', as_index=False)
.agg(**{'Number': ('Promotions', 'sum'),
'Converted': ('Converted to customer', 'max')
})
)
Output:
ID Number Converted
0 1 5 1
1 2 19 1
2 3 39 0
Alternative input:
ID Date Promotions Converted to customer
0 1 2-Jan 2 0
1 1 10-Jan 3 1
2 1 14-Jan 3 0
3 2 10-Jan 19 1
4 2 10-Jan 8 0
5 2 10-Jan 12 0
6 3 10-Jan 19 0 # this group has
7 3 10-Jan 8 0 # no conversion
8 3 10-Jan 12 0 # to customer
you want to compute something by ID, so a groupby ID seems appropriate, e.g.
data.groupby("ID").apply(fct)
Now write a separate function agg_fct which computes the result for a
dataframe consisting of only one ID
Assuming data are ordered by Date, I guess that
def agg_fct(df):
index_of_conv = df["Converted to customer"].argmax()
return df.iloc[0:index_of_conv,df.columns.get_loc("Promotions")].sum()
would be fine. You might want to make some adjustments in case of a customer who has never been converted.
Disclaimer: I am very new to the Q language so please excuse my silly question.
I have a function that currently is taking on 2 parameters (date;sym).It runs fine for 1 sym and 1 day. however, I need to perform this on multiple syms and dates which will take forever.
How do I create a loop that run the function on every sym, and on every date?
In python, it is straighforward as :
for date in datelist:
for sym in symlist:
func(date,sym)
How can I do something similar to this in Q? and how can I dynamically change the output table names and append them to 1 single table?
Currently, I am using the following:
output: raze .[function] peach paralist
where paralist is a list of parameter pairs: ((2020.06.01;ABC);(2020.06.01;XYZ)) but imho this is nowhere near efficient.
What would be the best way to achieve this in Q?
I'll generalize everything, if you have a given function foo which will operate on an atom dt with a vector s
q)foo:{[dt;s] dt +\: s}
q)dt:10?10
q)s:100?10
q)dt
8 1 9 5 4 6 6 1 8 5
q)s
4 9 2 7 0 1 9 2 1 8 8 1 7 2 4 5 4 2 7 8 5 6 4 1 3 3 7 8 2 1 4 2 8 0 5 8 5 2 8..
q)foo[;s] each dt
12 17 10 15 8 9 17 10 9 16 16 9 15 10 12 13 12 10 15 16 13 14 12 9 11 11 ..
5 10 3 8 1 2 10 3 2 9 9 2 8 3 5 6 5 3 8 9 6 7 5 2 4 4 ..
13 18 11 16 9 10 18 11 10 17 17 10 16 11 13 14 13 11 16 17 14 15 13 10 12 12 ..
9 14 7 12 5 6 14 7 6 13 13 6 12 7 9 10 9 7 12 13 10 11 9 6 8 8 ..
The solution is to project the symList over the function in question, then use each (or peach) for the date variable.
If your function requires an atomic date and sym, then you can just create a new function to implement this
q)bar:{[x;y] foo[x;] each y};
datelist:`date$10?10
symlist:10?`IBM`MSFT`GOOG
function:{0N!(x;y)}
{.[function;x]} peach datelist cross symlist
cross will return all combinations of sym and date
Is this what you need?
Try to use two "double" '
raze function'[datelist]'[symlist]
peach or each won't work here. They are not operators, but anonymous functions with two parameters: each is k){x'y}. That is why function each list1 each list2 statement is invalid, but function'[list1]'[list2] works.
From reading your response to another answer you are looking to save the results with unique names yes? Take a look at this solution using set to save and get to retrieve.
q)t:flip enlist each `colA`colB!(100;`name)
q)t
colA colB
---------
100 name
q)f:{[date;sym]tblName:`$string[date],string sym;tblName set update date:date,sym:sym from t}
q)newTbls:f'[.z.d+til 3;`AAA`BBB`CCC]
q)newTbls
`2020.09.02AAA`2020.09.03BBB`2020.09.04CCC
q)get each newTbls
+`colA`colB`date`sym!(,100;,`name;,2020.09.02;,`AAA)
+`colA`colB`date`sym!(,100;,`name;,2020.09.03;,`BBB)
+`colA`colB`date`sym!(,100;,`name;,2020.09.04;,`CCC)
q)get first newTbls
colA colB date sym
------------------------
100 name 2020.09.02 AAA
Does this meet your needs?
This could be a stab in the dark, but why not create a hdb rather than all these variables output20191005ABC, output20191006ABC .. etc and given you want to append them to 1 table.
Below I have outlined how to create a date partitioned hdb called outputHDB which has one table outputTbl. I created the hdb by running a function by date and sym and then upserting those rows to disk.
C:\Users\Matthew Moore>mkdir outputHDB
C:\Users\Matthew Moore>cd outputHDB
// can change the outputHDB as desired
// start q
h:hopen `::6789; // as a demo I connected to another hdb process and extracted some data per sym / date over IPC
hdbLoc:hsym `$"C:/Users/Matthew Moore/outputHDB";
{[d;sl]
{[d;s]
//output:yourFunc[date;sym];
// my func as a demo, I'm grabbing rows where price = max price by date and by sym from another hdb using the handle h
output:{[d;s]
h({[d;s] select from trades where date = d, sym = s, price = max price};d;s)
}[d;s];
// HDB Part
path:` sv (hdbLoc;`$string d;`outputTbl;`);
// change `outputTbl to desired table name
// dynamically creates the save location and then upserts one syms data directly to disk
// e.g. `:C:/Users/Matthew Moore/outputHDB/2014.04.21/outputTbl/
// extra / at the end saves the table as splayed i.e. each column is it's own file within the outputTbl directory
path upsert .Q.en[`:.;output];
// .Q.en enumerates syms in a table which is required when saving a table splayed
}[d;] each sl;
// applies the parted attribute to the sym column on disk, this speeds up querying for on disk data
#[` sv (hdbLoc;`$string d;`outputTbl;`);`sym;`p#];
}[;`AAPL`CSCO`DELL`GOOG`IBM`MSFT`NOK`ORCL`YHOO] each dateList:2014.04.21 2014.04.22 2014.04.23 2014.04.24 2014.04.25;
Now that the hdb has been created, you can load it from disk and query with qSQL
q)\l .
q)select from outputTbl where date = 2014.04.24, sym = `GOOG
date sym time src price size
------------------------------------------------------------
2014.04.24 GOOG 2014.04.24D13:53:59.182000000 O 46.43 2453
I was getting some whacky values from localtime function in Perl. The following is some code for which I get incorrect values.
In particular, this code is meant to determine the weekday for the first of each year.
#!/usr/bin/perl
use strict 'vars';
use Time::Local;
use POSIX qw(strftime);
mytable();
sub mytable {
print "Year" . " "x4 . "Jan 1st (localtime)" . " "x4 . "Jan 1st (Gauss)\n";
foreach my $year ( 1964 .. 2017 )
{
my $janlocaltime = evalweekday( 1,1,$year);
my $jangauss = gauss($year);
my $diff = $jangauss - $janlocaltime;
printf "%4s%10s%-12s ",$year,"",$janlocaltime;
printf "%12s",$jangauss;
printf " <----- ERROR: off by %2s", $diff if ( $diff != 0 );
print "\n";
}
}
sub evalweekday {
## Using "localtime"
my ($day,$month,$year) = #_;
my $epoch = timelocal(0,0,0, $day,$month-1,$year-1900);
my $weekday = ( localtime($epoch) ) [6];
return $weekday;
}
sub gauss {
## Alternative approach
my ($year) = #_;
my $weekday =
( 1 + 5 * ( ( $year - 1 ) % 4 )
+ 4 * ( ( $year - 1 ) % 100 )
+ 6 * ( ( $year - 1 ) % 400 )
) % 7;
return $weekday;
}
Here is the output which shows the years with incorrect values:
Year Jan 1st (localtime) Jan 1st (Gauss)
1964 2 3 <----- ERROR: off by 1
1965 4 5 <----- ERROR: off by 1
1966 5 6 <----- ERROR: off by 1
1967 6 0 <----- ERROR: off by -6
1968 1 1
1969 3 3
1970 4 4
1971 5 5
1972 6 6
1973 1 1
1974 2 2
1975 3 3
1976 4 4
1977 6 6
1978 0 0
1979 1 1
1980 2 2
1981 4 4
1982 5 5
1983 6 6
1984 0 0
1985 2 2
1986 3 3
1987 4 4
1988 5 5
1989 0 0
1990 1 1
1991 2 2
1992 3 3
1993 5 5
1994 6 6
1995 0 0
1996 1 1
1997 3 3
1998 4 4
1999 5 5
2000 6 6
2001 1 1
2002 2 2
2003 3 3
2004 4 4
2005 6 6
2006 0 0
2007 1 1
2008 2 2
2009 4 4
2010 5 5
2011 6 6
2012 0 0
2013 2 2
2014 3 3
2015 4 4
2016 5 5
2017 0 0
In fact, the errors seem to extend as far back as 1900, but I just haven't verified that they are in fact wrong prior to 1964.
perl --version returns the following:
This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)
Copyright 1987-2013, Larry Wall
Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.
I'm not sure whether it's relevant, but my operating system is macOS Sierra Version 10.12.3.
I've read through the documentation, but I don't see anything (or I'm being blind) regarding values returned prior to 1968. I've also tried to do a websearch but am not pulling up anything beyond the typical misunderstandings of array values and the numbering of months and days of the year.
Could someone help me out and explain what I'm getting wrong? Or, if this is an issue with my version of Perl, let me know what I can do to fix it.
This is likely to do with how negative epoch values are handled in Time::Local. Have a look at perldoc Time::Local #Negative-Epoch-Values
On my Linux box (perl 5.20), your code demonstrates the issue nicely. If you print out the epoch value received, you will see the issue, namely that the epoch returned by timelocal becomes huge instead of more negative:
Year Epoch Jan 1st (localtime) Jan 1st (Gauss)
1964 2966342400 2 3 <----- ERROR: off by 1
1965 2997964800 4 5 <----- ERROR: off by 1
1966 3029500800 5 6 <----- ERROR: off by 1
1967 3061036800 6 0 <----- ERROR: off by -6
1968 -63185400 1 1
1969 -31563000 3 3
1970 -27000 4 4
1971 31509000 5 5
1972 63045000 6 6
Why don't you try using DateTime library instead:
use DateTime;
my $dt = DateTime->new(
year => 1966, # Real Year
day => 1, # 1-31
month => 1, # 1-12
hour => 0, # 0-23
second => 0, # 0-59
);
print $dt->dow . "\n";
6
6 = Saturday which matches the Wikipedian view: Jan 1, 1966 (Saturday)
I have a dataframe that contains a series of dates, e.g.:
0 2014-06-17
1 2014-05-05
2 2014-01-07
3 2014-06-29
4 2014-03-15
5 2014-06-06
7 2014-01-29
What I would like to do is convert these dates to integers by the month, since all the values are within the same year. So I would like to get something like this:
0 6
1 5
2 1
3 6
4 3
5 6
7 1
Is there a quick way to do this with Pandas?
EDIT: Answered by jezrael. Thanks so much!
Use function dt.month:
print (df)
Dates
0 2014-06-17
1 2014-05-05
2 2014-01-07
3 2014-06-29
4 2014-03-15
5 2014-06-06
6 2014-01-29
print (df.Dates.dt.month)
0 6
1 5
2 1
3 6
4 3
5 6
6 1
Name: Dates, dtype: int64
I have an SSRS report displaying data on a per-day basis in a matrix. I am using the left side (up-to-down) to display the total of all entities grouped by day. I am using the top side (left-to-right) to display a break-down of the different types of entries that are summed up in that row.
eg:
dataset:
day typ cnt amount exp
Mon 1 3 001000 400
Tue 1 4 000200 400
Wed 1 0 000000 400
Thu 1 1 000020 400
Fri 1 5 002100 400
Mon 2 2 001000 200
Tue 2 0 000000 200
Wed 2 2 005000 200
Thu 2 0 000000 200
Fri 2 20 250000 200
Output:
|| day cnt amount exp || typ cnt amount typ cnt amount
|| Mon 5 002000 600 || 1 3 001000 2 2 001000
up Tue 4 000200 600 up 1 4 000200 2 0 000000
dwn Wed 2 005000 600 dwn 1 0 000000 2 2 005000
|| Thu 1 000020 600 || 1 1 000020 2 0 000000
|| Fri 25 252100 600 || 1 5 002100 2 20 250000
The caveat is that I want to essentially sum-by-distinct-type the exp column (expected amount).
Normally I would sum/ group everything in my query, but one requirement of this report is to display each individual entry on a Detail page (in addition to the output I described above) and the query is already prohibitively heavy.
Hopefully my formatted Output is not too hard to decipher; The left side (surrounded by ||||up dwn||||) is grouping on day, sum(cnt), sum(amount), sumDistinct(exp). And the right side is the "matrix", grouped on typ. The sumDistinct(exp) (DISTINCT by the typ column) is the part I am having trouble with.
Normally I would sum/ group everything in my query, but one
requirement of this report is to display each individual entry on a
Detail page (in addition to the output I described above) and the
query is already prohibitively heavy.
If I have something like this, I will define something called a ""Subreport" at the of the first, that is separated by a page-break. Not sure what program you are using for building your reports, however it should allow you to create a report with all the details save it and add it to your main report as a sub-report.