TSE_TO_STS application - traminer

I have 12 categorical sequence in TSE format. In the help page of this function tmax is specified as 12 based on the sequence data example used. How would I change this value if the maximum time length is 292 for one sequence and smaller than 292 for other sequences. Assume one of the sequence ends at time 25. using tmax=292, any state after 25 will use the same state till 292 which is wrong I believe. I would like to stop the sequence at time 25 and fill anything else on the right side with void.

TSE_to_STS is a function provided by the TraMineRextras package. It converts time stamped event sequences into state sequences. The resulting state sequences are in STS form, i.e., organized in a table with each sequence in a different row and the states in successive columns. tmaxis used to determine the number of columns of this table. Therefore, it should be fixed to the maximal state sequence length.
To end a sequence at time 25 for example, you need to insert an end of sequence event at time 25. TSE_to_STS cannot guess when the sequence ends.
============ example
Below I illustrate how to proceed using the actcal.tse data that ships with TraMineR. I consider the data for ids 2 and 4 and assume id 2 was observed up to the 8th month and id 4 up to the 10th month.
data(actcal.tse)
## Consider the data for id 2 and 4 and
## insert "endobs" event to indicate end of observation
subset <- rbind(actcal.tse[2:4,], data.frame(id=2,time=8,event="endobs"),
actcal.tse[7:9,], data.frame(id=4,time=10,event="endobs"))
subset
## id time event
## 2 2 0 NoActivity
## 3 2 4 Start
## 4 2 4 FullTime
## 1 2 8 endobs
## 7 4 0 LowPartTime
## 8 4 9 Increase
## 9 4 9 PartTime
## 11 4 10 endobs
## Define list of events of interest
events <- c("PartTime", "NoActivity", "FullTime", "LowPartTime", "endobs")
## Dropping all previous events
stm <- seqe2stm(events, dropList=list(PartTime=events[-1], NoActivity=events[-2],
FullTime=events[-3], LowPartTime=events[-4], endobs=events[-5]))
mysts <- TSE_to_STS(subset, id=1, timestamp=2, event=3,
stm=stm, tmin=1, tmax=12, firstState="None")
## replacing "endobs" with NAs
mysts[mysts=="endobs"] <- NA
seq <- seqdef(mysts)
seqiplot(seq)
We see the different length of the two resulting state sequences in the plot.

Related

How serializing foreign keyed table works internally in kdb

I have a keyed table(referenced table) linked using foreign key to the referencing table and I serialize both tables using set operator.
q)kt:([sym:`GOOG`AMZN`FB]; px:20 30 40);
q)`:/Users/uts/db/kt set kt
q)t:([] sym:`kt$5?`GOOG`AMZN`FB; vol:5?10000)
q)`:/Users/uts/db/t set t
Then I remove these tables from the memory
q)delete kt,t from `.
Now I deserialize the table t in memory:
t:get `:/Users/uts/db/t
If I do meta t after this it fails, expecting kt as foreign key.
If I print t, as expected it shows index values in column sym of table t.
So, the question arises -
As kdb stores the meta of each table(i.e c,t,f,a) and its corresponding values on disk, how does table t serialization works internally?
How(In which form in binary format) are these values stored in file t.
-rw-r--r-- 1 uts staff 100 Apr 13 23:09 t
tl;dr A foreign key is stored as a vector of 4-byte indices of a key column of a referenced table plus a name of a table a foreign key refers to.
As far as I know kx never documented their file formats, and yet I think some useful information relevant to your question can be deduced right from a q console session.
Let me modify your example a bit to make things simpler.
q)show kt:([sym:`GOOG`AMZN`FB]; px:20 30 40)
sym | px
----| --
GOOG| 20
AMZN| 30
FB | 40
q)show t:([] sym:`kt$`GOOG`GOOG`AMZN`FB`FB)
sym
----
GOOG
GOOG
AMZN
FB
FB
I left only one column - sym - in t because vol is not relevant to the question. Let's save t without any data first:
q)`:/tmp/t set 0#t
`:/tmp/t
q)hcount `:/tmp/t
30
Now we know that it takes 30 bytes to represent t when it's empty. Let's see if there's a pattern when we start adding rows to t:
q){`:/tmp/t set x#t;`cnt`size!(x;hcount[`:/tmp/t] - 30)} each til[11], 100 1000 1000000
cnt size
---------------
0 0
1 4
2 8
3 12
4 16
5 20
6 24
7 28
8 32
9 36
10 40
100 400
1000 4000
1000000 4000000
We can see that adding one row increases the size of t by four bytes. What can these 4 bytes be? Can they be a representation of a symbol itself? No, because if they were and we renamed a sym value in kt it would affect the size of t on disk but it doesn't:
q)update sym:`$50#.Q.a from `kt where sym=`GOOG
`kt
q)1#t
sym
--------------------------------------------------
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwx
q)`:/tmp/t set 1#t
`:/tmp/t
q)hcount `:/tmp/t
34
Still 34 bytes. I think it should be obvious by now that the 4 bytes is an index, but an index of what? Is it an index of a column which must be called sym exactly? Apparently no, it isn't.
q)kt:`foo xcol kt
q)t
sym
--------------------------------------------------
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwx
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwx
AMZN
FB
FB
There's no column called sym in kt any longer but t hasn't changed at all! We can go even further and change the type of foo (ex sym) in kt:
q)update foo:-1 -2 -3.0 from `kt
`kt
q)t
sym
---
-1
-1
-2
-3
-3
Not only did it change t, it changed its meta too:
q)meta t
c | t f a
---| ------
sym| f kt
q)/ ^------- used to be s
I hope it's clear now that kdb stores a 4-byte index of a key column of a referenced table and a name of a table (but not a key column name!). If a referenced table is missing kdb can't reconstruct the original data and displays the bare index. It a referencing table needs to be sent over the wire then indices are replaced with actual values so that the receiving side can see the real data.

SPSS - Create dummy for top volume months within customer grouping

I need to create a dummy for the top purchase months within each customer ID. That is, if a month belong to one of the four months within the year where the customer purchased the most then it is noted with the number 1, otherwise 0.
Example of data, cust id, order date, volume and new variable dummy:
This code creates some sample data:
data list free/ID volume (2f4).
begin data
1 100 1 500 2 1 2 2 2 3 2 90 1 600 1 90 1 870 2 9 2 8 2 10
end data.
Using the sample data in the question, this code will create a new variable containing the dummy according to your definition:
RANK VARIABLES=volume (A) BY ID /RANK.
compute high4=(Rvolume<=4).

Get consecutive sequence number in ireport

I need to display row number sequence of each group.
I have used $V{PAGE_COUNT} and evaluation time as now
The report data that I am getting is
Group A
1.
2
3
4
...........
page ends ......
Group A
1
2
3
4
page ends ---------
Group B
1
2
3
4
5
page ends....
But my requirement is
Group A
1.
2
3
4
...........
page ends
Group A
5
6
7
8
9
page ends .......
Group B
1
2
3
4
5
page ends....
I need all rows of same group to be continuous sequence. And start sequence from 1 when group is changed
You should use the GroupName_COUNT variable in this case.
The quote from the JasperReports Ultimate Guide
When declaring a report group, the engine automatically creates a count variable that
calculates the number of records that make up the current group (that is, the number of
records processed between group ruptures).
The name of this variable is derived from the name of the group it corresponds to,
suffixed with the _COUNT sequence. It can be used like any other report variable, in any
report expression, even in the current group expression, as shown in the BreakGroup
group of the /demo/samples/jasper sample)
More info is here: Data Grouping

Add values from multiple columns in pivot table

I have a created a pivot table from an Excel spreadsheet which has many columns and many rows. Here is my requirement.
The Pivot Table has
Row Labels --> Individual Names
Column Labels --> Types of Products
Now I have 4 regions like AP, EMEA, CALA, & US in the Excel spreadsheet.
I need to get the value of = Sum of (AP + EMEA + CALA + AP), for each type of Product for the respective individual name.
For example,
Clarke would have sold Type 1 Product, 4 Nos in AP, 10 in EMEA, 4 in CALA, 7 in US
would have sold Type 2 product, 12 Nos in AP, 16 in EMEA, 8 in CALA, 5 in US
I need pivot table, which looks like
Type 1 Type 2 Type 3 Type 4
Clarke 25 41 0 0
Marsh 11 20 12 6
How do I get this?
Like this:
Grand Total for rows and columns is optional and other Types/Names etc may be added to suit into the source data without other configuration changes except (i) possibly the range will need extending (PivotTable Tools > Data - Change Data Source) and (ii) the PT will need to be refreshed (right-click on the data area and left-click Refresh).

Calculating change in leaders for baseball stats in MSSQL

Imagine I have a MSSQL 2005 table(bbstats) that updates weekly showing
various cumulative categories of baseball accomplishments for a team
week 1
Player H SO HR
Sammy 7 11 2
Ted 14 3 0
Arthur 2 15 0
Zach 9 14 3
week 2
Player H SO HR
Sammy 12 16 4
Ted 21 7 1
Arthur 3 18 0
Zach 12 18 3
I wish to highlight textually where there has been a change in leader for each category
so after week 2 there would be nothing to report on hits(H); Zach has joined Arthur with most strikeouts(SO) at
18; and Sammy is new leader in homeruns(HR) with 4
So I would want to set up a process something like
a) save the past data(week 1) as table bbstatsPrior,
b) updates the bbstats for the new results - I do not need assistance with this
c) compare between the tables for the player(s with ties) with max value for each column
and spits out only where they differ
d) move onto next column and repeat
In any real world example there would be significantly more columns to calculate for
Thanks
Responding to Brents comments, I am really after any changes in the leaders for each category
So I would have something like
select top 1 with ties player
from bbstatsPrior
order by H desc
and
select top 1 with ties player,H
from bbstats
order by H desc
I then want to compare the player from each query (do I need to do temp tables) . If they differ I want to output the second select statement. For the H category Ted is leader `from both tables but for other categories there are changes between the weeks
I can then loop through the columns using
select name from sys.all_columns sc
where sc.object_id=object_id('bbstats') and name <>'player'
If the number of stats doesn't change often, you could easily just write a single query to get this data. Join bbStats to bbStatsPrior where bbstatsprior.week < bbstats.week and bbstats.week=#weekNumber. Then just do a simple comparison between bbstats.Hits to bbstatsPrior.Hits to get your difference.
If the stats change often, you could use dynamic SQL to do this for all columns that match a certain pattern or are in a list of columns based on sys.columns for that table?
You could add a column for each stat column to designate the leader using a correlated subquery to find the max value for that column and see if it's equal to the current record.
This might get you started, but I'd recommend posting what you currently have to achieve this and the community can help you from there.