I am having difficulty achieving this functionality in SPSS. The data set is formatted like this (apologies for the excel format)
In this example, the AGGREGATE function was used to combine the cases by the same variable. In other words, CITY, Tampa in the example, is the break variable.
Unfortunately, each entry for Tampa gives 10 unique temperatures for each day. So the first entry for Tampa is days 0-10, and the second is days 10-20, they provide useful information. I can't figure out how to use the aggregate function to create new variables to avoid losing these days. I want to do this, as I want to be able to run tests on the mean temperature in Tampa over days 0-20, relative to days 0-20 in other cities.
My current syntax is:
AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES
/BREAK=CITY
/Temp=Max(Temp).
But this doesn't create the variables, and I'm not sure where to start on that end. I checked the SPSS manual and didn't see this as an option within aggregate, any idea on what function might allow this functionality?
If I understand right, you are trying to reorganize all the CITY information into one line, and not to aggregate it. So what you are looking for is the restructure command casestovars.
First we'll create some fake data to demonstrate on:
data list list/City (a10) temp1 to temp10 (10f6).
begin data
Tampa 10 11 12 13 14 15 16 17 18 19
Boston 20 21 22 23 24 25 26 27 28 29
Tampa 30 31 32 33 34 35 36 37 38 39
NY 40 41 42 43 44 45 46 47 48 49
Boston 50 51 52 53 54 55 56 57 58 59
End data.
casestovars needs an index variable (e.g number of row within city). In your example your data doesn't have an index, so the following commands will create one:
sort cases by CITY.
if $casenum=1 or city<>lag(city) IndVar=1.
if city=lag(city) IndVar=lag(IndVar)+1.
format IndVar(f2).
Now we can restructure:
sort cases by CITY IndVar.
casestovars /id=CITY /index=IndVar/separator="_"/groupby=index.
This will also work if you have more rows per city.
Important note: my artificial index (IndVar) doesn't necessarily reflect the original order of rows in your file. If your file really doesn't contain an index and isn't ordered so the first row represents the first measurements etc', the restructured file will accordingly not be ordered either: the earlier measurements might appear on the left or on the right of the later ones - according to their order in the original file. To avoid this you should try to define a real index and use it in casestovars.
Run EXECUTE or Transform > Run Pending Transformations to see the results of the AGGREGATE command.
Related
I did dropDuplicates in a dataframe with subsets of Region,store,and id.
The dataframe contains some other columns like latitude, longitude, address, Zip, Year, Month...
When I do count of the derived dataframe am getting a constant value,
But when i take the count of a selected year, say 2018, am getting different counts when running the df.count()
Could anyone please explain why this is happening?
Df.dropDuplicates("region","store","id")
Df.createOrReplaceTempView(Df)
spark.sql("select * from Df").count() is constant
whenever i run
But if i put a where clause inside with Year or Month am getting multiple counts.
Eg:
spark.sql("select * from Df where Year =2018").count()
This statement is giving multiple values on each execution.
Intermediate output
Region store objectnr latitude longitude newid month year uid
Abc 20 4572 46.6383 8.7383 1 4 2018 0
Sgs 21 1425 47.783 6.7282 2 5 2019 1
Efg 26 1277 48.8293 8.2727 3 7 2019 2
Output
Region store objectnr latitude longitude newid month year uid
Abc 20 4572 46.6383 8.7383 1277 4 2018 0
Sgs 21 1425 47.783 6.7282 1425 5 2019 1
Efg 26 1277 48.8293 8.2727 1277 7 2019 2
So here newid gets the value of objecrnr,
When newid is comming same then i need to assign the latest objectnr to newid, considering the year and month
The line
Df.dropDuplicates("region","store","id")
creates a new Dataframe and it is not modifying the existing one. Dataframes are immutable.
To solve your issue you need to save the output of the dropDuplicates statement into a new Dataframe as shown below:
val Df2 = Df.dropDuplicates("region","store","id")
Df2.createOrReplaceTempView(Df2)
spark.sql("select * from Df2").count()
In addition you may get different counts when applying the filter Year=2018 because the Year column ist not part of the three columns you used to drop the duplicates. Apparently you have date in your Dataframe that share the same values in the three column but differ in the Year. Dropping duplicates is not a deterministic process ass it depends on the ordering of your data which vary in every run on your code.
I have 2 data sources:
1) Eligible Devices to be changed to a newer model
2) Devices actually changed to a new model
These could be for example computer devices which are after a certain time required to be changed to a newer version i.e. end of life of a product.
As this is a blended data source, so i cannot apply an LOD.
The calculation i am trying to achieve is:
Jan 2017: There are 100 eligible devices but actually only 85 got refreshed. hence there are 15 device which are carried forward to the next month
Feb 2017: There are 200 eligible devices but actually only 160 got refreshed, plus 15 from Jan 15 so the total opening bal for Feb = 200 + 15 = 215 and then 160 to be deducted i.e. 55.
The same process will continue for all the other months and year.
The challenge:
Lets say by actual - eligible is named as Diff. This number should only take actual - eligible for the first month. from second month onward it should take actual - eligible and then the balance from previous month i.e. look up
How do i write a calc which only shows the calculation as described above for first month and then a look up + actual - eligible from previous month from next month onwards.
There would be month and year columns in filters, if i remove any year or month filter, the carry forward from that period will not be accounted for.
Sounds like Running_Sum([Diff]) ahould do the trick as long as you set the partitioning and addressing correctly.
I tried to perform two requests to Reporting API:
startDate=2016-01-01, endDate=2016-08-26, ga:users, ga:yearMonth
startDate=2016-01-01, endDate=2016-08-26, ga:users, ga:yearMonth, ga:year
The metric results do not match. Why?
Example on https://ga-dev-tools.appspot.com/query-explorer/
Result for request one:
ga:yearMonth ga:users
201601 1372
201602 1701
201603 1980
201604 1779
201605 1465
201606 1336
201607 1402
201608 1595
Result for request two:
ga:year ga:yearMonth ga:users
2016 201601 1372
2016 201602 1525
2016 201603 1761
2016 201604 1531
2016 201605 1239
2016 201606 1084
2016 201607 1157
2016 201608 1365
This answer maybe useful to someone having the same problem.Whenever there is mismatch in data between api and data on dashboard do following things.
Make sure you are using right parameters for both of them (similar metrics and dimensions).
If after step one still there is mismatch then its probably because sampling has been kicked in internally by the google , this is because even the smallest query requires heavy computation. To make sure sampling is being done there will be a field samplingSpaceSizes in the response .
To avoid sampling make sure you loop over dates and query for each day independently.
in your case its most probably the issue of sampling because of huge date range (and this is because you GA account has lots of data), so intead of querying for a bigger range , loop over date range.
Also remember it may tae upto 48 hours for the fresh data to be processed.To make sure if your data is processed one look for a field isDataGolden in response if its present that data is processed and so results will match. if that param is absent it means some of your data has not been processed yet.
Have you checked for sampling? The date range you're working with is on the large side, so you might consider testing with a smaller range to see if the totals become more consistent.
Another thing to consider is that Users metric can be pre-calculated or calculated on the fly. More information on the Users metric here
I have a dataset like this:
time value
1990 22
1991 31
1992 21
1993 7
1994 32
And I have a macro variable contains several obs value.
%put &p; returns: 1 4 5
I want to use this macro &p to select the matched time in default sequence.
The result should be this:
time value
1990 22
1993 7
1994 32
data result;
set indata;
if _N_ in (&p);
run;
_N_ is automatic variable containing incremental number of current data step iteration. Effectively it's number of current observation for simple cases like this. More on Automatic Variables
I have crystal generating the following output in my Details section
Cats Group
Number How Old
________________________
12 0-30 days old
32 0-30 days old
34 31-60 days old
Dogs Group
Number How Old
________________________
22 over 61 days old
123 0-30 days old
but i need the above info in a table format
Group 0-30 days old 31-60 days old over 61 days old
______________________________________________________
Cats 2 1 0
Dogs 1 0 1
thanks
You need a Cross-Tab;
Open Cross-tab expert
Drag group into columns
Drag animal type into rows
Drag animalinto value and set count as summarize option