SAS PROC SQL - Concatenate variable values into a single value by group - group-by

I have a data set which contains 'factor' values and corresponding 'response' values:
data inTable;
input fact $ val $;
datalines;
a 1
a 2
a 3
b 4
b 5
b 6
c 7
d 8
e 9
e 10
f 11
;
run;
I want to aggregate response options by factor, i.e. I need to get
I know perfectly well how to implement this in a data step running a loop through values and applying CATX (posted here). But can I do the same with PROC SQL, using a combination of GROUP BY and some character analog of SUM() or CATX()?
Thanks for help,
Dmitry

The data step is the appropriate tool to use in SAS if you want to apply any sort of logic that carries lots of values forward from previous rows.
Any SQL solution would be extremely unwieldy - you would need to join the input table to itself n times, where n is the maximum number of distinct values for any of your factors, and you would also need to define a sequential key preserving the row order to use for the join.
A list of aggregation functions you can use in proc sql is available here:
http://support.sas.com/kb/25/279.html
Although a few of these do work with character variables, there is no aggregation function for string concatenation.

Related

Select cases if value is greater than mean of group

Is there a way to include means of entire variables in Select Cases If syntax?
I have a dataset with three groups n=20 each (sorting variable grp with values 1, 2, or 3) and results of a pre and post evaluation (variable pre and post). I want to select for every group only the 10 cases where the pre value is higher than the mean of that value in the group.
In pseudocode:
select if pre-value > mean(grp)
So if the mean in group 1 is 15, that's what all values from group one cases should be compared to. But at the same time if group 2's mean is 20, that is what values from cases in group 2 should be compared to.
Right now I only see the MEAN(arg1,arg2,...) function in the Select Cases If window, but no possibility to get the mean of an entire variable, much less with an additional condition (like group).
Is there a way to do this with Select Cases If syntax, or otherwise?
You need to create a new variable that will contain the mean of the group (so all lines in each group will have the same value in this variable - group mean). You can then compare each line to this value .
First I'll create some example data to demonstrate on:
data list list/grp pre_value .
begin data
1 3
1 6
1 8
2 1
2 4
2 9
3 55
3 43
3 76
end data.
Now you can calculate the group mean and select:
AGGREGATE /OUTFILE=* MODE=ADDVARIABLES /BREAK=grp /GrpMean=MEAN(pre_value).
select if pre_value > GrpMean.
.

Create sample value for failure records spark

I have a scenario where my dataframe has 3 columns a,b and c. I need to validate if the length of all the columns is equal to 100. Based on validation I am creating status column like a_status,b_status,c_status with values 5 (Success) and 10 (Failure). In Failure scenarios I need to update count and create new columns a_sample,b_sample,c_sample with some 5 failure sample values separated by ",". For creating samples column I tried like this
df= df.select(df.columns.toList.map(col(_)) :::
df.columns.toList.map( x => (lit(getSample(df.select(x, x + "_status").filter(x + "_status=10" ).select(x).take(5))).alias(x + "_sample")) ).toList: _* )
getSample method will just get array of rows and concatenate as a string. This works fine for limited columns and data size. However if the number of columns > 200 and data is > 1 million rows it creates huge performance impact. Is there any alternate approach for same.
While the details of your problem statement are unclear, you can break up the task into two parts:
Transform data into a format where you identify several different types of rows you need to sample.
Collect sample by row type.
The industry jargon for "row type" is stratum/strata and the way to do (2), without collecting data to the driver, which you don't want to do when the data is large, is via stratified sampling, which Spark implements via df.stat.sampleBy(). As a statistical function, it doesn't work with exact row numbers but fractions. If you absolutely must get a sample with an exact number of rows there are two strategies:
Oversample by fraction and then filter unneeded rows, e.g., using the row_number() window function followed by a filter 'row_num < n.
Build a custom user-defined aggregate function (UDAF), firstN(col, n). This will be much faster but a lot more work. See https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html
An additional challenge for your use case is that you want this done per column. This is not a good fit with Spark's transformations such as grouping or sampleBy, which operate on rows. The simple approach is to make several passes through the data, one column at a time. If you absolutely must do this in a single pass through the data, you'll need to build a much more custom UDAF or Aggregator, e.g., the equivalent of takeFirstNFromAWhereBHasValueC(n, colA, colB, c).

Unpivot data in Tableau Converting Rows to Columns

I have this data in Tableau:
KPI_NAME Value Date
------------------------
A 2 1-Jan
B 4 1-Jan
A 6 2-Jan
B 7 2-Jan
and I want it like this:
A B Date
------------------------
2 4 1-Jan
6 7 2-Jan
So I want it to convert each distinct value in the column KPI_NAME to a separate row, this can be done in the visualization part in Tableau but I want to do that in the data preparation because I want to use it in calculated field
Any help is appreciated.
Most tableau functionality is designed to consume more granular, flattened, and tidy data in the form of your first set. As such, the data prep functionality has a feature to unpivot column values into rows. I don't believe that reverse functionality is built into the data prep capability in the same way.
Not knowing your end use case, potentially a work around would be to:
Create a calculated field with an IF statement to return the value
when record is listed as A, otherwise return NULL.
Although you will still have the same number of records, you should be able to perform many of the calculations available with this type of data structure
Alternatively, you could perform you pivot outside of Tableau.

Difference between Matlab JOIN vs. INNERJOIN

In SQL, JOIN and INNER JOIN mean the same thing. In Matlab, they are different commands. Just from perusing the documentation thus far, they appear on the surface to fufill the same general function, with possible differences in the details, as controlled by parameters. I am slogging through the individual examples and may (or may not) find the fundamental difference. However, I feel that the difference should not be a subtlety that users have to ferrut out of the examples. These are two separate commands, and the documentation should make it clear up front why they are both needed. Would anyone be able to chime in about the key difference? Perhaps it could become a request to place it front and centre in the documentation.
I've empirically characterized the difference between JOIN and INNERJOIN (some would refer to this as reverse engineering). I'll summarize from the perspective of one who is comfortable with SQL. As I am new to SQL-like operations in Matlab, I've only been able to test drive it to a limited degree, but the INNERJOIN appears to join records in the same manner as SQL. Since SQL is a pretty open language, the behavioural specification of INNERJOIN is readily available, and I won't dwell on that. It's Matlab's JOIN that I need to suss out.
In short, from my testing, Matlab's JOIN seems to "join" the rows in the two operand table in a manner more like Excel's VLOOKUP rather than any of the JOINS in SQL. In general, the main differences with SQL joins seem to be (i) that the right hand table cannot have repeating values in the columns used matching up rows between the two tables and (ii) all combinations of values in the key columns of the left hand table must show up in the right hand table.
Here is the empirical testing. First, prepare the test tables:
a=array2table([
1 2
3 4
5 4
],'VariableNames',{'col1','col2'})
b=array2table([
4 7
4 8
6 9
],'VariableNames',{'col2','col3'})
c=array2table([
2 10
4 8
6 9
],'VariableNames',{'col2','col3'})
d=array2table([
2 10
4 8
6 9
6 11
],'VariableNames',{'col2','col3'})
a2=array2table([
1 2
3 4
5 4
20 99
],'VariableNames',{'col1','col2'})
Here are the tests:
>> join(a,b)
Error using table/join (line 111)
The key variable for B must have unique values.
>> join(a,c)
ans = col1 col2 col3
____ ____ ____
1 2 10
3 4 8
5 4 8
>> join(a,d)
Error using table/join (line 111)
The key variable for B must have unique values.
>> join(a2,c)
Error using table/join (line 130)
The key variable for B must contain all values in the key
variable for A.
The first thing to notice is that JOIN is not a symmetric operation with respect to the two tables.
It seems that the 2nd table argument is used as a lookup table. Unlike SQL joins, Matlab throws an error if it can't find a match in the 2nd table [See join(a2,d)]. This is somewhat hinted at in the documentation, though not entirely clearly. For example, it says that the key values must be common to both tables, but join(a,c) clearly shows that the tables do not have to have common key values. On the contrary, just as one would expect of a lookup table, 2nd table contains entries that aren't matched do not throw errors.
Another difference with SQL joins is that records that cause the key values to replicate in 2nd table are not allowed in Matlab's join. [See join(a,b) & join(a,d)]. In contrast, the fields used for matching records between tables aren't even referred to as keys in SQL, and hence can have non-unique values in either of the two tables. The disallowance of repeated key values in the 2nd table is consistent with the view of the 2nd table as a lookup table. On the other hand, repetition on of key values are permitted in the 1st table.

SQL Server 2008: Pivot column with no aggregate function workaround

Yes I know, this question has been asked MANY times but after reading all the posts I found that there wasn't an answer that fits my need. So, Heres my question. I would like to take a column of values and pivot them into rows of 6 columns.
I want to take this...... And turn it into this.......................
G Letter Date Code Ammount Name Account
081278 G 081278 12 00123535 John Doe 123456
12
00123535
John Doe
123456
I have 110000 values in this one column in one table called TempTable. I need all the values displayed because each row is an entity to itself. For instance, There is one unique entry for all of the Letter, Date, Code, Ammount, Name, and Account columns. I understand that the aggregate function is required but is there a workaround that will allow me to get this desired result?
Just use a MAX aggregate
If one row = one column (per group of 6 rows) then MAX of a single value = that row value.
However, the data you've posted in insufficient. I don't see anything to:
associate the 6 rows per group
distinguish whether a row is "Letter" or "Name"
There is no implicit row order or number to rely upon to generate the groups
Unfortunately, the max columns in a SQL 2008 select statement is 4,096 as per MSDN Max Capacity.
Instead of using a pivot, you might consider dynamic SQL to get what you want to do.
Declare #SQLColumns nvarchar(max),#SQL nvarchar(max)
select #SQLColumns=(select '''+ColName+'''',' from TableName for XML Path(''))
set #SQLColumns=left(#SQLColumns,len(#SQLColumns)-1)
set #SQL='Select '+#SQLColumns
exec sp_ExecuteSQL #SQL,N''