Combine multiple rows into single row in Google Data Prep - google-cloud-dataprep

I have a table which has multiple payload values in separate rows. I want to combine those rows into a single row to have all the data together. Table looks something like this.
+------------+--------------+------+----+----+----+----+
| Date | Time | User | D1 | D2 | D3 | D4 |
+------------+--------------+------+----+----+----+----+
| 2020-04-15 | 05:39:45 UTC | A | 2 | | | |
| 2020-04-15 | 05:39:45 UTC | A | | 5 | | |
| 2020-04-15 | 05:39:45 UTC | A | | | 8 | |
| 2020-04-15 | 05:39:45 UTC | A | | | | 7 |
+------------+--------------+------+----+----+----+----+
And I want to convert it to something like this.
+------------+--------------+------+----+----+----+----+
| Date | Time | User | D1 | D2 | D3 | D4 |
+------------+--------------+------+----+----+----+----+
| 2020-04-15 | 05:39:45 UTC | A | 2 | 5 | 8 | 7 |
+------------+--------------+------+----+----+----+----+
I tried "set" and "aggregate" but they didn't work as I wanted them to and I am not sure how to go forward.
Any help would be appreciated.
Thanks.

tl;dr:
use fill() function to fill all empty values within each d1-d4 columns in the wanted group (AKA - the columns date+time+user) then dedup\aggregate to your heart's content.
long version
So the quickest way to do this is by using a window-function called "fill()".
What this function does for each given field in a column, it tells it:
"Look down. look up. find the closest non-empty value, and copy it!"
you can ofcourse limit it's sight (look only 3 rows above, for example) but for this example, don't need the limitation. so your fill function will look like this:
FILL($col, -1, -1)
So the "$col" will reference all the chosen columns. the "-1" says "unlimited sight".
finally, the "~" says "from column D1 to column D4".
So, function will look like this:
.
Which in turn will make your columns look like this:
.
Now you can use the "dedup" transformation to remove any duplications, and only 1 copy of each "group" will remain.
Alternatively, if you still want to use "group by", you can do that aswell.
Hope this helps =]
p.s
There are more ways to do this - which entails using the "pivot" transformation, and array unnesting. But in the process you'll lose your columns' names, and will need to rename them.

Related

SPSS group by rows and concatenate string into one variable

I'm trying to export SPSS metadata to a custom format using SPSS syntax. The dataset with value labels contains one or more labels for the variables.
However, now I want to concatenate the value labels into one string per variable. For example for the variable SEX combine or group the rows F/Female and M/Male into one variable F=Female;M=Male;. I already concatenated the code and labels into a new variable using Compute CodeValueLabel = concat(Code,'=',ValueLabel).
so the starting point for the source dataset is like this:
+--------------+------+----------------+------------------+
| VarName | Code | ValueLabel | CodeValueLabel |
+--------------+------+----------------+------------------+
| SEX | F | Female | F=Female |
| SEX | M | Male | M=Male |
| ICFORM | 1 | Yes | 1=Yes |
| LIMIT_DETECT | 0 | Too low | 0=Too low |
| LIMIT_DETECT | 1 | Normal | 1=Normal |
| LIMIT_DETECT | 2 | Too high | 2=Too high |
| LIMIT_DETECT | 9 | Not applicable | 9=Not applicable |
+--------------+------+----------------+------------------+
The goal is to get a dataset something like this:
+--------------+-------------------------------------------------+
| VarName | group_and_concatenate |
+--------------+-------------------------------------------------+
| SEX | F=Female;M=Male; |
| ICFORM | 1=Yes; |
| LIMIT_DETECT | 0=Too low;1=Normal;2=Too high;9=Not applicable; |
+--------------+-------------------------------------------------+
I tried using CASESTOVARS but that creates separate variables, so several variables not just one single string variable. I'm starting to suspect that I'm running up against the limits of what SPSS can do. Although maybe it's possible using some AGGREGATE or OMS trickery, any ideas on how to do this?
First I recreate your example here to demonstrate on:
data list list/varName CodeValueLabel (2a30).
begin data
"SEX" "F=Female"
"SEX" "M=Male"
"ICFORM" "1=Yes"
"LIMIT_DETECT" "0=Too low"
"LIMIT_DETECT" "1=Normal"
"LIMIT_DETECT" "2=Too high"
"LIMIT_DETECT" "9=Not applicable"
end data.
Now to work:
* sorting to make sure all labels are bunched together.
sort cases by varName CodeValueLabel.
string combineall (a300).
* adding ";" .
compute combineall=concat(rtrim(CodeValueLabel), ";").
* if this is the same varname as last row, attach the two together.
if $casenum>1 and varName=lag(varName)
combineall=concat(rtrim(lag(combineall)), " ", rtrim(combineall)).
exe.
*now to select only relevant lines - first I identify them.
match files /file=* /last=selectthis /by varName.
*now we can delete the rest.
select if selectthis=1.
exe.
NOTE: make combineall wide enough to contain all the values of your most populated variable.

Lead on datastage for timestamp?

I need to use the Lead function on Datastage with the column timestamp, but when I need to put the next column substriting-10 minutes, if someone did it before I tried, and I had also, that can't put strings on the timestamp.
How can I solve this problem?
Input
Id | date. |
1 | 01/04/2016 13:45:25|
2 | 10/04/2016 01:25:36|
3 | 26/10/2017 22:35:13|
Output
Id| date. | Befor. |
1 | 01/04/2016 13:45:25 | 10/04/2016 01:15:36|
2 | 10/04/2016 01:25:36 | 26/10/2017 22:35:13|
3 | 26/10/2017 22:35:13 |null |

Combine multiple rows of the same column into one row

I have following data from my database:
+------+-----------+-------------+---------------+
| ID | SomeValue | SomeDate | SomeOtherDate |
+------+-----------+-------------+---------------+
| 123 | 12345 | 01.01.2017 | 01.01.2018 |
| 123 | 54321 | 01.01.2017 | 01.01.2019 |
| 123 | 25314 | 01.01.2017 | 01.01.2020 |
+------+-----------+-------------+---------------+
I want the following format in Crystal Reports:
+------+---------------+---------------+---------------+
| ID | SomeValue2018 | SomeValue2019 | SomeValue2020 |
+------+---------------+---------------+---------------+
| 123 | 12345 | 54321 | 25314 |
+------+---------------+---------------+---------------+
How can I do this, if it's even possible? I've tried multiple examples but cant seem to make it work. I was successfully able to make the headings.
It is difficult to make Crystal Reports evaluate things horizontally. The entire system is designed to evaluate vertically. (Things on the top of a report are evaluated before things below them.)
However you might be able to get a CrossTab to do this. First you'd want to make a new Forumla field for your columns. It would be structured something like:
"SomeValue" + Year({#SomeOtherDate})
When you design the crosstab, you'll want to set ID as the Row, your new formula from above as the Column, and the Summarized Field will be SomeValue. You'll also want to suppress the Grand Totals in the Customize Style.

Finding the last seven days in a time series

I have a spreadsheet with column A which holds a timestamp and updates daily. Column B holds a value. Like the following:
+--------------------+---------+
| 11/24/2012 1:14:21 | $487.20 |
| 11/25/2012 1:14:03 | $487.20 |
| 11/26/2012 1:14:14 | $487.20 |
| 11/27/2012 1:14:05 | $487.20 |
| 11/28/2012 1:13:56 | $487.20 |
| 11/29/2012 1:13:57 | $487.20 |
| 11/30/2012 1:13:53 | $487.20 |
| 12/1/2012 1:13:54 | $492.60 |
+--------------------+---------+
What I am trying to do is get the average of the last 7, 14, 30 days.
I been playing with GoogleClock() function in order to filter the dates in column A but I can't seem to find the way to subtract TODAY - 7 days. I suspect FILTER will also help, but I am a little bit lost.
There are a few ways to go about this; one way is to return an array of values with a QUERY function (this assumes a header row in row 1, and you want the last 7 dates):
=QUERY(A2:B;"select B order by A desc limit 7";0)
and you can wrap this in whatever aggregation function you like:
=AVERAGE(QUERY(A2:B;"select B order by A desc limit 7";0))

Copying a range remotely

I've got a table named FOO with the column ("Porc" |- 3 7 15 50 15 7 3) and I'm copying the numbers to another table, shown below. I'm doing it the hard way, cell for cell, but I was wondering if there is a way to copy that range of the remote table (A2 to the bottom) in a single command.
| Pr (%) | ROE de A | ROE de B |
|--------+----------+----------|
| 3 | -11.43 | -34.29 |
| 7 | 0. | -11.43 |
| 15 | 3.43 | 0. |
| 50 | 12. | 17.14 |
| 15 | 20.57 | 34.29 |
| 7 | 24. | 41.14 |
| 3 | 30.86 | 54.86 |
|--------+----------+----------|
| Média | 11.86 | 16.41 |
| Desvio | 8.37 | 17.61 |
#+TBLFM: #2$1=remote(FOO, A2)::#3$1=remote(FOO, A3)::#4$1=remote(FOO, A4)::etc
Thanks
It seems your answer is in the org-mode manual:
$3 = remote(FOO, ###$2)
copy column 2 from table FOO into
column 3 of the current table For the
second example, table FOO must have at
least as many rows as the current
table. Inefficient for large number of
rows.
A Kind of Corollary: Copying all fields in a given row
So just as:
$3 = remote(FOO, ###$2)
copies all the fields from a given column (col2) into column three of the new table, then:
#3 = remote(FOO, #1$$#)
copies all the fields from a given row (row1) into row 3.
There's something about how this standard reference form #r$c interacts with the ## and $# notation that makes this seem a bit abstruse. e.g. this is all the org manual has to say about this remote reference syntax:
## and $# can be used to get the row or column number of the field where the formula result goes.
Umm…?
Posting this example here because I found it all a bit mystifying and hope this helps some else save a few minutes when dealing with rows and tables in the awesome org-mode