Scalding: Create list from column in Pipe

Scalding: Create list from column in Pipe - scala

I need to take a pipe that has a column of labels with associated values, and pivot that pipe so that there is a column for each label with the correct values in each column. So f example if I have this:
Id Label Value
1 Red 5
1 Blue 6
2 Red 7
2 Blue 8
3 Red 9
3 Blue 10
I need to turn it into this:
ID Red Blue
1 5 6
2 7 8
3 9 10
I know how to do this using the pivot command, but I have to explicitly know the values of the labels. How can I can dynamically read the labels from the “label” column into a list that I can then pass into the pivot command? I have tried to create list with:
pipe.groupBy('id) {_.toList('label) }
, but I get a type mismatch saying it found a symbol but is expecting (cascading.tuple.Fields, cascading.tuple.Fields). Also, from reading online, it sounds like using toList is frowned upon. The number of things in 'label is finite and not that big (30-50 items maybe), but may be different depending on what sample of data I am working with.
Any suggestions you have would be great. Thanks very much!

I think you're on the right track, you just need to map the desired values to Symbols:
val newHeaders = lines
.map(_.split(" "))
.map(a=>a(1))
.distinct
.map(f=>Symbol(f))
.toList
The Execution type will help you to combine with the subsequent pivot, for performance reasons.
Note that I'm using a TypedPipe for the lines variable.
If you want your code to be super-concise, you could combine lines 1 & 2, but it's just a stylistic choice:
map(_.split(" ")(1))

Try using Execution to get the list of values from the data. More info on executions: https://github.com/twitter/scalding/wiki/Calling-Scalding-from-inside-your-application

Related

How to put multiple where statements into function on kdb+

I'm trying to write a function using kdb+ which will look at the list, and find the values that simply meet two conditions.
Let's call the list DR (for data range). And I want a function that will combine these two conditions
"DR where (DR mod 7) in 2"
and
"DR where (DR.dd) in 1"
I'm able to apply them one at a time but I really need to combine them into one function. I was hoping I could do this
"DR was (DR.dd mod 7) in 2 and DR where (DR.dd) in 1"
but this obviously didn't work. Any advice?

You can utilize the and function to help with this, which is the same as &:
q)dr:.z.d+til 100
q)and
&
q)2=dr mod 7
10000001000000100000010000001000000100000010000001000000100000010000001000000..
q)1=dr.dd
00000000000000000000000001000000000000000000000000000000100000000000000000000..
q)(1=dr.dd)&2=dr mod 7
00000000000000000000000000000000000000000000000000000000100000000000000000000..
q)dr where(1=dr.dd)&2=dr mod 7
2021.02.01 2021.03.01
Its necessary wrap the first part in brackets due to how kdb reads code from right to left. This format changes slightly when doing this in a where clause, the brackets arent needed due to how each where clause is parsed, that is each clause between the commas are parsed seperately. However it is essentially doing the same thing as the code above.
q)t:([]date:dr)
q)select from t where 1=date.dd,2=date mod 7
date
----------
2021.02.01
2021.03.01

You could also do this using min to achieve similar results, like so:
DR where min(1=DR.dd;2=DR mod 7)

How to import dates correctly from this .csv file into Matlab?

I have a .csv file with the first column containing dates, a snippet of which looks like the following:
date,values
03/11/2020,1
03/12/2020,2
3/14/20,3
3/15/20,4
3/16/20,5
04/01/2020,6
I would like to import this data into Matlab (I think the best way would probably be using the readtable() function, see here). My goal is to bring the dates into Matlab as a datetime array. As you can see above, the problem is that the dates in the original .csv file are not consistently formatted. Some of them are in the format mm/dd/yyyy and some of them are mm/dd/yy.
Simply calling data = readtable('myfile.csv') on the .csv file results in the following, which is not correct:
'03/11/2020' 1
'03/12/2020' 2
'03/14/0020' 3
'03/15/0020' 4
'03/16/0020' 5
'04/01/2020' 6
Does anyone know a way to automatically account for this type of data in the import?
Thank you!
My version: Matlab R2017a
EDIT ---------------------------------------
Following the suggestion of Max, I have tried specifiying some of the input options for the read command using the following:
T = readtable('example.csv',...
'Format','%{dd/MM/yyyy}D %d',...
'Delimiter', ',',...
'HeaderLines', 0,...
'ReadVariableNames', true)
which results in:
date values
__________ ______
03/11/2020 1
03/12/2020 2
NaT 3
NaT 4
NaT 5
04/01/2020 6
and you can see that this is not working either.

If you are sure all the dates involved do not go back more than 100 years, you can easily apply the pivot method which was in use in the last century (before th 2K bug warned the world of the danger of the method).
They used to code dates in 2 digits only, knowing that 87 actually meant 1987. A user (or a computer) would add the missing years automatically.
In your case, you can read the full table, parse the dates, then it is easy to detect which dates are inconsistent. Identify them, correct them, and you are good to go.
With your example:
a = readtable(tfile) ; % read the file
dates = datetime(a.date) ; % extract first column and convert to [datetime]
idx2change = dates.Year < 2000 ; % Find which dates where on short format
dates.Year(idx2change) = dates.Year(idx2change) + 2000 ; % Correct truncated years
a.date = dates % reinject corrected [datetime] array into the table
yields:
a =
date values
___________ ______
11-Mar-2020 1
12-Mar-2020 2
14-Mar-2020 3
15-Mar-2020 4
16-Mar-2020 5
01-Apr-2020 6

Instead of specifying the format explicitly (as I also suggested before), one should use the delimiterImportoptions and in the case of a csv-file, use the delimitedTextImportOptions
opts = delimitedTextImportOptions('NumVariables',2,...% how many variables per row?
'VariableNamesLine',1,... % is there a header? If yes, in which line are the variable names?
'DataLines',2,... % in which line does the actual data starts?
'VariableTypes',{'datetime','double'})% as what data types should the variables be read
readtable('myfile.csv',opts)
because the neat little feature recognizes the format of the datetime automatically, as it knows that it must be a datetime-object =)

Find rows where string contains certain character at specific place

I have a field in my database, that contains 10 characters:
Fx: 1234567891
I want to look for the rows where the field has eg. the numbers 8 and 9 in places 5 and 6
So for example,
if the rows are
a) 1234567891
b) 1234897891
c) 1234877891
I only want b) returned in my select.
The type of the field is string/character varying.
I have tried using:
where field like '%89%'
but that won't work, because I need it to be 89 at a specific place in the string.

The fastest solution would be
WHERE substr(field, 8, 2) = '89'
If the positions are not adjacent, you end up with two conditions joined with AND.

You should be able to evaluate the single character using the underscore(_) character. So you should be able to use it as follows.
where field like '____89%'

Range operator [3..max?] for selecting elements from an array [duplicate]

How can I get the array element range of first to second last?
For example,
$array = 1,2,3,4,5
$array[0] - will give me the first (1)
$array[-2] - will give me the second last (4)
$array[0..2] - will give me first to third (1,2,3)
$array[0..-2] - I'm expecting to get first to second last (1,2,3,4) but I get 1,5,4 ???
I know I can do long hand and go for($x=0;$x -lt $array.count;$x++), but I was looking for the square bracket shortcut.

You just need to calculate the end index, like so:
$array[0..($array.length - 2)]
Do remember to check that you actually have more than two entries in your array first, otherwise you'll find yourself getting duplicates in the result.
An example of such a duplicate would be:
#(1)[0..-1]
Which, from an array of a single 1 gives the following output
1
1

There might be a situation where you are processing a list, but you don't know the length. Select-object has a -skiplast parameter.
(1,2,3,4,5 | select -skiplast 2)
1
2
3

As mentioned earlier the best solution here:
$array[0..($array.length - 2)]
The problem you met with $array[0..-2] can be explained with the nature of "0..-2" expression and the range operator ".." in PowerShell. If you try to evaluate just this part "0..-2" in PowerShell you will see that result will be an array of numbers from 0 to -2.
>> 0..-2
0
-1
-2
And when you're trying to do $array[0..-2] in PowerShell it's the same as if you would do $array[0,-1,-2]. That's why you get results as 1, 5, 4 instead of 1, 2, 3, 4.
It could be kind of counterintuitive at first especially if you have some Python or Ruby background, but you need to take it into account when using PowerShell.

Robert Westerlund answer is excellent.
This answer I just saw on the Everything you wanted to know about arrays page and wanted to try it out.
I like it because it seems to describe exactly what the goal is, end at one short of the upper bound.
$array[0..($array.GetUpperBound(0) - 1)]
1
2
3
4
I used this variation of your original attempt to uninstall all but the latest version from Get-InstalledModule. It's really short, but not perfect because if there are more than 9 items it still returns just 8, but you can put a larger negative number, though.
$array[-9..-2]
1
2
3
4

Looping through a set of variables sharing the same prefix

I routinely work with student exam files, where each response to an exam item is recorded in points. I want to transform that variable into 1 or 0 (effectively reducing each item to Correct or Incorrect).
Every dataset has the same nomenclature, where the variable is prefixed with points_ and then followed by an identification number (e.g., points_18616). I'm using the following syntax:
RECODE points_18616 (0=Copy) (SYSMIS=SYSMIS) (ELSE=1) INTO Binary_18616.
VARIABLE LABELS Binary_18616 'Binary Conversion of Item_18616'.
EXECUTE.
So I end up creating this syntax for each variable, and since every dataset is different, it gets tedious. Is there a way to loop through a dataset and perform this transformation on all variables that are prefixed with points_?

Here is a way to do it:
First I'll create a little fake data to demonstrate on:
data list list/points_18616 points_18617 points_18618 points_18619 (4f2).
begin data
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 9
end data.
* the following code will create a list of all the relevant variables in a new file.
SPSSINC SELECT VARIABLES MACRONAME="!list" /PROPERTIES PATTERN = "points_*".
* now we'll use the created list in a macro to loop your syntax over all the vars.
define !doList ()
!do !lst !in(!eval(!list))
RECODE !lst (0=Copy) (SYSMIS=SYSMIS) (ELSE=1) INTO !concat("Binary", !substr(!lst,7)).
VARIABLE LABELS !concat("Binary", !substr(!lst,7)) !concat("'Binary Conversion of Item",!substr(!lst,7) ,"'.").
!doend
!enddefine.
!doList.
EXECUTE.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scalding: Create list from column in Pipe - scala

Try using Execution to get the list of values from the data. More info on executions: https://github.com/twitter/scalding/wiki/Calling-Scalding-from-inside-your-application

Related

How to put multiple where statements into function on kdb+

How to import dates correctly from this .csv file into Matlab?

Find rows where string contains certain character at specific place

Range operator [3..max?] for selecting elements from an array [duplicate]

Looping through a set of variables sharing the same prefix

Categories

Resources