Scala dataframe calling function inside withColumn? - scala

Here is what I am trying to do, I have two tables that have exactly the same column names.
Table look somewhat like this:
-----------
A B C D
-----------
1 2 3 4
5 6 3 4
7 8 3 4
The logic of the problem I need to have is, compare A B C D columns in Table1 with Table2. If A,B match each other, return a new column with value 0, else return 0. If C from table A is 3, return 0, else return 1. Only one value should be returned for each row, priority: C>D>A=B.
I joined two tables(dataFrames), result in a combinedDf. This is how I join them: Table1.join(Table2,table1($"A")=table2($"A"))
So here is what I did:
def func(A:mutable.WrappedArray[String],B:mutable.WrappedArray[String],C:String,D:String) =
{if(C=="3") "0";
else if(D=="4")"1";
else if ((0 to A.length-1).exists(i => A(i) == B(i)))"0" else "1"}
For this function I want to put A,B columns from Table1 in to one array and A,B column from Table2 into another array and running a for loop to check the equality. (I need the array because for real case, I have a random number of columns I need to compare).
And here is how I call the function.
combinedDf.withColumn("returnVal",func(array(col("table1.A"),col("table1.B")),
array(col("table2.A"),col("table2.B")),col("table1.C"),col("table1.D")))
But it's just doesn't work, even though I put the columns inside array using the array function its' still telling me type mismatch.
Error Message: <console:67>: error: type mismatch; found:org.apache.spark.Column required: String
Thanks in advance!

You can try this, however help me understand one thing why do you need to combine the dataframes, and what you mean by if A and B matches(my assumption is per row, am i right?), If your A,B,C,D columns are string then change integer to String.
def func(A:Integer,B:Integer,C:Integer,D:Integer) =
{
if(C == 3) "0"
else if(D == 4) "1"
else if (A == B) "0"
else "1"
}
val udfFunc = udf(func _)
combinedDf.withColumn("returnVal",
udfFunc(col("table1.A"), col("table1.B"),
col("table1.C"),col("table1.D")
)
)

Related

UDF function to check whether my input dataframe has duplicate columns or not using pyspark

I need to return boolean false if my input dataframe has duplicate columns with the same name. I wrote the below code. It identifies the duplicate columns from the input dataframe and returns the duplicated columns as a list. But when i call this function it must return boolean value i.e., if my input dataframe has duplicate columns with the same name it must return flase.
#udf('string')
def get_duplicates_cols(df, df_cols):
duplicate_col_index = list(set([df_cols.index(c) for c in df_cols if df_cols.count(c) == 2]))
for i in duplicate_col_index:
df_cols[i] = df_cols[i] + '_duplicated'
df2 = df.toDF(*df_cols)
cols_to_remove = [c for c in df_cols if '_duplicated' in c]
return cols_to_remove
duplicate_cols = udf(get_duplicates_cols,BooleanType())
You don't need any UDF, you simple need a Python function. The check will be in Python not in JVM. So, as #Santiago P said you can use checkDuplicate ONLY
def checkDuplicate(df):
return len(set(df.columns)) == len(df.columns)
Assuming that you pass the data frame to the function.
udf(returnType=BooleanType())
def checkDuplicate(df):
return len(set(df.columns)) == len(df.columns)

kdb apply function in select by row

I have a table
t: flip `S`V ! ((`$"|A|B|"; `$"|B|C|D|"; `$"|B|"); 1 2 3)
and some dicts
t1: 4 10 15 20 ! 1 2 3 5;
t2: 4 10 15 20 ! 0.5 2 4 5;
Now I need to add a column with values on the the substrings in S and the function below (which is a bit pseudocode because I am stuck here).
f:{[s;v];
if[`A in "|" vs string s; t:t1;];
else if[`B in "|" vs string s; t:t2;];
k: asc key t;
:t k k binr v;
}
problems are that s and v are passed in as full column vectors when I do something like
update l:f[S,V] from t;
How can I make this an operation that works by row?
How can I make this a vectorized function?
Thanks
You will want to use the each-both adverb to apply a function over two columns by row.
In your case:
update l:f'[S;V] from t;
To help with your pseudocode function, you might want to use $, the if-else operator, e.g.
f:{[s;v]
t:$["A"in ls:"|"vs string s;t1;"B"in ls;t2;()!()];
k:asc key t;
:t k k binr v;
};
You've not mentioned a final else clause in your pseudocode but $ expects one hence the empty dictionary at the end.
Also note that in your table the columns S and V have been cast to a symbol. vs expects a string to split so I've had to use the stringoperation - this could be removed if you are able to redefine your original table.
Hope this helps!

q - internal state in while loop

In q, I am trying to call a function f on an incrementing argument id while some condition is not met.
The function f creates a table of random length (between 1 and 5) with a column identifier which is dependent on the input id:
f:{[id] len:(1?1 2 3 4 5)0; ([] identifier:id+til len; c2:len?`a`b`c)}
Starting with id=0, f should be called while (count f[id])>1, i.e. so long until a table of length 1 is produced. The id should be incremented each step.
With the "repeat" adverb I can do the while condition and the starting value:
{(count x)>1} f/0
but how can I keep incrementing the id?
Not entirely sure if this will fix your issue but I was able to get it to work by incrementing id inside the function and returning it with each iteration:
q)g:{[id] len:(1?1 2 3 4 5)0; id[0]:([] identifier:id[1]+til len; c2:len?`a`b`c);#[id;1;1+]}
In this case id is a 2 element list where the first element is the table you are returning (initially ()) and the second item is the id. By amending the exit condition I was able to make it stop whenever the output table had a count of 1:
q)g/[{not 1=count x 0};(();0)]
+`identifier`c2!(,9;,`b)
10
If you just need the table you can run first on the output of the above expression:
q)first g/[{not 1=count x 0};(();0)]
identifier c2
-------------
3 a
The issue with the function f is that when using over and scan the output if each iteration becomes the input for the next. In your case your function is working on a numeric value put on the second pass it would get passed a table.

How can I filter my array of numbers in Matlab/Octave?

I have a very trivial example where I'm trying to filter by matching a String:
A = [0:1:999];
B = A(int2str(A) == '999');
This
A(A > 990);
works
This
int2str(5) == '5'
also works
I just can't figure out why I cannot put the two together. I get an error about nonconformant arguments.
int2str(A) produces a very long char array (of size 1 x 4996) containing the string representations of all those numbers (including spacing) appended together end to end.
int2str(A) == '999'
So, in the statement above, you're trying to compare a matrix of size 1 x 4996 with another of size 1 x 3. This, of course, fails as the two either need to be of the same size, or at least one needs to be a scalar, in which case scalar expansion rules apply.
A(A > 990);
The above works because of logical indexing rules, the result will be the elements from the indices of A for which that condition holds true.
int2str(5) == '5'
This only works because the result of the int2str call is a 1 x 1 matrix ('5') and you're comparing it to another matrix of the same size. Try int2str(555) == '55' and it'll fail with the same error as above.
I'm not sure what result you expected from the original statements, but maybe you're looking for this:
A = [0:1:999];
B = int2str(A(A == 999)) % outputs '999'
I am not sure that the int2str() conversion is what you are looking for. (Also, why do you need to convert numbers to strings and then carry out a char comparison?)
Suppose you have a simpler case:
A = 1:3;
strA = int2str(A)
strA =
1 2 3
Note that this is a 1x7 char array. Thus, comparing it against a scalar char:
strA == '2'
ans =
0 0 0 1 0 0 0
Now, you might wanna transpose A and carry out the comparison:
int2str(A')=='2'
ans =
0
1
0
however, this approach will not work if the number of digits of each number is not the same because lower numbers will be padded with spaces (try creating A = 1:10 and comparing against '2').
Then, create a cell array of string without whitespaces and use strcmp():
csA = arrayfun(#int2str,A','un',0)
csA =
'1'
'2'
'3'
strcmp('2',csA)
Should be much faster, and correct to turn the string into a number, than the other way around. Try
B = A(A == str2double ('999'));

Excel - Combine multiple columns into one column

I have multiple lists that are in separate columns in excel. What I need to do is combine these columns of data into one big column. I do not care if there are duplicate entries, however I want it to skip row 1 of each column.
Also what about if ROW1 has headers from January to December, and the length of the columns are different and needs to be combine into one big column?
ROW1| 1 2 3
ROW2| A D G
ROW3| B E H
ROW4| C F I
should combine into
A
B
C
D
E
F
G
H
I
The first row of each column needs to be skipped.
Try this. Click anywhere in your range of data and then use this macro:
Sub CombineColumns()
Dim rng As Range
Dim iCol As Integer
Dim lastCell As Integer
Set rng = ActiveCell.CurrentRegion
lastCell = rng.Columns(1).Rows.Count + 1
For iCol = 2 To rng.Columns.Count
Range(Cells(1, iCol), Cells(rng.Columns(iCol).Rows.Count, iCol)).Cut
ActiveSheet.Paste Destination:=Cells(lastCell, 1)
lastCell = lastCell + rng.Columns(iCol).Rows.Count
Next iCol
End Sub
You can combine the columns without using macros. Type the following function in the formula bar:
=IF(ROW()<=COUNTA(A:A),INDEX(A:A,ROW()),IF(ROW()<=COUNTA(A:B),INDEX(B:B,ROW()-COUNTA(A:A)),IF(ROW()>COUNTA(A:C),"",INDEX(C:C,ROW()-COUNTA(A:B)))))
The statement uses 3 IF functions, because it needs to combine 3 columns:
For column A, the function compares the row number of a cell with the total number of cells in A column that are not empty. If the result is true, the function returns the value of the cell from column A that is at row(). If the result is false, the function moves on to the next IF statement.
For column B, the function compares the row number of a cell with the total number of cells in A:B range that are not empty. If the result is true, the function returns the value of the first cell that is not empty in column B. If false, the function moves on to the next IF statement.
For column C, the function compares the row number of a cell with the total number of cells in A:C range that are not empty. If the result is true, the function returns a blank cell and doesn't do any more calculation. If false, the function returns the value of the first cell that is not empty in column C.
I created an example spreadsheet here of how to do this with simple Excel formulae, and without use of macros (you will need to make your own adjustments for getting rid of the first row, but this should be easy once you figure out how my example spreadsheet works):
https://docs.google.com/a/umich.edu/spreadsheet/ccc?key=0AuSyDFZlcRtHdGJOSnFwREotRzFfM28tWElpZ1FaR2c#gid=0
Not sure if this completely helps, but I had an issue where I needed a "smart" merge. I had two columns, A & B. I wanted to move B over only if A was blank. See below. It is based on a selection Range, which you could use to offset the first row, perhaps.
Private Sub MergeProjectNameColumns()
Dim rngRowCount As Integer
Dim i As Integer
'Loop through column C and simply copy the text over to B if it is not blank
rngRowCount = Range(dataRange).Rows.Count
ActiveCell.Offset(0, 0).Select
ActiveCell.Offset(0, 2).Select
For i = 1 To rngRowCount
If (Len(RTrim(ActiveCell.Value)) > 0) Then
Dim currentValue As String
currentValue = ActiveCell.Value
ActiveCell.Offset(0, -1) = currentValue
End If
ActiveCell.Offset(1, 0).Select
Next i
'Now delete the unused column
Columns("C").Select
selection.Delete Shift:=xlToLeft
End Sub
Function Concat(myRange As Range, Optional myDelimiter As String) As String
Dim r As Range
Application.Volatile
For Each r In myRange
If Len(r.Text) Then
Concat = Concat & IIf(Concat <> "", myDelimiter, "") & r.Text
End If
Next
End Function