Pyspark: pass multiple columns in pandas_udf - pyspark

My problem is similar to this one but instead of udf I need to use pandas_udf.
I have a spark data frame with many columns (number of columns varies) and I need to apply on them a custom function (for example sum). I know I can hard-code column names but it does not work when the number of columns varies.
Please see examples:

Related

PySpark Feature Transformation: QuantileTransformer with uniform distribution of the output

Link to the document on scikit-learn:
link
What it essentially does is, it normalizes the data such that each data point falls under a bucket between 0 and 1 (percentile rank?) and I assume each of these buckets would have equal number of data points. This image describes what I am trying to do.
image
I would like to use this Quantile transformation with PySpark. There is a QuantileDiscretizer link in PySpark, but it doen't exactly do what I am looking for. It also returns less number of buckets than given in the input parameters. The below line of code returns only 81 distinct buckets on a data set with millions of rows, and min(col_1) as 0 and max(col_1) as 20000.
discretizer_1 = QuantileDiscretizer(numBuckets=100, inputCol="col_1", outputCol="result")
So is there a way I can uniformly normalize my data, either using QuantileDiscretizer or otherwise using PySpark?

How do I Map and Remap string values to Int or double in scala

I have a data file of some columns. I am performing some mathematical computations on the values for that purpose I want to map my non Integer value columns to Int and then after the operations on the values I want to remap them.
Following are my columns values
atom_id,molecule_id,element,type,charge
d100_1,d100,c,22,-0.128
d100_10,d100,h,3,0.132
d100_11,d100,c,29,0.002
d100_12,d100,c,22,-0.128
d100_13,d100,c,22,-0.128
Suppose I want to map only 2 columns and then remap those columns values only. I have searched for methods and found String Indexer but it maps all of the columns of the DF, I need to map only specific columns and then remap the values of those specific columns. Any help will be appreciated.
//edited Part
I have the following columns in my DataFrame
ind1,inda,logp,lumo,mutagenic,element
1,0,4.23,-1.246,yes,c
1,0,4.62,-1.387,yes,b
0,0,2.68,-1.034,no,h
1,0,6.26,-1.598,yes,c
1,0,2.4,-3.172,yes,a
Basically I am writing the code for synthetic Data Generation based on the given input data, so I want to use column values i.e ind1,inda,logp,lumo,mutagenic,element. single row at a time and after applying some math functions on it I will get a row which will consist of 6 values and each value will be representing the corresponding column value.
Now the problem is that all column values are of type double except mutagenic and element. I want to map this mutagenic and element columns to double values for example yes to 0 and No to 1 so that I can use them and then when I will receive the output row then I will reverse map that generated mutagenic value back to the corresponding string value using that mapping function.
Hope so I am clear this time

How can I convert one column data to a vector using Spark Scala

I am using Spark, Scala to process data. I have one question couldn't figure out. I have a dataframe, which is one column:
data
1
2
3
4
5
I want it to a single vector
[1.0,2.0,3.0,4.0,5.0]
How can I implemented it ? I tried
df.collect().toVector or rdd.foreach, but everytime it returns to me an array of vectors [1,0], [2.0], [3.0], [4.0], [5.0], not one single vector.
This is happening because when you collect a dataframe you get an Array of rows. You need to extract the values from the row objects.
df.collect().map(x => x.getDouble(0)).toVector

Looking for a way to Calculate Frequency distribution of a dataframe in spark/scala

I want to calculate the frequency distribution(return most common element in each column and the number of times it appeared) of a dataframe using spark and scala. I've tried using DataFrameStatFunctions library but after I filter my dataframe for only numeric type columns, I cant apply any functions from the library. Is the best way to do this to create a UDF?
you can use
val newDF = df.groupBy("columnName").count()
newDF.show()
it will show you the frequency count for unique entries.

When to use a cell, matrix, or table in Matlab

I am fairly new to matlab and I am trying to figure out when it is best to use cells, tables, or matrixes to store sets of data and then work with the data.
What I want is to store data that has multiple lines that include strings and numbers and then want to work with the numbers.
For example a line would look like
'string 1' , time, number1, number 2
. I know a matrix works best if al elements are numbers, but when I use a cell I keep having to convert the numbers or strings to a matrix in order to work with them. I am running matlab 2012 so maybe that is a part of the problem. Any help is appreciated. Thanks!
Use a matrix when :
the tabular data has a uniform type (all are floating points like double, or integers like int32);
& either the amount of data is small, or is big and has static (predefined) size;
& you care about the speed of accessing data, or you need matrix operations performed on data, or some function requires the data organized as such.
Use a cell array when:
the tabular data has heterogeneous type (mixed element types, "jagged" arrays etc.);
| there's a lot of data and has dynamic size;
| you need only indexing the data numerically (no algebraic operations);
| a function requires the data as such.
Same argument for structs, only the indexing is by name, not by number.
Not sure about tables, I don't think is offered by the language itself; might be an UDT that I don't know of...
Later edit
These three types may be combined, in the sense that cell arrays and structs may have matrices and cell arrays and structs as elements (because thy're heterogeneous containers). In your case, you might have 2 approaches, depending on how you need to access the data:
if you access the data mostly by row, then an array of N structs (one struct per row) with 4 fields (one field per column) would be the most effective in terms of performance;
if you access the data mostly by column, then a single struct with 4 fields (one field per column) would do; first field would be a cell array of strings for the first column, second field would be a cell array of strings or a 1D matrix of doubles depending on how you want to store you dates, the rest of the fields are 1D matrices of doubles.
Concerning tables: I always used matrices or cell arrays until I
had to do database related things such as joining datasets by a unique key; the only way I found to do this in was by using tables. It takes a while to get used to them and it's a bit annoying that some functions that work on cell arrays don't work on tables vice versa. MATLAB could have done a better job explaining when to use one or the other because it's not super clear from the documentation.
The situation that you describe, seems to be as follows:
You have several columns. Entire columns consist of 1 datatype each, and all columns have an equal number of rows.
This seems to match exactly with the recommended situation for using a [table][1]
T = table(var1,...,varN) creates a table from the input variables,
var1,...,varN . Variables can be of different sizes and data types,
but all variables must have the same number of rows.
Actually I don't have much experience with tables, but if you can't figure it out you can always switch to using 1 cell array for the first column, and a matrix for all others (in your example).