Scala: wrapper for Breeze DenseMatrix for column and row referencing - scala

I am new to Scala. Looking at it as an alternative to MATLAB for some applications.
I would like to program in Scala a wrapping class in order to be able to assign column names ("QuantityQ" && "QuantityP" -> Range) and row names (dates -> Range) to Breeze DenseMatrices (http://www.scalanlp.org/) in order to reference columns and rows.
The usage should resemble Python Pandas or Scala Saddle (http://saddle.github.io).
Saddle is very interesting but its usage is limited to 2D matrices. A huge limitation.
My Ideas:
Columns:
I thought a Map would do the job for colums but that may not be the best implementation.
Rows:
For rows, I could maintain a separate Breeze vector with timestamps and provide methods that convert dates into timestamps, doing the numbercruncing through Breeze. This comes with a loss of generality as a user may want to give whatever string names to rows.
Concerning dates I use nscala-time (a scala wrapper for joda)?
What are the drawbacks of my implementation?
Would you design the data structure differently?
Thank you for your help.

Related

Using dplyr correctly to combine shared values of a row to a new column of a table

How do I combine data from two tables based on certain shared values from the row?
I already tried using the which function and it didn't work.
I think you will have the best luck using the dplyr fuction. Specifically you can use right_join(). You can wright it like this, right_join(df1,df2, by="specification")
This will combine that columns from df2 with the specifications matching the rows according to the shared specification from df1.
For future reference it would be a lot of help if you included a screenshot of code just so it is easier to know exactly what you are asking.
Anyway, let me know if this answers your question!

XGBoost4J - Scala dataframe to sparse dmatrix

What is the most efficient and scalable way to convert a scala dataframe to a sparse dmatrix for XGBoost4J?
Say I have a dataframe train with columns row_index, column_index, and value, it would be something like
new DMatrix(train.select("row_index"), train.select("column_index"), train.select("Value"), DMatrix.SparseType.CSR, n_col)
However the above code results in a type mismatch because DMatrix expects Array[Long].
train.select(F.collect_list("row_index")).first().getList[Long](0) seems like a possible option but it doesn't seem to be memory friendly and scalable.
I am doing this on Databricks so solutions in the other supported languages (python, SQL, scala) are welcome.
The answer was to use sparse vectors by row rather than trying to create sparse matrix or dmatrix.
train.rdd.map(r => (r.getInt(0), (r.getInt(1), r.getInt(2).toDouble))).groupByKey().map(r => (r._1, Vectors.sparse(n_col, r._2.toSeq))).toDF
I tested scoring a sample of the data in R using Matrix::sparseMatrix and xgboost::dmatrix and the results matched up.

Pyspark improving performance for multiple column operations

I have written a class which performs standard scaling over grouped data.
class Scaler:
.
.
.
.
def __transformOne__(self, df_with_stats, newName, colName):
return df_with_stats\
.withColumn(newName,
(F.col(colName)-F.col(f'avg({colName})'))/(F.col(f'stddev_samp({colName})')+self.tol))\
.drop(colName)\
.withColumnRenamed(newName, colName)
def transform(self, df):
df_with_stats = df.join(....) #calculate stats here by doing a groupby and then do a join
return reduce(lambda df_with_stats, kv: self.__transformOne__(df_with_stats, *kv),
self.__tempNames__(), df_with_stats)[df.columns]
The idea is to save the mean and variances in columns and simply do a column subtraction/division on the column i want to scale. This part is done in the function transformOne. So basically its an arithmetic operation on one column.
If i want to scale multiple columns I just call the function transformOne multiple times but a bit more efficiently using functools.reduce (see the function transform. The class works fast enough for a single column but when I have multiple columns it takes too much time.
I have no idea about internals of spark so im a complete newbie. Is there a way i can improve this computation over multiple columns ?
My solution does a lot of calls to withColumn function. Hence i changed the solution by using select instead of withColumn. There is substantial difference in the physical plans of both the approaches. For my application I improved from 15 minutes to 2 minutes using select. More information about this in this SO post.

Transpose data using Talend

I have this kind of data:
I need to transpose this data into something like this using Talend:
Help would be much appreciated.
dbh's suggestion should work indeed, but I did not try it.
However, I have another solution which doesn't require to change input format and is not too complicated to implement. Indeed the job has only 2 transformation components (tDenormalize and tMap).
The job looks like the following:
Explanation :
Your input is read from a CSV file (could be a database or any other kind of input)
tDenormalize component will Denormalize your column value (column 2), based on value on id column (column 1), separating fields with a specific delimiter (";" in my case), resulting as shown in 2 rows.
tMap : split the aggregated column into multiple columns, by using java's String.split() method and spreading the resulting array into multiple columns. The tMap should like like this:
Since Talend doesn't accept to store Array objects, make sure to store the splitted String in Object format. Then, cast that object into Array on the right side of the Map.
That approach should give you the expected result.
IMPORTANT:
tNormalize might shuffle the rows, meaning for bigger input, you might encounter unsorted output. Make sure to sort it if needed or use tDenormalizeSortedRow instead.
tNormalize is similar to an aggregation component meaning it scans the whole input before processing, which results into possible performance issues with particularly big inputs (tens of millions of records).
Your input is probably wrong (you have 5 entries with 1 as id, and 6 entries with 2 as id). 6 columns are expected meaning you should always have 6 lines per id. If not, then you should implement dbh's solution, and you probably HAVE TO add a column with a key.
You can use Talend's tPivotToColumnsDelimited component to achieve this. You will most likely need an additional column in your data to represent the field name.
Like "Identifier, field name, value "
Then you can use this component to pivot the data and write a file as output. If you need to process the data further, read the resulting file with tFileInoutDelimited .
See docs and an example at
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/13.43+tPivotToColumnsDelimited

To make a variable or column name an object in Spark

In Spark with scala, is there any easy way to automatically turn the variable or column into an object from imported data and therefore we can use column_a.contains("something") per se inside .map( )?
It looks like you are coming from R. Spark is row oriented and not column oriented. If you want to do a contains for example you would first filter the rows and than apply a map to it, or use collect and do both operations at once but this is a bit harder to get right.