Spark convert dataframe to RowMatrix - scala

Say I have a dataframe resulted from a sequence of transformations. It looks like the following:
id matrixRow
0 [1,2,3]
1 [4,5,6]
2 [7,8,9]
each row actually corresponds to a row of a matrix.
How can I convert the matrixRow column of the dataframe to RowMatrix?

After numerous tries, here's one solution:
val rdd = df.rdd.map(
row => Vectors.dense(row.getAs[Seq[Double]](1).toArray)//get the second column value as Seq[Double], then as Array, then cast to Vector
)
val row = new RowMatrix(rdd)

Related

Converting dictionary to a Data Frame in Pyspark

Description . How can I convert a dictionary dataset to DataFrame in PySpark:
Error/Not expected result I tried
df = spark.createDataFrame([Row(**i) for i in bounds])
but get:
TypeError: Can not infer schema for type: <class 'str'>
this code :
rdd = sc.parallelize(bounds)
rdd.map(lambda x: (x,)).toDF().show()
and others give unexpected result.
Expected result:
My DataSet:
Your input to createDataFrame() has incorrect format. It should look like this -
[("price", {"q1":1, "q3": 3, "upper": 10, "lower":2} ),
("carAge", {"q1":1, "q3": 3, "upper": 11, "lower":1})]
This is a list of tuples (a list of lists would also work), where each tuple has two elements, the first is string, and the second is dictionary. Each tuple contains all row data in the future spark dataframe, and two elements in tuple means there will be 2 columns in the dataframe you'll create.
To bring your dictionary data to the format above, use this line of code:
[(x, dct[x]) for x in dct.keys()]
where dct is your original dictionary as in My DataSet image.
Then, you can create spark dataframe as follows:
df = (spark.createDataFrame([(x, dct[x]) for x in dct.keys()],
schema=["Colums", "dct_col"]))
This dataframe will have only two columns, the second column, "dct_col" will be the dictionary column, and you can get "q1" , "q3", and other columns as follows:
df_expected_result = (df
.withColumn("q1", df.dct_col["q1"])
.withColumn("q3", df.dct_col["q3"])
.withColumn("lower", df.dct_col["lower"])
.withColumn("upper", df.dct_col["upper"]))

How to add a list or array of strings as a column to a Spark Dataframe

So, I have n number of strings that I can keep either in an array or in a list like this:
val checks = Array("check1", "check2", "check3", "check4", "check5")
val checks: List[String] = List("check1", "check2", "check3", "check4", "check5")
Now, I have a spark dataframe df and I want to add a column with the values present in this List/Array. (It is guaranteed that the number of items in my List/Array will be exactly equal to the number of rows in the dataframe, i.e n)
I tried doing:
df.withColumn("Value", checks)
But that didn't work. What would be the best way to achieve this?
You need to add it as an array column as follows:
val df2 = df.withColumn("Value", array(checks.map(lit):_*))
If you want a single value for each row, you can get the array element:
val df2 = df.withColumn("Value", array(checks.map(lit):_*))
.withColumn("rn", row_number().over(Window.orderBy(lit(1))) - 1)
.withColumn("Value", expr("Value[rn]"))
.drop("rn")

Create new DataFrame with new rows depending in number of a column - Spark Scala

I have a DataFrame with the following data:
num_cta | n_lines
110000000000| 2
110100000000| 3
110200000000| 1
With that information, I need to create a new DF with different number of rows depending the value that comes over the n_lines column.
For example, for the first row of my DF (110000000000), the value of the n_lines column is 2. The result would have to be something like the following:
num_cta
110000000000
110000000000
For all the Dataframe example that I show, the result to get would have to be something like this:
num_cta
110000000000
110000000000
110100000000
110100000000
110100000000
110200000000
Is there a way to do that? And multiply a row n times, depending on the value of a column value?
Regards.
One approach would be to expand n_lines into an array with an UDF and explode it:
val df = Seq(
("110000000000", 2),
("110100000000", 3),
("110200000000", 1)
)toDF("num_cta", "n_lines")
def fillArr = udf(
(n: Int) => Array.fill(n)(1)
)
val df2 = df.withColumn("arr", fillArr($"n_lines")).
withColumn("a", explode($"arr")).
select($"num_cta")
df2.show
+------------+
| num_cta|
+------------+
|110000000000|
|110000000000|
|110100000000|
|110100000000|
|110100000000|
|110200000000|
+------------+
There is no off the shelve way to doing this. However you can try iterate over the dataframe and return a list of num_cta where the number of elements are equal to the corresponding n_lines.
Something like
import spark.implicits._
case class (num_cta:String) // output dataframe schema
case class (num_cta:String, n_lines:Integer) // input dataframe 'df' schema
val result = df.flatmap(x => {
List.fill(x.n_lines)(x.num_cta)
}).toDF

Spark scala copying dataframe column to new dataframe

I have an empty dataframe with schema already created.
I'm trying to add the columns to this dataframe from a new dataframe to the existing columns in a for loop.
k schema - |ID|DATE|REPORTID|SUBMITTEDDATE|
for(data <- 0 to range-1){
val c = df2.select(substring(col("value"), str(data)._2, str(data)._3).alias(str(data)._1)).toDF()
//c.show()
k = c.withColumn(str(data)._1, c(str(data)._1))
}
k.show()
But the k dataframe has just one column, but it should have all the 4 columns populated with values.
I think the last line in for loop is replacing exisitng columns in the dataframe.
Can somebody help me with this?
Thanks!!
Add your logic and conditions and create new dataframe
val dataframe2 = dataframe1.select("A","B",C)
Copying few columns of a dataframe to another one is not possible in spark.
Although there are few alternatives to attain the same
1. You need to join both the dataframe based on some join condition.
2. Convert bot the data frame to json and do RDD Union
val rdd = df1.toJSON.union(df2.toJSON)
val dfFinal = spark.read.json(rdd)

Spark: Computing correlations of a DataFrame with missing values

I currently have a DataFrame of doubles with approximately 20% of the data being null values. I want to calculate the Pearson correlation of one column with every other column and return the columnId's of the top 10 columns in the DataFrame.
I want to filter out nulls using pairwise deletion, similar to R's pairwise.complete.obs option in its Pearson correlation function. That is, if one of the two vectors in any correlation calculation has a null at an index, I want to remove that row from both vectors.
I currently do the following:
val df = ... //my DataFrame
val cols = df.columns
df.registerTempTable("dataset")
val target = "Row1"
val mapped = cols.map {colId =>
val results = sqlContext.sql(s"SELECT ${target}, ${colId} FROM dataset WHERE (${colId} IS NOT NULL AND ${target} IS NOT NULL)")
(results.stat.corr(colId, target) , colId)
}.sortWith(_._1 > _._1).take(11).map(_._2)
This runs very slowly, as every single map iteration is its own job. Is there a way to do this efficiently, perhaps using Statistics.corr in the Mllib, as per this SO Question (Spark 1.6 Pearson Correlation)
There are "na" functions on DataFrame: DataFrameNaFunctions API
They work in the same way DataFramStatFunctions do.
You can drop the rows containing a null in either of your two dataframe columns with the following syntax:
myDataFrame.na.drop("any", target, colId)
if you want to drop rows containing null any of the columns then it is:
myDataFrame.na.drop("any")
By limiting the dataframe to the two columns you care about first, you can use the second method and avoid verbose!
As such your code would become:
val df = ??? //my DataFrame
val cols = df.columns
val target = "Row1"
val mapped = cols.map {colId =>
val resultDF = df.select(target, colId).na.drop("any")
(resultDF.stat.corr(target, colId) , colId)
}.sortWith(_._1 > _._1).take(11).map(_._2)
Hope this helps you.