How to get the first row data of each list? - scala

my DataFrame like this :
+------------------------+----------------------------------------+
|ID |probability |
+------------------------+----------------------------------------+
|583190715ccb64f503a|[0.49128147201958017,0.5087185279804199]|
|58326da75fc764ad200|[0.42143416087939345,0.5785658391206066]|
|583270ff17c76455610|[0.3949217100212508,0.6050782899787492] |
|583287c97ec7641b2d4|[0.4965059792664432,0.5034940207335569] |
|5832d7e279c764f52e4|[0.49128147201958017,0.5087185279804199]|
|5832e5023ec76406760|[0.4775830044196701,0.52241699558033] |
|5832f88859cb64960ea|[0.4360509428173421,0.563949057182658] |
|58332e6238c7643e6a7|[0.48730029128352853,0.5126997087164714]|
and I get the column of probability using
val proVal = Data.select("probability").rdd.map(r => r(0)).collect()
proVal.foreach(println)
the result is :
[0.49128147201958017,0.5087185279804199]
[0.42143416087939345,0.5785658391206066]
[0.3949217100212508,0.6050782899787492]
[0.4965059792664432,0.5034940207335569]
[0.49128147201958017,0.5087185279804199]
[0.4775830044196701,0.52241699558033]
[0.4360509428173421,0.563949057182658]
[0.48730029128352853,0.5126997087164714]
but I want to get the first column of data for each row, like this:
0.49128147201958017
0.42143416087939345
0.3949217100212508
0.4965059792664432
0.49128147201958017
0.4775830044196701
0.4360509428173421
0.48730029128352853
how can this be done?
The input is standard random forest input, above the input is val Data = predictions.select("docID", "probability")
predictions.printSchema()
root
|-- docID: string (nullable = true)
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
|-- indexedLabel: double (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)
|-- predictedLabel: string (nullable = true)
and I want to get the first value of the "probability" column

You can use the Column.apply method to get the n-th item on an array column - in this case the first column (using index 0):
import sqlContext.implicits._
val proVal = Data.select($"probability"(0)).rdd.map(r => r(0)).collect()
BTW, if you're using Spark 1.6 or higher, you can also use the Dataset API for a cleaner way to convert the dataframe into Doubles:
val proVal = Data.select($"probability"(0)).as[Double].collect()

Related

scala: read csv of documents, create cosine similarity

I'm reading in dozens of documents. They seem to be read into both RDD and DFs as a string of columns:
This is the schema:
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)...
|-- _c58: string (nullable = true)
|-- _c59: string (nullable = true)
This is the head of the df:
_c1| _c2|..........
V1 V2
This text ... This is an...
I'm trying to create a cosine similarity matrix using this:
val contentRDD = spark.sparkContext.textFile("...documents_vector.csv").toDF()
val Row(coeff0: Matrix) = Correlation.corr(contentRDD, "features").head
println(s"Pearson correlation matrix:\n $coeff0")
This is another way I was doing it:
val df_4 = spark.read.csv("/document_vector.csv")
Where features would be the name of the column created by converting the single row of 59 columns into a single column of 59 rows, named features.
Is there a way to map each new element in the csv to a new row to complete the cosine similarity matrix? Is there another way I should be doing this?
Thank you to any who consider this.

Scala Spark : How to extract nested column names from parquet file and adding prefix to it

The idea is to read a parquet file into dataFrame. Then, extract all column name's and type's from it's schema. If we have a nested columns, i would like to add a "prefix" before the column name.
Considering that we can have a nested column with sub column named properly, and we can have also a nested column with just an array of array without column name but "element".
val dfSource: DataFrame = spark.read.parquet("path.parquet")
val dfSourceSchema: StructType = dfSource.schema
Example of dfSourceSchema (Input):
|-- exCar: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: binary (nullable = true)
|-- exProduct: string (nullable = true)
|-- exName: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- exNameOne: string (nullable = true)
| | |-- exNameTwo: string (nullable = true)
Desired output :
((exCar.prefix.prefix,binary)),(exProduct, String), (exName.prefix.exNameOne, String), (exName.prefix.exNameTwo, String) )

How to add assign value to empty dataframe existing column in scala?

I am reading a csv file which has | delimiter at last , while load method make last column in dataframe with no name and no values in Spark 1.6
df.withColumnRenamed(df.columns(83),"Invalid_Status").drop(df.col("Invalid_Status"))
val df = sqlContext.read.format("com.databricks.spark.csv").option("delimiter","|").option("header","true").load("filepath")
val df2 = df.withColumnRenamed(df.columns(83),"Invalid_Status").
I expected result
root
|-- FddCell: string (nullable = true)
|-- Trn_time: string (nullable = true)
|-- CELLNAME.FddCell: string (nullable = true)
|-- Invalid_Status: string (nullable = true)
but actual output is
root
|-- FddCell: string (nullable = true)
|-- Trn_time: string (nullable = true)
|-- CELLNAME.FddCell: string (nullable = true)
|-- : string (nullable = true)
with no value in column so I have to drop this column and again make new column.
It is not completely clear what you want to do, to just rename the column to Invalid_Status or to drop the column entirely. What I understand is, you are trying to operate (rename/drop) on the last column which has no name.
But I will try to help you with both the solution -
To Rename the column with same values (blanks) as it is:
val df2 = df.withColumnRenamed(df.columns.last,"Invalid_Status")
Only To Drop the last column without knowing its name, use:
val df3 = df.drop(df.columns.last)
And then add the "Invalid_Status" column with default values:
val requiredDf = df3.withColumn("Invalid_Status", lit("Any_Default_Value"))

How to compute statistics on a streaming dataframe for different type of columns in a single query?

I have a streaming dataframe having three columns time, col1,col2.
+-----------------------+-------------------+--------------------+
|time |col1 |col2 |
+-----------------------+-------------------+--------------------+
|2018-01-10 15:27:21.289|0.4988615628926717 |0.1926744113882285 |
|2018-01-10 15:27:22.289|0.5430687338123434 |0.17084552928040175 |
|2018-01-10 15:27:23.289|0.20527770821641478|0.2221980020202523 |
|2018-01-10 15:27:24.289|0.130852802747647 |0.5213147910202641 |
+-----------------------+-------------------+--------------------+
The datatype of col1 and col2 is variable. It could be a string or numeric datatype.
So I have to calculate statistics for each column.
For string column, calculate only valid count and invalid count.
For timestamp column, calculate only min & max.
For numeric type column, calculate min, max, average and mean.
I have to compute all statistics in a single query.
Right now, I have computed with three queries separately for every type of column.
Enumerate cases you want and select. For example, if stream is defined as:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
val schema = StructType(Seq(
StructField("v", TimestampType),
StructField("x", IntegerType),
StructField("y", StringType),
StructField("z", DecimalType(10, 2))
))
val df = spark.readStream.schema(schema).format("csv").load("/tmp/foo")
The result would be
val stats = df.select(df.dtypes.flatMap {
case (c, "StringType") =>
Seq(count(c) as s"valid_${c}", count("*") - count(c) as s"invalid_${c}")
case (c, t) if Seq("TimestampType", "DateType") contains t =>
Seq(min(c), max(c))
case (c, t) if (Seq("FloatType", "DoubleType", "IntegerType") contains t) || t.startsWith("DecimalType") =>
Seq(min(c), max(c), avg(c), stddev(c))
case _ => Seq.empty[Column]
}: _*)
// root
// |-- min(v): timestamp (nullable = true)
// |-- max(v): timestamp (nullable = true)
// |-- min(x): integer (nullable = true)
// |-- max(x): integer (nullable = true)
// |-- avg(x): double (nullable = true)
// |-- stddev_samp(x): double (nullable = true)
// |-- valid_y: long (nullable = false)
// |-- invalid_y: long (nullable = false)
// |-- min(z): decimal(10,2) (nullable = true)
// |-- max(z): decimal(10,2) (nullable = true)
// |-- avg(z): decimal(14,6) (nullable = true)
// |-- stddev_samp(z): double (nullable = true)

Convert array of vectors to DenseVector

I am running Spark 2.1 with Scala. I am trying to convert and array of vectors into a DenseVector.
Here is my dataframe:
scala> df_transformed.printSchema()
root
|-- id: long (nullable = true)
|-- vals: vector (nullable = true)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
scala> df_transformed.show()
+------------+--------------------+--------------------+
| id| vals| hashValues|
+------------+--------------------+--------------------+
|401310732094|[-0.37154,-0.1159...|[[-949518.0], [47...|
|292125586474|[-0.30407,0.35437...|[[-764013.0], [31...|
|362051108485|[-0.36748,0.05738...|[[-688834.0], [18...|
|222480119030|[-0.2509,0.55574,...|[[-1167047.0], [2...|
|182270925238|[0.32288,-0.60789...|[[-836660.0], [97...|
+------------+--------------------+--------------------+
For example, I need to extract the value of the hashValues column into a DenseVectorfor id 401310732094.
This can be done with an UDF:
import spark.implicits._
val convertToVec = udf((array: Seq[Vector]) =>
Vectors.dense(array.flatMap(_.toArray).toArray)
)
val df = df_transformed.withColumn("hashValues", convertToVec($"hashValues"))
This will overwrite the hashValues column with a new one containing a DenseVector.
Tested with a dataframe with following schema:
root
|-- id: integer (nullable = false)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
The result is:
root
|-- id: integer (nullable = false)
|-- hashValues: vector (nullable = true)