scala: read csv of documents, create cosine similarity - scala

I'm reading in dozens of documents. They seem to be read into both RDD and DFs as a string of columns:
This is the schema:
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)...
|-- _c58: string (nullable = true)
|-- _c59: string (nullable = true)
This is the head of the df:
_c1| _c2|..........
V1 V2
This text ... This is an...
I'm trying to create a cosine similarity matrix using this:
val contentRDD = spark.sparkContext.textFile("...documents_vector.csv").toDF()
val Row(coeff0: Matrix) = Correlation.corr(contentRDD, "features").head
println(s"Pearson correlation matrix:\n $coeff0")
This is another way I was doing it:
val df_4 = spark.read.csv("/document_vector.csv")
Where features would be the name of the column created by converting the single row of 59 columns into a single column of 59 rows, named features.
Is there a way to map each new element in the csv to a new row to complete the cosine similarity matrix? Is there another way I should be doing this?
Thank you to any who consider this.

Related

Divide dataframe into batches Spark

I need to run a set of transformations on batches of hours of the dataframe. And the number of hours should be parameterized so it could be changed - for example, run transformations on 3 hours of dataframe, then next 2 hours. In such a way, there should be a step with a parameterized number of hours for each transformation.
The signature of the transformation looks like this:
def transform(wordsFeed: DataFrame)(filesFeed: DataFrame): Unit
So I want to do this division into batches and then call a transform on this datafeed. But I can't use groupBy as it would change dataframe into grouped dataset while I need to preserve all the columns in the schema. How can I do that?
val groupedDf = df.srcHours.groupBy($"event_ts")
transform(keywords)(groupedDf)
Data schema looks like this:
root
|-- date_time: integer (nullable = true)
|-- user_id: long (nullable = true)
|-- order_id: string (nullable = true)
|-- description: string (nullable = true)
|-- hashed_user_id: string (nullable = true)
|-- event_date: date (nullable = true)
|-- event_ts: timestamp (nullable = true)
|-- event_hour: long (nullable = true)
The main reason to introduce this batching is that there's too much data to process at once.
Note: I still want to use batch data processing and not streaming in this case

How is the VectorAssembler used with Sparks Correlation util?

I'm trying to correlate a couple columns of a dataframe in spark scala by piping the columns of the original dataframe into the VectorAssembler followed by the Correlation util. For some reason the Vector assembler seems to be producing empty vectors as seen below. Here's what I have so far.
val numericalCols = Array(
"price", "bedrooms", "bathrooms",
"sqft_living", "sqft_lot"
)
val data: DataFrame = HousingDataReader(spark)
data.printSchema()
/*
...
|-- price: decimal(38,18) (nullable = true)
|-- bedrooms: decimal(38,18) (nullable = true)
|-- bathrooms: decimal(38,18) (nullable = true)
|-- sqft_living: decimal(38,18) (nullable = true)
|-- sqft_lot: decimal(38,18) (nullable = true)
...
*/
println("total record:"+data.count()) //total record:21613
val assembler = new VectorAssembler().setInputCols(numericalCols)
.setOutputCol("features").setHandleInvalid("skip")
val df = assembler.transform(data).select("features","price")
df.printSchema()
/*
|-- features: vector (nullable = true)
|-- price: decimal(38,18) (nullable = true)
*/
df.show
/* THIS IS ODD
+--------+-----+
|features|price|
+--------+-----+
+--------+-----+
*/
println("df row count:" + df.count())
// df row count:21613
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head //ERROR HERE
println("Pearson correlation matrix:\n" + coeff1.toString)
this ends up with the following exception
java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.numCols(RowMatrix.scala:64)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:345)
at org.apache.spark.mllib.stat.correlation.PearsonCorrelation$.computeCorrelationMatrix(PearsonCorrelation.scala:49)
at org.apache.spark.mllib.stat.correlation.Correlations$.corrMatrix(Correlation.scala:66)
at org.apache.spark.mllib.stat.Statistics$.corr(Statistics.scala:74)
at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:73)
at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:84)
at
Looks like any one of your feature columns contains a null value always. setHandleInvalid("skip") will skip any row that contains null in one of the features. Can you try filling the null values with fillna(0) and check the result. This must solve your issue.

How to add assign value to empty dataframe existing column in scala?

I am reading a csv file which has | delimiter at last , while load method make last column in dataframe with no name and no values in Spark 1.6
df.withColumnRenamed(df.columns(83),"Invalid_Status").drop(df.col("Invalid_Status"))
val df = sqlContext.read.format("com.databricks.spark.csv").option("delimiter","|").option("header","true").load("filepath")
val df2 = df.withColumnRenamed(df.columns(83),"Invalid_Status").
I expected result
root
|-- FddCell: string (nullable = true)
|-- Trn_time: string (nullable = true)
|-- CELLNAME.FddCell: string (nullable = true)
|-- Invalid_Status: string (nullable = true)
but actual output is
root
|-- FddCell: string (nullable = true)
|-- Trn_time: string (nullable = true)
|-- CELLNAME.FddCell: string (nullable = true)
|-- : string (nullable = true)
with no value in column so I have to drop this column and again make new column.
It is not completely clear what you want to do, to just rename the column to Invalid_Status or to drop the column entirely. What I understand is, you are trying to operate (rename/drop) on the last column which has no name.
But I will try to help you with both the solution -
To Rename the column with same values (blanks) as it is:
val df2 = df.withColumnRenamed(df.columns.last,"Invalid_Status")
Only To Drop the last column without knowing its name, use:
val df3 = df.drop(df.columns.last)
And then add the "Invalid_Status" column with default values:
val requiredDf = df3.withColumn("Invalid_Status", lit("Any_Default_Value"))

How to compare 2 JSON schemas using pyspark?

I have 2 JSON schemas as below -
df1.printSchema()
# root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
df2.printSchema()
#root
# |-- name: array (nullable = true)
# |-- gender: integer (nullable = true)
# |-- age: long (nullable = true)
How can I compare these 2 schemas and highlight the differences using pyspark as I am using pyspark-sql to load data from the JSON file into a DF.
While it is not clear what do you mean by "compare", the following code will give you the fields (FieldType) which are on DF2 and not on DF1.
set(df2.schema.fields) - set(df1.schema.fields)
Set will take your list and truncate the duplicates.
I find the following one line code useful and neat. I also provides you the two-directional differences at a column level
set(df1.schema).symmetric_difference(set(df2.schema))

How to get the first row data of each list?

my DataFrame like this :
+------------------------+----------------------------------------+
|ID |probability |
+------------------------+----------------------------------------+
|583190715ccb64f503a|[0.49128147201958017,0.5087185279804199]|
|58326da75fc764ad200|[0.42143416087939345,0.5785658391206066]|
|583270ff17c76455610|[0.3949217100212508,0.6050782899787492] |
|583287c97ec7641b2d4|[0.4965059792664432,0.5034940207335569] |
|5832d7e279c764f52e4|[0.49128147201958017,0.5087185279804199]|
|5832e5023ec76406760|[0.4775830044196701,0.52241699558033] |
|5832f88859cb64960ea|[0.4360509428173421,0.563949057182658] |
|58332e6238c7643e6a7|[0.48730029128352853,0.5126997087164714]|
and I get the column of probability using
val proVal = Data.select("probability").rdd.map(r => r(0)).collect()
proVal.foreach(println)
the result is :
[0.49128147201958017,0.5087185279804199]
[0.42143416087939345,0.5785658391206066]
[0.3949217100212508,0.6050782899787492]
[0.4965059792664432,0.5034940207335569]
[0.49128147201958017,0.5087185279804199]
[0.4775830044196701,0.52241699558033]
[0.4360509428173421,0.563949057182658]
[0.48730029128352853,0.5126997087164714]
but I want to get the first column of data for each row, like this:
0.49128147201958017
0.42143416087939345
0.3949217100212508
0.4965059792664432
0.49128147201958017
0.4775830044196701
0.4360509428173421
0.48730029128352853
how can this be done?
The input is standard random forest input, above the input is val Data = predictions.select("docID", "probability")
predictions.printSchema()
root
|-- docID: string (nullable = true)
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
|-- indexedLabel: double (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)
|-- predictedLabel: string (nullable = true)
and I want to get the first value of the "probability" column
You can use the Column.apply method to get the n-th item on an array column - in this case the first column (using index 0):
import sqlContext.implicits._
val proVal = Data.select($"probability"(0)).rdd.map(r => r(0)).collect()
BTW, if you're using Spark 1.6 or higher, you can also use the Dataset API for a cleaner way to convert the dataframe into Doubles:
val proVal = Data.select($"probability"(0)).as[Double].collect()