Deequ - How to put validation on a subset of dataset? - scala

I have a usecase where I want to put certain validations on subset of data that satisfies a specific condition.
For example, I have a dataframe which has 4 columns. colA, colB, colC, colD
df.printSchema
root
|-- colA: string (nullable = true)
|-- colB: integer (nullable = false)
|-- colC: string (nullable = true)
|-- colD: string (nullable = true)
I want to put a validation that, wherever "colA == "x" and colB > 20" , combination of "colC and colD" should be unique. ( basically, hasUniqueness(Seq("colC", "colD"), Check.IsOne)

Related

scala: read csv of documents, create cosine similarity

I'm reading in dozens of documents. They seem to be read into both RDD and DFs as a string of columns:
This is the schema:
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)...
|-- _c58: string (nullable = true)
|-- _c59: string (nullable = true)
This is the head of the df:
_c1| _c2|..........
V1 V2
This text ... This is an...
I'm trying to create a cosine similarity matrix using this:
val contentRDD = spark.sparkContext.textFile("...documents_vector.csv").toDF()
val Row(coeff0: Matrix) = Correlation.corr(contentRDD, "features").head
println(s"Pearson correlation matrix:\n $coeff0")
This is another way I was doing it:
val df_4 = spark.read.csv("/document_vector.csv")
Where features would be the name of the column created by converting the single row of 59 columns into a single column of 59 rows, named features.
Is there a way to map each new element in the csv to a new row to complete the cosine similarity matrix? Is there another way I should be doing this?
Thank you to any who consider this.

Scala Spark : How to extract nested column names from parquet file and adding prefix to it

The idea is to read a parquet file into dataFrame. Then, extract all column name's and type's from it's schema. If we have a nested columns, i would like to add a "prefix" before the column name.
Considering that we can have a nested column with sub column named properly, and we can have also a nested column with just an array of array without column name but "element".
val dfSource: DataFrame = spark.read.parquet("path.parquet")
val dfSourceSchema: StructType = dfSource.schema
Example of dfSourceSchema (Input):
|-- exCar: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: binary (nullable = true)
|-- exProduct: string (nullable = true)
|-- exName: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- exNameOne: string (nullable = true)
| | |-- exNameTwo: string (nullable = true)
Desired output :
((exCar.prefix.prefix,binary)),(exProduct, String), (exName.prefix.exNameOne, String), (exName.prefix.exNameTwo, String) )

Divide dataframe into batches Spark

I need to run a set of transformations on batches of hours of the dataframe. And the number of hours should be parameterized so it could be changed - for example, run transformations on 3 hours of dataframe, then next 2 hours. In such a way, there should be a step with a parameterized number of hours for each transformation.
The signature of the transformation looks like this:
def transform(wordsFeed: DataFrame)(filesFeed: DataFrame): Unit
So I want to do this division into batches and then call a transform on this datafeed. But I can't use groupBy as it would change dataframe into grouped dataset while I need to preserve all the columns in the schema. How can I do that?
val groupedDf = df.srcHours.groupBy($"event_ts")
transform(keywords)(groupedDf)
Data schema looks like this:
root
|-- date_time: integer (nullable = true)
|-- user_id: long (nullable = true)
|-- order_id: string (nullable = true)
|-- description: string (nullable = true)
|-- hashed_user_id: string (nullable = true)
|-- event_date: date (nullable = true)
|-- event_ts: timestamp (nullable = true)
|-- event_hour: long (nullable = true)
The main reason to introduce this batching is that there's too much data to process at once.
Note: I still want to use batch data processing and not streaming in this case

How to add assign value to empty dataframe existing column in scala?

I am reading a csv file which has | delimiter at last , while load method make last column in dataframe with no name and no values in Spark 1.6
df.withColumnRenamed(df.columns(83),"Invalid_Status").drop(df.col("Invalid_Status"))
val df = sqlContext.read.format("com.databricks.spark.csv").option("delimiter","|").option("header","true").load("filepath")
val df2 = df.withColumnRenamed(df.columns(83),"Invalid_Status").
I expected result
root
|-- FddCell: string (nullable = true)
|-- Trn_time: string (nullable = true)
|-- CELLNAME.FddCell: string (nullable = true)
|-- Invalid_Status: string (nullable = true)
but actual output is
root
|-- FddCell: string (nullable = true)
|-- Trn_time: string (nullable = true)
|-- CELLNAME.FddCell: string (nullable = true)
|-- : string (nullable = true)
with no value in column so I have to drop this column and again make new column.
It is not completely clear what you want to do, to just rename the column to Invalid_Status or to drop the column entirely. What I understand is, you are trying to operate (rename/drop) on the last column which has no name.
But I will try to help you with both the solution -
To Rename the column with same values (blanks) as it is:
val df2 = df.withColumnRenamed(df.columns.last,"Invalid_Status")
Only To Drop the last column without knowing its name, use:
val df3 = df.drop(df.columns.last)
And then add the "Invalid_Status" column with default values:
val requiredDf = df3.withColumn("Invalid_Status", lit("Any_Default_Value"))

How to get the first row data of each list?

my DataFrame like this :
+------------------------+----------------------------------------+
|ID |probability |
+------------------------+----------------------------------------+
|583190715ccb64f503a|[0.49128147201958017,0.5087185279804199]|
|58326da75fc764ad200|[0.42143416087939345,0.5785658391206066]|
|583270ff17c76455610|[0.3949217100212508,0.6050782899787492] |
|583287c97ec7641b2d4|[0.4965059792664432,0.5034940207335569] |
|5832d7e279c764f52e4|[0.49128147201958017,0.5087185279804199]|
|5832e5023ec76406760|[0.4775830044196701,0.52241699558033] |
|5832f88859cb64960ea|[0.4360509428173421,0.563949057182658] |
|58332e6238c7643e6a7|[0.48730029128352853,0.5126997087164714]|
and I get the column of probability using
val proVal = Data.select("probability").rdd.map(r => r(0)).collect()
proVal.foreach(println)
the result is :
[0.49128147201958017,0.5087185279804199]
[0.42143416087939345,0.5785658391206066]
[0.3949217100212508,0.6050782899787492]
[0.4965059792664432,0.5034940207335569]
[0.49128147201958017,0.5087185279804199]
[0.4775830044196701,0.52241699558033]
[0.4360509428173421,0.563949057182658]
[0.48730029128352853,0.5126997087164714]
but I want to get the first column of data for each row, like this:
0.49128147201958017
0.42143416087939345
0.3949217100212508
0.4965059792664432
0.49128147201958017
0.4775830044196701
0.4360509428173421
0.48730029128352853
how can this be done?
The input is standard random forest input, above the input is val Data = predictions.select("docID", "probability")
predictions.printSchema()
root
|-- docID: string (nullable = true)
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
|-- indexedLabel: double (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)
|-- predictedLabel: string (nullable = true)
and I want to get the first value of the "probability" column
You can use the Column.apply method to get the n-th item on an array column - in this case the first column (using index 0):
import sqlContext.implicits._
val proVal = Data.select($"probability"(0)).rdd.map(r => r(0)).collect()
BTW, if you're using Spark 1.6 or higher, you can also use the Dataset API for a cleaner way to convert the dataframe into Doubles:
val proVal = Data.select($"probability"(0)).as[Double].collect()