Deequ - How to put validation on a subset of dataset?

Deequ - How to put validation on a subset of dataset? - scala

I have a usecase where I want to put certain validations on subset of data that satisfies a specific condition.
For example, I have a dataframe which has 4 columns. colA, colB, colC, colD
df.printSchema
root
|-- colA: string (nullable = true)
|-- colB: integer (nullable = false)
|-- colC: string (nullable = true)
|-- colD: string (nullable = true)
I want to put a validation that, wherever "colA == "x" and colB > 20" , combination of "colC and colD" should be unique. ( basically, hasUniqueness(Seq("colC", "colD"), Check.IsOne)

Related

scala: read csv of documents, create cosine similarity

I'm reading in dozens of documents. They seem to be read into both RDD and DFs as a string of columns:
This is the schema:
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)...
|-- _c58: string (nullable = true)
|-- _c59: string (nullable = true)
This is the head of the df:
_c1| _c2|..........
V1 V2
This text ... This is an...
I'm trying to create a cosine similarity matrix using this:
val contentRDD = spark.sparkContext.textFile("...documents_vector.csv").toDF()
val Row(coeff0: Matrix) = Correlation.corr(contentRDD, "features").head
println(s"Pearson correlation matrix:\n $coeff0")
This is another way I was doing it:
val df_4 = spark.read.csv("/document_vector.csv")
Where features would be the name of the column created by converting the single row of 59 columns into a single column of 59 rows, named features.
Is there a way to map each new element in the csv to a new row to complete the cosine similarity matrix? Is there another way I should be doing this?
Thank you to any who consider this.

Scala Spark : How to extract nested column names from parquet file and adding prefix to it

The idea is to read a parquet file into dataFrame. Then, extract all column name's and type's from it's schema. If we have a nested columns, i would like to add a "prefix" before the column name.
Considering that we can have a nested column with sub column named properly, and we can have also a nested column with just an array of array without column name but "element".
val dfSource: DataFrame = spark.read.parquet("path.parquet")
val dfSourceSchema: StructType = dfSource.schema
Example of dfSourceSchema (Input):
|-- exCar: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: binary (nullable = true)
|-- exProduct: string (nullable = true)
|-- exName: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- exNameOne: string (nullable = true)
| | |-- exNameTwo: string (nullable = true)
Desired output :
((exCar.prefix.prefix,binary)),(exProduct, String), (exName.prefix.exNameOne, String), (exName.prefix.exNameTwo, String) )

Divide dataframe into batches Spark

I need to run a set of transformations on batches of hours of the dataframe. And the number of hours should be parameterized so it could be changed - for example, run transformations on 3 hours of dataframe, then next 2 hours. In such a way, there should be a step with a parameterized number of hours for each transformation.
The signature of the transformation looks like this:
def transform(wordsFeed: DataFrame)(filesFeed: DataFrame): Unit
So I want to do this division into batches and then call a transform on this datafeed. But I can't use groupBy as it would change dataframe into grouped dataset while I need to preserve all the columns in the schema. How can I do that?
val groupedDf = df.srcHours.groupBy($"event_ts")
transform(keywords)(groupedDf)
Data schema looks like this:
root
|-- date_time: integer (nullable = true)
|-- user_id: long (nullable = true)
|-- order_id: string (nullable = true)
|-- description: string (nullable = true)
|-- hashed_user_id: string (nullable = true)
|-- event_date: date (nullable = true)
|-- event_ts: timestamp (nullable = true)
|-- event_hour: long (nullable = true)
The main reason to introduce this batching is that there's too much data to process at once.
Note: I still want to use batch data processing and not streaming in this case

How to add assign value to empty dataframe existing column in scala?

I am reading a csv file which has | delimiter at last , while load method make last column in dataframe with no name and no values in Spark 1.6
df.withColumnRenamed(df.columns(83),"Invalid_Status").drop(df.col("Invalid_Status"))
val df = sqlContext.read.format("com.databricks.spark.csv").option("delimiter","|").option("header","true").load("filepath")
val df2 = df.withColumnRenamed(df.columns(83),"Invalid_Status").
I expected result
root
|-- FddCell: string (nullable = true)
|-- Trn_time: string (nullable = true)
|-- CELLNAME.FddCell: string (nullable = true)
|-- Invalid_Status: string (nullable = true)
but actual output is
root
|-- FddCell: string (nullable = true)
|-- Trn_time: string (nullable = true)
|-- CELLNAME.FddCell: string (nullable = true)
|-- : string (nullable = true)
with no value in column so I have to drop this column and again make new column.

It is not completely clear what you want to do, to just rename the column to Invalid_Status or to drop the column entirely. What I understand is, you are trying to operate (rename/drop) on the last column which has no name.
But I will try to help you with both the solution -
To Rename the column with same values (blanks) as it is:
val df2 = df.withColumnRenamed(df.columns.last,"Invalid_Status")
Only To Drop the last column without knowing its name, use:
val df3 = df.drop(df.columns.last)
And then add the "Invalid_Status" column with default values:
val requiredDf = df3.withColumn("Invalid_Status", lit("Any_Default_Value"))

How to get the first row data of each list?

my DataFrame like this :
+------------------------+----------------------------------------+
|ID |probability |
+------------------------+----------------------------------------+
|583190715ccb64f503a|[0.49128147201958017,0.5087185279804199]|
|58326da75fc764ad200|[0.42143416087939345,0.5785658391206066]|
|583270ff17c76455610|[0.3949217100212508,0.6050782899787492] |
|583287c97ec7641b2d4|[0.4965059792664432,0.5034940207335569] |
|5832d7e279c764f52e4|[0.49128147201958017,0.5087185279804199]|
|5832e5023ec76406760|[0.4775830044196701,0.52241699558033] |
|5832f88859cb64960ea|[0.4360509428173421,0.563949057182658] |
|58332e6238c7643e6a7|[0.48730029128352853,0.5126997087164714]|
and I get the column of probability using
val proVal = Data.select("probability").rdd.map(r => r(0)).collect()
proVal.foreach(println)
the result is :
[0.49128147201958017,0.5087185279804199]
[0.42143416087939345,0.5785658391206066]
[0.3949217100212508,0.6050782899787492]
[0.4965059792664432,0.5034940207335569]
[0.49128147201958017,0.5087185279804199]
[0.4775830044196701,0.52241699558033]
[0.4360509428173421,0.563949057182658]
[0.48730029128352853,0.5126997087164714]
but I want to get the first column of data for each row, like this:
0.49128147201958017
0.42143416087939345
0.3949217100212508
0.4965059792664432
0.49128147201958017
0.4775830044196701
0.4360509428173421
0.48730029128352853
how can this be done?
The input is standard random forest input, above the input is val Data = predictions.select("docID", "probability")
predictions.printSchema()
root
|-- docID: string (nullable = true)
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
|-- indexedLabel: double (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)
|-- predictedLabel: string (nullable = true)
and I want to get the first value of the "probability" column

You can use the Column.apply method to get the n-th item on an array column - in this case the first column (using index 0):
import sqlContext.implicits._
val proVal = Data.select($"probability"(0)).rdd.map(r => r(0)).collect()
BTW, if you're using Spark 1.6 or higher, you can also use the Dataset API for a cleaner way to convert the dataframe into Doubles:
val proVal = Data.select($"probability"(0)).as[Double].collect()

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Deequ - How to put validation on a subset of dataset? - scala

Related

scala: read csv of documents, create cosine similarity

Scala Spark : How to extract nested column names from parquet file and adding prefix to it

Divide dataframe into batches Spark

How to add assign value to empty dataframe existing column in scala?

How to get the first row data of each list?

Categories

Resources