Calculate hash over a whole column from pyspark dataframe - pyspark

I have a big data frame (approximatively 40 million rows) which looks like this:
|col A | col B |
|------|-------|
|valA1 | valB1 |
|valA2 | valB2 |
I want to compare two column in different data frames, in different workspaces. I am not allowed to bring both of them to the same environment. What I want is creating a hash value for every column, so I can compare with other columns from the other data frame.
The easy approach would be concatenating all values from a column, and then hash the resulting string. But because of the size of the data frame, I cannot do this.
So far I tried this version, but it takes too long:
hashlib.sha256(''.join(map(str,df.agg(collect_list(col("colName"))).first()[0])).encode('utf-8')).hexdigest()
and also this but same long time:
def compute_hash(df):
hasher = hashlib.sha256()
dataCollect=df.rdd.toLocalIterator()
for row in dataCollect:
hasher.update(row['colName'].encode('utf-8'))
return hasher.hexdigest()
Is this achievable in spark in a reasonable time?

You dont need to hash the whole string at once.
Example using sha256 from hashlibs library
import hashlib
column = ['valA1', 'valA2', 'valA3']
hasher = hashlib.sha256()
for row in column:
hasher.update(row.encode('utf-8'))
print(hasher.hexdigest())
# >>> 68f900960718b4881107929da0918e0e9f50599b12ebed3ec70066e55c3ec5f4
Using the update method will process the data as you use it.

The solution was grouping by a column which contained data evenly distributed. This way, spark would trigger the parallel execution for every value from the "columnToGroupBy" and generate a dataframe containing on the first column all the values of "columnToGroupBy" and on the second column, a hash over the concatenated values of "colToHash" corresponding to that value of "columnToGroupBy". For example if we had this table:
columnToGroupBy
colToHash
ValueA1
ValB1
ValueA2
ValB2
ValueA1
ValB3
ValueA2
ValB4
After aplying this function:
df.groupby("columnToGroupBy").agg(md5(concat_ws(",", array_sort(collect_set(col("colToHash"))))))
We would get the following dataframe:
columnToGroupBy
hash
ValueA1
md5(ValB1
ValueA2
md5(ValB1
If you have this new dataframe with a small number of rows, equal to the number of values you have in "columnToGroupBy", you can easily generate a hash for the whole column by collecting all the values from the "hash" column, concatenating them and hashing over.

Related

Java code to check if a pcollection is empty

I am trying to write a pipeline to insert/update/delete mysql table based on the pubsub messages.While inserting into a particular table, I will have to check if the data exists in other table and do the insert only when the data is available in the other table
I will have to stop the insertion process , when there is no data in the other table(PCollection).
PCollection recordCount= windowedMatchedCollection.apply(Combine.globally(new CountElements()).withoutDefaults());
This piece of line does not seem to help. Any inputs on this please
It's a little unclear exactly what you're trying to do, but this should be achievable with counting elements. For example, suppose you have
# A PCollection of (table, new_row) KVs.
new_data = ...
# A PCollection of (table, old_row) KVs.
old_data = ...
You could then do
rows_per_old_table = old_data | beam.CombinePerKey(
beam.combiners. CountCombineFn())
and use this to filter out your data with a side input.
def maybe_filter_row(table_and_row, old_table_count):
# table_and_row comes from the PCollection new_data
# old_table_count is the side input as a Map
table = table_and_row[0]
if old_table_count.get(table) > 0:
yield table_and_row
new_data_to_update = new_data | beam.FlatMap(
maybe_filter_row,
old_table_count=beam.pvalue.AsMap(rows_per_old_table))
Now your new_data_to_update will contain only that data for tables that had a non-zero number of rows in old_data.
If you're trying to do this in a streaming fashion, everything would have to be windowed, including old_data, and it would filter out only those things that have data in that same window. You could instead do something like
# Create a PCollection containing the set of tables in new_data, per window.
tables_to_consider = (
new_data
| beam.GroupByKey()
| beam.MapTuple(lambda table, rows: table))
rows_per_old_table = tables_to_consider | beam.ParDo(
SomeDoFnLookingUpCurrentSizeOfEachTable())
# Continue as before.

How to sum values of an entire column in pyspark

I have a data frame with 900 columns I need the sum of each column in pyspark, so it will be 900 values in a list. Please let me know how to do this? Data has around 280 mil rows all binary data.
Assuming you already have the data in a Spark DataFrame, you can use the sum SQL function, together with DataFrame.agg.
For example:
sdf = spark.createDataFrame([[1, 3], [2, 4]], schema=['a','b'])
from pyspark.sql import functions as F
sdf.agg(F.sum(sdf.a), F.sum(sdf.b)).collect()
# Out: [Row(sum(a)=3, sum(b)=7)]
Since in your case you have quite a few columns, you can use a list comprehension to avoid naming columns explicitly.
sums = sdf.agg(*[F.sum(sdf[c_name]) for c_name in sdf.columns]).collect()
Notice how you need to unpack the arguments from the list using the * operator.

Extract and Replace values from duplicates rows in PySpark Data Frame

I have duplicate rows of the may contain the same data or having missing values in the PySpark data frame.
The code that I wrote is very slow and does not work as a distributed system.
Does anyone know how to retain single unique values from duplicate rows in a PySpark Dataframe which can run as a distributed system and with fast processing time?
I have written complete Pyspark code and this code works correctly.
But the processing time is really slow and its not possible to use it on a Spark Cluster.
'''
# Columns of duplicate Rows of DF
dup_columns = df.columns
for row_value in df_duplicates.rdd.toLocalIterator():
print(row_value)
# Match duplicates using std name and create RDD
fill_duplicated_rdd = ((df.where((sf.col("stdname") == row_value['stdname'] ))
.where(sf.col("stdaddress")== row_value['stdaddress']))
.rdd.map(fill_duplicates))
# Creating feature names for the same RDD
fill_duplicated_rdd_col_names = (((df.where((sf.col("stdname") == row_value['stdname']) &
(sf.col("stdaddress")== row_value['stdaddress'])))
.rdd.map(fill_duplicated_columns_extract)).first())
# Creating DF using the previous RDD
# This DF stores value of a single set of matching duplicate rows
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
col_value = ([str(value[column]) for value in
df_streamline.select(col(column)).distinct().rdd.toLocalIterator() if value[column] != ""])
if len(col_value) >= 1:
# non null or empty value of a column store here
# This value is a no duplicate distinct value
col_value = col_value[0]
#print(col_value)
# The non-duplicate distinct value of the column is stored back to
# replace any rows in the PySpark DF that were empty.
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
#print(col_value)
except:
print("None")
'''
There are no error messages but the code is running very slow. I want a solution that fills rows with unique values in PySpark DF that are empty. It can fill the rows with even mode of the value
"""
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
# distinct() was replaced by isNOTNULL().limit(1).take(1) to improve the speed of the code and extract values of the row.
col_value = df_streamline.select(column).where(sf.col(column).isNotNull()).limit(1).take(1)[0][column]
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
"""

Iterate through a dataframe and dynamically assign ID to records based on substring [Spark][Scala]

Currently I have an input file(millions of records) where all the records contain a 2 character Identifier. Multiple lines in this input file will be concatenated into only one record in the output file, and how this is determined is SOLELY based on the sequential order of the Identifier
For example, the records would begin as below
1A
1B
1C
2A
2B
2C
1A
1C
2B
2C
1A
1B
1C
1A marks the beginning of a new record, so the output file would have 3 records in this case. Everything between the "1A"s will be combined into one record
1A+1B+1C+2A+2B+2C
1A+1C+2B+2C
1A+1B+1C
The number of records between the "1A"s varies, so I have to iterate through and check the Identifier.
I am unsure how to approach this situation using scala/spark.
My strategy is to:
Load the Input file into the dataframe.
Create an Identifier column based on substring of record.
Create a new column, TempID and a variable, x that is set to 0
Iterate through the dataframe
if Identifier =1A, x = x+1
TempID= variable x
Then create a UDF to concat records with the same TempID.
To summarize my question:
How would I iterate through the dataframe, check the value of Identifier column, then assign a tempID(whose value increases by 1 if the value of identifier column is 1A)
This is dangerous. The issue is that spark is not guaranteed keep the same order among elements, especially since they might cross partition boundaries. So when you iterate over them you could get a different order back. This also has to happen entirely sequentially, so at that point why not just skip spark entirely and run it as regular scala code as a preproccessing step before getting to spark.
My recommendation would be to either look into writing a custom data inputformat/data source, or perhaps you could use "1A" as a record delimiter similar to this question.
First - usually "iterating" over a DataFrame (or Spark's other distributed collection abstractions like RDD and Dataset) is either wrong or impossible. The term simply does not apply. You should transform these collections using Spark's functions instead of trying to iterate over them.
You can achieve your goal (or - almost, details to follow) using Window Functions. The idea here would be to (1) add an "id" column to sort by, (2) use a Window function (based on that ordering) to count the number of previous instances of "1A", and then (3) using these "counts" as the "group id" that ties all records of each group together, and group by it:
import functions._
import spark.implicits._
// sample data:
val df = Seq("1A", "1B", "1C", "2A", "2B", "2C", "1A", "1C", "2B", "2C", "1A", "1B", "1C").toDF("val")
val result = df.withColumn("id", monotonically_increasing_id()) // add row ID
.withColumn("isDelimiter", when($"val" === "1A", 1).otherwise(0)) // add group "delimiter" indicator
.withColumn("groupId", sum("isDelimiter").over(Window.orderBy($"id"))) // add groupId using Window function
.groupBy($"groupId").agg(collect_list($"val") as "list") // NOTE: order of list might not be guaranteed!
.orderBy($"groupId").drop("groupId") // removing groupId
result.show(false)
// +------------------------+
// |list |
// +------------------------+
// |[1A, 1B, 1C, 2A, 2B, 2C]|
// |[1A, 1C, 2B, 2C] |
// |[1A, 1B, 1C] |
// +------------------------+
(if having the result as a list does not fit your needs, I'll leave it to you to transform this column to whatever you need)
The major caveat here is that collect_list does not necessarily guarantee preserving order - once you use groupBy, the order is potentially lost. So - the order within each resulting list might be wrong (the separation to groups, however, is necessarily correct). If that's important to you, it can be worked around by collecting a list of a column that also contains the "id" column and using it later to sort these lists.
EDIT: realizing this answer isn't complete without solving this caveat, and realizing it's not trivial - here's how you can solve it:
Define the following UDF:
val getSortedValues = udf { (input: mutable.Seq[Row]) => input
.map { case Row (id: Long, v: String) => (id, v) }
.sortBy(_._1)
.map(_._2)
}
Then, replace the row .groupBy($"groupId").agg(collect_list($"val") as "list") in the suggested solution above with these rows:
.groupBy($"groupId")
.agg(collect_list(struct($"id" as "_1", $"val" as "_2")) as "list")
.withColumn("list", getSortedValues($"list"))
This way we necessarily preserve the order (with the price of sorting these small lists).

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark