Is it possible to find the hash (preferably hash 256) value of the full PySpark dataframe. I dont want to find hash of individual rows or columns. I know function exists in pySpark for column level hash calculation from pyspark.sql.functions import sha2
The requirement is to partiton a big dataframe based on years and for each year(small dataframes) find the hash value and persist the result in a table.
Input
(Product, Qauntity, Store, SoldDate)
Read the data in a dataframe, partition by SoldDate, calculate the hash for each partition and write to a file/table.
Output:
(Date, hash)
The reason I am doing this is I have to compare the run this process daily and then check whether the hash changed for any previous dates.
There is file level md5 possible but dont want to generate files but calcualte hash on the fly for the partitions/small dataframes based on dates
Related
I have a df1 based on disticnt values containing two columns, date and value. There is df2 that has multiple column but contains the date and value column. For each distinct value from df1, i want to filter the df2 such that the records before the date from df1 are dropped. It would be rather easy for a single disticnt value, i can use something like filter by vlaue and then gt(lit(date), however i have over 500 such distinct pairs in df1. For single operation, it takes around 20 minute so if i use a loop then it is computationally not feasible. Perhaps some body can advice me on a better methodology here.
have tried multiple methodlogies but nothing has worked yet.
I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.
I have a situation where i have list of rows with multiple columns and have to group by with a lookup key which is a unique column in all rows ..
I have to sum up values of a given column (say x) for all the rows from the lookup key (say lookup_1) in scala only using group by aggregate or any other method available ..
I can't convert to dataframe and use aggregate ..
I am sorry if I can't put code because policy doesn't allow so ..
Your quick help would be appreciated or guidance !!!
I want to collect a specific Row in a Spark 1.6 DataFrame which originates from a partitioned HiveTable (the table is partitioned by a String column named date and saved as Parquet)
A record is unambiguously identified by date,section,sample
In addition, I have the following constraints
date and section are Strings, sample is Long
date is unique and the Hive table is partitioned by date. But
there are maybe more than 1 files on HDFS for each date
section is also unique across the dataframe
sample is unique for a given section
So far I use this query, but it takes quite a long time to execute (~25 seconds using 10 executors):
sqlContext.table("mytable")
.where($"date"=== date)
.where($"section"=== section)
.where($"sample" === sample)
.collect()(0)
I also tried to replace collect()(0) with take(1)(0) which is not faster.
What is the difference between CHECKSUM_AGG() and CHECKSUM() ?
CHECKSUM calculates a hash for one or more values in a single row and returns an integer.
CHECKSUM_AGG is an aggregate function that takes a single integer value from multiple rows and calculates an aggregated checksum for each group.
They can be used together to checksum multiple columns in a group:
SELECT category, CHECKSUM_AGG(CHECKSUM(*)) AS checksum_for_category
FROM yourtable
GROUP BY category
CHECKSUM_AGG will perform a checksum across all the values that are being aggregated, coming up with a value.
It's typically used to see if a collection of values (in the group) has generally changed.
CHECKSUM is intended to build a hash index based on an expression or column list.
One example of using a CHECKSUM is to store the unique value for the entire row in a column for later comparison.