Fastest way to get specific record in Spark DataFrame - scala

I want to collect a specific Row in a Spark 1.6 DataFrame which originates from a partitioned HiveTable (the table is partitioned by a String column named date and saved as Parquet)
A record is unambiguously identified by date,section,sample
In addition, I have the following constraints
date and section are Strings, sample is Long
date is unique and the Hive table is partitioned by date. But
there are maybe more than 1 files on HDFS for each date
section is also unique across the dataframe
sample is unique for a given section
So far I use this query, but it takes quite a long time to execute (~25 seconds using 10 executors):
sqlContext.table("mytable")
.where($"date"=== date)
.where($"section"=== section)
.where($"sample" === sample)
.collect()(0)
I also tried to replace collect()(0) with take(1)(0) which is not faster.

Related

Spark hash of full dataframe

Is it possible to find the hash (preferably hash 256) value of the full PySpark dataframe. I dont want to find hash of individual rows or columns. I know function exists in pySpark for column level hash calculation from pyspark.sql.functions import sha2
The requirement is to partiton a big dataframe based on years and for each year(small dataframes) find the hash value and persist the result in a table.
Input
(Product, Qauntity, Store, SoldDate)
Read the data in a dataframe, partition by SoldDate, calculate the hash for each partition and write to a file/table.
Output:
(Date, hash)
The reason I am doing this is I have to compare the run this process daily and then check whether the hash changed for any previous dates.
There is file level md5 possible but dont want to generate files but calcualte hash on the fly for the partitions/small dataframes based on dates

Union with Redshift native tables and external tables (Spectrum)

If I have a view that contains a union between a native table and external table like so (pseudocode):
create view vwPageViews as
select from PageViews
union all
select from PageViewsHistory
PageViews has for the last 2 years. External table has for older data than 2 years.
If a user selects from the view with filters for the last 6 months, how does RS Spectrum handle it - does it read the entire external table even though none will be returned (and accordingly cost us money for all of it)? (Assuming the s3 files are parquet based).
ex.
Select from vwPageViews where MyDate >= '01/01/2021'
What's the best approach for querying both cold and historical data using RS and Spectrum? Thanks!
How this will happen on Spectrum will depend on whether or not you have provided partitions for the data in S3. Without partitions (and a where clause on the partition) the Spectrum engines in S3 will have to read every file to determine if the needed data is in any of them. The cost of this will depend on the number and size of the files AND what format they are in. (CSV is more expensive than Parquet for example.)
The way around this is to partition the data in S3 and to have a WHERE clause on the partition value. This will exclude files from needing to be read when they don't match on the partition value.
The rub is in providing the WHERE clause for the partition as this will likely be less granular than the date or timestamp you using in your base data. For example if you partition on YearMonth (YYYYMM) and want to have a day level WHERE clause you will need to 2 parts to the WHERE clause - WHERE date_col >= 2015-07-12 AND part_col >= 201507. How to produce both WHERE conditions will depend on your solution around Redshift.

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

how to load specific row and column from an excel sheet through pyspark to HIVE table?

I have an excel file having 4 worksheets. Each worksheet has first 3 rows as blank, i.e. the data starts from row number 4 and that continues for thousands of rows further.
Note: As per the requirement I am not supposed to delete the blank rows.
My goals are below
1) read the excel file in spark 2.1
2) ignore the first 3 rows, and read the data from 4th row to row number 50. The file has more than 2000 rows.
3) convert all the worksheets from the excel to separate CSV, and load them to existing HIVE tables.
Note: I have the flexibility of writing separate code for each worksheet.
How can I achieve this?
I can create a Df to read a single file and load it to HIVE. But I guess my requirement would need more than that.
You could for instance use the HadoopOffice library (https://github.com/ZuInnoTe/hadoopoffice/wiki).
There you have the following options:
1) use Hive directly to read the Excel files and to CTAS to a table in CSV format
You would need to deploy the HadoopOffice Excel Serde
https://github.com/ZuInnoTe/hadoopoffice/wiki/Hive-Serde
then you need to create the table (see documentation for all the option, the example reads from sheet1 and skips the first 3 lines)
create external table ExcelTable(<INSERTHEREYOURCOLUMNSPECIFICATION>) ROW FORMAT SERDE 'org.zuinnote.hadoop.excel.hive.serde.ExcelSerde' STORED AS INPUTFORMAT 'org.zuinnote.hadoop.office.format.mapred.ExcelFileInputFormat' OUTPUTFORMAT 'org.zuinnote.hadoop.excel.hive.outputformat.HiveExcelRowFileOutputFormat' LOCATION '/user/office/files' TBLPROPERTIES("hadoopoffice.read.simple.decimalFormat"="US","hadoopoffice.read.sheet.skiplines.num"="3", "hadoopoffice.read.sheet.skiplines.allsheets"="true", "hadoopoffice.read.sheets"="Sheet1","hadoopoffice.read.locale.bcp47"="US","hadoopoffice.write.locale.bcp47"="US");
Then do CTAS into a CSV format table:
create table CSVTable ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' AS Select * from ExcelTable;
2) use Spark
Depending on the Spark version you have different options:
for Spark 1.x you can use the HadoopOffice fileformat and for Spark 2.x the Spark2 DataSource (the latter would also include support for Python). See howtos here

Ordered Data with Cassandra RandomPartitioner

I have about a billion pieces of data that I would like to store in Cassandra. The data items are ordered by time, and one of the main queries I'll be doing is to find the items between two time ranges, in order. I'd really prefer to use the RandomParititioner, if at all possible. Is there a way to do this in Cassandra?
At first, since I'm coming from SQL, I assumed I should create each event as a row, but then it occurred to me that I was thinking about it the wrong way and I should really use columns. Columns in Cassandra seem to be ordered, but I'm confused as to just how ordered they are. If I use a time as the column name, is there a way for me to get all of the columns from one time to another in order?
Another thing I looked at was the 0.7 feature of secondary indices, but I've had trouble finding documentation for whether I can use these to view a range of things in order.
All I want is the Cassandra equivalent of this SQL: "Select * from Stuff where date > X and date < Y order by date asc". How can I do this?
The partitioner only affects the distribution of keys around the ring, not the order of columns within a key. Columns are always ordered according to the Column Comparator defined for the column family.
You can call get_slice with a SlicePredicate that specifies a SliceRange to get all the columns of a key within a range.
To model your data, you can create 1 row for each day (or suitable time shard) and have a column for each piece of data. Something like,
"yyyy-mm-dd" : { #key, one for each day
timeStampMillis1:dataid1 : "value1" # one column for each piece of data
timeStampMillis2:dataid2 : "value2"
timeStampMillis3:dataid3 : "value3"
}
The column names should be binary, using the binary comparator. The first 8 bytes are the timestamp, while the rest of the bytes are the id of the data.
Assuming X and Y are on the same day, to find all items between X and Y, do a do a get_slice on the day key, with a SlicePredicate with a SliceRange specifying a start of X and a finish of Y+1. Both start and finish are byte arrays of 8 bytes.
To find data over multiple days, read from multiple keys.