spark streaming and spark sql considerations - scala

I am using spark stream (scala) and receiving records of customer calls to a call center through kafka after every 20 minutes. Those records are converted in rdd and later dataframe to exploit spark sql. I have a business use case where I want to identify all the customers who called more than 3 times in the last two hours.
What would be the best approach to do that? Should I keep inserting in a hive table all the records received in every batch and run a separate script to keep querying who did 3 calls in last two hours or there is another better using spark in memory capabilities ?
Thanks.

For this kind of use case you can get the result using spark (There is no requirement of hive). You must have some customer unique id, so you can prepared some query like
ROW_NUMBER() OVER(PARTITION BY cust_id ORDER BY time DESC) as call_count
and you have to apply filter for last 2 hours data with call_count=3 so you can get your expected result. Then you can set this spark script into crontab or any other autorun methods.

Related

Create Dataframe in PySpark with a UDF GROUPED_MAP per group

I need to train a ML model per client_id. I have around 100,000 clients, so the same amount of models. I use Spark with the GROUPED_MAP UDF to train a ML model per client in parallel.
df.groupBy('client_id').apply(train_ml_model)
This works very well and split every trainings job per client_id on a worker node.
This UDF works only for a single dataframe with a column to do your groupBy on. But the problem is that my data must be first queried from a warehouse per client_id. So there is no single dataframe yet.
In the current setup, the main dataframe must be first created but this takes many hours. There is a query like SELECT data from datawarehouse WHERE ID IN (client_IDs) to collect the data for all client IDs. I use spark.read.load() for this.
There are several options to overcome this long loading time.
Option1:
Is there a way to use this spark sql loading in the GROUPED_MAP functionality? That you do a SELECT data FROM datawarehouse per client_id in your worker nodes. And only retreive the data for that specific client_id
If I try that, I receive the error that the JBDC functionality cannot work on a Worker Node.
Option2:
This option would be that you first load your main Dataframe using the parameters lowerBound, upperBound, numPartitions to speed up the query. But this will take several hours as well.
Are there other options possible in Spark?
Can you pls share the full load command, are u using the .option("dbtable", "SELECT data from datawarehouse WHERE ID IN (client_IDs)") while loading. This will filter at the source and then send the required rows to spark.

Good way to ensure data is ready to be queried while Glue partition is created?

We have queries that run on a schedule every several minutes that join a few different glue tables (via Athena) before coming up with some results. For the table in question, we have Glue Crawlers set up and partitions based on snapshot_date and a couple other columns.
In the query, we grab the latest snapshot_date and use only data from that snapshot_date for the query. The data in S3 gets updated and put into the right folder a few times a day, but it looks like sometimes, if we try to query the data right as the data in S3 is getting updated, we end up with empty results due to the query trying to access the new snapshot_date partition while Glue is still getting the data set up(?)
Is there a built-in way to ensure that our glue partitions are ready before we start querying them? So far, we considered building in artificial time "buffers" in our query around when we expect the snapshot_date partition data to be written and the glue update to be complete, but I'm aware that that's really brittle and depends on the exact timing.

Get Unique rows in SELECT query using JPA in POSTGRESQL

I am working in a Spring batch application in Spring boot which will be running in two different instances, where I have a scenario in which I have to retrieve unique rows from a table. By unique I mean, one row per instance. For example,
id language
1 java
2 python
if I have two rows and when I call a SELECT query with limit one, For first instance I should get id 1 and for second instance id 2 should be returned. So far I have tried using JPA Lock #Lock(value = LockModeType.PESSIMISTIC_WRITE) This doesn't work. Each time I get the same row. I have also tried using JdbcTemplate with SELECT * FROM some_table LIMIT 1 FOR UPDATE SKIP LOCKED. This is also not working. My postgres version is 10.3 . Is there a way to achieve this.
Number of instances of my application might grow in the future. So I want to handle this as well.
Thanks in advance.
You want each instance to process a different partition of your table. In this case, I would recommend using a partitioned step.
For example, you can partition the table by even/odd IDs, and make each instance process a partition. This is IMO better than locking the table and using LIMIT 1 to force each instance read one row (This won't work as you mentioned and even if it works, it would be very poor in terms of performance).
You can find a sample job of how to partition a table here: https://github.com/spring-projects/spring-batch/blob/master/spring-batch-samples/src/main/resources/jobs/partitionJdbcJob.xml along with the corresponding partitioner here: https://github.com/spring-projects/spring-batch/blob/master/spring-batch-samples/src/main/java/org/springframework/batch/sample/common/ColumnRangePartitioner.java

Spark Streaming dropDuplicates

Spark 2.1.1 (scala api) streaming json files from an s3 location.
I want to deduplicate any incoming records based on an ID column (“event_id”) found in the json for every record. I do not care which record is kept, even if duplication of the record is only partial. I am using append mode as the data is merely being enriched/filtered, with no group by/window aggregations, via the spark.sql() method. I then use the append mode to write parquet files to s3.
According to the documentation, I should be able to use dropDuplicates without watermarking in order to deduplicate (obviously this is not effective in long-running production). However, this fails with the error:
User class threw exception: org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets
That error seems odd as I am doing no aggregation (unless dropDuplicates or sparkSQL counts as an aggregation?).
I know that duplicates won’t occur outside 3 days of each other, so I then tried it again by adding a watermark (by using .withWatermark() immediately before the drop duplicates). However, it seems to want to wait until 3 days are up before writing the data. (ie since today is July 24, only data up to the same time on July 21 is written to the output).
As there is no aggregation, I want to write every row immediately after the batch is processed, and simply throw away any rows with an event id that has occurred in the previous 3 days. Is there a simple way to accomplish this?
Thanks
In my case, I used to achieve that in two ways through DStream :
One way:
load tmp_data(contain 3 days unique data, see below)
receive batch_data and do leftOuterJoin with tmp_data
do filter on step2 and output new unique data
update tmp_data with new unique data through step2's result and drop old data(more than 3 days)
save tmp_data on HDFS or whatever
repeat above again and again
Another way:
create a table on mysql and set UNIQUE INDEX on event_id
receive batch_data and just save event_id + event_time + whatever to mysql
mysql will ignore duplicate automatically
Solution we used was a custom implementation of org.apache.spark.sql.execution.streaming.Sink that inserts into a hive table after dropping duplicates within batch and performing a left anti join against the previous few days worth of data in the target hive table.

hadoop - large database query

Situation: I have a Postgres DB that contains a table with several million rows and I'm trying to query all of those rows for a MapReduce job.
From the research I've done on DBInputFormat, Hadoop might try and use the same query again for a new mapper and since these queries take a considerable amount of time I'd like to prevent this in one of two ways that I've thought up:
1) Limit the job to only run 1 mapper that queries the whole table and call it
good.
or
2) Somehow incorporate an offset in the query so that if Hadoop does try to use
a new mapper it won't grab the same stuff.
I feel like option (1) seems more promising, but I don't know if such a configuration is possible. Option(2) sounds nice in theory but I have no idea how I would keep track of the mappers being made and if it is at all possible to detect that and reconfigure.
Help is appreciated and I'm namely looking for a way to pull all of the DB table data and not have several of the same query running because that would be a waste of time.
DBInputFormat essentially does already do your option 2. It does use LIMIT and OFFSET in its queries to divide up the work. For example:
Mapper 1 executes: SELECT field1, field2 FROM mytable ORDER BY keyfield LIMIT 100
Mapper 2 executes: SELECT field1, field2 FROM mytable ORDER BY keyfield LIMIT 100 OFFSET 100
Mapper 3 executes: SELECT field1, field2 FROM mytable ORDER BY keyfield LIMIT 100 OFFSET 200
So if you have proper indexes on the key field, you probably shouldn't mind that multiple queries are being run. Where you do get some possible re-work is with speculative execution. Sometimes hadoop will schedule multiples of the same task, and simply only use the output from whichever finishes first. If you wish, you can turn this off with by setting the following property:
mapred.map.tasks.speculative.execution=false
However, all of this is out the window if you don't have a sensible key for which you can efficiently do these ORDER, LIMIT, OFFSET queries. That's where you might consider using your option number 1. You can definitely do that configuration. Set the property:
mapred.map.tasks=1
Technically, the InputFormat gets "final say" over how many Map tasks are run, but DBInputFormat always respects this property.
Another option that you can consider using is a utility called sqoop that is built for transferring data between relational databases and hadoop. This would make this a two-step process, however: first copy the data from Postgres to HDFS, then run your MapReduce job.