Create Dataframe in PySpark with a UDF GROUPED_MAP per group - pyspark

I need to train a ML model per client_id. I have around 100,000 clients, so the same amount of models. I use Spark with the GROUPED_MAP UDF to train a ML model per client in parallel.
df.groupBy('client_id').apply(train_ml_model)
This works very well and split every trainings job per client_id on a worker node.
This UDF works only for a single dataframe with a column to do your groupBy on. But the problem is that my data must be first queried from a warehouse per client_id. So there is no single dataframe yet.
In the current setup, the main dataframe must be first created but this takes many hours. There is a query like SELECT data from datawarehouse WHERE ID IN (client_IDs) to collect the data for all client IDs. I use spark.read.load() for this.
There are several options to overcome this long loading time.
Option1:
Is there a way to use this spark sql loading in the GROUPED_MAP functionality? That you do a SELECT data FROM datawarehouse per client_id in your worker nodes. And only retreive the data for that specific client_id
If I try that, I receive the error that the JBDC functionality cannot work on a Worker Node.
Option2:
This option would be that you first load your main Dataframe using the parameters lowerBound, upperBound, numPartitions to speed up the query. But this will take several hours as well.
Are there other options possible in Spark?

Can you pls share the full load command, are u using the .option("dbtable", "SELECT data from datawarehouse WHERE ID IN (client_IDs)") while loading. This will filter at the source and then send the required rows to spark.

Related

Loading dataframe from dynamodb - aws glue pyspark

I'm trying to read the records from dynamodb table. I have tried to use the dynamic frame. Since I have 8 million records in my table it was taking too long time to filter. Anyway I don't need to load 8 million records to dataframe.Instead of applying filter in dynamic frame, I want to know is any option is available to load dataframe by passing query. So few records only loaded to dataframe and it will work faster.
You can load dataframe by passing a query in spark.sql() but before that, you will have to run the AWS Glue crawler on the Dynamo DB table so that you can get a table corresponding to that Dynamo DB table in AWS Glue catalog and then you can use this table generated in Glue Catalog to read data using Spark dataframe directly.

Can you use a data flow sink as a source in the same data flow?

I am trying to load the sales data to the database using Azure Synapse Analytics pipelines, and the strategy is as follows (scenario is made up):
Load the students` data to the table Students
Load the students` classes information to the table StudentsClasses. In this data flow I need to join the data with the Students table (obviously, the new data about students must be loaded to Students at this join step)
Can I have these two processes in the same data flow with Sink ordering? Or does the sink ordering not define source read ordering? (that is, the source reading and transformations are done in parallel, and only the write is according to the ordering?
Edit: This is an example data flow that I want to implement:
source3 and sink1 are the same table. What I want is to first populate sink1, then use it for source 2 to join with it. Can this be implemented using Sink ordering? Or source3 will be empty regardless of sink ordering?
Yes, you can use multiple source and sinks in a single data flow and reference same source over join activity. And order sink write using Custom sink ordering property
I am using Inline dataset but you can use any type
Using inline dataset to store the result in sink1. In source3, use the same inline dataset to join with Source2
Make sure you give the sink order correctly, if you have the wrong order or if it encounters no data while proceeding with transformation, it will publish with no errors however the pipeline run would fail.
Refer MS DOC: Sink ordering

Good way to ensure data is ready to be queried while Glue partition is created?

We have queries that run on a schedule every several minutes that join a few different glue tables (via Athena) before coming up with some results. For the table in question, we have Glue Crawlers set up and partitions based on snapshot_date and a couple other columns.
In the query, we grab the latest snapshot_date and use only data from that snapshot_date for the query. The data in S3 gets updated and put into the right folder a few times a day, but it looks like sometimes, if we try to query the data right as the data in S3 is getting updated, we end up with empty results due to the query trying to access the new snapshot_date partition while Glue is still getting the data set up(?)
Is there a built-in way to ensure that our glue partitions are ready before we start querying them? So far, we considered building in artificial time "buffers" in our query around when we expect the snapshot_date partition data to be written and the glue update to be complete, but I'm aware that that's really brittle and depends on the exact timing.

spark streaming and spark sql considerations

I am using spark stream (scala) and receiving records of customer calls to a call center through kafka after every 20 minutes. Those records are converted in rdd and later dataframe to exploit spark sql. I have a business use case where I want to identify all the customers who called more than 3 times in the last two hours.
What would be the best approach to do that? Should I keep inserting in a hive table all the records received in every batch and run a separate script to keep querying who did 3 calls in last two hours or there is another better using spark in memory capabilities ?
Thanks.
For this kind of use case you can get the result using spark (There is no requirement of hive). You must have some customer unique id, so you can prepared some query like
ROW_NUMBER() OVER(PARTITION BY cust_id ORDER BY time DESC) as call_count
and you have to apply filter for last 2 hours data with call_count=3 so you can get your expected result. Then you can set this spark script into crontab or any other autorun methods.

Group by In HBase

I almost do not know anything about HBase. Sorry for basic questions.
Imagine I have a table of 100 billion rows with 10 int, one datetime, and one string column.
Does HBase allow querying this table and Group the result based on key (even a composite key)?
If so, does it have to run a map/reduce job to it?
How do you feed it the query?
Can HBase in general perform real-time like queries on a table?
Data aggregation in HBase intersects with the "real time analytics" need. While HBase is not built for this type of functionality there is a lot of need for it. So the number of ways to do so is / will be developed.
1) Register HBase table as an external table in Hive and do aggregations. Data will be accessed via HBase API what is not that efficient. Configuring Hive with Hbase this is a discussion about how it can be done.
It is the most powerful way to group by HBase data. It does imply running MR jobs but by Hive, not by HBase.
2) You can write you own MR job working with HBase data sitting in HFiles in the HDFS. It will be most efficient way, but not simple and data you processed would be somewhat stale. It is most efficient since data will not be transferred via HBase API - instead it will be accesses right from HDFS in sequential manner.
3) Next version of HBase will contain coprocessors which would be able to aggregations inside specific regions. You can assume them to be a kind of stored procedures in the RDBMS world.
4) In memory, Inter-region MR job which will be parralelized in one node is also planned in the future HBase releases. It will enable somewhat more advanced analytical processing then coprocessors.
FAST RANDOM READS = PREPREPARED data sitting in HBase!
Use Hbase for what it is...
1. A place to store a lot of data.
2. A place from which you can do super fast reads.
3. A place where SQL is not gonna do you any good (use java).
Although you can read data from HBase and do all sorts of aggregates right in Java data structures before you return your aggregated result, its best to leave the computation to mapreduce. From your questions, it seems as if you want the source data for computation to sit in HBase. If this is the case, the route you want to take is have HBase as the source data for a mapreduce job. Do computations on that and return the aggregated data. But then again, why would you read from Hbase to run a mapreduce job? Just leave the data sitting HDFS/ Hive tables and run mapreduce jobs on them THEN load the data to Hbase tables "pre-prepared" so that you can do super fast random reads from it.
Once you have the preaggregated data in HBase, you can use Crux http://github.com/sonalgoyal/crux to further drill, slice and dice your HBase data. Crux supports composite and simple keys, with advanced filters and group by.