Several group by in pyspark by different columns - pyspark

i have a big table in databricks, this table have 500 columns and i want to calculate a groupBy by each column like that.
list_grp = [df_ref_crt.groupBy('SEGMENT',colum).count().withColumnRenamed(colum,"new_name").withColumn('variable',lit(colum)) for colum in df_ref_crt.columns]
The code works but is too slow, so my question is if there another way to do faster.
(the purpose of this is calculate PSI and this is the reference, of course there is another group by that a have to do respecto same variables and Period.)
thanks....

You can improve the Databricks perform by implementing below suggestions. This will definitely improve your execution performance:
Use larger cluster: General cluster configuration with two workers with four cores each takes forever to do anything. It’s actually not any more expensive to use a large cluster for a workload than it is to use a smaller one. It’s just faster.
Note that you’re renting the cluster for the length of the workload. So, if you spin up that two worker cluster and it takes an hour, you’re paying for those workers for the full hour. However, if you spin up a four worker cluster and it takes only half an hour, the cost is actually the same!
Delta Cache: If you’re using regular clusters, be sure to use the L series or E series on Azure Databricks. This will have fast SSDs and caching enabled by default.
You can refer the Microsoft official document to know more about optimization.

Related

Spark performance: local faster than cluster (very uneven executor load)

let me start off by saying that I'm relatively new to spark so if I'm saying something that doesn't make sense just please correct me.
Summarising the problem, no mather what I do, at certain stages one executor does all the computation, which makes cluster execution slower than local, one-processor execution.
Full story:
I've written a spark 1.6 application which consists of series of maps, filters, joins and a short graphx part. The app uses only one data source - csv file. For the purpose of development I created a mockup dataset consisting of 100 000 rows, 7MB, with all of the fields having random data with uniform distribution (random sorting in file as well). The joins are self inner joins on PairRDD on various fields (the dataset has duplicate keys with ~200 duplicates per key immitating real data), leading to cartesian product within key. Then I perform a number of map and filter operations on the result of the joins, store it as RDD of some custom-class objects and save everything as a graph at the and.
I developed the code on my laptop and run it, which took about 5 minutes (windows machine, local file). To my surprise, when I deployed the jar onto the cluster (master yarn, cluster mode, file in csv in HDFS) and submitted it the code has taken 8 minutes to execute.
I've run same experiment with smaller data and the results were 40 seconds locally and 1.1 min on the cluster.
When I looked at history server I've seen that 2 stages are particularly long (almost 4 mins each), and on these stages there is one task that takes >90% of the time. I run the code multiple times and it was always the same task that took so much time, even though it was deployed on different data node each time.
To my surprise, when I opened the executors I saw that one executor does almost all of the job (in terms of time spent) and executes most jobs. In the screenshot provided second most 'active' executor had 50 tasks, but that's not always the case - in different submission second most busy executor had 15 tasks, and the leading one 95).
Moreover, I saw that the time of 3.9 mins is used for computation (second screenshot), which is most heavy on the joined data shortly after map. I thought, that the data may not be partitioned equally and one executor has to perform all the computation. Therefore, I tried to patrition the pairRdd manually (using .partitionBy(new HashPartitioner(40))) right before join (similar execution time) or right after join (execution even slower).
What could be the issue? Any help will be appreciated.
It's hard to tell without seeing your queries and understanding your Dataset, I'm guessing you didn't include it either because it's very complex or sensitive? So this is a little bit of a shot in the dark, however this looks a lot like a problem we dealt with on my team at work. My rough guess at what is happening is that during one of your joins, you have a key space that has a high cardinality, but very uneven distribution. In our case, we were joining on sources of web traffic, which while we have thousands of possible sources of traffic, the overwhelming majority of the traffic comes from just a few. This caused a problem when we joined. The keys would be distributed evenly among the executors, however since maybe 95% of the data shared maybe 3 or 4 keys, a very small number of executors were doing most of the work. When you find a join that suffers from this, the thing to do is to pick the smaller of the two datasets and explicitly perform a broadcast join. (Spark normally will try to do this, but it's not always perfect at being able to tell when it should.)
To do this, let's say you have two DataFrames. One of them has two columns, number and stringRep where number is just one row for all integers from 0-10000 and stringRep is just a string representation of that, so "one", "two", "three", etc. We'll call this numToString
The other DataFrame has some key column to join against number in numToString called kind, some other irrelevant data, and 100,000,000 rows. We'll call this DataFrame ourData. Then let's say that the distribution of the 100,000,000 rows in ourData is 90% have kind == 1, 5% have kind == 2, and the remaining 5% distributed pretty evenly amongst the remaining 99,998 numbers. When you perform the following code:
val numToString: DataFrame = loadNumToString()
val ourData: DataFrame = loadOurCode()
val joined = ourData.join(numToString).where(ourData("kind") === numToString("number"))
...it is very likely that Spark will send %90 of the data (that which has kind == 1) to one executor, %5 of the data (that which has kind == 2) to another executor, and the remaining %5 smeared across the rest, leaving two executors with huge partitions and the rest with very tiny ones.
The way around this as I mentioned before is to explicitly perform a broadcast join. What this does is take one DataFrame and distribute it entirely to each node. So you would do this instead:
val joined = ourData.join(broadcast(numToString)).where(ourData("kind") === numToString("number"))
...which would send numToString to each executor. Assuming that ourData was evenly partitioned beforehand, the data should remain evenly partitioned across executors. This might not be your problem, but it does sound a lot like a problem we were having. Hope it helps!
More information on broadcast joins can be found here:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins-broadcast.html

Overhead of using many partitions in Postgres

In the following link, the creator of a tool I use (Airflow) suggests to create partitions for daily snapshots of dimension tables. I am wondering about the overhead of doing something like this in Postgres.
I am using the Postgres 10 built in partitioning for several tables, but mostly at a monthly or yearly level for facts. I never tried implementing a daily partition for dimensions before and it seems scary. It would simplify things though in several areas for me in case I need to rerun old tasks.
https://medium.com/#maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a
Simple. With dimension snapshots where a new partition is appended at
each ETL schedule. The dimension table becomes a collection of
dimension snapshots where each partition contains the full dimension
as-of a point in time. “But only a small percentage of the data
changes every day, that’s a lot of data duplication!”. That’s right,
though typically dimension tables are negligible in size in proportion
to facts. It’s also an elegant way to solve SCD-type problematic by
its simplicity and reproducibility. Now that storage and compute are
dirt cheap compared to engineering time, snapshoting dimensions make
sense in most cases.
While the traditional type-2 slowly changing dimension approach is
conceptually sound and may be more computationally efficient overall,
it’s cumbersome to manage. The processes around this approach, like
managing surrogate keys on dimensions and performing surrogate key
lookup when loading facts, are error-prone, full of mutations and
hardly reproducible.
I have worked with systems with different levels of partitioning.
Generally any partitioning is OK as long as you have check constrains on partitions which allow query planner to find adequate partitions for query. Or you will have to query specific partition directly for some special cases. Otherwise you will see sequential scans over all partitions even for simple queries.
Daily partitions are completely OK do not worry. And I worked event with data collector based on PG which needed to have partitions for every 5 minutes of data because it collected several TBs per day.
Number of partitions can become a bigger problem only when you have several thousands or dozens of thousands of partitions - with this amount of partitions everything goes to different level of problems.
You will have to set proper max_locks_per_transaction for example to be able to work with them. Because even simple select over parent table places SharedAccessLock over all partitions which is not exactly nice but PG inheritance works this way.
Plus higher planing time for query - in our data warehouse we sometimes see planning times for queries like several minutes and queries taking only seconds - which is a bit craped... But it is hard to do anything with it because current PG planner works this way.
But PROs still overweight CONs so I highly recommend to use any partitioning granularity you need.

Cassandra as replacement to PostgreSQL

Is Cassandra with multiple nodes a good choice as replacement to single node PostgreSql? Data being stored is a time series. It is about tens of gigabytes already and is expected to grow. Database should be integrated into pipeline with apache spark as source and possibly result destination.
What is needed:
1) redundancy: one node failure shouldn't stop the system (all data should be available)
2) speed: more nodes - less time per single insert/select for one client
3) concurrency: more nodes - better speed for simultaneous inserts/selects from different clients
For your points:
1) This is a question which is up to you while choosing the keyspace replication factor RF and the consistency levels CL of your inserts and selects. To be available and consistent you need RF=3 on your and CL.QUORUM for both insert and select for hande loss of one node (for QUORUM you need RF/2+1 nodes online, 3/2+1=2 - integer division, with RF=5 you would neeed 5/2+1=3 nodes online, so you can handle loss of 2).
2) A single request will be handled by a single node as coordinator in your cluster. You do not gain much performance here with singe and synchronous requsts. If you issue any requests and use async you will split your requests across more nodes and gain performance.
3) With more clients you have the same effect - the coordinator will be picked at random (ok there is the TokenAwarePolicy which will pick a appropriate coordinator).
You've mentioned that you use time series data.
1. Naturally, you can vary the replication factor and consistency level. So yes, Cassandra would be good as a replacement.
2. The insert would be really fast as Cassandra writes memory first. So yes, Cassandra would be good as a replacement.
3. Cassandra has linear horizontal scalability. So yes, Cassandra would be good as a replacement.
The drawbacks are that Cassandra is a key-value storage. So you should model the table structure around the queries. And PostgreSQL as RDBMS is more flexible as support the whole set of SQL operations.
You can read more about some pros and cons of using Cassandra with time series data here and here.

Spark/Scala parallel write to redis

Is it possible to write to Redis in parallel from spark?
(Or: how to write tens of thousands of keys/lists quickly from spark)
Currently, I'm writing to Redis by key in sequence, and it's taking forever. I need to write about 90000 lists (of length 2-2000). Speed is extremely important. Currently, it's taking on the order of 1 hour. Tradition benchmarks of Redis claim thousands of Redis writes per second, but in my pipeline, I'm not anywhere near that.
Any help is appreciated.
A single Redis instance runs in one thread so operations are inherently sequential. If you have a Redis cluster then the instance to which a datum is written depends on a hash slot calculated from the key being written. This hash function (amongst other things) ensures that the load gets distributed across all the Redis instances in the cluster. If you cluster has N instances, you have (almost) at most N parallel writes that you can execute. This because each cluster instance is still a single thread. A reasonable Spark Redis connector should exploit the cluster efficiently.
Either way, Redis is really quick, especially if you use mass inserts.

Distributing work to multiple cores: Hadoop or Scala's parallel collections?

What is the better way of making full use of multiple cores for parallel processing in a Scala/Hadoop system?
Let's say I need to process 100 million documents. Documents are not very large, but processing them is computationally intensive. If I have a Hadoop cluster with 100 machines with 10 cores each, I could either:
A) send 1000 documents to each machine and let Hadoop start a map on each of the 10 cores (or as many as are available)
or
B) send 1000 documents to each machine (still using Hadoop) and use Scala's parallel collections to make full use of the multiple cores. (I would put all documents in a parallel collection, and then call map on the collection). In other words, use Hadoop for distribution at cluster level, and use parallel collections to manage the distribution to cores within each machine.
Hadoop is going to offer a lot more than just parallelization. It offers a platform to distribute work, a scheduler for handling concurrent jobs, a distributed filesystem, the ability to perform a distributed reduce, and fault tolerance. That said, it is a complicated system and can sometimes be difficult to work with.
If you plan to have multiple users submitting many different jobs, Hadoop is the way to go (out of the two options). However, if you are devoting a cluster to be always be processing documents through the same function, you could, without too much trouble, develop a system with Scala parallel collections and actors for inter-machine communication. The Scala solution would give you more control, the system could respond in real time, and you wouldn't have to deal with a lot of Hadoop configuration that doesn't pertain to your task.
If you need to run varied jobs over large amounts of data (larger than would fit on a single node), then use Hadoop. I can give you more information if you describe your requirements in more detail.
Update: one million is a fairly small number. You might want to do some calculations and see how long it would take on a single machine with parallel collections. The advantage here is that the development time is minimal!
The answer depends on the following question - does your Scala code capable to fully utilize all cores available. Probabbly if you have good intrinsic synchronization between parts of the document to be processed or some other way to parralelyze algorithm without lock contention - then the "B"" is the way. If so - configure one mapper per node and let your mapper to utilize cores in a best way.
If your gain from the parralelization is not that good, and adding more threads (cores) to the processing does not improve performance in a linear way - then the "A" can be better way. Efficiency of "A" also depends on the size of your RAM - you will need enough ram for 10 mappers per node.
I can suspect that ideal solution can be somewhere in between. So my suggestion is to develop mapper which takes number of threads used as a parameter and then do a few tests increasing number of threads per mapper and decreasing number of mappers per node.
Hadoop is not very good for processing a lot of small files, but for processing a small amount of very large files. Is there any way you can merge the files before processing them, or are they all totally different? Hadoop takes care of distribution and parallelism itself, so there is no need to explicitly send X docs to Y machines. And also i don't think you should use hadoop only as a distribution mechanism, that is not what it's made for. You should either use a real map/reduce, or build your own system for whatever you are trying to do, but not try to bend hadoop to your will.