Spark.SQL – Aggregation of separated data in parallel - pyspark

My task is to aggregate data by an hour (and store each as a row in DB).
For aggregating one hour, there is no need to know what the other hours have.
The input is json files. Important point is that these files are stored in separated folders – folder for an hour.
I have 2 questions:
What is the right way to aggregate in such scenario – I'd want to "send" each hour data to different node/s and aggregate them separately in parallel – such that in the end I'll finish with a dataframe that contains only an aggregated result of each hour. I understand that simple partitioning doesn't return such dataframe.
How could I take advantage of that separated folders – is it worth to read each hour data separately, and then combine all with union? (while preserving the partition like here). Is it indeed saves the "group-by" operation?

Related

Spark binary file and Delta Table

I have batches of binary files (~3mb each) that I receive in batches of ~20000 files at a time. These files are used downstream for further processing, but I want to process them and store in Delta tables.
I can do this easily:
df = spark.read.format(“binaryFile”).load(<path-to-batch>)
df = df.withColumn(“id”, expr(“uuid()”)
dt = DeltaTable.forName(“myTable”)
dt.alias(“a”).merge(
df.alias(“a”),
“a.path = b.path”
).whenNotMatchedInsert(
values={“id”: “b.id”, “content”: “b.content”}
).execute()
This makes the table quite slow already, but later I need to query certain IDs, do collect and write them individually back to binary files.
Questions:
Would my table benefit from a batch column and partition?
Should I partition by id? I know this is not ideal, but might make querying individual rows easier?
Is there a better way to write the files out again, rather than .collect()? I have seen when I select about 1000 specific ids write them out that about 10 minutes is just for collect and then less than a minute to write. I do something like:
for row in df.collect():
with open(row.id, “wb”) as fw:
fw.write(row.content)
As uuid() returns random values, I'm afraid we cannot use it to compare existing data with new records. (Sorry if I misunderstood the idea)
I don't think using partition by id will help as the id column has obviously high cardinality.
Instead of using collect() which loads all records into Driver, I think it would be better if you can write the records in the Spark dataframe directly and simultaneously from all the worker nodes into a temporary location on ADLS first and then aggregate a few data files from that location.

Apache Spark: Why are many small dataframes so much slower than few big ones?

So I am having the following scenario:
I have a service where I do some calculation on parameters in a dataframe. For example I am doing the describe() operation. I got the parameters via and http-post (Array[String]+schema) and read them in via read.json function on the sql context.
I can either get it in one one big dataframe with 10.000 parameters or in 10.000 small dataframes with just one parameter. Each having around 12.000 rows with timestamps.
In the end I need to collect the dataframe(s) to send it to a different service for further calculations. It would be easier to to it parameter wise, because of the way the input is created.
But I figured out, that doing the collecting/converting to json on the many small dataframes is way more expensive than on the one huge dataframe.
For the big dataframe taking about 6 seconds and all small ones at least 20 seconds. For one this does not seem to be so important but I want to do it on at least 3000 of these 10.000 parameter inputs.
Why is that so? It does not seem to be the difference in calculation, but the difference in collecting it once vs many times.
When you call collect(), Spark must submit job and send data to one node.
Let's consider 10 DataFrames of n elements and one with 10n elements.
One big collect() -> 10n data is sent, one execution plan and one job created
10 DataFrames -> 10*collect() -> 10*n data is sent, 10 execution plans needs to be generated and 10 jobs submited.
Of course it depends also on hardware and network, i.e. if you can have small DataFrame on one node then it may be faster than sending over network.

Batch version of sessionization-like counting in Spark

I've got log files from various devices showing users and want to create kind of a stateful count of users visiting specfic websites for every minute. I can tranform the data to a format: ts,websitename,userID,(-)1 (1 for joiners/-1 for leavers).
I'd like to end up with a time series with count per website per ts:
ts1,siteA,34
ts2,siteA,30 <- 4 users left
ts3,siteA,32 <- 2 users joined
The way to do this in Spark streaming is well descibed. The most straight-forward way IMHO would be to have a timewindow in Spark Streaming of the desired aggregation time and use updateStateByKey to keep a count per website (not even taking into account log ts to keep it simple).
Now the question is how to achieve this in a batch process, more specifically it's not to hard to use aggregateByKey() and end up with something like:
ts0,siteA,30
ts1,siteA,4
ts2,siteA,-4
ts3,siteA,2
But then how to iterate over that? It would not sound very logical but the only thing I can think of would be to sort the data using sortByKey(), partition it to be sure that all data for a specific site is on one node, and then iterate over every element of the RDD creating a new RDD with (ts,count)..
But e.g. using foreach doesn't iterate over the elements sequentially as far as I understand. Actually this might not even suit Spark well as it's not really "batch-type" work going down to the level of individual records.
Any help or pointers to specific functions greatly appreciated!

Equivalent of collection.groupBy in scalaz-streams

I have a folder which contain multiple files with names such as filetype1_ddMMyyyy_hhmm, filetype2_ddMMyyyy_hhmm
Per each day, there could be multiple files with a different hour and I would need to parse only the one with the highest hour. In a non-reactive stream world, the algorithm can be implemented as a groupBy date, what's its equivalent in scalaz-stream?

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.