I have a pyspark array that maps customers to a list of categories as well as a geo location.
[('customer', 'bigint'), ('category', 'array<int>'), ('geo_location', 'string')]
Each customer can map to more than one category so I capture it as a list.
I'd like to count up the number of customers in each category while preserving the geo information.
Is there a method in pyspark that will easily unpack the list values as a column so I can count them? Alternatively, is there a better pattern in PySpark that will accomplish this in a better way?
Related
I have a RDD with key-value pairs (dummy_key, text)
I need code in PySpark a operation to return n top most frequent words per category with count that can be extracted from text (it is first word in text). While this is easy without aggregation I have troubles to do it per category.
I would be grateful for any help.
Hi I have query in which i want to place the variable data into the group by query
i Tried like this but it not working
dd2=(dd1.groupBy("hours").agg({'%s':'%s'})%(columnname1,input1))
In the columnname1 contain 'total' and input1 contain what kind of aggregation is required like mean or stddev.
i want this query to be dynamic.
Try this,
dd2=(dd1.groupBy("hours").agg({'{}'.format(columnname1):'{}'.format(input1)}))
This is my RDD, as you can see a single block can have multiple genre values:
[['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy'],
['Adventure', 'Children', 'Fantasy'],
['Comedy', 'Romance'],
['Comedy', 'Drama', 'Romance'],
['Comedy']]
I wish to make a list out of the values in the arrays of the RDD, splitting all the values into multiple rows like this:
['Adventure',
'Drama',
'Comedy'] ....
and also make this collection distinct.
So far I have tried this
RDD.flatMap(lambda x: x).distinct().take(100)
But I don't know whether this code takes all the values out of all the arrays and makes a distinct list from them. The question is, does it perform the task I require it to do?
How to sort Array[Row] by given column index in Scala?
I'm using RDD[Row].collect() which gives me array[Row], but I want to sort it based on a given column index.
I have already used quick-sort logic and it's working, but there are too many for loops and all.
I would like to use a Scala built-in API which can do this task with the minimum amount of code.
It would be much more efficient to sort the Dataframe before collecting it - if you collect it, you lose the distributed (and parallel) computation. You can use Dataframe's sort, for example - ascending order by column "col1":
val sorted = dataframe.sort(asc("col1"))
My Mongo schema is as follows:
KEY: Client ID
Value: { Location1: Bitwise1, Location2: Bitwise2, ...}
So the Column names would be names of locations. This data represents the locations to which a client has been to, and bitwise captures the days for which the client was present at that location.
I'd like to run a map-reduce query on the above schema. In that, I'd like to iterate on all the columns of the Value for a Row. How can that be done? Could some one give a small code snipped which explains it clearly? I'm having a hard time finding it on web.