Which elememt does DataFrame. DropDuplicate drop - scala

If I sort a dataframe in descending ortder based on a column. And then drop the duplicates using df.dropDuplicate then which element will be removed? The element which was smaller based on sort?

DropDuplicate method preserves the first Element and removing the others.
So yes , on descending sort only the largest(based on sort) will be preserved and others removed.

Related

Sort array elements alphabetically PostgreSQL

I know there has to be a way to do this, but my googling was to no avail. I am trying to take a recordset, create an array and sort the elements alphabetically.
How would I accomplish this?
To do this you will need to use the ORDER BY clause in the ARRAY_AGG function.
Example:
SELECT ARRAY_AGG(fullname ORDER BY lastname)
FROM ...
This will return an array with names sorted by last name.

Preserving index in a RDD in Spark

I would like to create an RDD which contains String elements. Alongside of these elements I would like a number indicating the index of the element. However, I do not want this number to change if I remove elements, as I want the number to be the original index (thus preserving it). It is also important that the order is preserved in this RDD.
If I use zipWithIndex and thereafter remove some elements, will the indexes change? Which function/structure can I use to have unchanged indexes? I was thinking of creating a Pair RDD, however my input data does not contain indexes.
Answering rather than deleting. My problem was easily solved by zipWithIndex which fulfilled all my requirements.

How to sort Array[Row] by given column index in Scala

How to sort Array[Row] by given column index in Scala?
I'm using RDD[Row].collect() which gives me array[Row], but I want to sort it based on a given column index.
I have already used quick-sort logic and it's working, but there are too many for loops and all.
I would like to use a Scala built-in API which can do this task with the minimum amount of code.
It would be much more efficient to sort the Dataframe before collecting it - if you collect it, you lose the distributed (and parallel) computation. You can use Dataframe's sort, for example - ascending order by column "col1":
val sorted = dataframe.sort(asc("col1"))

default sorting order of columns in cassandra?

I was going through the tutorial where the instructor says that the default ordering of columns with in a row is UTF8-tye. But he does not touch upon it further.
I don't understand what it means. especially what if my columns are different types such as int, timestamp etc.
Also how would I specify the sort order on the columns to be something other than "UTF8-type".
He is talking about the columns names, not the columns values.
In old cassandra versions you could use SuperColumns, which are collections of columns within a Row. Something like this:
{ RowKey:
{ SuperColumn1Key: {c1:v, c2:v .... } },
{ SuperColumn2Key: {c1:v, c2:v .... } },
{ SuperColumn3Key: {c1:v, c2:v .... } }
}
It is something similar to what today is a wide row. The comparator could establish both the sorting of supercolumns within a row and also sorting of columns by their name (you could choose two differents comparator in a SuperColumnFamily, one for supercolumns sorting and another for columns sorting). For example using TimeUUID comparator for supercolumns you could retrieve them sorted by time while UTF8Type is an "alphabetic" sorting.
Imagine this row in an UTF8 Columns Comparator:
{ id: {"author":"john", "vote": 3} }
Now let's add a new column, say text. Since it's utf8, "text" ("a"<"t"<"v") will be between author and vote
{ id: {"author":"john", "text": "blablabla", "vote": 3} }
However I think what you've seen is an old video, since this concept is not used anymore in newer version
HTH, Carlo
Short answer is: The default clustering order in Cassandra is ascending (ASC).
By default Cassandra tables with no clustering order specified are optimized for ascending SELECT queries. If you need to perform queries with descending queries you can specify a clustering order to store columns on disk in the reverse order of the DEFAULT.
The official documentation state this a little bit unclear for a quick reader (look out for the "default" magic keyword):
Ordering results
You can order query results to make use of the on-disk sorting of
columns. You can order results in ascending or descending order. The
ascending order will be more efficient than descending. If you need
results in descending order, you can specify a clustering order to
store columns on disk in the reverse order of the default. Descending
queries will then be faster than ascending ones.
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/create_table_r.html?scroll=reference_ds_v3f_vfk_xj__ordering-results

Use distinct and skip in a query

I tried running this:
db.col.find().skip(5).distinct("field1")
But it throws an error.
How to use them together?
I can use aggregation but results are different:
db.col.aggregate([{$group:{_id:'$field1'}}, {$skip:3},{$sort:{"field1":1}}])
What I want is links in sorted order i.e numbers should come first then capital letters and then small letters.
Distinct method must be run on COLLECTION not on cursor and returns an array. Read this
http://docs.mongodb.org/manual/reference/method/db.collection.distinct/
So you can't use skip after distinct.
May be you should use this query
db.col.aggregate([{$group:{_id:'$field1'}}, {$skip:3},{$sort:{"_id":1}}]) because field field1 will not exists in result after first clause of grouping.
Also I think you should do sort at first and then skip because in your query you skip 3 unsorted results and then sort them.
(If you provide more information about structure of your documents and what output you want it would be more clearly and I will correct answer properly)