I have a situation where I need fetch only distict records which are grater than 0 and all records with value 0.
For Example I have column name called mid then it rows like "0,0,1,1,2,3,5,5,3" then I should fetch only "0,0,1,2,5,3".
In short distinct record plus all mid with value 0
I have used this
def distinctMIdCursor = dataSetCollection.distinct("mid",whereObject)
def distinctMIdList = distinctMIdCursor.asList()
but its fetching result like "0,1,2,5,3"
Actual result "0,1,2,5,3".
Expected result "0,0,1,2,5,3"
How to achieve it. What is better way?
You cannot achieve it with distinct because by doing so you are defying the whole purpose of using distinct. Instead you can write two queries and concat the result.
def nonZeroDistinctList = dataSetCollection.distinct("mid",{mid: {$ne:0}});
// map function to convert object list into mid value list
def allZeroList = dataSetCollection.find({mid:0}).map(function(doc){return doc.mid});
// concating the two lists
def result = nonZeroDistinctList + allZeroList ;
Related
I have a List:hdtList which contain columns that represent the columns of a Hive table:
forecast_id bigint,period_year bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_system_name string,source_record_type string,gl_source_name string,gl_source_system_name string,year string
I have a List: partition_columns which contains two elements: source_system_name, period_year
Using the List: partition_columns, I am trying to match them and move the corresponding columns in List: hdtList to the end of it as below:
val (pc, notPc) = hdtList.partition(c => partition_columns.contains(c.takeWhile(x => x != ' ')))
But when I print them as: println(notPc.mkString(",") + "," + pc.mkString(","))
I see the output unordered as below:
forecast_id bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string,period_year bigint,source_system_name string
The columns period_year comes first and the source_system_name last. Is there anyway I can make data as below so that the order of columns in the List: partition_columns is maintained.
forecast_id bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string,source_system_name string,period_year bigint
I know there is an option to reverse a List but I'd like to learn if I can implement a collection that maintains that order of insert.
It doesn't matter which collections you use; you only use partition_columns to call contains which doesn't depend on its order, so how could it be maintained?
But your code does maintain order: it's just hdtList's.
Something like
// get is ugly, but safe here
val pc1 = partition_columns.map(x => pc.find(y => y.startsWith(x)).get)
after your code will give you desired order, though there's probably more efficient way to do it.
I have a small Anorm query which is returning all the rows in the Service Messages table in my database. I would eventually like to turn each of these rows into JSON.
However, currently all I am doing is iterating through the elements of the first row with the .map function. How could I iterate through all rows so I can manipulate all the rows and turn it into a JSON object.
val result = DB.withConnection("my-db") { implicit connection =>
val messagesRaw = SQL("""
SELECT *
FROM ServiceMessages
""").apply;
messagesRaw.map(row =>
println(row[String]("title"))
)
}
Actually what you do IS iterating all the rows (not only the first one) taking contents of title column from each row.
In order to collect all titles you need the following trivial modification:
val titles = messagesRaw.map(row =>
row[String]("title")
)
Converting them to json (array) is simple as well:
import play.api.libs.json._
...
Ok(Json.toJson(titles))
I'm filtering Integer columns from the input parquet file with below logic and been trying to modify this logic to add additional validation to see if any one of the input columns have count equals to the input parquet file rdd count. I would want to filter out such column.
Update
The number of columns and names in the input file will not be static, it will change every time we get the file.
The objective is to also filter out column for which the count is equal to the input file rdd count. Filtering integer columns is already achieved with below logic.
e.g input parquet file count = 100
count of values in column A in the input file = 100
Filter out any such column.
Current Logic
//Get array of structfields
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("integer"))
//Get the column names
val z = df.select(columns.map(x => col(x.name)): _*)
//Get array of string
val m = z.columns
New Logic be like
val cnt = spark.read.parquet("inputfile").count()
val d = z.column.where column count is not equals cnt
I do not want to pass the column name explicitly to the new condition, since the column having count equal to input file will change ( val d = .. above)
How do we write logic for this ?
According to my understanding of your question, your are trying filter in columns with integer as dataType and whose distinct count is not equal to the count of rows in another input parquet file. If my understanding is correct, you can add column count filter in your existing filter as
val cnt = spark.read.parquet("inputfile").count()
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("string") && df.select(x.name).distinct().count() != cnt)
Rest of the codes should follow as it is.
I hope the answer is helpful.
Jeanr and Ramesh suggested the right approach and here is what I did to get the desired output, it worked :)
cnt = (inputfiledf.count())
val r = df.select(df.col("*")).where(df.col("MY_COLUMN_NAME").<(cnt))
I need a window function that partitions by some keys (=column names), orders by another column name and returns the rows with top x ranks.
This works fine for ascending order:
def getTopX(df: DataFrame, top_x: String, top_key: String, top_value:String): DataFrame ={
val top_keys: List[String] = top_key.split(", ").map(_.trim).toList
val w = Window.partitionBy(top_keys(1),top_keys.drop(1):_*)
.orderBy(top_value)
val rankCondition = "rn < "+top_x.toString
val dfTop = df.withColumn("rn",row_number().over(w))
.where(rankCondition).drop("rn")
return dfTop
}
But when I try to change it to orderBy(desc(top_value)) or orderBy(top_value.desc) in line 4, I get a syntax error. What's the correct syntax here?
There are two versions of orderBy, one that works with strings and one that works with Column objects (API). Your code is using the first version, which does not allow for changing the sort order. You need to switch to the column version and then call the desc method, e.g., myCol.desc.
Now, we get into API design territory. The advantage of passing Column parameters is that you have a lot more flexibility, e.g., you can use expressions, etc. If you want to maintain an API that takes in a string as opposed to a Column, you need to convert the string to a column. There are a number of ways to do this and the easiest is to use org.apache.spark.sql.functions.col(myColName).
Putting it all together, we get
.orderBy(org.apache.spark.sql.functions.col(top_value).desc)
Say for example, if we need to order by a column called Date in descending order in the Window function, use the $ symbol before the column name which will enable us to use the asc or desc syntax.
Window.orderBy($"Date".desc)
After specifying the column name in double quotes, give .desc which will sort in descending order.
Column
col = new Column("ts")
col = col.desc()
WindowSpec w = Window.partitionBy("col1", "col2").orderBy(col)
I am currently trying to create a concatenating string for each group. This string should be the concatenation of all the occurrences of the field.
For the moment my code looks like :
grouped = GROUP a by group_field;
b = FOREACH grouped {
unique_field = DISTINCT myfield;
tupl = TOTUPLE(unique_field) ;
FOREACH tupl GENERATE group as id, CONCAT( ? ) as my_new_string;
}
The thing is I absolutely do not know for each group the number of distinct fields or what they contains. I don't know how what to do to replace the ? and make it work.
TOTUPLE is not doing what you are expecting, it is making a one element tuple where that one element is the bag of unique_field.
Also, CONCAT only takes two things to concat and they must be explicitly defined. Let's say that you have a schema like A: {A1: chararray, A2: chararray, A3: chararray} and you want to concatinate all fields together. You will have to do this (which is obviously not ideal): CONCAT(CONCAT(A1, A2), A3).
Anyways, this problem can be easily solved with a python UDF.
myudfs.py
#!/usr/bin/python
#outputSchema('concated: string')
def concat_bag(BAG):
return ''.join(BAG)
This UDF would be used in your script like:
Register 'myudfs.py' using jython as myfuncs;
grouped = GROUP a by group_field;
b = FOREACH grouped {
unique_field = DISTINCT myfield;
GENERATE group as id, myfuncs.concat_bag(unique_field);
}
I just noticed the FOREACH tupl GENERATE ... line. That is not valid syntax. The last statement in a nested FOREACH should be a GENERATE.