This question already has answers here:
Calculate average using Spark Scala
(4 answers)
Closed 2 years ago.
I am new to Spark and I couldn't find enough information to understand some things in Spark. I am trying to write a pseudocode in scala (like these examples http://spark.apache.org/examples.html)
A file with data is given. Each line has some data: number, course name, credits, and mark.
123 Programming_1 10 75
123 History 5 80
I am trying to compute the average of each student (number). Average is the sum of every course credits*Mark a student had took
divided by the sum of every course credits the student took. Ignoring any line that has mark==NULL. Suppose that I have a function parseData(line) which makes a line with strings to record with 4 member : number, coursename, credits, mark.
What I tried until now
data=spark.textFile(“hdfs://…”)
line=data.filter(mark=> mark != null)
line= line.map(line => parseData(line))
data = parallelize(List(line))
groupkey= data.groupByKey()
((a,b,c)=>(a, sum(mul(b,c))/ sum(b))
But I don't know how to read the specific values and use them to produce the average for each student. Is it possible to use array?
Once you filter and get the dataframe, you could use something like this:
df.withColumn("product",col("credits")*col("marks"))
.groupBy(col("student"))
.agg(sum("credits").as("sumCredits"),sum("product").as("sumProduct"))
.withColumn("average",col("sumProduct")/col("sumCredits"))
Hope this helps!!
Related
This question already has answers here:
Spark filter based on the max date record
(3 answers)
Closed 6 months ago.
I'm trying to make a function in Scala to filter the most recent date. I want to keep it general, so whatever dataframe I input, as long as it has the column "date", will return me the most recent line of that dataframe. It's worth noting that my date column is usually defined as a string, in the format yyyy-MM-dd. I'm pretty sure that my code here is flawed, but I guess it illustrates the idea.
def fixDate(table: DataFrame): DataFrame = {
table
.withColumn("date", from_unixtime(unix_timestamp(col("date"), "yyyy-MM-dd"), "yyyyMMdd").cast(Integer))
.filter(col("date")===functions.max("date"))
}
It depends a lot on what you want to achieve. If you want to get all the columns with the maximum value by the "date" field, then from the point of view of performance, you'd better divide this action into two stages, getting the maximum value of the column, and filtering by it:
val maxValue = df.withColumn("eventTime", unix_timestamp(col("date"), "yyyy-MM-dd")).cast(TimestampType))
.agg(max("meventTime1")).collect()(0)(0)
df.withColumn("eventTime", unix_timestamp(col("date"), "yyyy-MM-dd")).cast(TimestampType))
.filter(col(eventTime1) === maxValue).drop("eventTime")
If you want any record with the maximum value, then I think the answers from this post will help you: How to get the row from a dataframe that has the maximum value in a specific column?
This question already has answers here:
Find gaps of a sequence in PostgreSQL tables
(1 answer)
Group rows by an incrementing column in PostgreSQL
(1 answer)
Group sequential integers postgres
(3 answers)
Closed 10 months ago.
I'm sorry for the title, I don't know how to clearly summarize the problem.
That's probably why I couldn't find an answer when searching by myself.
Feel free to improve it.
Anyways, let's say I have a query returning primary id's.
SELECT id FROM ...
Instead of having results presented with one row for each id like this:
id
-----
1
2
3
45
182
183
184
I would like to know if there's any access to some internal state based on the index that would return this:
ranges
---------
1-3
45
182-184
The whole point here is NOT to have a nice presentation, I can do that.
Besides it would add a treatment after having run the query, I want the opposite.
I'd like to know if it exists some king of shortcut that would accelerate the query by not having to return each row individually.
Maybe something related to extracting data directly from the indexes used in the WHERE clause.
I'm not aware of a generic SQL way to do that but I would love to know if there's some postgres feature for this.
If the answer is "no", it's ok. I just had to ask...
This question already has answers here:
Format a BigDecimal as String with max 2 decimal digits, removing 0 on decimal part
(5 answers)
Java format BigDecimal numbers with comma and 2 decimal place
(2 answers)
Closed 2 years ago.
I am working on a MySql backed Jasper report. I have 5 BigDecimal Variables in my report. This report calculates the yearly charge of multiple services .I need to add the first 4 and the subtract the last one. Some of these these variables can potentially house fairly large values ( 8 figures ). I donot know how to do calculations is BigDecimal. I have tried making a Double variable and doing the following
Double.valueOf($V{enc}.doubleValue() + $V{Bed}.doubleValue()+$V{Diet}.doubleValue() + $V{Investigation}.doubleValue() - $V{ref}.doubleValue())
The report does not give any error but leads to 7.269307848E9 as shown below:
254350, 14589, 8122, 3708, 0 7.269307848E9
( these are the values of the original variables except the last one which is the result of the calculation )
While searching for the solution, I did come across the following:
($V{enc}.ADD($V{Bed})).toString()
I am not sure how I can apply it in my case. This can add only two values.
I need to have a whole value, not decimal like above. By the way, I use iReport 1.2.0, which is a requirement for generating the report.
I will try to explain the problem on an abstract level first:
I have X amount of data as input, which is always going to have a field DATE. Before, the dates that came as input (after some process) where put in a table as output. Now, I am asked to put both the input dates and any date between the minimun date received and one year from that moment. If there was originally no input for some day between this two dates, all fields must come with 0, or equivalent.
Example. I have two inputs. One with '18/03/2017' and other with '18/03/2018'. I now need to create output data for all the missing dates between '18/03/2017' and '18/04/2017'. So, output '19/03/2017' with every field to 0, and the same for the 20th and 21st and so on.
I know to do this programmatically, but on powercenter I do not. I've been told to do the following (which I have done, but I would like to know of a better method):
Get the minimun date, day0. Then, with an aggregator, create 365 fields, each has that "day0"+1, day0+2, and so on, to create an artificial year.
After that we do several transformations like sorting the dates, union between them, to get the data ready for a joiner. The idea of the joiner is to do an Full Outer Join between the original data, and the data that is going to have all fields to 0 and that we got from the previous aggregator.
Then a router picks with one of its groups the data that had actual dates (and fields without nulls) and other group where all fields are null, and then said fields are given a 0 to finally be written to a table.
I am wondering how can this be achieved by, for starters, removing the need to add 365 days to a date. If I were to do this same process for 10 years intead of one, the task gets ridicolous really quick.
I was wondering about an XOR type of operation, or some other function that would cut the number of steps that need to be done for what I (maybe wrongly) feel is a simple task. Currently I now need 5 steps just to know which dates are missing between two dates, a minimun and one year from that point.
I have tried to be as clear as posible but if I failed at any point please let me know!
Im not sure what the aggregator is supposed to do?
The same with the 'full outer' join? A normal join on a constant port is fine :) c
Can you calculate the needed number of 'dublicates' before the 'joiner'? In that case a lookup configured to return 'all rows' and a less-than-or-equal predicate can help make the mapping much more readable.
In any case You will need a helper table (or file) with a sequence of numbers between 1 and the number of potential dublicates (or more)
I use our time-dimension in the warehouse, which have one row per day from 1753-01-01 and 200000 next days, and a primary integer column with values from 1 and up ...
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
Ok... so you could override your source qualifier to achieve this in the selection query itself (am giving Oracle based example as its what I'm used to and I'm assuming your data in is from a table). I looked up the connect syntax here
SQL to generate a list of numbers from 1 to 100
SELECT (MIN(tablea.DATEFIELD) + levquery.n - 1) AS Port1 FROM tablea, (SELECT LEVEL n FROM DUAL CONNECT BY LEVEL <= 365) as levquery
(Check if the query works for you - haven't access to pc to test it at the minute)
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
SQL - order by list order
I have written a SQL query by passing a list of acct numbers using IN query like:-
select acctnum,acctname from account where acctnum IN ('100', '200', '300')
This is the way i am passing the acctnumbers in the query. But Oracle returns the data in the below order.
acctnum acctname
200 Bob
100 Aaron
300 Chandler
But i want the data to be displayed in the order in which i pass the acctnum (ie) display the record for the acct num "100" first followed by others.
Hence, i want the result to be like :-
acctnum acctname
100 Aaron
200 Bob
300 Chandler
Is there any Oracle function we can use to display the records in the order in which it is passed? Please let me know ur opinion.
thanks.
AFAIK, there is no way of doing this as the method of retrieval depends upon the optimiser and the order of the rows in the table as well as other factors.
To achieve what you want to do here you'll need to explicitly order the output, something like.
SELECT acctnum,
acctname
FROM account
WHERE acctnum IN ('100', '200', '300')
ORDER BY (CASE acctnum
WHEN '100' THEN 1
WHEN '200' THEN 2
WHEN '300' THEN 3
END)
Hope this helps,
Ollie