I am learning from this site
http://naildrivin5.com/scalatour/wiki_pages/ExplcitlyTypedSelfReferences/
trait BaseComponent {
protected var label:Label = _
}
In this case what does the placeholder _ stands for?What would be the alternative?
The placeholder syntax for a variable assings the default value to the variable. Assuming Label inherits AnyRef, that would be null.
The Scala language specification lays this out explicitly:
4.2 Variable Declarations and Definitions:
A variable definition var x: T = _ can appear only as a member of a template. It introduces a mutable field with type T and a default initial value. The default value depends on the type T as follows:
| default | type T |
|---------|----------------------------------|
| 0 | Int or one of its subrange types |
| 0L | Long |
| 0.0f | Float |
| 0.0d | Double |
| false | Boolean |
| () | Unit |
| null | all other types |
Related
I am new to PostgreSQL.Simple so please forgive if question is dumb.
I am going through this tutorial: Postgresql Data Access with Haskell
I got the first demo program to run.
Now I am taking this function:
retrieveClient :: Connection -> Int -> IO [Only String]
retrieveClient conn cid = query conn "SELECT ticker FROM spot_daily WHERE id = ?" $ (Only cid)
and wish to modify it to return IO [(String, Integer, Float)].
So I wrote:
retrieveClient2 :: Connection -> Float -> IO [(String, Integer, Float)]
retrieveClient2 conn cid = query conn "SELECT (ticker, timestamp, some_val) FROM spot_daily WHERE p_open > ?" $ (Only cid)
main :: IO ()
main = do
conn <- connect localPG
mapM_ print =<< (retrieveClient2 conn 50.0)
and I get this:
MyApp-exe.EXE: Incompatible {errSQLType = "record", errSQLTableOid = Nothing, errSQLField = "row", errHaskellType = "Text", errMessage = "types incompatible"}
It's common in the Haskell world to say, "Search for the type signature you want on Hackage!" but it's not clear to me from the error message what type signature would make GHC happy.
Is there a conversion function for this sort of thing? I tried doing this:
data MyStruct = { field1 :: String, field2 :: Integer, field3 :: Float} deriving (Eq, Show)
retrieveClient3 :: Connection -> Int -> IO [Only MyStruct]
retrieveClient3 conn cid = MyStruct (query conn "SELECT ticker FROM spot_daily WHERE id = ?" $ (Only cid))
but that results in a different error.
In response to a comment, here is schema for spot_daily:
Table "public.spot_daily"
Column | Type | Collation | Nullable | Default
-----------+-----------------------+-----------+----------+----------------------------------------
id | integer | | not null | nextval('spot_daily_id_seq'::regclass)
ticker | character varying(20) | | not null |
epoch | bigint | | not null |
p_open | double precision | | |
p_close | double precision | | |
p_high | double precision | | |
p_low | double precision | | |
synthetic | boolean | | |
Indexes:
"spot_daily_pkey" PRIMARY KEY, btree (id)
PostgreSQL types make a distinction between a "row" and a "record". As written (with parentheses), your SQL query is returning a record, which isn't handled by the FromRow instance for tuples.
SELECT (ticker, timestamp, some_val) FROM prices_daily WHERE p_open > ?
Changing the query (by removing the parentheses), makes it return a row, which postgresql-simple should be able to handle:
SELECT ticker, timestamp, some_val FROM prices_daily WHERE p_open > ?
There is some table:
case class Thing(name: String, color: Option[String], height: Option[String])
class ThingSchema(t: Tag) extends Table[Thing](t, "things") {
def name = column[String]("name")
def color = column[Option[String]]("color")
def height = column[Option[String]]("height")
def * = (name, color, height) <> (Thing.tupled, Thing.unapply)
}
val things = TableQuery[ThingSchema]
For example, there are the following data in the things table:
| name | color | height |
+---------+-----------+--------+
| n1 | green | <null> |
| n1 | green | <null> |
| n1 | <null> | normal |
| n1 | <null> | normal |
| n1 | red | <null> |
| n2 | red | <null> |
I need to get the following result from the above data:
| name | color | height | size |
+---------+-----------+--------+------+
| n1 | green | <null> | 2 |
| n1 | <null> | normal | 2 |
| n1 | red | <null> | 1 |
| n2 | red | <null> | 1 |
To solve this task I use the following grouping queries:
SELECT name, color, null, count(*) AS size
FROM things
GROUP BY name, color
UNION ALL
SELECT name, null, height, count(*) AS size
FROM things
GROUP BY name, height
I've tried to create this query with the Slick:
val query1 =
things.groupBy(t => (t.name, t.color))
.map { case ((name, color), g) => (name,color,None, g.size)} //Error#1
val query2 =
things.groupBy(t => (t.name, t.height))
.map { case ((name, height), g) => (name,None,height,g.size)} //Error#1
val query = query1 ++ query2
But the above code isn't compiled, because the Slick can't define type for ConstColumn for the None values (see //Error#1 comment in the above code).
This would work for the NOT-null values (such as numbers, strings), but doesn't work for Nullable values which are represented as Option[String]=None for example.
How to use ConstColumn for None values for this case?
Here is the link to the same question
I've found another solution for this task. Perhaps it will be useful for someone.
We can use Rep.None[T] or Rep.Some[T] to generate ConstColumn values for optional types.
This example works too:
val query1 =
things.groupBy(t => (t.name, t.color))
.map { case ((name, color), g) =>
(name,color, Rep.None[String], g.size)
}
This approach has two advantages:
1) We can assign to more general type. For example:
val field: Rep[String] = ...
val x: (Rep[String], Rep[Option[String]]) = (field, Rep.None[String])
// it is not compiled because a tuple has type (Rep[String], Option[String])
val y: (Rep[String], Rep[Option[String]]) = (field, None: Option[String])
2) This approach is a bit shorter
The error I'd expect in this situation is a type-mismatch of some kind between Option[String] and None.type at the two //Error points in your code.
What you can do is give a type annotation on the None:
val query1 =
things.groupBy(t => (t.name, t.color))
.map { case ((name, color), g) => (name,color, None: Option[String], g.size)}
That will be compiled into the SELECT name, color, null, count pattern you're using.
I have a table with ~300 columns filled with characters (stored as String):
valuesDF:
| FavouriteBeer | FavouriteCheese | ...
|---------------|-----------------|--------
| U | C | ...
| U | E | ...
| I | B | ...
| C | U | ...
| ... | ... | ...
I have a Data Summary, which maps the characters onto their actual meaning. It is in this form:
summaryDF:
| Field | Value | ValueDesc |
|------------------|-------|---------------|
| FavouriteBeer | U | Unknown |
| FavouriteBeer | C | Carlsberg |
| FavouriteBeer | I | InnisAndGunn |
| FavouriteBeer | D | DoomBar |
| FavouriteCheese | C | Cheddar |
| FavouriteCheese | E | Emmental |
| FavouriteCheese | B | Brie |
| FavouriteCheese | U | Unknown |
| ... | ... | ... |
I want to programmatically replace the character values of each column in valuesDF with the Value Descriptions from summaryDF. This is the result I'm looking for:
finalDF:
| FavouriteBeer | FavouriteCheese | ...
|---------------|-----------------|--------
| Unknown | Cheddar | ...
| Unknown | Emmental | ...
| InnisAndGunn | Brie | ...
| Carlsberg | Unknown | ...
| ... | ... | ...
As there are ~300 columns, I'm not keen to type out withColumn methods for each one.
Unfortunately I'm a bit of a novice when it comes to programming for Spark, although I've picked up enough to get by over the last 2 months.
What I'm pretty sure I need to do is something along the lines of:
valuesDF.columns.foreach { col => ...... } to iterate over each column
Filter summaryDF on Field using col String value
Left join summaryDF onto valuesDF based on current column
withColumn to replace the original character code column from valuesDF with new description column
Assign new DF as a var
Continue loop
However, trying this gave me Cartesian product error (I made sure to define the join as "left").
I tried and failed to pivot summaryDF (as there are no aggregations to do??) then join both dataframes together.
This is the sort of thing I've tried, and always getting a NullPointerException. I know this is really not the right way to do this, and can see why I'm getting Null Pointer... but I'm really stuck and reverting back to old, silly & bad Python habits in desperation.
var valuesDF = sourceDF
// I converted summaryDF to a broadcasted RDD
// because its small and a "constant" lookup table
summaryBroadcast
.value
.foreach{ x =>
// searchValue = Value (e.g. `U`),
// replaceValue = ValueDescription (e.g. `Unknown`),
val field = x(0).toString
val searchValue = x(1).toString
val replaceValue = x(2).toString
// error catching as summary data does not exactly mapping onto field names
// the joys of business people working in Excel...
try {
// I'm using regexp_replace because I'm lazy
valuesDF = valuesDF
.withColumn( attribute, regexp_replace(col(attribute), searchValue, replaceValue ))
}
catch {case _: Exception =>
null
}
}
Any ideas? Advice? Thanks.
First, we'll need a function that executes a join of valuesDf with summaryDf by Value and the respective pair of Favourite* and Field:
private def joinByColumn(colName: String, sourceDf: DataFrame): DataFrame = {
sourceDf.as("src") // alias it to help selecting appropriate columns in the result
// the join
.join(summaryDf, $"Value" === col(colName) && $"Field" === colName, "left")
// we do not need the original `Favourite*` column, so drop it
.drop(colName)
// select all previous columns, plus the one that contains the match
.select("src.*", "ValueDesc")
// rename the resulting column to have the name of the source one
.withColumnRenamed("ValueDesc", colName)
}
Now, to produce the target result we can iterate on the names of the columns to match:
val result = Seq("FavouriteBeer",
"FavouriteCheese").foldLeft(valuesDF) {
case(df, colName) => joinByColumn(colName, df)
}
result.show()
+-------------+---------------+
|FavouriteBeer|FavouriteCheese|
+-------------+---------------+
| Unknown| Cheddar|
| Unknown| Emmental|
| InnisAndGunn| Brie|
| Carlsberg| Unknown|
+-------------+---------------+
In case a value from valuesDf does not match with anything in summaryDf, the resulting cell in this solution will contain null. If you want just to replace it with Unknown value, instead of .select and .withColumnRenamed lines above use:
.withColumn(colName, when($"ValueDesc".isNotNull, $"ValueDesc").otherwise(lit("Unknown")))
.select("src.*", colName)
I have a dataframe like this:
|-----+-----+-------+---------|
| foo | bar | fox | cow |
|-----+-----+-------+---------|
| 1 | 2 | red | blue | // row 0
| 1 | 2 | red | yellow | // row 1
| 2 | 2 | brown | green | // row 2
| 3 | 4 | taupe | fuschia | // row 3
| 3 | 4 | red | orange | // row 4
|-----+-----+-------+---------|
I need to group the records by "foo" and "bar" and then perform some magical computation on "fox" and "cow" to produce "badger", which may insert or delete records:
|-----+-----+-------+---------+---------|
| foo | bar | fox | cow | badger |
|-----+-----+-------+---------+---------|
| 1 | 2 | red | blue | zebra |
| 1 | 2 | red | blue | chicken |
| 1 | 2 | red | yellow | cougar |
| 2 | 2 | brown | green | duck |
| 3 | 4 | red | orange | peacock |
|-----+-----+-------+---------+---------|
(In this example, row 0 has been split into two "badger" values, and row 3 has been deleted from the final output.)
My best approach so far looks like this:
val groups = df.select("foo", "bar").distinct
groups.flatMap(row => {
val (foo, bar): (String, String) = (row(0), row(1))
val group: DataFrame = df.where(s"foo == '$foo' AND bar == '$bar'")
val rowsWithBadgers: List[Row] = makeBadgersFor(group)
rowsWithBadgers
})
This approach has a few problems:
It's clumsy to match on foo and bar individually. (A utility method can clean that up, so not a big deal.)
It throws an Invalid tree: null\nnull error because of the nested operation in which I try to refer to df from inside groups.flatMap. Don't know how to get around that one yet.
I'm not sure whether this mapping and filtering actually leverages Spark distributed computation correctly.
Is there a more performant and/or elegant approach to this problem?
This question is very similar to Spark DataFrame: operate on groups, but I'm including it here because 1) it's not clear if that question requires addition and deletion of records, and 2) the answers in that question are out-of-date and lacking detail.
I don't see a way to accomplish this with groupBy and a user-defined aggregate function, because an aggregation function aggregates to a single row. In other words,
udf(<records with foo == 'foo' && bar == 'bar'>) => [foo,bar,aggregatedValue]
I need to possibly return two or more different rows, or zero rows after analyzing my group. I don't see a way for aggregation functions to do this -- if you have an example, please share.
A user-defined function could be used.
The single row returned can contain a list.
Then you can explode the list into multiple rows and reconstruct the columns.
The aggregator:
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoders.kryo
import org.apache.spark.sql.expressions.Aggregator
case class StuffIn(foo: BigInt, bar: BigInt, fox: String, cow: String)
case class StuffOut(foo: BigInt, bar: BigInt, fox: String, cow: String, badger: String)
object StuffOut {
def apply(stuffIn: StuffIn): StuffOut = new StuffOut(stuffIn.foo,
stuffIn.bar, stuffIn.fox, stuffIn.cow, "dummy")
}
object MultiLineAggregator extends Aggregator[StuffIn, Seq[StuffOut], Seq[StuffOut]] {
def zero: Seq[StuffOut] = Seq[StuffOut]()
def reduce(buffer: Seq[StuffOut], stuff: StuffIn): Seq[StuffOut] = {
makeBadgersForDummy(buffer, stuff)
}
def merge(b1: Seq[StuffOut], b2: Seq[StuffOut]): Seq[StuffOut] = {
b1 ++: b2
}
def finish(reduction: Seq[StuffOut]): Seq[StuffOut] = reduction
def bufferEncoder: Encoder[Seq[StuffOut]] = kryo[Seq[StuffOut]]
def outputEncoder: Encoder[Seq[StuffOut]] = kryo[Seq[StuffOut]]
}
The call:
val averageSalary: TypedColumn[StuffIn, Seq[StuffOut]] = MultiLineAggregator.toColumn
val res: DataFrame =
ds.groupByKey(x => (x.foo, x.bar))
.agg(averageSalary)
.map(_._2)
.withColumn("value", explode($"value"))
.withColumn("foo", $"value.foo")
.withColumn("bar", $"value.bar")
.withColumn("fox", $"value.fox")
.withColumn("cow", $"value.cow")
.withColumn("badger", $"value.badger")
.drop("value")
I need to return only a portion of the value in a given field.
Example:
A given field returns something like 'AB-1X3.4567' but the desired value is only the '1X3.4567'portion. So for this example I need to remove anything that precedes the pattern of
[0-9,A-Z][0-9,A-Z][0-9,A-Z][.][0-9,A-Z][0-9,A-Z][0-9,A-Z][0-9,A-Z].
What query could I write to do this?
using stuff() and patindex():
create table t (val varchar(32))
insert into t values
('AB-1X3.4567') -- given example
,('1X3.4567AB-1X3.4567') --extra junk on the end
,('1X3.4567') -- goldy locks
,('X3.4567') -- too short
,('AB-1X#.4567') -- # is not [0-9A-Z]
select
val
, str = stuff(val,1,patindex('%[0-9A-Z][0-9A-Z][0-9A-Z][.][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',val)-1,'')
from t
rextester demo: http://rextester.com/ITUJ68634
returns:
+---------------------+---------------------+
| val | str |
+---------------------+---------------------+
| AB-1X3.4567 | 1X3.4567 |
| 1X3.4567AB-1X3.4567 | 1X3.4567AB-1X3.4567 |
| 1X3.4567 | 1X3.4567 |
| X3.4567 | NULL |
| AB-1X#.4567 | NULL |
+---------------------+---------------------+
Your pattern alludes to anything which is XXX.XXXX where X = any single digit or letter. In that case we can use RIGHT() and LEN()
DECLARE #value VARCHAR(4000)='AB-1X3.4567'
SELECT RIGHT(#value,LEN(#value) - 3)