Why does the d[c] assignment not work here?
d: `a`b!(1;2)
d
a| 1
b| 2
d[`c]: d
'type
[0] d[`c]: d
(PS it doesn't work with any dictionary, not just the recursive example shown here)
Your attempted assignment fails because you're trying to add to a "typed" dictionary (the type being long, in this case). You'll encounter the same error, trying to add a key-value pair with a symbol as the value, for example:
q)d[`c]:`s
'type
[0] d[`c]:`s
You can get around this by using a dictionary without a specified type for the values:
q)d:enlist[`]!enlist(::)
q)d[`a]:12.5
q)d[`b]:d
q)d
| ::
a| 12.5
b| ``a!(::;12.5)
Related
I have a number of repetitive queries:
select lR, e10, e25, vr from z
Is there a way I can do something like:
features: `lR`e10`e25`vr
select features from z
You could use # like so:
`lR`e10`e25`vr#z
NB: The left argument here must be a list so to select a single column use the following:
enlist[`vr]#z
Example:
q)t:([]a:`a`b`c;b:til 3;c:0b);
q)`a`b#t
a b
---
a 0
b 1
c 2
Another approach is to use a functional form (which you can build using parse):
q)0N!parse"select lR, e10, e25, vr from z";
(?;`z;();0b;`lR`e10`e25`vr!`lR`e10`e25`vr)
q)features:`lR`e10`e25`vr
q)?[z;();0b;features!features]
If you use # for this then be aware it will fail on a keyed table.
One possible way of modifying it to work on any table would be something like:
f:{[t;c] if[not .Q.qt[t]; '"Input is not a table"]; c,:(); $[99h = type[t];c#/:t;c#t]}
So make sure your table is, in fact, a table, make sure columns are a list, and then perform the required # operation.
q)t
a| b c d
-| ------
a| 4 7 10
b| 5 8 11
c| 6 9 12
q)f[t;`b]
a| b
-| -
a| 4
b| 5
c| 6
q)f[0!t;`b]
b
-
4
5
6
q)f[flip 0!t;`b]
'Input is not a table
[0] f[flip 0!t;`b]
^
How do I get columns from a table? If they don't exist it's ok to get them as null columns.
Trying reshape#:
q)d:`a`b!1 2
q)enlist d
a b
---
1 2
q)`a`c#d
a| 1
c|
q)`a`c#enlist d
'c
[0] `a`c#enlist d
^
Why does thereshape# operator not work on a table? It could easily act on each row (which is dict) and combine results. So I'm forced to write:
q)`a`c#/:enlist d
a c
---
1
Is it the shortest way?
Any key you try to take (#) which is not present in a dictionary will be assigned a null value of the same type as the first value in the dictionary. Similar behaviour is not available for tables.
q)`a`c#`a`b!(1 2;())
a| 1 2
c| `long$()
q)`b`c#`a`b!(();1 2)
b| 1 2
c| ()
Like you mentioned, the use of each-right (/:) will act on each row of the table IE each dictionary. Instead of using an iterator to split the table into dictionaries we can act on the dictionary itself. This will return the same output and is slightly faster.
q)d:`a`b!1 2
q)enlist`a`c#d
a c
---
1
q)(`a`c#/:enlist d)~enlist`a`c#d
1b
q)\ts:1000000 enlist`a`c#d
395 864
q)\ts:1000000 `a`c#/:enlist d
796 880
Using
from pyspark.sql import functions as f
and methods f.agg and f.collect_set I have created a column colSet within a dataFrame as follows:
+-------+--------+
| index | colSet |
+-------+--------+
| 1|[11, 13]|
| 2| [3, 6]|
| 3| [3, 7]|
| 4| [2, 7]|
| 5| [2, 6]|
+-------+--------+
Now, how is it possible, using python/ and pyspark, to select only those rows where, for instance, 3 is an element of the array in the colSet entry (where in general there can be far more than only two entries!)?
I have tried using a udf function like this:
isInSet = f.udf( lambda vcol, val: val in vcol, BooleanType())
being called via
dataFrame.where(isInSet(f.col('colSet'), 3))
I also tried removing f.col from the caller and using it in the definition of isInSet instead, but neither worked, I am getting an exception:
AnalysisException: cannot resolve '3' given input columns: [index, colSet]
Any help is appreciated on how to select rows with a certain entry (or even better subset!!!) given a row with a collect_set result.
Your original UDF is fine, but to use it you need to pass the value 3 as a literal:
dataFrame.where(isInSet(f.col('colSet'), f.lit(3)))
But as jxc points out in a comment, using array_contains is probably a better choice:
dataFrame.where(f.array_contains(f.col('colSet'), 3))
I have not done any benchmarking, but in general using UDFs in PySpark is slower than using built-in functions because of the back-and-forth communication between the JVM and the Python interpreter.
I found a solution today (after failing Friday evening) without using an udf-method:
[3 in x[0] for x in list(dataFrame.select(['colSet']).collect())]
Hope this helps someone else in the future.
I am able to add and assign the second dictionary (s i) to the one with (d t)
d1:`d`t!(.z.d ;.z.t)
d1,:`s`i!`VOD`L
d1
However the other way round does not work, I am getting type error :
d2:`s`i!`VOD`L
d2,:`d`t!(.z.d ;.z.t)
d2
When dictionary d2 was created all of the values where symbols. When you try to update this using d2,: with non-symbol types it causes kdb to throw an error due to mismatched types. One way to prevent this is to add a null key to your dictionary that will ensure you can have mixed types for your values:
q)d2:enlist[`]!enlist(::) / add null key
q)d2,:`s`i!`VOD`L
q)d2
| ::
s| `VOD
i| `L
q)d2,:`d`t!(.z.d ;.z.t)
q)d2
| ::
s| `VOD
i| `L
d| 2018.03.25
t| 09:42:52.754
If you investigate a namespace, for example .q or create your own, you will see that the null key exists, ensuring namespaces can contain mixed types.
In the first case, (d t) is making a heterogenous dictionary :
q)d1:`d`t!(.z.d ;.z.t)
q)type value d1
0h
now if you add and assign any homogeneous or heterogenous dictionary, it will work.
while in another case the first dictionary created is homogeneous , and it is throwing error when you add & assign a heterogenous dictionary (or homogeneous dictionary of another type for that matter )
q)d2:`s`i!`VOD`L
q)type value d2
11h
q)type value `d`t!(.z.d ;.z.t)
To solve this issue , you should only add the dictionary and then assign it.
q)d2:`s`i!`VOD`L
q)d2:d2, `d`t!(.z.d ;.z.t)
q)d2
s| `VOD
i| `L
d| 2018.03.25
t| 09:59:17.109
My question is quite similar to this one: Apache Spark SQL issue : java.lang.RuntimeException: [1.517] failure: identifier expected But I just can't figure out where my problem lies. I am using SQLite as database backend. Connecting and simple select statements work fine.
The offending line:
val df = tableData.selectExpr(tablesMap(t).toSeq:_*).map(r => myMapFunc(r))
tablesMap contains the table name as key and an array of strings as expressions. Printed, the array looks like this:
WrappedArray([My Col A], [ColB] || [Col C] AS ColB)
The table name is also included in square brackets since it contains spaces. The exception I get:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: identifier expected
I already made sure not to use any Spark Sql keywords. In my opinion there are 2 possible reasons why this code fails: 1) I somehow handle spaces in column names wrong. 2) I handle concatenation wrong.
I am using a resource file, CSV-like, which contains the expressions I want to be evaluated on my tables. Apart from this file, I want to allow the user to specify additional tables and their respective column expressions at runtime. The file looks like this:
TableName,`Col A`,`ColB`,CONCAT(`ColB`, ' ', `Col C`)
Appartently this does not work. Nevertheless I would like to reuse this file, modified of course. My idea was to map the columns with the expressions from an array of strings, like now, to a sequence of spark columns. (This is the only solution for me I could think of, since I want to avoid pulling in all hive dependecies just for this one feature.) I would introduce a small syntax for my expressions to mark raw column names with a $ and some keywords for functions like concat and as. But how could I do this? I tried something like this but it's far far away from even compiling.
def columnsMapFunc( expr: String) : Column = {
if(expr(0) == '$')
return expr.drop(1)
else
return concat(extractedColumnNames).as(newName)
}
Generally speaking using names containing whitespaces is asking for problems but replacing square brackets with backticks should solve the problem:
val df = sc.parallelize(Seq((1,"A"), (2, "B"))).toDF("f o o", "b a r")
df.registerTempTable("foo bar")
df.selectExpr("`f o o`").show
// +-----+
// |f o o|
// +-----+
// | 1|
// | 2|
// +-----+
sqlContext.sql("SELECT `b a r` FROM `foo bar`").show
// +-----+
// |b a r|
// +-----+
// | A|
// | B|
// +-----+
For concatenation you have to use concat function:
df.selectExpr("""concat(`f o o`, " ", `b a r`)""").show
// +----------------------+
// |'concat(f o o, ,b a r)|
// +----------------------+
// | 1 A|
// | 2 B|
// +----------------------+
but it requires HiveContext in Spark 1.4.0.
In practice I would simply rename columns after loading data
df.toDF("foo", "bar")
// org.apache.spark.sql.DataFrame = [foo: int, bar: string]
and use functions instead of expression strings (concat function is available only in Spark >= 1.5.0, for 1.4 and earlier you'll need an UDF):
import org.apache.spark.sql.functions.concat
df.select($"f o o", concat($"f o o", lit(" "), $"b a r")).show
// +----------------------+
// |'concat(f o o, ,b a r)|
// +----------------------+
// | 1 A|
// | 2 B|
// +----------------------+
There is also concat_ws function which takes separator as the first argument:
df.selectExpr("""concat_ws(" ", `f o o`, `b a r`)""")
df.select($"f o o", concat_ws(" ", $"f o o", $"b a r"))