Haskell HDBC Elegance in F#? - ado.net

I'm struck by Haskell's terseness and elegance. But I work in a .Net house, so I use F# when I can get away with it--I may be the only one of hundreds across the country who uses it.
Does ADO.NET or F# offer something as terse and elegant as HDBC's executeMany? I'm making my way through Real World Haskell. In chapter 21 it offers this example:
ghci> conn <- connectSqlite3 "test1.db"
ghci> stmt <- prepare conn "INSERT INTO test VALUES (?, ?)"
ghci> executeMany stmt [[toSql 5, toSql "five's nice"], [toSql 6, SqlNull]]
ghci> commit conn
ghci> disconnect conn
I'd like to get this elegance and terseness in my F#. I've seen a lot of hype around using parameterized queries to avoid SQL injection attacks. I'm not using them in this case for three reasons:
I find parameterized queries in .Net ugly and burdensome.
My data comes from the corporate office, so it's (mostly) clean.
My table has 34 columns. I despise the idea of parameterizing a query with 34 columns.
Here's my F# code:
module Data
open System
open System.Data
open System.Data.OleDb
open System.Text.RegularExpressions
type Period = Prior | Current
let Import period records db =
use conn = new OleDbConnection(#"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + db + ";Persist Security Info=False;")
let execNonQuery s =
let comm = new OleDbCommand(s, conn) in
comm.ExecuteNonQuery() |> ignore
let enquote = sprintf "\"%s\""
let escapeQuotes s = Regex.Replace(s, "\"", "\"\"")
let join (ss:string[]) = String.Join(",", ss)
let table = match period with
| Prior -> "tblPrior"
| Current -> "tblCurrent"
let statements =
[| for r in records do
let vs = r |> Array.map (escapeQuotes >> enquote) |> join
let vs' = vs + sprintf ",\"14\",#%s#" (DateTime.Now.ToString "yyyy-MM-dd") in
yield sprintf "INSERT INTO %s ( [Field01], [Field02], [Field03] [Field04], [Field05], [Field06], [Field07], [Field08], [Field09], [Field10], [Field11], [Field12], [Field13], [Field14], [Field15], [Field16], [Field17], [Field18], [Field19], [Field20], [Field21], [Field22], [Field23], [Field24], [Field25], [Field26], [Field27], [Field28], [Field29], [Field30], [Field31], [Field32], [Field33], [Field34] ) VALUES (%s)" table vs' |] in
do conn.Open()
execNonQuery (sprintf "DELETE FROM %s" table)
statements |> Array.iter execNonQuery
I've renamed the fields of the table(s) for security reasons.
Because all the fields on the table are text, I can easily Array.map them to escape and quote the values.
At between 9,000 and 10,000 records per day to import to each of the two tables, I want to do this as efficiently as possible. Hence my interest in the executeMany of Haskell. Too, though, I like the idea behind parameterized queries, and I like the way Hasekll has implemented them. Is there something equivalent in terseness and elegance in F#?

I agree with #JonnyBoats comment that generally using an F# SQL type provider like SqlDataConnection (LINQ-to-SQL) or SqlEntityConnection (Entity Framework) would be far more elegant than any kind of solution involving building insert statement strings by hand.
But, there is one important qualifier to your question: "At between 9,000 and 10,000 records per day to import to each of the two tables, I want to do this as efficiently as possible." In a scenario like this, you'll want to use SqlBulkCopy for efficient bulk inserts (it leverages native database driver features for much faster inserts than you are likely getting with HDBC's executeMany).
Here's a small example that should help you getting started using SqlBulkCopy with F#: https://stackoverflow.com/a/8942056/236255. Note that you'll be working with a DataTable to stage the data which though old and somewhat awkward to use from F#, is still superior to building insert statement strings in my opinion.
Update in response to comment
Here's a generalized approach to using SqlBulkCopy which is improved for your scenario (we pass in a column specification separately from the row data, and both are dynamic):
//you must reference System.Data and System.Xml
open System
open System.Data
open System.Data.SqlClient
let bulkLoad (conn:SqlConnection) tableName (columns:list<string * Type>) (rows: list<list<obj>>) =
use sbc = new SqlBulkCopy(conn, SqlBulkCopyOptions.TableLock, null, BatchSize=500, BulkCopyTimeout=1200, DestinationTableName=tableName)
sbc.WriteToServer(
let dt = new DataTable()
columns
|> List.iter (dt.Columns.Add>>ignore)
for row in rows do
let dr = dt.NewRow()
row |> Seq.iteri(fun i value -> dr.[i] <- value)
dt.Rows.Add(dr)
dt)
//example usage:
//note: since you know all your columns are of type string, you could define columns like
//let columns = ["Field1", "Field2", "Field3"] |> List.map (fun name -> name, typeof<String>)
let columns = [
"Field1", typeof<String>
"Field2", typeof<String>
"Field3", typeof<String>
]
let rows = [
["a"; "b"; "c"]
["d"; "e"; "f"]
["g"; "h"; "i"]
["j"; "k"; "l"]
["m"; "n"; "o"]
]
//a little funkiness to transform our list<list<string>> to list<list<obj>>,
//probably not needed in practice because you won't be constructing your lists literally
let rows = rows |> List.map (fun row -> row |> List.map (fun value -> value :> obj))
bulkLoad conn "tblPrior" columns rows
You could get even fancier / more terse using an approach involving reflection. e.g. create a type like
type RowData = { Field1:string; Field2:string; Field3:string }
and make a bulkLoad with a signature that takes a list<'a> argument such that it reflects over the property names and types of typeof<'a> to build the DataTable Columns, and similarly uses reflection to iterate over all the properties of a row instance to create and add a new row to the DataTable. In fact, this question shows how to make a generic ToDataTable method that does it (in C#).

Related

How to explode a struct column with a prefix?

My goal is to explode (ie, take them from inside the struct and expose them as the remaining columns of the dataset) a Spark struct column (already done) but changing the inner field names by prepending an arbitrary string. One of the motivations is that my struct can contain columns that have the same name as columns outside of it - therefore, I need a way to differentiate them easily. Of course, I do not know beforehand what are the columns inside my struct.
Here is what I have so far:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = df.select("*", column + ".*").drop(column)
}
This does the job alright - I use this writing:
df.explodeStruct("myColumn")
It returns all the columns from the original dataframe, plus the inner columns of the struct at the end.
As for prepending the prefix, my idea is to take the column and find out what are its inner columns. I browsed the documentation and could not find any method on the Column class that does that. Then, I changed my approach to taking the schema of the DataFrame, then filtering the result by the name of the column, and extracting the column found from the resulting array. The problem is that this element I find has the type StructField - which, again, presents no option to extract its inner field - whereas what I would really like is to get handled a StructType element - which has the .getFields method, that does exactly what I want (that is, showing me the name of the inner columns, so I can iterate over them and use them on my select, prepending the prefix I want to them). I know no way to convert a StructField to a StructType.
My last attempt would be to parse the output of StructField.toString - which contains all the names and types of the inner columns, although that feels really dirty, and I'd rather avoid that lowly approach.
Any elegant solution to this problem?
Well, after reading my own question again, I figured out an elegant solution to the problem - I just needed to select all the columns the way I was doing, and then compare it back to the original dataframe in order to figure out what were the new columns. Here is the final result - I also made this so that the exploded columns would show up in the same place as the original struct one, so not to break the flow of information:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = {
val prefix = column + "_"
val originalPosition = df.columns.indexOf(column)
val dfWithAllColumns = df.select("*", column + ".*")
val explodedColumns = dfWithAllColumns.columns diff df.columns
val prefixedExplodedColumns = explodedColumns.map(c => col(column + "." + c) as prefix + c)
val finalColumnsList = df.columns.map(col).patch(originalPosition, prefixedExplodedColumns, 1)
df.select(finalColumnsList: _*)
}
}
Of course, you can customize the prefix, the separator, and etc - but that is simple, anyone could tweak the parameters and such. The usage remains the same.
In case anyone is interested, here is something similar for PySpark:
def explode_struct(df: DataFrame, column: str) -> DataFrame:
original_position = df.columns.index(column)
original_columns = df.columns
new_columns = df.select(column + ".*").columns
exploded_columns = [F.col(column + "." + c).alias(column + "_" + c) for c in new_columns]
col_list = [F.col(c) for c in df.columns]
col_list.pop(original_position)
col_list[original_position:original_position] = exploded_columns
return df.select(col_list)

How can I convert this select statement to functional form?

I am having a couple of issues to put this in a functional format.
select from tableName where i=fby[(last;i);([]column_one;column_two)]
This is what I got:
?[tableName;fby;enlist(=;`i;(enlist;last;`i);(+:;(!;enlist`column_one`column_two;(enlist;`column_one;`column_two))));0b;()]
but I get a type error.
Any suggestions?
Consider using the following function, adjust from the buildQuery function given in the whitepaper on Parse Trees. This is a pretty useful tool for quickly developing in q, this version is an improvement on that given in the linked whitepaper, having been extended to handle updates by reference (i.e., update x:3 from `tab)
\c 30 200
tidy:{ssr/[;("\"~~";"~~\"");("";"")] $[","=first x;1_x;x]};
strBrk:{y,(";" sv x),z};
//replace k representation with equivalent q keyword
kreplace:{[x] $[`=qval:.q?x;x;"~~",string[qval],"~~"]};
funcK:{$[0=t:type x;.z.s each x;t<100h;x;kreplace x]};
//replace eg ,`FD`ABC`DEF with "enlist`FD`ABC`DEF"
ereplace:{"~~enlist",(.Q.s1 first x),"~~"};
ereptest:{((0=type x) & (1=count x) & (11=type first x)) | ((11=type x)&(1=count x))};
funcEn:{$[ereptest x;ereplace x;0=type x;.z.s each x;x]};
basic:{tidy .Q.s1 funcK funcEn x};
addbraks:{"(",x,")"};
//where clause needs to be a list of where clauses, so if only one whereclause need to enlist.
stringify:{$[(0=type x) & 1=count x;"enlist ";""],basic x};
//if a dictionary apply to both, keys and values
ab:{$[(0=count x) | -1=type x;.Q.s1 x;99=type x;(addbraks stringify key x),"!",stringify value x;stringify x]};
inner:{[x]
idxs:2 3 4 5 6 inter ainds:til count x;
x:#[x;idxs;'[ab;eval]];
if[6 in idxs;x[6]:ssr/[;("hopen";"hclose");("iasc";"idesc")] x[6]];
//for select statements within select statements
//This line has been adjusted
x[1]:$[-11=type x 1;x 1;$[11h=type x 1;[idxs,:1;"`",string first x 1];[idxs,:1;.z.s x 1]]];
x:#[x;ainds except idxs;string];
x[0],strBrk[1_x;"[";"]"]
};
buildSelect:{[x]
inner parse x
};
We can use this to create the functional query that will work
q)n:1000
q)tab:([]sym:n?`3;col1:n?100.0;col2:n?10.0)
q)buildSelect "select from tab where i=fby[(last;i);([]col1;col2)]"
"?[tab;enlist (=;`i;(fby;(enlist;last;`i);(flip;(lsq;enlist`col1`col2;(enlist;`col1;`col2)))));0b;()]"
So we have the following as the functional form
?[tab;enlist (=;`i;(fby;(enlist;last;`i);(flip;(lsq;enlist`col1`col2;(enlist;`col1;`col2)))));0b;()]
// Applying this
q)?[tab;enlist (=;`i;(fby;(enlist;last;`i);(flip;(lsq;enlist`col1`col2;(enlist;`col1;`col2)))));0b;()]
sym col1 col2
----------------------
bah 18.70281 3.927524
jjb 35.95293 5.170911
ihm 48.09078 5.159796
...
Glad you were able to fix your problem with converting your query to functional form.
Generally it is the case that when you use parse with a fby in your statement, q will convert this function into its k definition. Usually you should just be able to replace this k code with the q function itself (i.e. change (k){stuff} to fby) and this should run properly when turning the query into functional form.
Additionally, if you check out https://code.kx.com/v2/wp/parse-trees/ it goes into more detail about parse trees and functional form. Additionally, it contains a script called buildQuery which will return the functional form of the query of interest as a string which can be quite handy and save time when a functional form is complex.
I actually got it myself ->
?[tableName;((=;`i;(fby;(enlist;last;`i);(+:;(!;enlist`column_one`column_two;(enlist;`column_one;`column_two)))));(in;`venue;enlist`venueone`venuetwo));0b;()]
The issues was a () missing from the statement. Works fine now.
**if someone wants to add a more detailed explanation on how manual parse trees are built and how the generic (k){} function can be replaced with the actual function in q feel free to add your answer and I'll accept and upvote it

Spark - Create a DataFrame from a list of Rows generated in a loop

I have a loop which generates rows in each iteration. My goal is to create a dataframe, with a given schema, that contents just those rows. I have in mind a set of steps to follow, but I am not able to add a new Row to a List[Row] in each loop iteration
I am trying the following approach:
var listOfRows = List[Row]()
val dfToExtractValues: DataFrame = ???
dfToExtractValues.foreach { x =>
//Not really important how to generate here the variables
//So to simplify all the rows will have the same values
var col1 = "firstCol"
var col2 = "secondCol"
var col3 = "thirdCol"
val newRow = RowFactory.create(col1,col2,col3)
//This step I am not able to do
//listOfRows += newRow -> Just for strings
//listOfRows.add(newRow) -> This add doesnt exist, it is a addString
//listOfRows.aggregate(1)(newRow) -> This is not how aggreage works...
}
val rdd = sc.makeRDD[RDD](listOfRows)
val dfWithNewRows = sqlContext.createDataFrame(rdd, myOriginalDF.schema)
Can someone tell me what am I doing wrong, or what could I change in my approach to generate a dataframe from the rows I'm generating?
Maybe there is a better way to collect the Rows instead of List[Row]. But then I need to convert that other type of collection into a dataframe.
Can someone tell me what am I doing wrong
Closures:
First of all it looks like you skipped over Understanding Closures in the Programming Guide. Any attempt to modify variables passed with closure is futile. All you can do is modify a copy and changes won't be reflected globally.
Variable doesn't make object mutable:
Following
var listOfRows = List[Row]()
creates a variable. Assigned List is as immutable as it was. If it wasn't in the Spark context you could create a new List and reassign:
listOfRows = newRow :: listOfRows
Note that we perpend not append - you don't want to append to the list in a loop.
Variables with immutable objects are useful, when you want to share data (it is common pattern in Akka for example), but don't have many applications in Spark.
Keep things distributed:
Finally never fetch data to the driver just to distribute it again. You should also avoid unnecessary conversions between RDDs and DataFrames. It is best to use DataFrame operators all the way:
dfToExtractValues.select(...)
but if you need something more complex map:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
dfToExtractValues.map(x => ...)(RowEncoder(schema))

Map word ngrams to counts in scala

I'm trying to create a map which goes through all the ngrams in a document and counts how often they appear. Ngrams are sets of n consecutive words in a sentence (so in the last sentence, (Ngrams, are) is a 2-gram, (are, sets) is the next 2-gram, and so on). I already have code that creates a document from a file and parses it into sentences. I also have a function to count the ngrams in a sentence, ngramsInSentence, which returns Seq[Ngram].
I'm getting stuck syntactically on how to create my counts map. I am iterating through all the ngrams in the document in the for loop, but don't know how to map the ngrams to the count of how often they occur. I'm fairly new to Scala and the syntax is evading me, although I'm clear conceptually on what I need!
def getNGramCounts(document: Document, n: Int): Counts = {
for (sentence <- document.sentences; ngram <- nGramsInSentence(sentence,n))
//I need code here to map ngram -> count how many times ngram appears in document
}
The type Counts above, as well as Ngram, are defined as:
type Counts = Map[NGram, Double]
type NGram = Seq[String]
Does anyone know the syntax to map the ngrams from the for loop to a count of how often they occur? Please let me know if you'd like more details on the problem.
If I'm correctly interpreting your code, this is a fairly common task.
def getNGramCounts(document: Document, n: Int): Counts = {
val allNGrams: Seq[NGram] = for {
sentence <- document.sentences
ngram <- nGramsInSentence(sentence, n)
} yield ngram
allNgrams.groupBy(identity).mapValues(_.size.toDouble)
}
The allNGrams variable collects a list of all the NGrams appearing in the document.
You should eventually turn to Streams if the document is big and you can't hold the whole sequence in memory.
The following groupBycreates a Map[NGram, List[NGram]] which groups your values by its identity (the argument to the method defines the criteria for "aggregate identification") and groups the corresponding values in a list.
You then only need to map the values (the List[NGram]) to its size to get how many recurring values there were of each NGram.
I took for granted that:
NGram has the expected correct implementation of equals + hashcode
document.sentences returns a Seq[...]. If not you should expect allNGrams to be of the corresponding collection type.
UPDATED based on the comments
I wrongly assumed that the groupBy(_) would shortcut the input value. Use the identity function instead.
I converted the count to a Double
Appreciate the help - I have the correct code now using the suggestions above. The following returns the desired result:
def getNGramCounts(document: Document, n: Int): Counts = {
val allNGrams: Seq[NGram] = (for(sentence <- document.sentences;
ngram <- ngramsInSentence(sentence,n))
yield ngram)
allNGrams.groupBy(l => l).map(t => (t._1, t._2.length.toDouble))
}

query to detect ISP violations

I am trying to create a special query with NDepend, but cannot figure it out.
Here's what I'd like to query in a more procedural pseudocode:
var list
foreach type t
foreach i = t.attribute that is an interface
var nm = i.numberOfMethods
var mu = numberOfMethods that t actually uses
if mu / nm < 1
list.Add(t)
end foreach
end foreach
return list
It's supposed to list types that don't comply with the Interface Segregation Principle.
Thanks!
So the query you ask can be written this way:
from t in JustMyCode.Types where !t.IsAbstract
from i in t.TypesUsed where i.IsInterface
// Here collect methods of i that are not used
let methodsOfInterfaceUnused = i.Methods.Where(m => !m.IsUsedBy(t))
where methodsOfInterfaceUnused.Count() > 0
select new { t, methodsOfInterfaceUnused }
This query has the peculiarity to match several time a same type, one for each time methodsOfInterfaceUnused is not empty. The result is then nicely presented and understandable: