casting multiple columns astype - pyspark

Should be an easy to answer question... Am I doing this wrong? Can I not cast multiple columns?
>>> val results2 = results.select( results["HCAHPS Base Score"].cast(IntegerType).as(results["HCAHPS Base Score"]), results["HCAHPS Consistency Score"].cast(IntegerType).as(results["HCAHPS Consistency Score"]) )
File "<stdin>", line 1
val results2 = results.select( results["HCAHPS Base Score"].cast(IntegerType).as(results["HCAHPS Base Score"]), results["HCAHPS Consistency Score"].cast(IntegerType).as(results["HCAHPS Consistency Score"]) )
^
SyntaxError: invalid syntax
The syntax error keeps popping up right around the comma...

Try this. I assume this is pySpark as the question is tagged under PySpark
results2 = results.select( results["HCAHPS Base Score"].cast(IntegerType()).alias("HCAHPS Base Score"), results["HCAHPS Consistency Score"].cast(IntegerType()).alias("HCAHPS Consistency Score") )
In Scala, you may try the below.
val results2 = results.select( results["HCAHPS Base Score"].cast(IntegerType).as("HCAHPS Base Score"), results["HCAHPS Consistency Score"].cast(IntegerType).as("HCAHPS Consistency Score") )

Related

How to union multiple dynamic inputs in Palantir Foundry?

I want to Union multiple datasets in Palantir Foundry, the name of the datasets are dynamic so I would not be able to give the dataset names in transform_df() statically. Is there a way I can dynamically take multiple inputs into transform_df and union all of those dataframes?
I tried looping over the datasets like:
li = ['dataset1_path', 'dataset2_path']
union_df = None
for p in li:
#transforms_df(
my_input = Input(p),
Output(p+"_output")
)
def my_compute_function(my_input):
return my_input
if union_df is None:
union_df = my_compute_function
else:
union_df = union_df.union(my_compute_function)
But, this doesn't generate the unioned output.
This should be able to work for you with some changes, this is an example of dynamic dataset with json files, your situation would maybe be only a little different. Here is a generalized way you could be doing dynamic json input datasets that should be adaptable to any type of dynamic input file type or internal to foundry dataset that you can specify. This generic example is working on a set of json files uploaded to a dataset node in the platform. This should be fully dynamic. Doing a union after this should be a simple matter.
There's some bonus logging going on here as well.
Hope this helps
from transforms.api import Input, Output, transform
from pyspark.sql import functions as F
import json
import logging
def transform_generator():
transforms = []
transf_dict = {## enter your dynamic mappings here ##}
for value in transf_dict:
#transform(
out=Output(' path to your output here '.format(val=value)),
inpt=Input(" path to input here ".format(val=value)),
)
def update_set(ctx, inpt, out):
spark = ctx.spark_session
sc = spark.sparkContext
filesystem = list(inpt.filesystem().ls())
file_dates = []
for files in filesystem:
with inpt.filesystem().open(files.path) as fi:
data = json.load(fi)
file_dates.append(data)
logging.info('info logs:')
logging.info(file_dates)
json_object = json.dumps(file_dates)
df_2 = spark.read.option("multiline", "true").json(sc.parallelize([json_object]))
df_2 = df_2.withColumn('upload_date', F.current_date())
df_2.drop_duplicates()
out.write_dataframe(df_2)
transforms.append(update_logs)
return transforms
TRANSFORMS = transform_generator()
So this question breaks down in two questions.
How to handle transforms with programatic input paths
To handle transforms with programatic inputs, it is important to remember two things:
1st - Transforms will determine your inputs and outputs at CI time. Which means that you can have python code that generates transforms, but you cannot read paths from a dataset, they need to be hardcoded into your python code that generates the transform.
2nd - Your transforms will be created once, during the CI execution. Meaning that you can't have an increment or special logic to generate different paths whenever the dataset builds.
With these two premises, like in your example or #jeremy-david-gamet 's (ty for the reply, gave you a +1) you can have python code that generates your paths at CI time.
dataset_paths = ['dataset1_path', 'dataset2_path']
for path in dataset_paths:
#transforms_df(
my_input = Input(path),
Output(f"{path}_output")
)
def my_compute_function(my_input):
return my_input
However to union them you'll need a second transform to execute the union, you'll need to pass multiple inputs, so you can use *args or **kwargs for this:
dataset_paths = ['dataset1_path', 'dataset2_path']
all_args = [Input(path) for path in dataset_paths]
all_args.append(Output("path/to/unioned_dataset"))
#transforms_df(*all_args)
def my_compute_function(*args):
input_dfs = []
for arg in args:
# there are other arguments like ctx in the args list, so we need to check for type. You can also use kwargs for more determinism.
if isinstance(arg, pyspark.sql.DataFrame):
input_dfs.append(arg)
# now that you have your dfs in a list you can union them
# Note I didn't test this code, but it should be something like this
...
How to union datasets with different schemas.
For this part there are plenty of Q&A out there on how to union different dataframes in spark. Here is a short code example copied from https://stackoverflow.com/a/55461824/26004
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
Since inputs and outputs are determined at CI time, we cannot form true dynamic inputs. We will have to somehow point to specific datasets in the code. Assuming the paths of datasets share the same root, the following seems to require minimum maintenance:
from transforms.api import transform_df, Input, Output
from functools import reduce
datasets = [
'dataset1',
'dataset2',
'dataset3',
]
inputs = {f'inp{i}': Input(f'input/folder/path/{x}') for i, x in enumerate(datasets)}
kwargs = {
**{'output': Output('output/folder/path/unioned_dataset')},
**inputs
}
#transform_df(**kwargs)
def my_compute_function(**inputs):
unioned_df = reduce(lambda df1, df2: df1.unionByName(df2), inputs.values())
return unioned_df
Regarding unions of different schemas, since Spark 3.1 one can use this:
df1.unionByName(df2, allowMissingColumns=True)

How to explode a struct column with a prefix?

My goal is to explode (ie, take them from inside the struct and expose them as the remaining columns of the dataset) a Spark struct column (already done) but changing the inner field names by prepending an arbitrary string. One of the motivations is that my struct can contain columns that have the same name as columns outside of it - therefore, I need a way to differentiate them easily. Of course, I do not know beforehand what are the columns inside my struct.
Here is what I have so far:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = df.select("*", column + ".*").drop(column)
}
This does the job alright - I use this writing:
df.explodeStruct("myColumn")
It returns all the columns from the original dataframe, plus the inner columns of the struct at the end.
As for prepending the prefix, my idea is to take the column and find out what are its inner columns. I browsed the documentation and could not find any method on the Column class that does that. Then, I changed my approach to taking the schema of the DataFrame, then filtering the result by the name of the column, and extracting the column found from the resulting array. The problem is that this element I find has the type StructField - which, again, presents no option to extract its inner field - whereas what I would really like is to get handled a StructType element - which has the .getFields method, that does exactly what I want (that is, showing me the name of the inner columns, so I can iterate over them and use them on my select, prepending the prefix I want to them). I know no way to convert a StructField to a StructType.
My last attempt would be to parse the output of StructField.toString - which contains all the names and types of the inner columns, although that feels really dirty, and I'd rather avoid that lowly approach.
Any elegant solution to this problem?
Well, after reading my own question again, I figured out an elegant solution to the problem - I just needed to select all the columns the way I was doing, and then compare it back to the original dataframe in order to figure out what were the new columns. Here is the final result - I also made this so that the exploded columns would show up in the same place as the original struct one, so not to break the flow of information:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = {
val prefix = column + "_"
val originalPosition = df.columns.indexOf(column)
val dfWithAllColumns = df.select("*", column + ".*")
val explodedColumns = dfWithAllColumns.columns diff df.columns
val prefixedExplodedColumns = explodedColumns.map(c => col(column + "." + c) as prefix + c)
val finalColumnsList = df.columns.map(col).patch(originalPosition, prefixedExplodedColumns, 1)
df.select(finalColumnsList: _*)
}
}
Of course, you can customize the prefix, the separator, and etc - but that is simple, anyone could tweak the parameters and such. The usage remains the same.
In case anyone is interested, here is something similar for PySpark:
def explode_struct(df: DataFrame, column: str) -> DataFrame:
original_position = df.columns.index(column)
original_columns = df.columns
new_columns = df.select(column + ".*").columns
exploded_columns = [F.col(column + "." + c).alias(column + "_" + c) for c in new_columns]
col_list = [F.col(c) for c in df.columns]
col_list.pop(original_position)
col_list[original_position:original_position] = exploded_columns
return df.select(col_list)

How to use filter in dynamically in scala?

I have the raw of line of logs file about 1TB. As below.
Test X1 SET WARN CATALOG MAP1,MAP2
INFO X2 SET WARN CATALOG MAPX,MAP2,MAP3
I read the logs file using spark scala scala and make the rdd of logs file.
I need to filter only those line which contains
1.SET
2.INFO
3. CATALOG
I write the filter like that
Val filterRdd = rdd.filter(f =>f.contains("SET")).filter(f => f.contains("INFO")).filter(f =>f.contains("CATALOG"))
can we do the same if these parameter are assign to list. and based on that we can filter dynamically not writing to much of line ; here in example i take only three restriction but in real it goes to upto 15 restriction keywords. can we do it dynamically.
Something like this could work when you require all words to appear in a line:
val words = Seq("SET", "INFO", "CATALOG")
val filterRdd = rdd.filter(f => words.forall(w => f.contains(w)))
and if you want any:
val filterRdd = rdd.filter(f => words.exists(w => f.contains(w)))

Insert a key using Ruamel

I am using the Ruamel Python library to programmatically edit human-edited YAML files. The source files have keys that are sorted alphabetically.
I'm not sure if this is a basic Python question, or a Ruamel question, but all methods I have tried to sort Ruamel's OrderedDict structure are failing for me.
I am quite confused, for instance, why the following code, based on this recipe, isn't working:
import ruamel.yaml
import collections
def read_file(f):
with open(f, 'r') as _f:
return ruamel.yaml.round_trip_load(
_f.read(),
preserve_quotes=True
)
def write_file(f, data):
with open(f, 'w') as _f:
_f.write(ruamel.yaml.dump(
data,
Dumper=ruamel.yaml.RoundTripDumper,
explicit_start=True,
width=1024
))
data = read_file('in.yaml')
data = collections.OrderedDict(sorted(data.items(), key=lambda t: t[0]))
write_file('out.yaml', data)
But given this input file:
---
bananas: 1
apples: 2
The following output file is produced:
--- !!omap
- apples: 2
- bananas: 1
I.e. it's turned my file into a YAML ordered map.
Is there an easy way to do this? Also, can I simply insert into the data structure somehow?
If you round_trip a mapping in ruamel.yaml¹ , a mapping doesn't get represented as a collections.OrderedDict(), it gets represented as a ruamel.yaml.comments.CommentedMap(). The latter can be a subclass of collections.OrderedDict() depending on which version of Python you are working with (e.g. in Python 2 it uses the much faster C implementation fromruamel.ordereddict)
The representer doesn't automatically interpret "normal" ordered dictionaries (whether from collections or ruamel.ordereddict) as special in round_trip_dump mode. But if you drop the collections:
import ruamel.yaml
def read_file(f):
with open(f, 'r') as _f:
return ruamel.yaml.round_trip_load(
_f.read(),
preserve_quotes=True
)
def write_file(f, data):
with open(f, 'w') as _f:
ruamel.yaml.dump(
data,
stream=_f,
Dumper=ruamel.yaml.RoundTripDumper,
explicit_start=True,
width=1024
)
data = read_file('in.yaml')
data = ruamel.yaml.comments.CommentedMap(sorted(data.items(), key=lambda t: t[0]))
write_file('out.yaml', data)
your out.yaml will be:
---
apples: 2
bananas: 1
Please note that I also removed an inefficiency in your write_file routine. If you don't specify a stream, all data will be streamed to a StringIO instance first (in memory) and then returned (which you wrote to a stream with _f.write(), it is much more efficient to directly write to the stream.
As for your final question: yes you can insert using:
data.insert(1, 'apricot', 3)
¹ Disclaimer: I am the author of both ruamel.yaml as well as ruamel.ordereddict.

Haskell HDBC Elegance in F#?

I'm struck by Haskell's terseness and elegance. But I work in a .Net house, so I use F# when I can get away with it--I may be the only one of hundreds across the country who uses it.
Does ADO.NET or F# offer something as terse and elegant as HDBC's executeMany? I'm making my way through Real World Haskell. In chapter 21 it offers this example:
ghci> conn <- connectSqlite3 "test1.db"
ghci> stmt <- prepare conn "INSERT INTO test VALUES (?, ?)"
ghci> executeMany stmt [[toSql 5, toSql "five's nice"], [toSql 6, SqlNull]]
ghci> commit conn
ghci> disconnect conn
I'd like to get this elegance and terseness in my F#. I've seen a lot of hype around using parameterized queries to avoid SQL injection attacks. I'm not using them in this case for three reasons:
I find parameterized queries in .Net ugly and burdensome.
My data comes from the corporate office, so it's (mostly) clean.
My table has 34 columns. I despise the idea of parameterizing a query with 34 columns.
Here's my F# code:
module Data
open System
open System.Data
open System.Data.OleDb
open System.Text.RegularExpressions
type Period = Prior | Current
let Import period records db =
use conn = new OleDbConnection(#"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + db + ";Persist Security Info=False;")
let execNonQuery s =
let comm = new OleDbCommand(s, conn) in
comm.ExecuteNonQuery() |> ignore
let enquote = sprintf "\"%s\""
let escapeQuotes s = Regex.Replace(s, "\"", "\"\"")
let join (ss:string[]) = String.Join(",", ss)
let table = match period with
| Prior -> "tblPrior"
| Current -> "tblCurrent"
let statements =
[| for r in records do
let vs = r |> Array.map (escapeQuotes >> enquote) |> join
let vs' = vs + sprintf ",\"14\",#%s#" (DateTime.Now.ToString "yyyy-MM-dd") in
yield sprintf "INSERT INTO %s ( [Field01], [Field02], [Field03] [Field04], [Field05], [Field06], [Field07], [Field08], [Field09], [Field10], [Field11], [Field12], [Field13], [Field14], [Field15], [Field16], [Field17], [Field18], [Field19], [Field20], [Field21], [Field22], [Field23], [Field24], [Field25], [Field26], [Field27], [Field28], [Field29], [Field30], [Field31], [Field32], [Field33], [Field34] ) VALUES (%s)" table vs' |] in
do conn.Open()
execNonQuery (sprintf "DELETE FROM %s" table)
statements |> Array.iter execNonQuery
I've renamed the fields of the table(s) for security reasons.
Because all the fields on the table are text, I can easily Array.map them to escape and quote the values.
At between 9,000 and 10,000 records per day to import to each of the two tables, I want to do this as efficiently as possible. Hence my interest in the executeMany of Haskell. Too, though, I like the idea behind parameterized queries, and I like the way Hasekll has implemented them. Is there something equivalent in terseness and elegance in F#?
I agree with #JonnyBoats comment that generally using an F# SQL type provider like SqlDataConnection (LINQ-to-SQL) or SqlEntityConnection (Entity Framework) would be far more elegant than any kind of solution involving building insert statement strings by hand.
But, there is one important qualifier to your question: "At between 9,000 and 10,000 records per day to import to each of the two tables, I want to do this as efficiently as possible." In a scenario like this, you'll want to use SqlBulkCopy for efficient bulk inserts (it leverages native database driver features for much faster inserts than you are likely getting with HDBC's executeMany).
Here's a small example that should help you getting started using SqlBulkCopy with F#: https://stackoverflow.com/a/8942056/236255. Note that you'll be working with a DataTable to stage the data which though old and somewhat awkward to use from F#, is still superior to building insert statement strings in my opinion.
Update in response to comment
Here's a generalized approach to using SqlBulkCopy which is improved for your scenario (we pass in a column specification separately from the row data, and both are dynamic):
//you must reference System.Data and System.Xml
open System
open System.Data
open System.Data.SqlClient
let bulkLoad (conn:SqlConnection) tableName (columns:list<string * Type>) (rows: list<list<obj>>) =
use sbc = new SqlBulkCopy(conn, SqlBulkCopyOptions.TableLock, null, BatchSize=500, BulkCopyTimeout=1200, DestinationTableName=tableName)
sbc.WriteToServer(
let dt = new DataTable()
columns
|> List.iter (dt.Columns.Add>>ignore)
for row in rows do
let dr = dt.NewRow()
row |> Seq.iteri(fun i value -> dr.[i] <- value)
dt.Rows.Add(dr)
dt)
//example usage:
//note: since you know all your columns are of type string, you could define columns like
//let columns = ["Field1", "Field2", "Field3"] |> List.map (fun name -> name, typeof<String>)
let columns = [
"Field1", typeof<String>
"Field2", typeof<String>
"Field3", typeof<String>
]
let rows = [
["a"; "b"; "c"]
["d"; "e"; "f"]
["g"; "h"; "i"]
["j"; "k"; "l"]
["m"; "n"; "o"]
]
//a little funkiness to transform our list<list<string>> to list<list<obj>>,
//probably not needed in practice because you won't be constructing your lists literally
let rows = rows |> List.map (fun row -> row |> List.map (fun value -> value :> obj))
bulkLoad conn "tblPrior" columns rows
You could get even fancier / more terse using an approach involving reflection. e.g. create a type like
type RowData = { Field1:string; Field2:string; Field3:string }
and make a bulkLoad with a signature that takes a list<'a> argument such that it reflects over the property names and types of typeof<'a> to build the DataTable Columns, and similarly uses reflection to iterate over all the properties of a row instance to create and add a new row to the DataTable. In fact, this question shows how to make a generic ToDataTable method that does it (in C#).