How do I split() the result of a split()? - pyspark

My PySpark data field has a column with a value of the form 0000-00-00-00-00-00-000_000.xxxx where 0 is a digit and x is a letter. The value represents an observation timestamp with some other values mixed in.
In my notebook, I have a cell that attempts to split the column containing the timestamp. For the most part, it works.I get most of the work done with the following:
`splitDF = ( df
.withColumn("fn_year", split(df["fn"], "-").getItem(0))
.withColumn("fn_month", split(df["fn"], "-").getItem(1))
.withColumn("fn_day", split(df["fn"], "-").getItem(2))
.withColumn("fn_hour", split(df["fn"], "-").getItem(3))
.withColumn("fn_min", split(df["fn"], "-").getItem(4))
.withColumn("fn_sec", split(df["fn"], "-").getItem(5))
.withColumn("fn_milli", split(df["fn"], "-").getItem(6))
)`
I need to extract two values from the string; the 000 preceding the underscore and the 000 following the underscore. I would normally (my usual language / environment is C# / .NET 7, web API stuff) just split the string multiple times using the two delimiters ('_' and '.') and grab the necessary components. I can't get that to work in this case. When I try to pass the split into another split I get ["", "", "", "", "", "", "", "", ""] for the result (.getItem(x) omitted).
Here's an example of what I thought might work to split on the underscore and then the period:
splitDF = df.withColumn("fn_qc", split(split(df["fn"], "_").getItem(1), ".").getItem(0))

Basically we split string based on dash - it would return an array which is used across. In the last statement, we split again based on underscore. For the last value you could use a substring or split again based on period or just replace xxxx if it is a static value...
Hope this helps.
from pyspark.sql.functions import split, col, substring
date_list = [["2023-01-02-03-04-05-666_777.xxxx"], ["2023-12-11-10-09-08-444_333.xxxx"]]
cols = ["fn"]
df = spark.createDataFrame(date_list, cols)
splitDF = df.withColumn("split_on_dash", split(col("fn"), "-")) \
.withColumn("fn_year", col("split_on_dash")[0]) \
.withColumn("fn_month", col("split_on_dash")[1]) \
.withColumn("fn_day", col("split_on_dash")[2]) \
.withColumn("fn_hour", col("split_on_dash")[3]) \
.withColumn("fn_min", col("split_on_dash")[4]) \
.withColumn("fn_sec", col("split_on_dash")[5]) \
.withColumn("fn_milli", split(col("split_on_dash")[6], "_")[0]) \
.withColumn("fn_after_underscore", substring(split(col("split_on_dash")[6], "_")[1], 0, 3))
display(splitDF)
You can select only the required columns later or drop the unnecessary ones...

Related

how to remove rows whose not have values?

I am removing some words from a column. at the end of the day some rows will be empty becouse all their string has been removed. there might be space or whitespace or nothing. How I can remove these rows?
I tried this but for some reason it does not work for all kind of rows:
df = df.withColumn('col1',trim(regexp_replace('col1','\n','')))
df=df.filter(df.col1!='')
the filter you've applied will work for blanks, but not if it has whitespaces.
try trim(<column>) != ''.
Example
spark.sparkContext.parallelize([('',), (' ',), (' ',)]).toDF(['foo']). \
filter(func.col('foo') != ''). \
count()
# 2
spark.sparkContext.parallelize([('',), (' ',), (' ',)]).toDF(['foo']). \
filter(func.trim(func.col('foo')) != ''). \
count()
# 0

DB2 extract data between two delimiters

Here is the string I am trying to extract from:
'cn=xyxyxyxyxyx ousy,ou=information services,ou=domain users,dc=corp,dc=xyxyxx,dc=com'
I am trying to extract the string between the first 'ou=' and the second comma. In this case that is
'information services'
Here is what I have so far:
SUBSTR(F_DN, locate('ou=', F_DN)+3, locate(',', F_DN, locate(',', F_DN)+1)+1 ) as role
And this is the result:
'information services,ou=domain users,dc=co'
It seems to locate to the first character just fine but I cannot get the length correct.
Try this:
select regexp_substr(str, 'ou=([^,]+)', 1, 1, '', 1)
from (values 'cn=xyxyxyxyxyx ousy,ou=information services,ou=domain users,dc=corp,dc=xyxyxx,dc=com') t (str);

Column name cannot be resolved in SparkSQL join

I'm not sure why this is happening. In PySpark, I read in two dataframes and print out their column names and they are as expected, but then when do a SQL join I get an error that cannot resolve column name given the inputs. I have simplified the merge just to get it to work, but I will need to add in more join conditions which is why I'm using SQL (will be adding in: "and b.mnvr_bgn < a.idx_trip_id and b.mnvr_end > a.idx_trip_data"). It appears that the column 'device_id' is being renamed to '_col7' in the df mnvr_temp_idx_prev_temp
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
print mnvr_temp_idx_prev.columns
['device_id', 'mnvr_bgn', 'mnvr_end']
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
print raw_data_filtered.columns
['device_id', 'trip_id', 'idx_trip_end']
raw_data_filtered.registerTempTable('raw_data_filtered_temp')
mnvr_temp_idx_prev.registerTempTable('mnvr_temp_idx_prev_temp')
test = sqlContext.sql('SELECT a.device_id, a.idx_trip_end, b.mnvr_bgn, b.mnvr_end \
FROM raw_data_filtered_temp as a \
INNER JOIN mnvr_temp_idx_prev_temp as b \
ON a.device_id = b.device_id')
Traceback (most recent call last): AnalysisException: u"cannot resolve 'b.device_id' given input columns: [_col7, trip_id, device_id, mnvr_end, mnvr_bgn, idx_trip_end]; line 1 pos 237"
Any help is appreciated!
I would recommend renaming the name of the field 'device_id' in at least one of the data frame. I modified your query just a bit and tested it(in scala). Below query works
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device_id")
[device_id: string, mnvr_bgn: string, mnvr_end: string, device_id: string, trip_id: string, idx_trip_end: string]
Now if you are doing a 'select * ' in above statement, it will work. But if you try to select 'device_id', you will get an error "Reference 'device_id' is ambiguous" . As you can see in the above 'test' data frame definition, it has two fields with the same name(device_id). So to avoid this, I recommend changing field name in one of the dataframes.
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
.withColumnRenamned("device_id","device")
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
Now use dataframes or sqlContext
//using dataframes with multiple conditions
val test = mnvr_temp_idx_prev.join(raw_data_filtered,$"device" === $"device_id"
&& $"mnvr_bgn" < $"idx_trip_id","inner")
//in SQL Context
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device and a. idx_trip_id < b.mnvr_bgn")
Above queries will work for your problem. And if your data set is too large, I would recommend to not use '>' or '<' operators in Join condition as it causes cross join which is a costly operation if data set is large. Instead use them in WHERE condition.

Psycopg2 insert python dictionary in postgres database

In python 3+, I want to insert values from a dictionary (or pandas dataframe) into a database. I have opted for psycopg2 with a postgres database.
The problems is that I cannot figure out the proper way to do this. I can easily concatenate a SQL string to execute, but the psycopg2 documentation explicitly warns against this. Ideally I wanted to do something like this:
cur.execute("INSERT INTO table VALUES (%s);", dict_data)
and hoped that the execute could figure out that the keys of the dict matches the columns in the table. This did not work. From the examples of the psycopg2 documentation I got to this approach
cur.execute("INSERT INTO table (" + ", ".join(dict_data.keys()) + ") VALUES (" + ", ".join(["%s" for pair in dict_data]) + ");", dict_data)
from which I get a
TypeError: 'dict' object does not support indexing
What is the most phytonic way of inserting a dictionary into a table with matching column names?
Two solutions:
d = {'k1': 'v1', 'k2': 'v2'}
insert = 'insert into table (%s) values %s'
l = [(c, v) for c, v in d.items()]
columns = ','.join([t[0] for t in l])
values = tuple([t[1] for t in l])
cursor = conn.cursor()
print cursor.mogrify(insert, ([AsIs(columns)] + [values]))
keys = d.keys()
columns = ','.join(keys)
values = ','.join(['%({})s'.format(k) for k in keys])
insert = 'insert into table ({0}) values ({1})'.format(columns, values)
print cursor.mogrify(insert, d)
Output:
insert into table (k2,k1) values ('v2', 'v1')
insert into table (k2,k1) values ('v2','v1')
I sometimes run into this issue, especially with respect to JSON data, which I naturally want to deal with as a dict. Very similar. . .But maybe a little more readable?
def do_insert(rec: dict):
cols = rec.keys()
cols_str = ','.join(cols)
vals = [ rec[k] for k in cols ]
vals_str = ','.join( ['%s' for i in range(len(vals))] )
sql_str = """INSERT INTO some_table ({}) VALUES ({})""".format(cols_str, vals_str)
cur.execute(sql_str, vals)
I typically call this type of thing from inside an iterator, and usually wrapped in a try/except. Either the cursor (cur) is already defined in an outer scope or one can amend the function signature and pass a cursor instance in. I rarely insert just a single row. . .And like the other solutions, this allows for missing cols/values provided the underlying schema allows for it too. As long as the dict underlying the keys view is not modified as the insert is taking place, there's no need to specify keys by name as the values will be ordered as they are in the keys view.
[Suggested answer/workaround - better answers are appreciated!]
After some trial/error I got the following to work:
sql = "INSERT INTO table (" + ", ".join(dict_data.keys()) + ") VALUES (" + ", ".join(["%("+k+")s" for k in dict_data]) + ");"
This gives the sql string
"INSERT INTO table (k1, k2, ... , kn) VALUES (%(k1)s, %(k2)s, ... , %(kn)s);"
which may be executed by
with psycopg2.connect(database='deepenergy') as con:
with con.cursor() as cur:
cur.execute(sql, dict_data)
Post/cons?
using %(name)s placeholders may solve the problem:
dict_data = {'key1':val1, 'key2':val2}
cur.execute("""INSERT INTO table (field1, field2)
VALUES (%(key1)s, %(key2)s);""",
dict_data)
you can find the usage in psycopg2 doc Passing parameters to SQL queries
Here is another solution inserting a dictionary directly
Product Model (has the following database columns)
name
description
price
image
digital - (defaults to False)
quantity
created_at - (defaults to current date)
Solution:
data = {
"name": "product_name",
"description": "product_description",
"price": 1,
"image": "https",
"quantity": 2,
}
cur = conn.cursor()
cur.execute(
"INSERT INTO products (name,description,price,image,quantity) "
"VALUES(%(name)s, %(description)s, %(price)s, %(image)s, %(quantity)s)", data
)
conn.commit()
conn.close()
Note: The columns to be inserted is specified on the execute statement .. INTO products (column names to be filled) VALUES ..., data <- the dictionary (should be the same **ORDER** of keys)

Report Builder .rdl Check Array Key Exists

I am making a report and I need to split a coma separated string into three columns of a table.
string = 'some text, some text, some text'
But the sting doesn't always have two coma's i.e.
string = 'some text, some text'
so when i try to get the value for the third column
=Split(Fields!GLDescription.Value, ", ").GetValue(2)
This code can result in a "#Error" message in the column. I tried to solve this by checking the length like so
=IIF(Split(Fields!GLDescription.Value, ", ").Length >= 3, Split(Fields!GLDescription.Value, ", ").GetValue(2), "")
But it still resulted in the same error. Is there anyway to check if an array key exists?
The issue, as you've seen, is that SSRS IIf expressions aren't good at short circuiting. I can think of a workaround that will work for 2 and 3 column fields.
Try an expression like:
=IIf(
Split(Fields!GLDescription.Value, ", ").Length = 3
, Mid(
Fields!GLDescription.Value
, InStrRev(Fields!GLDescription.Value, ", ") + 2
, Len(Fields!GLDescription.Value) - InStrRev(Fields!GLDescription.Value, ", ") + 2
)
, "No val 3"
)
With dataset:
Gives result:
It's not bulletproof for all possible situations, but might be enough for your data.