Hexadecimal string to decimal conversion - pyspark

I have 2 pyspark columns consisting of following hexadecimal values:
Value | 245FC;324EE;
Value_Split | [245FC,324EE]
I would like to convert them to the following decimal numbers:
Value | 148988;206062;
Value_Split | [148988,206062]
I am happy, even one column conversion happens.

Use conv function to change hexadecimal to the decimal.
Spark >= 3.1.0
df.withColumn('Value_Split', f.transform('Value_Split', lambda v: f.conv(v, 16, 10))) \
.show(truncate=False)
+----------------+
|Value_Split |
+----------------+
|[148988, 206062]|
+----------------+
Spark >= 2.4.0
df.withColumn('Value_Split', f.expr('transform(Value_Split, v -> conv(v, 16, 10))')) \
.show(truncate=False)
+----------------+
|Value_Split |
+----------------+
|[148988, 206062]|
+----------------+

Related

pyspark - left join with random row matching the key

I am looking to a way to join 2 dataframes but with random rows matching the key. This strange request is due to a very long calculation to generate positions.
I would like to do a kind of "random left join" in pyspark.
I have a dataframe with an areaID (string) and a count (int). The areaID is unique (around 7k).
+--------+-------+
| areaID | count |
+--------+-------+
| A | 10 |
| B | 30 |
| C | 1 |
| D | 25 |
| E | 18 |
+--------+-------+
I have a second dataframe with around 1000 precomputed rows for each areaID with 2 positions columns x (float) and y (float). This dataframe is around 7 millions rows.
+--------+------+------+
| areaID | x | y |
+--------+------+------+
| A | 0.0 | 0 |
| A | 0.1 | 0.7 |
| A | 0.3 | 1 |
| A | 0.1 | 0.3 |
| ... | | |
| E | 3.15 | 4.17 |
| E | 3.14 | 4.22 |
+--------+------+------+
I would like to end with a dataframe like:
+--------+------+------+
| areaID | x | y |
+--------+------+------+
| A | 0.1 | 0.32 | < row 1/10 - randomly picked where areaID are the same
| A | 0.0 | 0.18 | < row 2/10
| A | 0.09 | 0.22 | < row 3/10
| ... | | |
| E | 3.14 | 4.22 | < row 1/18
| ... | | |
+--------+------+------+
My first idea is to iterate over each areaID of the first dataframe, filter the second dataframe by areaID and sample count rows of this dataframe. The problem is that this is quite slow with 7k load/filtering/sampling processes.
The second approach is to do an outer join on areaID, then shuffle the dataframe (but seems quite complex), apply a rank and keep when the rank <= count but I don't like the approch to load a lot a data to filter afterward.
I am wondering if there is a way to do it using a "random" left join ? In that case, I'll duplicate each row count times and apply it.
Many thanks in advance,
Nicolas
One can interpret the question as stratified sampling of the second dataframe where the number of samples to be taken from each subpopulation is given by the first dataframe.
There is Spark function for stratified sampling.
df1 = ...
df2 = ...
#first calculate the fraction for each areaID based on the required number
#given in df1 and the number of rows for the areaID in df2
fractionRows = df2.groupBy("areaId").agg(F.count("areaId").alias("count2")) \
.join(df1, "areaId") \
.withColumn("fraction", F.col("count") / F.col("count2")) \
.select("areaId", "fraction") \
.collect()
fractions = {f[0]:f[1] for f in fractionRows}
#now run the statified samling
df2.stat.sampleBy("areaID", fractions).show()
There is caveat with this approach: as the sampling done by Spark is a random process, the exact number of rows given in the first dataframe will not always be met exactly.
Edit: fractions > 1.0 are not supported by sampleBy. Looking at the Scala code of sampleBy shows why: the function is implemented as filter with a random variable indicating whether to keep to row or not. Returning multiple copies of a single row will therefore not work.
A similar idea can be used to support fractions > 1.0: instead of using a filter, an udf is created that returns an array. The array contains one entry per copy of the row that should be contained in the result. After applying the udf, the array column is exploded and then dropped:
from pyspark.sql import functions as F
from pyspark.sql import types as T
fractions = {'A': 1.5, 'C': 0.5}
def ff(stratum,x):
fraction = fractions.get(stratum, 0.0)
ret=[]
while fraction >= 1.0:
ret.append("x")
fraction = fraction - 1
if x < fraction:
ret.append("x")
return ret
f=F.udf(ff, T.ArrayType(T.StringType())).asNondeterministic()
seed=42
df2.withColumn("r", F.rand(seed)) \
.withColumn("r",f("areaID", F.col("r")))\
.withColumn("r", F.explode("r")) \
.drop("r") \
.show()

Spark decimal type precision loss

I'm doing some testing of spark decimal types for currency measures and am seeing some odd precision results when I set the scale and precision as shown below. I want to be sure that I won't have any data loss during calculations but the example below is not reassuring of that. Can anyone tell me why this is happening with spark sql? Currently on version 2.3.0
val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as decimal(38,14)) as decimal(38,14)) val"""
spark.sql(sql).show
This returns
+----------------+
| val|
+----------------+
|0.33333300000000|
+----------------+
This is a current open issue, see SPARK-27089. The suggested work around is to adjust the setting below. I validated that the SQL statement works as expected with this setting set to false.
spark.sql.decimalOperations.allowPrecisionLoss=false
Use BigDecimal to avoid precision loss. See Double vs. BigDecimal?
example:
scala> val df = Seq(BigDecimal("0.03"),BigDecimal("8.20"),BigDecimal("0.02")).toDS
df: org.apache.spark.sql.Dataset[scala.math.BigDecimal] = [value: decimal(38,18)]
scala> df.select($"value").show
+--------------------+
| value|
+--------------------+
|0.030000000000000000|
|8.200000000000000000|
|0.020000000000000000|
+--------------------+
Using BigDecimal:
scala> df.select($"value" + BigDecimal("0.1")).show
+-------------------+
| (value + 0.1)|
+-------------------+
|0.13000000000000000|
|8.30000000000000000|
|0.12000000000000000|
+-------------------+
if you don't use BigDecimal, there will be a loss in precision. In this case 0.1 is a double
scala> df.select($"value" + lit(0.1)).show
+-------------------+
| (value + 0.1)|
+-------------------+
| 0.13|
| 8.299999999999999|
|0.12000000000000001|
+-------------------+

List to DataFrame in pyspark

Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
Now, i want to create a Dataframe as follows
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
I even want to add ID column which is not associated in the data
You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
Try this -
data_array = []
for i in range (0,len(my_data)) :
data_array.extend([(i, my_data[i])])
df = spark.createDataframe(data = data_array, schema = ["ID", "words"])
df.show()
Try this -- the simplest approach
from pyspark.sql import *
x = Row(utc_timestamp=utc, routine='routine name', message='your message')
data = [x]
df = sqlContext.createDataFrame(data)
Simple Approach:
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)
+---------------------+-----+
|id |words|
+---------------------+-----+
|[apple, ball, ballon]|0 |
|[cat, camel, james] |1 |
|[none, focus, cake] |2 |
+---------------------+-----+

Read fixed length file with implicit decimal point?

Suppose I have a data file like this:
foo12345
bar45612
I want to parse this into:
+----+-------+
| id| amt|
+----+-------+
| foo| 123.45|
| bar| 456.12|
+----+-------+
Which is to say, I need to select df.value.substr(4,5).alias('amt'), but I want the value to be interpreted as a five digit number where the last two digits are after the decimal point.
Surely there's a better way to do this than "divide by 100"?
from pyspark.sql.functions import substring, concat, lit
from pyspark.sql.types import DoubleType
#sample data
df = sc.parallelize([
['foo12345'],
['bar45612']]).toDF(["value"])
df = df.withColumn('id', substring('value',1,3)).\
withColumn('amt', concat(substring('value', 4, 3),lit('.'),substring('value', 7, 2)).cast(DoubleType()))
df.show()
Output is:
+--------+---+------+
| value| id| amt|
+--------+---+------+
|foo12345|foo|123.45|
|bar45612|bar|456.12|
+--------+---+------+

Split string on custom Delimiter in pyspark

I have data with column foo which can be
foo
abcdef_zh
abcdf_grtyu_zt
pqlmn#xl
from here I want to create two columns such that
Part 1 Part 2
abcdef zh
abcdf_grtyu zt
pqlmn xl
The code I am using for this is
data = data.withColumn("Part 1",split(data["foo"],substring(data["foo"],-3,1))).get_item(0)
data = data.withColumn("Part 2",split(data["foo"],substring(data["foo"],-3,1))).get_item(1)
However I am getting an error column not iterable
The following should work
>>> from pyspark.sql import Row
>>> from pyspark.sql.functions import expr
>>> df = sc.parallelize(['abcdef_zh', 'abcdfgrtyu_zt', 'pqlmn#xl']).map(lambda x: Row(x)).toDF(["col1"])
>>> df.show()
+-------------+
| col1|
+-------------+
| abcdef_zh|
|abcdfgrtyu_zt|
| pqlmn#xl|
+-------------+
>>> df.withColumn('part2',df.col1.substr(-2, 3)).withColumn('part1', expr('substr(col1, 1, length(col1)-3)')).select('part1', 'part2').show()
+----------+-----+
| part1|part2|
+----------+-----+
| abcdef| zh|
|abcdfgrtyu| zt|
| pqlmn| xl|
+----------+-----+