I have DF like this:
+------+------+--------------+---------------+--------------+---------------+--------------+
| A .....|.. B ..|2019-01-31|2019-02-28|2019-03-31|2019-04-30|2019-05-31|
+------+------+--------------+---------------+--------------+---------------+--------------+
|11125 |SL15| 15.0 | 12.0 | 1.0 | 0.0 | 0.0 |
|20047 |SL20| 7.0 | 13.0 | 1.0 | 0.0 | 0.0 |
|35858 |SL25| 8.0 | 0.0 | 1.0 | 0.0 | 0.0 |
I am creating a calculated field like new field, which sum up columns 3 to 7(5 month).
My problem is that columns have dynamic names.
This month column names are different that next month, but are always located on exact positions. (secured in flow, correct column which must be on position 3 will on position 3 next month, only names are changing).
That means every month I will sum up columns from 3 to 7.
example
If I assign first variable like this colH1 = F.col("2019-01-31") and second like this colH2 = F.col("2019-02-28"), I can do mathematical operations between them.
New calculated column is Like this: df = df.withColumn('new_column',((colH1) + (colH2)))
What need to change is mentioned assignment of colH1 = F.col("2019-01-31") as ("2019-01-31") is fix name of the column on position 3 in DF this month, but name will change next month. I need calculation with “Position3”.
I cannot use df.select, as this operation just pick the exact column, but then I am unable to join it back to original DF - no unique keys for joining
I hope, that this explains my needs, but not find a solution to such a case.
get the column name based on column index and store into variable. you can get the column name based on column index by below code.
df.columns(index)
pass the variable in below code.
df = df.withColumn('new_column',((var1) + (var2)))
Related
I want to convert the prefix from 222.. to 999.. in pyspark.
Expected new column new_id with changed prefixt to 999..s
I will be using this column for inner merge b/w 2 pysparl dataframes
id
new_id
2222238308750
9999938308750
222222579844
999999579844
222225701296
999995701296
2222250087899
9999950087899
2222250087899
9999950087899
2222237274658
9999937274658
22222955099
99999955099
22222955099
99999955099
22222955099
99999955099
222285678
999985678
You can achieve it with something like this,
# First calculate the number of "2"s from the start till some other value is found, for eg '2223' should give you 3 as the length
# Use that calculated value to repeat the "9" that many times
# replace starting "2"s with the calulated "9" string
# finally drop all the calculated columns
df.withColumn("len_2", F.length(F.regexp_extract(F.col("value"), r"^2*(?!2)", 0)).cast('int'))\
.withColumn("to_replace_with", F.expr("repeat('9', len_2)"))\
.withColumn("new_value", F.expr("regexp_replace(value, '^2*(?!2)', to_replace_with)")) \
.drop("len_2", "to_replace_with")\
.show(truncate=False)
Output:
+-------------+-------------+
|value |new_value |
+-------------+-------------+
|2222238308750|9999938308750|
|222222579844 |999999579844 |
|222225701296 |999995701296 |
|2222250087899|9999950087899|
|2222250087899|9999950087899|
|2222237274658|9999937274658|
|22222955099 |99999955099 |
|22222955099 |99999955099 |
|22222955099 |99999955099 |
|222285678 |999985678 |
+-------------+-------------+
I have used the column name as value, you would have to substitute it with id.
You can try the following:
from pyspark.sql.functions import *
df = df.withColumn("tempcol1", regexp_extract("id", "^2*", 0)).withColumn("tempcol2", split(regexp_replace("id", "^2*", "_"), "_")[1]).withColumn("new_id", concat((regexp_replace("tempcol1", "2", "9")), "tempcol2")).drop("tempcol1", "tempcol2")
The id column is split into two temp columns, one having the prefix and the other the rest of the string. The prefix column values are replaced and concatenated back with the second temp column.
I want to use pySpark to restructure my data so that I can use it for MLLib models, currently for each user I have an array of array in one column and I want to convert it unique columns with the count.
Users | column1 |
user1 | [[name1, 4], [name2, 5]] |
user2 | [[name1, 2], [name3, 1]] |
should get converted to:
Users | name1 | name2 | name3 |
user1 | 4.0 | 5.0 | 0.0 |
user2 | 2.0 | 0.0 | 1.0 |
I came up with a method that uses for loops but I am looking for a way that can utilize spark because the data is huge. Could you give me any hints? Thanks.
Edit:
All of the unique names should come as individual columns with the score corresponding to each user. Basically, a sparse matrix.
I am working with pandas right now and the code I'm using to do this is
data = data.applymap(lambda x: dict(x)) # To convert the array of array into a dictionary
columns = list(data)
for i in columns:
# For each columns using the dictionary to make a new Series and appending it to the current dataframe
data = pd.concat([data.drop([i], axis=1), data[i].apply(pd.Series)], axis=1)
Figured out the answer,
import pyspark.sql.functions as F
# First we explode column`, this makes each element as a separate row
df= df.withColumn('column1', F.explode_outer(F.col('column1')))
# Then, seperate out the new column1 into two columns
df = df.withColumn(("column1_seperated"), F.col('column1')[0])
df= df.withColumn("count", F.col(i)['column1'].cast(IntegerType()))
# Then pivot the df
df= df.groupby('Users').pivot("column1_seperated").sum('count')
I have a data frame in scala spark as
category | score |
A | 0.2
A | 0.3
A | 0.3
B | 0.9
B | 0.8
B | 1
I would like to
add a row id column as
category | score | row-id
A | 0.2 | 0
A | 0.3 | 1
A | 0.3 | 2
B | 0.9 | 0
B | 0.8 | 1
B | 1 | 2
Basically I want the row id to be monotonically increasing for each distinct value in column category. I already have a sorted dataframe so all the rows with same category are grouped together. However, I still don't know how to generate the row_id that restarts when a new category appears. Please help!
This is a good use case for Window aggregation functions
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
import df.sparkSession.implicits._
val window = Window.partitionBy('category).orderBy('score)
df.withColumn("row-id", row_number.over(window))
Window functions work kind of like groupBy except that instead of each group returning a single value, each row in each group returns a single value. In this case the value is the row's position within the group of rows of the same category. Also, if this is the effect that you are trying to achieve, then you don't need to have pre-sorted the column category beforehand.
I'm trying to get the Frequency of distinct values in a Spark dataframe column, something like "value_counts" from Python Pandas. By frequency I mean, the highest occurring value in a table column (such as rank 1 value, rank 2, rank 3 etc. In the expected output, 1 has occurred 9 times in column a, so it has topmost frequency.
I'm using Spark SQL but it is not working out, may be because of the reduce operation I have written is wrong.
**Pandas Example**
value_counts().index[1]
**Current Code in Spark**
val x= parquetRDD_subset.schema.fieldNames
val dfs = x.map(field => spark.sql
(s"select 'ParquetRDD' as TableName,
'$field' as column,
min($field) as min, max($field) as max,
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field) as frequency from peopleRDDtable"))
val withSum = dfs.reduce((x, y) => x.union(y)).distinct()
withSum.show()
The problem area is with query below.
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field)
**Expected output**
TableName | column | min | max | frequency1 |
_____________+_________+______+_______+____________+
ParquetRDD | a | 1 | 30 | 9 |
_____________+_________+______+_______+____________+
ParquetRDD | b | 2 | 21 | 5 |
How do I solve this ? please help.
I could solve the issue with below with using count($field) instead of approx_count_distinct($field). Then I used Rank analytical function to get the first rank of value. It worked.
I have dataframe(inputDF) with 100 columns with decimal data type. I want to created LabelPoint using the dataframe(inputDF).
I am able to create the LablePoint by hardcoding the each column index of the dataframe, which is not the optimal solution.
val outputLabelPoint = inputDF.map(x => new LabeledPoint(0.0, Vectors.dense(x.getAs[Double](0),x.getAs[Double](1),x.getAs[Double](2),x.getAs[Double](3), ...))
How to create LablePoint from DataFrame directly without hardcoding the each column index of the dataframe?
Help would be much appreciated.
VectorAssembler may be the transformer you wanna find.
VectorAssembler is a transformer that combines a given list of columns into a single vector column.
id | hour | mobile | userFeatures | clicked
----|------|--------|------------------|---------
0 | 18 | 1.0 | [0.0, 10.0, 0.5] | 1.0
AFTER
id | hour | mobile | userFeatures | clicked | features
----|------|--------|------------------|---------|-----------------------------
0 | 18 | 1.0 | [0.0, 10.0, 0.5] | 1.0 | [18.0, 1.0, 0.0, 10.0, 0.5]
I am confused why the two tables cannot display correctly.
Refer to the example in the Spark Doc for more details.
If you want more help, please describe your column names and how they are generated.