Sum columns of a dataset - pyspark

I'm working in PySpark and I have a dataset like this :
I want to create a new df like this with the corresponding sums :
So I tried this code :
df = df.withColumnRenamed("month_actual_january", "monthjanuary")
fin=df.groupBy(["column1","column2"]).sum()
The problem is that I get the following error :
Attribute sum(column3) contains an invalid character among ,;{}()\n\t=. Please use an alias to rename it
Do you know how to fix this error ? Thanks !

Lets try use least squares to pass a wild card alias as follows
df.groupBy(["column1","column2"]).agg(*[sum(x).alias(f"sum_{x}") for x in df.drop("column1","column2").columns]).show()

Related

How to remove the first 2 rows using zipwithindex using spark scala

I have two headers in the file. have to remove them. i tried with zipwithindex. it will assign the index from zero onwards. But its showing error while performing filter condition on it.
val data=spark.sparkContext.textFile(filename)
val s=data.zipWithIndex().filter(row=>row[0]>1) --> throwing error here
Any help here please.
Sample data:
===============
sno,empno,name --> need to remove
c1,c2,c3 ==> need to remove
1,123,ramakrishna
2,234,deepthi
Error: identifier expected but integer literal found
value row of type (String,Long) does not take type parameters.
not found: type <error>
If you have to remove rows, you can use row_num and remove rows easily by filter row_num>2.

How can I resolve the "need struct type but got struct"

As you can see on my picture, I have a column named probability and I want to create a new column from the probability column. I want to extract values from the probability column which is an array. But while trying to do so, I receive an error:
"Can't extract value from probability#52427: need struct type but got struct<type:tinyint,size:int,indices:array<int>,values:array<double>>"
Here is my extraction code:
preds_test = preds.withColumn("newCol", col("probability").getItem(3))
Can someone please tell me what I did wrong?
I figured it out. I used a lambda function. This is my code:
preds_subset = preds.select('CustomerID','prediction', probs_churn('probability')).orderBy(asc("probability"))```

How to keep keep original column after applying data validation in same column

I have a task to validate decimal and date field.I am able to validate decimal and date filed on same column but not able to keep old column values.
Input:
id,amt1
1,123
2,321
3,345
4,543
5,789
Current Output:
id,amt1
1,12.3
2,32.1
3,34.5
4,54.3
5,78.9
Expected Output:
id,amt1,original_amt1_values
1,12.3,123
2,32.1,321
3,34.5,345
4,54.3,543
5,78.9,789
Below is the code, I am able to validate decimal filed but not able to keep original values. Kindly help me on this. I want to keep its original column in dataframe itself.
SourceFileDF = SourceFileDF.withColumn("amt1", DecimalConversion(col(amt1)))
DecimalConversion is my UDF and SourceFileDF is my dataframe.
You can use a temporary column name for "amt1" and the use column rename
SourceFileDF.withColumn("amt1_converted", DecimalConversion(col(amt1)))
SourceFileDF.withColumnRenamed("amt1", "original_amt1_values")
SourceFileDF.withColumnRenamed("amt1_converted", "amt1")
You can use select and provide the alias in a single line :
sourceFileDF.select(
DecimalConversion($"amt1").as("amt1") ,
$"amt1".as("original_amt1_values")
)

dataFrame keying using pandas groupby method

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
filePath,vp,score
E:\Audio\7168965711_5601_4.wav,Cust_9709495726,-2
E:\Audio\7168965711_5601_4.wav,Cust_9708568031,-80
E:\Audio\7168965711_5601_4.wav,Cust_9702445777,-2
E:\Audio\7168965711_5601_4.wav,Cust_7023544759,-35
E:\Audio\7168965711_5601_4.wav,Cust_9702229339,-77
E:\Audio\7168965711_5601_4.wav,Cust_9513243289,25
E:\Audio\7168965711_5601_4.wav,Cust_2102513187,18
E:\Audio\7168965711_5601_4.wav,Cust_6625625104,-56
E:\Audio\7168965711_5601_4.wav,Cust_6073165338,-40
E:\Audio\7168965711_5601_4.wav,Cust_5105831247,-30
E:\Audio\7168965711_5601_4.wav,Cust_9513082770,-55
E:\Audio\7168965711_5601_4.wav,Cust_5753907026,-79
E:\Audio\7168965711_5601_4.wav,Cust_7403410322,11
E:\Audio\7168965711_5601_4.wav,Cust_4062144116,-70
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
res.index
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Cust_9709495726]
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
res['Cust_4062144116']
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.
Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
res['E:\Audio\7168965711_5601_4.wav]
and will return:
Cust_2102513187 2
Cust_4062144116 8
....
but trying to index by the inner index (the Cust_ indexes) will fail.
You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]

Merge multiple geometry lines into a single geometry line

Now I have 3 lines (LINESTRING) with 3 different geometry values.
Line#1 : "11AABB"
Line#2 : "22CCDD"
Line#3 : "33EEFF"
How to merge these lines into a single line via PostGIS function ?
I use to know that It use ST_LineMerge or ST_Union But I'm not sure how to use it with geometry values.
could you be more specific?
with the examples you provided you could try this:
SELECT ST_Union('11AABB'::geometry , ST_Union('22CCDD'::geometry,'33EEFF'::geometry ) );
your geometry values are invalid, but this is the correct sintax of the query you want. You should get this error with the example you provided: "ERROR: parse error - invalid geometry"