I am trying to parse a PySpark column which has an "=" sign inside. The two functions I've created for this purpose work individually:
DF=DF.withColumn("findEqual",instr(columnName,"="))
and also when I create a column of Substring
DF=DF.withColumn("parsedString",substring(columnName,2,18))
However, when I combine the two functions:
DF=DF.withColumn("parsedString",2,instr(columnName,"="))
I receive an error:
TypeError: int() argument must be a string or a number, not 'Column'
The issue seems to be that "findEqual" isn't seen by PySpark as an integer, rather an "integer object".
Thanks for your help!
You are using functions defined on strings not on pyspark columns, you can convert them using udf:
from pyspark.sql.functions import udf
from pyspark.sql.types import *
def instr (x, s):
return s in x
instr_udf = lambda s: udf(lambda x: instr(x, s), BooleanType())
DF=DF.withColumn("findEqual",instr_udf("=")("columnName"))
and
substring_udf = udf(substring, StringType())
I would not recommand using UDFs when the functions already exist in pyspark:
DF=DF.withColumn("findEqual",DF.columnName.like('%=%'))
DF=DF.withColumn("parsedString",DF.columnName[2:18])
Related
This is my logic on pyspark:
df2 = spark.sql(f" SELECT tbl_name, column_name, data_type, current_count FROM {database_name}.{tablename}")
query_df = spark.sql(f"SELECT tbl_name, COUNT(column_name) as `num_cols` FROM {database_name}.{tablename} GROUP BY tbl_name")
df_join = df2.join(query_df,['tbl_name'])
Then I want to add to the Dataframe another column called 'column_case_lower' with the analyzes if the columns_names are lower case using islower() function.
I'm using this logic to do the analyzes:
df_join.withColumn("column_case_lower",
when((col("column_name").islower()) == 'true'.otherwise('false'))
-- The error is: unexpected EOF while parsing
expecting something like this:
islower() cant be applied on column type. Use the below code that uses UDF instead.
def checkCase(col_value):
return col_value.islower()
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
checkUDF = udf(lambda z: checkCase(z),StringType())
from pyspark.sql.functions import col,when
df.withColumn("new_col", when(checkUDF(col('column_name')) == True,"True")
.otherwise("False")).show()
I have one column in DataFrame with format =
'[{jsonobject},{jsonobject}]'. here length will be 2 .
I have to find length of this array and store it in another column.
I've only worked with pySpark, but the Scala solution would be similar. Assuming the column name is input:
from pyspark.sql import functions as f, types as t
json_schema = t.ArrayType(t.MapType(t.StringType(), t.StringType()))
df.select(f.size(f.from_json(df.input, json_schema)).alias("num_objects"))
My scala dataframe has a column that has the datatype array(element: String). I want to display those rows of the dataframe that has the word "hello" in that column.
I have this:
display(df.filter($"my_column".contains("hello")))
I get an error because of data mismatch. It says that argument 1 requires string type, however, 'my:column' is of array<string> type.
You can use array_contains function
import org.apache.spark.sql.functions._
df.filter(array_contains(df.col("my_column"), "hello")).show
I have a pyspark dataframe, and I wish to get the mean and std for all columns, and rename the columns name and type, what is the easiest way to implement this, currently below is my code:
test_mean=test.groupby('id').agg({'col1': 'mean',
'col2': 'mean',
'col3':'mean'
})
test_std=test.groupby('id').agg({'col1': 'std',
'col2': 'std',
'col3':'std'
})
##rename one columns by one columns
## type cast decimal to float
May I know how to improve it?
Thanks.
You can try with Col experssioons:
from pyspark.sql import functions as F
expr1 = F.std(F.col('col1').cast('integer').alias('col1'))
expr2 = F.std(F.col('col2').cast('integer').alias('col2'))
test \
.groupBy(id) \
.agg(
expr1,
expr2
)
I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?
If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)