I have a dataframe with a boolean column. If this column is set to true, I would like to change the values of various other columns. Is there a way to do this without having to do the following for every column that I need to update? - .withColumn(<column that needs to change>, when(<boolean column is true>).otherwise(<column that needs to change>)). Something along the lines of:
if (boolean column):
col(<col1 that needs to change) = "val1",
col(<col2 that needs to change) = "val2",
...
Related
In pyspark , i tried to do this
df = df.select(F.col("id"),
F.col("mp_code"),
F.col("mp_def"),
F.col("mp_desc"),
F.col("mp_code_desc"),
F.col("zdmtrt06_zstation").alias("station"),
F.to_timestamp(F.col("date_time"), "yyyyMMddHHmmss").alias("date_time_utc"))
df = df.groupBy("id", "mp_code", "mp_def", "mp_desc", "mp_code_desc", "station").min(F.col("date_time_utc"))
But, i have an issue
raise TypeError("Column is not iterable")
TypeError: Column is not iterable
Here is an extract of the pyspark documentation
GroupedData.min(*cols)[source]
Computes the min value for each numeric column for each group.
New in version 1.3.0.
Parameters: cols : str
In other words, the min function does not support column arguments. It only works with column names (strings) like this:
df.groupBy("x").min("date_time_utc")
# you can also specify several column names
df.groupBy("x").min("y", "z")
Note that if you want to use a column object, you have to use agg:
df.groupBy("x").agg(F.min(F.col("date_time_utc")))
I'm new with pYSPark and I'm struggling when I select one colum and I want to showh the type.
If I have a datagrame and want to show the types of all colums, this is what i do:
raw_df.printSchema()
If i want a specific column, i'm doig this but i'm sure we can do it faster:
new_df = raw_df.select( raw_df.annee)
new_df.printSchema()
Do i have to use select and store my colum in a new dataframe and use printchema()?
I tried something like this but it doesn't work:
raw_df.annee.printchema()
is there another way?
Do i have to use select and store my colum in a new dataframe and use printchema()
Not necessarily - take a look at this code:
raw_df = spark.createDataFrame([(1, 2)], "id: int, val: int")
print(dict(raw_df.dtypes)["val"])
int
The "val" is of course the column name you want to query.
I'm new to Spark and Scala.
We have an external data source feeding us JSON. This JSON has quotes around all values including number and boolean fields. So by the time I get it into my DataFrame all the columns are strings. The end goal is to convert these JSON records into a properly typed Parquet files.
There are approximately 100 fields, and I need to change several of the types from string to int, boolean, or bigint (long). Further, each DataFrame we process will only have a subset of these fields, not all of them. So I need to be able to handle subsets of columns for a given DataFrame, compare each column to a known list of column types, and cast certain columns from string to int, bigint, and boolean depending on which columns appear in the DataFrame.
Finally, I need the list of column types to be configurable because we'll have new columns in the future and may want to get rid of or change old ones.
So, here's what I have so far:
// first I convert to all lower case for column names
val df = dfIn.toDF(dfIn.columns map(_.toLowerCase): _*)
// Big mapping to change types
// TODO how would I make this configurable?
// I'd like to drive this list from an external config file.
val dfOut = df.select(
df.columns.map {
///// Boolean
case a # "a" => df(a).cast(BooleanType).as(a)
case b # "b" => df(b).cast(BooleanType).as(b)
///// Integer
case i # "i" => df(i).cast(IntegerType).as(i)
case j # "j" => df(j).cast(IntegerType).as(j)
// Bigint to Double
case x # "x" => df(x).cast(DoubleType).as(x)
case y # "y" => df(y).cast(DoubleType).as(y)
case other => df(other)
}: _*
)
Is this a good efficient way to transform this data to having the types I want in Scala?
I could use some advice on how to drive this off an external 'config' file where I could define the column types.
My question evolved into this question. Good answer given there:
Spark 2.2 Scala DataFrame select from string array, catching errors
I am trying add an extra "tag" column to an Hbase table. Tagging is done on the basis of words present in the rows of the table. Say for example, If "Dark" appears in a certain row, then its tag will be added as "Horror". I have read all the rows from the table in a spark RDD and have matched them with words based on which we would tag. A snippet to code looks like this:
var hBaseRDD2=sc.newAPIHadoopRDD(conf,classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
val transformedRDD = hBaseRDD2.map(tuple => {
(Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieName"))),
Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieSummary"))),
Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieActor")))
)
})
Here, "moviesdata" is the columnfamily of the HBase table and "MovieName"&"MovieSummary" & "MovieActor" are column names. "transformedRDD" in the above snippet is of type RDD[String,String,String]. It has been converted into type RDD[String] by:
val arrayRDD: RDD[String] = transformedRDD.map(x => (x._1 + " " + x._2 + " " + x._3))
From this, all words have been extracted by doing this:
val words = arrayRDD.map(x => x.split(" "))
The words which we would are looking for in the HBase Table rows are in a csv file. One of the column, let's say "synonyms" column, of the csv has the words which we would look for. Another column in the csv is a "target_tag" column, which has the words which would be tagged to the row corresponding to which there is match.
Read the csv by:
val csv = sc.textFile("/tag/moviestagdata.csv")
reading the synonyms column: (synonyms column is the second column, therefore "p(1)" in the below snippet)
val synonyms = csv.map(_.split(",")).map( p=>p(1))
reading the target_tag column: (target_tag is the 3rd column)
val targettag = csv.map(_.split(",")).map(p=>p(2))
Some rows in synonyms and targetag have more than one strings and are seperated by "###". The snippet to seperate them is this:
val splitsyno = synonyms.map(x => x.split("###"))
val splittarget = targettag.map(x=>x.split("###"))
Now, to match each string from "splitsyno", we need to traverse every row, and further a row might have many strings, hence, to create a set of every string, I did this:(an empty set was created)
splitsyno.map(x=>x.foreach(y=>set += y)
To match every string with those in "words" created up above, I did this:
val check = words.exists(set contains _)
Now, the problem which I am facing is that I don't exactly know that strings from what rows in csv are matching to strings from what rows in HBase table. This is needed as I would need to find corresponding target string and which row in HBase table to add to. How should I get it done? Any help would be highly appreciated.
I need a window function that partitions by some keys (=column names), orders by another column name and returns the rows with top x ranks.
This works fine for ascending order:
def getTopX(df: DataFrame, top_x: String, top_key: String, top_value:String): DataFrame ={
val top_keys: List[String] = top_key.split(", ").map(_.trim).toList
val w = Window.partitionBy(top_keys(1),top_keys.drop(1):_*)
.orderBy(top_value)
val rankCondition = "rn < "+top_x.toString
val dfTop = df.withColumn("rn",row_number().over(w))
.where(rankCondition).drop("rn")
return dfTop
}
But when I try to change it to orderBy(desc(top_value)) or orderBy(top_value.desc) in line 4, I get a syntax error. What's the correct syntax here?
There are two versions of orderBy, one that works with strings and one that works with Column objects (API). Your code is using the first version, which does not allow for changing the sort order. You need to switch to the column version and then call the desc method, e.g., myCol.desc.
Now, we get into API design territory. The advantage of passing Column parameters is that you have a lot more flexibility, e.g., you can use expressions, etc. If you want to maintain an API that takes in a string as opposed to a Column, you need to convert the string to a column. There are a number of ways to do this and the easiest is to use org.apache.spark.sql.functions.col(myColName).
Putting it all together, we get
.orderBy(org.apache.spark.sql.functions.col(top_value).desc)
Say for example, if we need to order by a column called Date in descending order in the Window function, use the $ symbol before the column name which will enable us to use the asc or desc syntax.
Window.orderBy($"Date".desc)
After specifying the column name in double quotes, give .desc which will sort in descending order.
Column
col = new Column("ts")
col = col.desc()
WindowSpec w = Window.partitionBy("col1", "col2").orderBy(col)