I have scrubbed the polars docs and cannot see an example of creating a column with a fixed value from a variable. Here is what works in pandas:
df['VERSION'] = version
Thx
Use polars.lit
import polars as pl
version = 6
df.with_column(pl.lit(version).alias('VERSION'))
Related
I am encoding catagorical data, many columns need to be seletced, I have typed them in individually and it works ok but there is obviouly a more elegant way.
dataset =pd.read_csv('train.csv')
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:, -1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(),[2,5,6,7,8,9,10,11,12,13,14,15,16,21,22,23,24,25,27,28,29,30,31,32,33,34,35,39,40,41,42,53,54,55,56,57,58,60,63,64,65,72,73,74,78,79])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
I have tried using (23:34) I have tried using slice but that does not work as it is not that data type.
Which method should I use for selecting a range of columns?
Also what datatype is it at this point were I am selecting the columns?
I made a search I an not able to see a solution for this exact question.
Finally, is this an effecient way to encode catagorical data or should I be looking at an alternative method?
Thanks!
you can use the following workaround:
ct = ColumnTransformer(
transformers=[
("ordinal_enc", OrdinalEncoder(), data.loc[:, "col1":"col100"].columns)
])
Hi I am trying to add one column in my spark dataframe and calculating value based on existing dataframe column. I am writing below code.
val df1=spark.sql("select id,dt1,salary frm dbdt1.tabledt1")
val df2=df1.withColumn("new_date",WHEN (month(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy')))
IN (01,02,03)) THEN
CONCAT(CONCAT(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy')))-1,'-'),
substr(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM-yyyy'))),3,4))
.otherwise(CONCAT(CONCAT(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy'))),'-')
,SUBSTR(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM-yyyy')))+1,3,4))))
But it always showing issue error: unclosed character literal. Can someone plase guide me how should i add this new column or modify the existing code.
Incorrect syntax in many places. First I suggest you look at a few spark sql examples online and also the org.apache.spark.sql.functions API documentation because your use of WHEN, CONCAT, IN are all incorrect.
Scala strings are enclosed by double quotes, you appear to be using SQL string syntax.
'dd-MM-yyyy' should be "dd-MM-yyyy"
To reference a column dt1 on DataFrame df1 you can use one of the following:
df1("dt1")
col("dt1") // if you import org.apache.spark.sql.functions.col
$"dt1" // if you import spark.implicits._ locally
For example:
from_unixtime(unix_timestamp(col("dt1")), 'dd-MM- yyyy')
Spark version 1.60, Scala version 2.10.5.
I have a spark-sql dataframe df like this,
+-------------------------------------------------+
|addess | attributes |
+-------------------------------------------------+
|1314 44 Avenue | Tours, Mechanics, Shopping |
|115 25th Ave | Restaurant, Mechanics, Brewery|
+-------------------------------------------------+
From this dataframe, I would like values as below,
Tours, Mechanics, Shopping, Brewery
If I do this,
df.select(df("attributes")).collect().foreach(println)
I get,
[Tours, Mechanics, Shopping]
[Restaurant, Mechanics, Brewery]
I thought I could use flatMapinstead found this, so, tried to put this into a variable using,
val allValues = df.withColumn(df("attributes"), explode("attributes"))
but I am getting an error:
error: type mismatch;
found:org.apache.spark.sql.column
required:string
I was thinking if I can get an output using explode I can use distinct to get the unique values after flattening them.
How can I get the desired output?
I strongly recommend you to use spark 2.x version. In Cloudera, when you issue "spark-shell", it launches 1.6.x version.. however, if you issue "spark2-shell", you get the 2.x shell. Check with your admin
But if you need with Spark 1.6 and rdd solution, try this.
import spark.implicits._
import scala.collection.mutable._
val df = Seq(("1314 44 Avenue",Array("Tours", "Mechanics", "Shopping")),
("115 25th Ave",Array("Restaurant", "Mechanics", "Brewery"))).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[mutable.WrappedArray[String]]("attributes") ).distinct().collect.foreach(println)
Results:
Brewery
Shopping
Mechanics
Restaurant
Tours
If the "attribute" column is not an array, but comma separated string, then use the below one which gives you same results
val df = Seq(("1314 44 Avenue","Tours,Mechanics,Shopping"),
("115 25th Ave","Restaurant,Mechanics,Brewery")).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[String]("attributes").split(",") ).distinct().collect.foreach(println)
The problem is that withColumn expects a String in its first argument (which is the name of the added column), but you're passing it a Column here df.withColumn(df("attributes").
You only need to pass "attributes" as a String.
Additionally, you need to pass a Column to the explode function, but you're passing a String - to make it a column you can use df("columName") or the Scala shorthand $ syntax, $"columnName".
Hope this example can help you.
import org.apache.spark.sql.functions._
val allValues = df.select(explode($"attributes").as("attributes")).distinct
Note that this will only preserve the attributes Column, since you want the distinct elements on that one.
I want to remove an instance or row with missing values.
It's so simple to do it by using Impute widget, but now I want to do it in Python Script Widget.
How do I do this?
Write this in Python Script widget:
import numpy as np
from Orange.preprocess import impute
drop_instances = impute.DropInstances()
var = in_data.domain.attributes[0] # choose a variable you wanna check
mask = drop_instances(in_data, var)
out_data = in_data[[np.logical_not(mask)]]
If you need more information, you are welcome to comment a question below!
I'm trying to import ecoinvent 3.4 cutoff database so I wrote:
import brightway2 as bw
[...]
fpei34 = r'C:\Users\Me\Anaconda3\ecoinvent 3.4_cutoff_ecoSpold02\MasterData'
if 'ecoinvent 3.4 cutoff' in bw.databases:
print("Database has already been imported")
else:
ei34 = bw.SingleOutputEcospold1Importer(fpei34, 'ecoinvent 3.4 cutoff')
ei34.apply_strategies()
ei34.statistics()
I get the answer:
NameError: name 'filename' is not defined.
It also indicates the problem occurs at the line where SingleOutputEcospold1Importer is used.
Do you know what is the mistake I did and how I could arrange the code?
There are at least two problems with your code:
- You are using the SingleOutputEcospold1Importer importer. The ecoinvent database has moved to ecoSpold2 files as of ecoinvent v3, so you should use the SingleOutputEcospold2Importer instead.
- Your filepath fpei34 seems to be for masterdata, and not the datasets per se. You should be looking for a folder named datasets with over 14000 spold files instead.