Polars dataframe columns selection - python-polars

In pandas we can take columns from database using , df_new = df[["A","B"]]. How to do similar thing using polars ?
I tried df_new = df.get_columns(["A","B"]). But it is giving error

I suggest you to read or skim through Polars introduction in here:
https://pola-rs.github.io/polars-book/user-guide/introduction.html. It is nicely written with some good examples.
You can choose columns by using Polars select statement
df_new = df.select(pl.col(["A","B"]))
or
df_new = df.select([pl.col("A"), pl.col("B")])

Related

drop all df2.columns from another df (pyspark.sql.dataframe.DataFrame specific)

I have a large DF (pyspark.sql.dataframe.DataFrame) that is a result of multiple joins, plus new columns being created by using a combination of inputs from different DFS, including DF2.
I want to drop all DF2 columns from DF after I'm done with the join/creating new columns based on DF2 input.
drop() doesn't accept list - only a string or a Column.
I know that df.drop("col1", "col2", "coln") will work but I'd prefer not to crowd the code (if I can) by listing those 20 columns.
Is there a better way of doing this in pyspark dataframe specifically?
drop_cols = df2.columns
df = df.drop(*drop_cols)

issues in creating a new column of tuple from two dataframe columns in pyspark

I'm trying to create a column of tuple based on other two columns in spark dataframe.
data = [ ('A', 4,5 ),
('B', 6, 9 )
]
columns= ["id","val1", "val2"]
sdf = spark.createDataFrame(data = data, schema = columns)
sdf.withColumn('values', F.struct(F.col('val1'), F.col('val2')) ).show()
what I got is:
I need column values to be tuples. So instead of {4,5} {6,9}, I want (4,5) (6,9). Does anyone know what I did wrong? Thanks a lot.
That's not how spark works.
Spark is a framework that is developped in Scala, based on Java JVM. It is not Python.
Pyspark is a set of API that calls the Scala methods to execute Spark but in Python language.
Therefore, Python types such as tuple do not exists in Spark. You have to use either :
Struct which is close to Python dict
Array which are the equivalent of list (probably what you need if you want something close to tuple).
The real question is Why do you need tuples?
EDIT: According to your comment, you need tuples because you want to use haversine. But if you use list (or Spark Array) for example, it works perfectly fine :
# Use the haversine doc example but with list
lyon = [45.7597, 4.8422]
paris = [48.8567, 2.3508]
haversine(lyon, paris)
> 392.2172595594006

How to read partitioned table from BigQuery to Spark dataframe (in PySpark)

I have a BQ table and it's partitioned by the default _PARTITIONTIME. I want to read one of its partitions to Spark dataframe (PySpark). However, the spark.read API doesn't seem to recognize the partition column. Below is the code (which doesn't work):
table = 'myProject.myDataset.table'
df = spark.read.format('bigquery').option('table', table).load()
df_pt = df.filter("_PARTITIONTIME = TIMESTAMP('2019-01-30')")
The partition is quite large so I'm not able to read as a pandas dataframe.
Thank you very much.
Good question
I filed https://github.com/GoogleCloudPlatform/spark-bigquery-connector/issues/50 to track this.
A work around today is the filter parameter to read
df = spark.read.format('bigquery').option('table', table) \
.option('filter', "_PARTITIONTIME = '2019-01-30'")).load()
should work today.
Try using the "$" operator: https://cloud.google.com/bigquery/docs/creating-partitioned-tables
So, the table you'd be pulling from is "myProject.myDataset.table$20190130"
table = 'myProject.myDataset.table'
partition = '20190130'
df = spark.read.format('bigquery').option('table', f'{table}${partition}').load()

Applying transformations with filter or map which one is faster Scala spark

Iam trying to do some transformations on the dataset with spark using scala currently using spark sql but want to shift the code to native scala code. i want to know whether to use filter or map, doing some operations like matching the values in column and get a single column after the transformation into a different dataset.
SELECT * FROM TABLE WHERE COLUMN = ''
Used to write something like this earlier in spark sql can someone tell me an alternative way to write the same using map or filter on the dataset, and even which one is much faster when compared.
You can read documentation from Apache Spark website. This is the link to API documentation at https://spark.apache.org/docs/2.3.1/api/scala/index.html#package.
Here is a little example -
val df = sc.parallelize(Seq((1,"ABC"), (2,"DEF"), (3,"GHI"))).toDF("col1","col2")
val df1 = df.filter("col1 > 1")
df1.show()
val df2 = df1.map(x => x.getInt(0) + 3)
df2.show()
If I understand you question correctly, you need to rewrite your SQL query to DataFrame API. Your query reads all columns from table TABLE and filter rows where COLUMN is empty. You can do this with DF in the following way:
spark.read.table("TABLE")
.where($"COLUMN".eqNullSafe(""))
.show(10)
Performance will be the same as in your SQL. Use dataFrame.explain(true) method to understand what Spark will do.

Spark DataFrame groupBy

I have Spark Java that looked like this. Code pulls data from oracle table using JDBC and displays the groupby output.
DataFrame jdbcDF = sqlContext.read().format("jdbc").options(options).load();
jdbcDF.show();
jdbcDF.groupBy("VA_HOSTNAME").count().show();
Long ll = jdbcDF.count();
System.out.println("ll="+ll);
When I ran the code, jdbcDF.show(); is working, whereas the groupBy and count are not printing anything and no errors were thrown.
My column name is correct. I tried by printing that column and it worked, but when groupBy it's not working.
Can someone help me with DataFrame output? I am using spark 1.6.3.
You can try
import org.apache.spark.sql.functions.count
jdbcDF.groupBy("VA_HOSTNAME").agg(count("*")).show()