Maximum number of columns shown when formatting DataFrames in python-polars? - python-polars

how set maximum number of columns shown when formatting DataFrames ?
i found that in rust-polars i need to modify "POLARS_FMT_MAX_COLS" env variable.
how can i do this in python ?

You can use polars.Config.set_tbl_cols

Related

Data Factory / Data Flow - conditional split based on a number of records

I need to split a huge dataset into multiple files and each file must not have more than 100 000 rows.
I don't know if this is possible with Data Flow and the conditional split?
If you want simply split by a fixed number of rows, I've created a simple test.
Declare a parameter inside the dataflow to store the row count of your source dataset. If your source dataset is Azure sql, you can use Lookup activity to get the max Row_No. If your source dataset is Azure storage, you can use Azure Function activity to get the max Row_No. Then pass the value to the parameter.
Here for test, set a static default value.
Then we can set Number of partitions expression $RowCount/10, if you want 10 lines per file.
We can set file names after division here.
My source dataset contains 50 lines, so ADF will split it to 5 files. Judging by the Id column, it has randomly taken 10 rows of data.
You can achieve this with 2 dataflows. 1 to get the row count and another to partition. This can also be achieved in 1 dataflow using a cache sink in the future.

Is is possible limit the number of rows in the output of a Dataprep flow?

I'm using Dataprep on GCP to wrangle a large file with a billion rows. I would like to limit the number of rows in the output of the flow, as I am prototyping a Machine Learning model.
Let's say I would like to keep one million rows out of the original billion. Is this possible to do this with Dataprep? I have reviewed the documentation of sampling, but that only applies to the input of the Transformer tool and not the outcome of the process.
You can do this, but it does take a bit of extra work in your Recipe--set up a formula in a new column using something like RANDBETWEEN to give you a random integer output between 1 and 1,000 (in this million-to-billion case). From there, you can filter rows based on whatever random integer between 1 and 1,000 as what you'll keep, and then your output will only have your randomized subset. Just have your last part of the recipe remove this temporary column.
So indeed there are 2 approaches to this.
As Courtney Grimes said, you can use one of the 2 functions that create random-number out of a range.
randbetween :
rand :
These methods can be used to slice an "even" portion of your data. As suggested, a randbetween(1,1000) , then pick 1<x<1000 to filter, because it's 1\1000 of data (million out of a billion).
Alternatively, if you just want to have million records in your output, but either
Don't want to rely on the knowledge of the size of the entire table
just want the first million rows, agnostic to how many rows there are -
You can just use 2 of these 3 row filtering methods: (top rows\ range)
P.S
By understanding the $sourcerownumber metadata parameter (can read in-product documentation), you can filter\keep a portion of the data (as per the first scenario) in 1 step (AKA without creating an additional column.
BTW, an easy way of "discovery" of how-to's in Trifacta would be to just type what you're looking for in the "search-transtormation" pane (accessed via ctrl-k). By searching "filter", you'll get most of the relevant options for your problem.
Cheers!

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

How to find widths of a Flat File using read_fwf() in Pandas?

I have downloaded some data from the Mainframe (.DATA format) and I need to parse it to create a PySpark DataFrame and perform some operations on it. Before doing that, I created a sample file and read it using read_fwf() feature of Pandas.
I was able to read and create the DataFrame but I encountered some problems like
Padding of "0" in the first column of some of the Rows
Repeating Headers while reading the Data
These were some of the issues I can handle, however the key challenge I am facing is in identifying the widths of the columns. I currently have 65 columns but in order to create a PySpark DataFrame, I would require to know the widths of these columns. Can read_fwf() tell what is the widths it is using for each column ?
And is there a read_fwf() like function in PySpark ? Or we would have to write a MapRed code for it ?

PostgreSQL - max number of parameters length change?

can any one please help me for how I can increase the PostgreSQL - max number of parameters length. I don't want to do any other way i want to use normal query as I am using. but if I am passing 90,000 parameters in IN Query then how I make it possible to execute this query?
If you believe this page https://msdn.microsoft.com/en-us/library/ms143432.aspx the number of parameters for example in a stored proc, statement, function, ... are fix.