Retrieve binary representation of series? - python-polars

Polars has a pl.Binary datatype with very little documentation. I'd like to get the binary representation of values in a DataFrame, but it appears that the data type and the resulting binary casts are independent. For example:
import polars as pl
pl.Series([1], dtype=pl.UInt16).cast(pl.Binary)[0].hex()
yields b'\x31' which corresponds to the character '1'. So it appears pl.Binary first casts as Utf8. Is there any way to convert the values to a series of Byte arrays representing the underlying dtypes?
Rationale: I have a pandas tool that I use to write dataframes into native BCP native format for SQL Server and bulk-uploads them using bcp.exe. The result is table uploads that are typically 300x faster than pandas.DataFrame.to_sql.
I'd like to implement something similar to polars without having to go through numpy (which my pandas implementation currently does), whereby I can cast to np.void() and perform a sum (fold) across columns to concatenate binary arrays. I'm looking to do something similar wtih polars.

Related

How to write array[] bytea using libpq PQexecParams?

I need help with write array[] bytea using libpq PQexecParams
I'v done a simple version where i'm write a single binary data in a single bytea arg using PQexecParams like this solution Insert Binary Large Object (BLOB) in PostgreSQL using libpq from remote machine
But I have problems with write array[] bytea like this
select * func(array[3, 4], array[bytea, bytea])
Unless you want to read the PostgreSQL source to figure out the binary format for arrays, you will have to use the text format, e.g.
{\\xDEADBEEF,\\x00010203}
The binary format for arrays is defined in array_send in src/backend/utils/adt/arrayfuncs.c. Have a look at the somments and definitions in src/include/utils/array.h as well. Consider also that all integers are sent in “network bit order”.
One way you can examine the binary output format and saving yourself a lot of trouble experimenting is to use binary copy, e.g.
COPY (SELECT ARRAY['\xDEADBEEF'::bytea,'\x00010203'::bytea])
TO '/tmp/file' (FORMAT 'binary');

why pyspark pandas udf grouped map serialization is designed this way?

I am trying to get a concrete understanding on how the pandas UDF grouped map is working. While looking at the code here[1], i see that first the arrow object is converted into pandas series and then pd.concat is applied to create the full data frame.
What is confusing for me is , since arrow has support for Tabular format[2] and an API exist for converting table format to pandas in pyarrow[3] , why is that not being used.
I am pretty sure i am missing out on something very basic, any pointers would be useful ?
[1] https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/python/pyspark/sql/pandas/serializers.py#L234-L276
[2] https://arrow.apache.org/docs/cpp/tables.html
[3] https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas

Transpose data using Talend

I have this kind of data:
I need to transpose this data into something like this using Talend:
Help would be much appreciated.
dbh's suggestion should work indeed, but I did not try it.
However, I have another solution which doesn't require to change input format and is not too complicated to implement. Indeed the job has only 2 transformation components (tDenormalize and tMap).
The job looks like the following:
Explanation :
Your input is read from a CSV file (could be a database or any other kind of input)
tDenormalize component will Denormalize your column value (column 2), based on value on id column (column 1), separating fields with a specific delimiter (";" in my case), resulting as shown in 2 rows.
tMap : split the aggregated column into multiple columns, by using java's String.split() method and spreading the resulting array into multiple columns. The tMap should like like this:
Since Talend doesn't accept to store Array objects, make sure to store the splitted String in Object format. Then, cast that object into Array on the right side of the Map.
That approach should give you the expected result.
IMPORTANT:
tNormalize might shuffle the rows, meaning for bigger input, you might encounter unsorted output. Make sure to sort it if needed or use tDenormalizeSortedRow instead.
tNormalize is similar to an aggregation component meaning it scans the whole input before processing, which results into possible performance issues with particularly big inputs (tens of millions of records).
Your input is probably wrong (you have 5 entries with 1 as id, and 6 entries with 2 as id). 6 columns are expected meaning you should always have 6 lines per id. If not, then you should implement dbh's solution, and you probably HAVE TO add a column with a key.
You can use Talend's tPivotToColumnsDelimited component to achieve this. You will most likely need an additional column in your data to represent the field name.
Like "Identifier, field name, value "
Then you can use this component to pivot the data and write a file as output. If you need to process the data further, read the resulting file with tFileInoutDelimited .
See docs and an example at
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/13.43+tPivotToColumnsDelimited

Scala: wrapper for Breeze DenseMatrix for column and row referencing

I am new to Scala. Looking at it as an alternative to MATLAB for some applications.
I would like to program in Scala a wrapping class in order to be able to assign column names ("QuantityQ" && "QuantityP" -> Range) and row names (dates -> Range) to Breeze DenseMatrices (http://www.scalanlp.org/) in order to reference columns and rows.
The usage should resemble Python Pandas or Scala Saddle (http://saddle.github.io).
Saddle is very interesting but its usage is limited to 2D matrices. A huge limitation.
My Ideas:
Columns:
I thought a Map would do the job for colums but that may not be the best implementation.
Rows:
For rows, I could maintain a separate Breeze vector with timestamps and provide methods that convert dates into timestamps, doing the numbercruncing through Breeze. This comes with a loss of generality as a user may want to give whatever string names to rows.
Concerning dates I use nscala-time (a scala wrapper for joda)?
What are the drawbacks of my implementation?
Would you design the data structure differently?
Thank you for your help.

What is the purpose of the input output functions in Postgresql 9.2 user defined types?

I have been implementing user defined types in Postgresql 9.2 and got confused.
In the PostgreSQL 9.2 documentation, there is a section (35.11) on user defined types. In the third paragraph of that section, the documentation refers to input and output functions that are used to construct a type. I am confused about the purpose of these functions. Are they concerned with on-disk representation or only in-memory representation? In the section referred to above, after defining the input and output functions, it states that:
If we want to do anything more with the type than merely store it,
we must provide additional functions to implement whatever operations
we'd like to have for the type.
Do the input and output functions deal with serialization?
As I understand it, the input function is the one which will be used to perform INSERT INTO and the output function to perform SELECT on the type so basically if we want to perform an INSERT INTO then we need a serialization function embedded or invoked in the input or output function. Can anyone help explain this to me?
Types must have a text representation, so that values of this type can be expressed as literals in a SQL query, and returned as results in output columns.
For example, '2013-20-01' is a text representation of a date. It's possible to write VALUES('2013-20-01'::date) in a SQL statement, because the input function of the date type recognizes this string as a date and transforms it into an internal representation (for both using it in memory and storing to disk).
Conversely, when client code issues SELECT date_field FROM table, the values inside date_field are returned in their text representation, which is produced by the type's output function from the internal representation (unless the client requested a binary format for this column).