SQLBindParameter with variable length strings - db2

How do you use SQLBindParameter to write an array of strings to a VARCHAR field in DB2 in a memory-efficient way?
The example in the DB2 docs does this
SQLCHAR Description[NUM_PRODS][257] = {
"Aquarium-Glass-25 litres", "Aquarium-Glass-50 litres",
"Aquarium-Acrylic-25 litres", "Aquarium-Acrylic-50 litres",
"Aquarium-Stand-Small", "Aquarium-Stand-Large",
"Pump-Basic-25 litre", "Pump-Basic-50 litre",
"Pump-Deluxe-25 litre", "Pump-Deluxe-50 litre",
"Pump-Filter-(for Basic Pump)",
"Pump-Filter-(for Deluxe Pump)",
"Aquarium-Kit-Small", "Aquarium-Kit-Large",
"Gravel-Colored", "Fish-Food-Deluxe-Bulk",
"Plastic-Tubing"
};
rc = SQLBindParameter(hstmt, 2, SQL_PARAM_INPUT, SQL_C_CHAR, SQL_VARCHAR, 257, 0, Description, 257, NULL);
I can get that to work without issues but it isn't very efficient since each string is stored using 256 characters(+null-terminator) regardless of its actual length. More generally, if you had one very long string (say 500 chars) and every other string was one character, you would still need a two-dimensional array of size [NUM_STRINGS][500] which wastes a lot of memory.
What I would like to do is pass SQLBindParameter an array that looks like
SQLCHAR* Description[NUM_STRINGS];
where each element of the array points to a string. This would be more memory-efficient since each string only uses the space it needs but I can't figure out how to get it to work using SQLBindParameter. Any help would be appreciated.
NOTE: General answers for any DB would be great but answers that are specific to DB2 would also be helpful.
The sizes I am working with involve millions of strings with widely varying lengths so being memory-efficient is a significant factor.
I am actually using DB2 on Linux, but for this specific usecase, the only example I could find was in the DB2 for z/OS docs. It does work correctly though except for the memory usage issue.

Related

Limitations in FDW code on passing List* metadata between GetForeignPlan() and BeginForeignScan()

I'm writing an FDW for a non-SQL data source. Platform is Windows 10, C (MS Visual Studio), Postgresql 14. My FDW code is modeled after FDW example codes I have studied such as SQLite, JSON, CSV, File, DB2 and others. There is a common practice of storing metadata in a pg List as part of GetForeignPlan() and passing that via fdw_private. This list is then retrieved in BeginForeignScan() and made available for IterateForeignScan().
My question is how to share a large amount of metadata across the fdw_private mechanism? Using the pg List macros, I have been unable to store and retrieve more than 5 List cells. I tried passing a single List cell with a JSON string containing all of my metadata, but the string becomes corrupted along the way.
List* mdList = NIL;
char feJSON[MAX_FESTATE_JSON_SIZE];
/// ... some code to format a JSON string into the feJSON buffer.
mdList = list_make1(makeString(feJSON));
return mdList;
I have also used lappend() to try and extend the List by more than 5 cells, but the additional cells' values are not maintained across the callbacks...
#define serializeInt(x) makeConst(INT4OID, -1, InvalidOid, 4, Int32GetDatum((int32)(x)), false, true)
result = list_make5(makeInteger(feState->start), makeInteger(feState->rows), makeString(feState->ltName), makeString(feState->ftName), makeInteger(feState->myTable->npgcols));
result = lappend(result, serializeInt(feState->myTable->ncols));
There is a hint in the pg source plannodes.h suggesting the use of bytea (byte array?) as an alternative to the pg List structure, but I'm not finding any examples for that, so far.
I'm suspecting certain characters in JSON strings may be part of the issue, but I also found that...
#define MAX_FESTATE_JSON_SIZE 2048
List* serializeMetadata(...)
{
List* mdList = NIL;
/// ...stuff...
char smokeTest[MAX_FESTATE_JSON_SIZE + 1];
memset(smokeTest, 'X', MAX_FESTATE_JSON_SIZE);
smokeTest[MAX_FESTATE_JSON_SIZE] = '\0';
mdList = list_make1(makeString(smokeTest));
return mdList;
}
...revealed some truncation of the List as a return value (but it's a pointer!?). So I'm not sure if casting a bytea* as a List* will help, but that's where I'm headed.
Suggestions are most welcome!

dataFrame keying using pandas groupby method

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
filePath,vp,score
E:\Audio\7168965711_5601_4.wav,Cust_9709495726,-2
E:\Audio\7168965711_5601_4.wav,Cust_9708568031,-80
E:\Audio\7168965711_5601_4.wav,Cust_9702445777,-2
E:\Audio\7168965711_5601_4.wav,Cust_7023544759,-35
E:\Audio\7168965711_5601_4.wav,Cust_9702229339,-77
E:\Audio\7168965711_5601_4.wav,Cust_9513243289,25
E:\Audio\7168965711_5601_4.wav,Cust_2102513187,18
E:\Audio\7168965711_5601_4.wav,Cust_6625625104,-56
E:\Audio\7168965711_5601_4.wav,Cust_6073165338,-40
E:\Audio\7168965711_5601_4.wav,Cust_5105831247,-30
E:\Audio\7168965711_5601_4.wav,Cust_9513082770,-55
E:\Audio\7168965711_5601_4.wav,Cust_5753907026,-79
E:\Audio\7168965711_5601_4.wav,Cust_7403410322,11
E:\Audio\7168965711_5601_4.wav,Cust_4062144116,-70
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
res.index
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Cust_9709495726]
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
res['Cust_4062144116']
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.
Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
res['E:\Audio\7168965711_5601_4.wav]
and will return:
Cust_2102513187 2
Cust_4062144116 8
....
but trying to index by the inner index (the Cust_ indexes) will fail.
You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]

Solr search error when dealing with Arabic string

I'm struggling with Solr search Arabic for several days and made some experiment. Here is the simple reflection of the problem.
After I store some Arabic sentence (now only 1 word السوري ) into database and have Solr index it, then query it by q=*:*&wt=python,(if no wt part, it was garbled chars) the response is:
'\u00d8\u00a7\u00d9\u201e\u00d8\u00b3\u00d9\u02c6\u00d8\u00b1\u00d9\u0160'
The actual word I store there for index is coding in another way:
'\xd8\xa7\xd9\x84\xd8\xb3\xd9\x88\xd8\xb1\xd9\x8a'
As you can tell, there is a one-to-to corresponding from \xd8↔\u00d8. But I don't know what is the name of this coding, thus I cannot convert it. And when I do the search as: <>/select/?q=السوري&wt=python,the response is:
{'responseHeader':{'status':0,'QTime':0,'params':{'wt':'python','q':u'\u0627\u0644\u0633\u0648\u0631\u064a'}},'response':{'numFound':0,'start':0,'docs':[]}}
No docs found and it seems using a third version for coding u'\u0627\u0644\u0633\u0648\u0631\u064a'. if I take it and encode('utf8') then it convert back to '\xd8\xa7\xd9\x84\xd8\xb3\xd9\x88\xd8\xb1\xd9\x8a'.
In summary, when it (السوري) is in my code (python) or in data base (mysql),
it presents as 'form1':
'\xd8\xa7\xd9\x84\xd8\xb3\xd9\x88\xd8\xb1\xd9\x8a'
When it is indexed by Solr, it converts to form2:
'\u00d8\u00a7\u00d9\u201e\u00d8\u00b3\u00d9\u02c6\u00d8\u00b1\u00d9\u0160'
And when I use <>/select/?q=السوري&wt=python, to query from browser (Google chrome), it becomes form3:
'\u0627\u0644\u0633\u0648\u0631\u064a'
(which could convert back to form1 by encode('utf8') But since they are different, the search matches nothing.
Therefore, those three different encode strategy may be the core problem. Could anyone help me figure it out and solve the search problem?
Thanks in advance.

Problem while using NSPredicate

Sql query:
select * from test_mart
where replace(replace(replace(replace(replace(replace(lower(name),'+'),'_'),'the '),' the'),'a '),' a')='tariq'
I can fire following query very easy, if I have to use simply Sqlite... but In current project I am using Core Data so not familiar about NSPredicate much.
The functionality talks about removing all BUT alphanumeric characters, which means removing special characters.
The characters that should be valid in the comparison would be
ABCDEFGHIJKLMNOPQRESTUVWXYZ1234567890
But we should not fail the comparison for the following characters
:;,~`!##$%^&*()_-+="'/?.>,<|\
Or for the following words
'the' 'an' 'a'
Some examples:
'Walmart' would be seen as the same payee as 'Wal-Mart'
'The Shoe Store' would be seen as the same payee as 'Shoe Store'
'Domino's Pizza' would be seen as the same payee as 'Dominos Pizza'
'Test Payee;' would be seen as the same payee as 'Test Payee'
Can any one suggest appropriate Predicates/Regular Expression ?
Thanks
I would have an extra field in the data base which would be a processed version of the original with all the irrelevant characters stripped out. Then use that for comparisons.
You might want to look at the soundex algorithm which may suite your purposes better... Soundex
It seems to me that you would want to normalize your data before it every gets set into the core data store. So if you're given "Wal-Mart", normalize it to "walmart" once, and then save it. Then you won't be doing all of this expensive on-the-fly comparison many many times.
The normalization would be fairly simple, given your rules:
Strip the words "a", "an", and "the"
Remove punctuation

Generate unique 3 letter/number code and compare to existing ones in PHP/MySQL

I'm making a code generation script for UN/LOCODE system and the database has unique 3 letter/number codes in every country. So for example the database contains "EE TLL", EE being the country (Estonia) and TLL the unique code inside Estonia, "AR TLL" can also exist (the country code and the 3 letter/number code are stored separately). Codes are in capital letters.
The database is fairly big and already contains a huge number of locations, the user has also the possibility of entering the 3 letter/number him/herself (which will be checked against the database before submission automatically).
Finally neither 0 or 1 may be used (possible confusion with O and I).
What I'm searching for is the most efficient way to pick the next available code when none is provided.
What I've came up with:
I'd check with AAA till 999, but then for each code it would require a new query (slow?).
I could store all the 40000 possibilities in an array and subtract all the used codes that are already in the database... but that uses too much memory IMO (not sure what I'm talking about here actually, maybe 40000 isn't such a big number).
Generate a random code and hope it doesn't exist yet and see if it does, if it does start over again. That's just risk taking.
Is there some magic MySQL query/PHP script that can get me the next available code?
I will go with number 2, it is simple and 40000 is not a big number.
To make it more efficient, you can store a number representing each 3-letter code. The conversion should be trivial because you have a total of 34 (A-Z, 2-9) letters.
I would for option 1 (i.e. do a sequential search), adding a table that gives the last assigned code per country (i.e. such that AAA..code are all assigned already). When assigning a new code through sequential scan, that table gets updated; for user-assigned codes, it remains unmodified.
If you don't want to issue repeated queries, you can also write this scan as a stored routine.
To simplify iteration, it might be better to treat the three-letter codes as numbers (as Shawn Hsiao suggests), i.e. give a meaning to A-Z = 0..25, and 2..9 = 26..33. Then, XYZ is the number X*34^2+Y*34+Z == 23*1156+24*34+25 == 27429. This should be doable using standard MySQL functions, in particular using CONV.
I went with the 2nd option. I was also able to make a script that will try to match as close as possible the country name, for example for Tartu it will try to match T** then TA* and if possible TAR, if not it will try TAT as T is the next letter after R in Tartu.
The code is quite extensive, I'll just post the part that takes the first possible code:
$allowed = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ23456789';
$length = strlen($allowed);
$codes = array();
// store all possibilities in a huge array
for($i=0;$i<$length;$i++)
for($j=0;$j<$length;$j++)
for($k=0;$k<$length;$k++)
$codes[] = substr($allowed, $i, 1).substr($allowed, $j, 1).substr($allowed, $k, 1);
$used = array();
$query = mysql_query("SELECT code FROM location WHERE country = '$country'");
while ($result = mysql_fetch_array($query))
$used[] = $result['code'];
$remaining = array_diff($codes, $used);
$code = $remaining[0];
Thanks for your opinion, this will be the key to transport codes all over the world :)