Postgres COPY command - fields with commas, quoted with double quotes

Postgres COPY command - fields with commas, quoted with double quotes - postgresql

I've searched and found a few posts relating to postgres csv imports, but nothing that solves my current problem.
I use the postgres copy command all the time to bring data from hetergeneous data sources into our system. Currently struggling with a 100-million row .csv file, comma-quote delimited. Issue is with rows like so:
009098,0981098094,"something","something else",""this one, well, is a problem"", "another thing"
Fields enclosed in double-quotes with embedded commas. The fields are not correctly parsed and I get the error:
"ERROR: extra data after last expected column"
Usually when this arises I deal with the offending rows ad hoc, but this file is so huge I'm hoping for some more general way to defend against it. Asking for a revised data format is not a possibility.
copy mytable from '/path/to/file.csv' csv header quote '"'

That's malformed CSV. You double a double quote to embed a double quote inside a quote field; for example:
"where","is ""pancakes""","house?"
has three values:
where
is "pancakes"
house?
The row you're having trouble with has stray doubled double quotes:
009098,0981098094,"something","something else",""this one, well, is a problem"", "another thing"
^^ ^^
I don't think there is anything that COPY can do about this as the correct version is ambiguous: should it be "this one, well, is a problem" or should it be """this one, well, is a problem"""?
I think you'll have to fix it by hand. A quick sed one-liner should be able to do the job if you can uniquely identify the broken row.
For reference purposes, the closest thing I've seen to a CSV standard is RFC 4180 and section two has this to say:
5. Each field may or may not be enclosed in double quotes (however
some programs, such as Microsoft Excel, do not use double quotes
at all). If fields are not enclosed with double quotes, then
double quotes may not appear inside the fields. For example:
"aaa","bbb","ccc" CRLF
zzz,yyy,xxx
[...]
7. If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"

Here is code based on the CSV code from The Practice of Programming by Kernighan and Plauger that has been adapted to deal with your weird malformed CSV data. (It wasn't all that hard to do; I already had the main code working and packaged, so I just had to add the CSV output functions and to modify the advquoted() function to handle the weird format in this question.
csv2.h
/*
#(#)File: $RCSfile: csv2.h,v $
#(#)Version: $Revision: 2.1 $
#(#)Last changed: $Date: 2012/11/01 22:23:07 $
#(#)Purpose: Scanner for Comma Separated Variable (CSV) Data
#(#)Author: J Leffler
#(#)Origin: Kernighan & Pike, 'The Practice of Programming'
*/
/*TABSTOP=4*/
#ifndef CSV2_H
#define CSV2_H
#ifdef __cplusplus
extern "C" {
#endif
#ifdef MAIN_PROGRAM
#ifndef lint
/* Prevent over-aggressive optimizers from eliminating ID string */
const char jlss_id_csv2_h[] = "#(#)$Id: csv2.h,v 2.1 2012/11/01 22:23:07 jleffler Exp $";
#endif /* lint */
#endif /* MAIN_PROGRAM */
#include <stdio.h>
extern char *csvgetline(FILE *ifp); /* Read next input line */
extern char *csvgetfield(size_t n); /* Return field n */
extern size_t csvnfield(void); /* Return number of fields */
extern void csvreset(void); /* Release space used by CSV */
extern int csvputfield(FILE *ofp, const char *field);
extern int csvputline(FILE *ofp, char **fields, int nfields);
extern void csvseteol(const char *eol);
#ifdef __cplusplus
}
#endif
#endif /* CSV2_H */
csv2.c
/*
#(#)File: $RCSfile: csv2.c,v $
#(#)Version: $Revision: 2.1 $
#(#)Last changed: $Date: 2012/11/01 22:23:07 $
#(#)Purpose: Scanner for Comma Separated Variable (CSV) Data
#(#)Modification: Deal with specific malformed CSV
#(#)Author: J Leffler
#(#)Origin: Kernighan & Pike, 'The Practice of Programming'
*/
/*TABSTOP=4*/
#ifndef lint
/* Prevent over-aggressive optimizers from eliminating ID string */
const char jlss_id_csv2_c[] = "#(#)$Id: csv2.c,v 2.1 2012/11/01 22:23:07 jleffler Exp $";
#endif /* lint */
/*
** See RFC 4180 (http://www.ietf.org/rfc/rfc4180.txt).
**
** Specific malformed CSV - see SO 13183644 (http://stackoverflow.com/questions/13183644).
** Data contains malformed CSV fields like: OK,""this is a problem"",OK
** Two (but not three) field quotes at the start extract as "this is a problem" (with the quotes).
*/
#include "csv2.h"
#include <stdlib.h>
#include <string.h>
enum { NOMEM = -2 };
static char *line = 0; /* Input line */
static char *sline = 0; /* Split line */
static size_t maxline = 0; /* Size of line[] and sline[] */
static char **field = 0; /* Field pointers */
static size_t maxfield = 0; /* Size of field[] */
static size_t nfield = 0; /* Number of fields */
static char fieldsep[]= ","; /* Field separator characters */
static char fieldquote = '"'; /* Quote character */
static char eolstr[8] = "\n";
void csvreset(void)
{
free(line);
free(sline);
free(field);
line = 0;
sline = 0;
field = 0;
maxline = maxfield = nfield = 0;
}
static int endofline(FILE *ifp, int c)
{
int eol = (c == '\r' || c == '\n');
if (c == '\r')
{
c = getc(ifp);
if (c != '\n' && c != EOF)
ungetc(c, ifp);
}
return(eol);
}
/* Modified to deal with specific malformed CSV */
static char *advquoted(char *p)
{
size_t i;
size_t j;
if (p[0] == fieldquote && (p[1] != *fieldsep && p[1] != fieldquote))
{
/* Malformed CSV: ""some stuff"" --> "some stuff" */
/* Find "\"\"," or "\"\"\0" to mark end of field */
/* If we don't find it, drop through to 'regular' case */
char *eof = strstr(&p[2], "\"\"");
if (eof != 0 && (eof[2] == *fieldsep || eof[2] == '\0'))
{
p[eof + 1 - p] = '\0';
return(eof + 2);
}
}
for (i = j = 0; p[j] != '\0'; i++, j++)
{
if (p[j] == fieldquote && p[++j] != fieldquote)
{
size_t k = strcspn(p+j, fieldsep);
memmove(p+i, p+j, k); // 1 -> i fixing transcription error
i += k;
j += k;
break;
}
p[i] = p[j];
}
p[i] = '\0';
return(p + j);
}
static int split(void)
{
char *p;
char **newf;
char *sepp;
int sepc;
nfield = 0;
if (line[0] == '\0')
return(0);
strcpy(sline, line);
p = sline;
do
{
if (nfield >= maxfield)
{
maxfield *= 2;
newf = (char **)realloc(field, maxfield * sizeof(field[0]));
if (newf == 0)
return NOMEM;
field = newf;
}
if (*p == fieldquote)
sepp = advquoted(++p);
else
sepp = p + strcspn(p, fieldsep);
sepc = sepp[0];
sepp[0] = '\0';
field[nfield++] = p;
p = sepp + 1;
} while (sepc == ',');
return(nfield);
}
char *csvgetline(FILE *ifp)
{
size_t i;
int c;
if (line == NULL)
{
/* Allocate on first call */
maxline = maxfield = 1;
line = (char *)malloc(maxline); /*=C++=*/
sline = (char *)malloc(maxline); /*=C++-*/
field = (char **)malloc(maxfield*sizeof(field[0])); /*=C++=*/
if (line == NULL || sline == NULL || field == NULL)
{
csvreset();
return(NULL); /* out of memory */
}
}
for (i = 0; (c = getc(ifp)) != EOF && !endofline(ifp, c); i++)
{
if (i >= maxline - 1)
{
char *newl;
char *news;
maxline *= 2;
newl = (char *)realloc(line, maxline); /*=C++=*/
news = (char *)realloc(sline, maxline); /*=C++-*/
if (newl == NULL || news == NULL)
{
csvreset();
return(NULL); /* out of memory */
}
line = newl;
sline = news;
}
line[i] = c;
}
line[i] = '\0';
if (split() == NOMEM)
{
csvreset();
return(NULL);
}
return((c == EOF && i == 0) ? NULL : line);
}
char *csvgetfield(size_t n)
{
if (n >= nfield)
return(0);
return(field[n]);
}
size_t csvnfield(void)
{
return(nfield);
}
int csvputfield(FILE *ofp, const char *ofield)
{
const char escapes[] = "\",\r\n";
if (strpbrk(ofield, escapes) != 0)
{
size_t len = strlen(ofield) + 2;
const char *pos = ofield;
while ((pos = strchr(pos, '"')) != 0)
{
len++;
pos++;
}
char *space = malloc(len+1);
if (space == 0)
return EOF;
char *cpy = space;
pos = ofield;
*cpy++ = '"';
char c;
while ((c = *pos++) != '\0')
{
if (c == '"')
*cpy++ = c;
*cpy++ = c;
}
*cpy++ = '"';
*cpy = '\0';
int rc = fputs(space, ofp);
free(space);
return rc;
}
else
return fputs(ofield, ofp);
}
int csvputline(FILE *ofp, char **fields, int nfields)
{
for (int i = 0; i < nfields; i++)
{
if (i > 0)
putc(',', ofp);
if (csvputfield(ofp, fields[i]) == EOF)
return EOF;
}
return(fputs(eolstr, ofp));
}
void csvseteol(const char *eol)
{
size_t nbytes = strlen(eol);
if (nbytes >= sizeof(eolstr))
nbytes = sizeof(eolstr) - 1;
memmove(eolstr, eol, nbytes);
eolstr[nbytes] = '\0';
}
#ifdef TEST
int main(void)
{
char *in_line;
while ((in_line = csvgetline(stdin)) != 0)
{
size_t n = csvnfield();
char *fields[n]; /* C99 VLA */
printf("line = '%s'\n", in_line);
for (size_t i = 0; i < n; i++)
{
printf("field[%zu] = '%s'\n", i, csvgetfield(i));
printf("field[%zu] = [", i);
csvputfield(stdout, csvgetfield(i));
fputs("]\n", stdout);
fields[i] = csvgetfield(i);
}
printf("fields[0..%zu] = ", n-1);
csvputline(stdout, fields, n);
}
return(0);
}
#endif /* TEST */
Compile the code with -DTEST to create a program with the example main() function. You need a C99 compiler; the code in main() uses a VLA (variable length array). You could avoid that with dynamic memory allocation or with pessimistic (overkill) memory allocation (an array of a few thousand pointers isn't going to kill most systems these days, but few CSV files will have a few thousand fields per line).
Example Data
Based closely on the data in the question.
009098,0981098094,"something","something else",""this one, well, is a problem"", "another thing"
123458,1234561007,"anything","nothing else",""this one, well, is a problem"","dohicky
503458,1234598094,"nothing","everything else","""this one, well, it isn't a problem""","abelone"
610078,1236100794,"everything","anything else","this ""isn't a problem"", he said.","Orcas Rule"
Example Output
line = '009098,0981098094,"something","something else",""this one, well, is a problem"", "another thing"'
field[0] = '009098'
field[0] = [009098]
field[1] = '0981098094'
field[1] = [0981098094]
field[2] = 'something'
field[2] = [something]
field[3] = 'something else'
field[3] = [something else]
field[4] = '"this one, well, is a problem"'
field[4] = ["""this one, well, is a problem"""]
field[5] = ' "another thing"'
field[5] = [" ""another thing"""]
fields[0..5] = 009098,0981098094,something,something else,"""this one, well, is a problem"""," ""another thing"""
line = '123458,1234561007,"anything","nothing else",""this one, well, is a problem"","dohicky'
field[0] = '123458'
field[0] = [123458]
field[1] = '1234561007'
field[1] = [1234561007]
field[2] = 'anything'
field[2] = [anything]
field[3] = 'nothing else'
field[3] = [nothing else]
field[4] = '"this one, well, is a problem"'
field[4] = ["""this one, well, is a problem"""]
field[5] = 'dohicky'
field[5] = [dohicky]
fields[0..5] = 123458,1234561007,anything,nothing else,"""this one, well, is a problem""",dohicky
line = '503458,1234598094,"nothing","everything else","""this one, well, it isn't a problem""","abelone"'
field[0] = '503458'
field[0] = [503458]
field[1] = '1234598094'
field[1] = [1234598094]
field[2] = 'nothing'
field[2] = [nothing]
field[3] = 'everything else'
field[3] = [everything else]
field[4] = '"this one, well, it isn't a problem"'
field[4] = ["""this one, well, it isn't a problem"""]
field[5] = 'abelone'
field[5] = [abelone]
fields[0..5] = 503458,1234598094,nothing,everything else,"""this one, well, it isn't a problem""",abelone
line = '610078,1236100794,"everything","anything else","this ""isn't a problem"", he said.","Orcas Rule"'
field[0] = '610078'
field[0] = [610078]
field[1] = '1236100794'
field[1] = [1236100794]
field[2] = 'everything'
field[2] = [everything]
field[3] = 'anything else'
field[3] = [anything else]
field[4] = 'this "isn't a problem", he said.'
field[4] = ["this ""isn't a problem"", he said."]
field[5] = 'Orcas Rule'
field[5] = [Orcas Rule]
fields[0..5] = 610078,1236100794,everything,anything else,"this ""isn't a problem"", he said.",Orcas Rule
The fields are printed twice, once to test the field extraction, once to test the field printing. You'd simplify the output by removing the printing except for csvputline() to convert your file from malformed CSV to properly formed CSV.

Related

read subject key identifier extension with mbedTLS

The project I have to extend is using mbedTLS and I have to extract the subject key identifier extension from the certificate. I have not found a workable solution so far. mbedTLS does not offer a direct function for this.
I have found mbedtls_x509_get_extbut this seems to be an internal function and very low level. I also do not have the offsets. I have also found the v3_ext storing maybe all extension as an mbedtls_x509_buf. I guess this could be parsed as ASN.1. a) Is the v3_extparsing approach the only option and correct? b) Are there better options?

Since no one had an idea how to do it, I followed the approach to parse the v3_ext field of the mbedtls_x509_crt struct.
#include "mbedtls/x509_crt.h"
#include <mbedtls/oid.h>
#include "mbedtls/platform.h"
const uint8_t SKID[] = {0x55, 0x1D, 0x0E}; //!< Subject key identifer OID.
int load_skid(const char* path, uint8_t* skid[]) {
int err = 0;
mbedtls_x509_crt cert;
mbedtls_x509_buf buf;
mbedtls_asn1_sequence extns;
mbedtls_asn1_sequence *next;
memset(&extns, 0, sizeof(extns));
mbedtls_x509_crt_init(&cert);
size_t tag_len;
if (mbedtls_x509_crt_parse_file(&cert, path)) {
err = 1;
goto exit;
}
buf = cert.v3_ext;
if (mbedtls_asn1_get_sequence_of(&buf.p, buf.p + buf.len, &extns, MBEDTLS_ASN1_CONSTRUCTED | MBEDTLS_ASN1_SEQUENCE)) {
err = 1;
goto exit;
}
next = &extns;
while (next) {
if (mbedtls_asn1_get_tag(&(next->buf.p), next->buf.p + next->buf.len, &tag_len, MBEDTLS_ASN1_OID)) {
err = 1;
goto exit;
}
if (!memcmp(next->buf.p, SKID, tag_len)) {
// get value
// correct length for SEQ TL + SKID OID value = 2 + tag_len
unsigned char *p = next->buf.p + tag_len;
if (mbedtls_asn1_get_tag(&p, p + next->buf.len-2-tag_len, &tag_len, MBEDTLS_ASN1_OCTET_STRING)) {
err = 1;
goto exit;
}
// include OCT TL = 2
if (mbedtls_asn1_get_tag(&p, p + next->buf.len-2, &tag_len, MBEDTLS_ASN1_OCTET_STRING)) {
err = 1;
goto exit;
}
if (tag_len != 20) {
err = 1;
goto exit;
}
memcpy(skid, p, 20);
goto exit;
}
next = next->next;
}
// skid not found
err = 1;
exit:
mbedtls_x509_crt_free(&cert);
mbedtls_asn1_sequence *cur;
next = extns.next;
while (next != NULL) {
cur = next;
next = next->next;
mbedtls_free(cur);
}
return err;
}

how to combine ADRESH and ADRESL on 12 bit ADC

MICRO: PIC18LF47K42
compiler: XC8
application: MPLABX
I am trying to combine the values in of 12 bit ADC. They go into ADRESH and ADRESL. My ADC is set up for right-justify which does formating as so:
ADRESH:(----MSB,x,x,x) ADRESL: (X,X,X,X,X,X,X,LSB)
From inspecting the value in my result register i can tell i dont have a great resolution. Im pretty sure its becasue of how im combining ADRESH and ADRESL. How could i do this?
#include "myIncludes.h"
volatile unsigned char ZCDSoftwareFlag = 0;
volatile unsigned char switchValue = 0;
void main(void)
{
portInit();
triac = 0;
unsigned char result;
adcInit();
while(1)
{
__delay_us(4);
ADCON0bits.GO = 1; //Start conversion
while (ADCON0bits.GO); //Wait for conversion done
result = ADRESH;
result = result << 8;
result = result |ADRESL;
}
}
And heres the ADC init function
void adcInit(void)
{
ADCON0bits.FM = 1; //right-justify
ADCON0bits.CS = 1; //ADCRC Clock
ADPCH = 0x00; //RA0 is Analog channel
ADCON0bits.ON = 1; //Turn ADC On
ADCON0bits.GO = 1; //Start conversion
}

You try to put an 12Bit result in an 8 Bit variable. Switch it to 16Bit.
uint16_t result;
Then you can combine the values:
result = ADRESH;
result = result << 8;
result = result |ADRESL;

Sending strings to BPF Map Space and printing them out

I have a small txt file that I would like to write to BPF here. Here is what my python code looks like for BPF but I am unable to print out anything as of now. I keep ending up with a Failed to load program: Invalid argument with a bunch of register errors. As of now my string basically says hello, world, hi
BPF_ARRAY(lookupTable, char, 512);
int helloworld2(void *ctx)
{
//print the values in the lookup table
#pragma clang loop unroll(full)
for (int i = 0; i < 512; i++) {
char *key = lookupTable.lookup(&i);
if (key) {
bpf_trace_printk("%s\n", key);
}
}
return 0;
}
Here is the Python code:
b = BPF(src_file="hello.c")
lookupTable = b["lookupTable"]
#add hello.csv to the lookupTable array
f = open("hello.csv","r")
file_contents = f.read()
#append file contents to the lookupTable array
b_string1 = file_contents.encode('utf-8')
b_string1 = ctypes.create_string_buffer(b_string1)
lookupTable[0] = b_string1
f.close()
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="helloworld2")
b.trace_print()
I have the error linked in this pastebin since it's so long:
BPF Error
One notable error is the mention of infinite loop detected which is something I would need to check out.

The issue is that i is passed by pointer in bpf_map_lookup_elem, so the compiler can't actually unroll the loop (from its point of view, i may not linearly increase).
Using an intermediate variable is enough to fix this:
BPF_ARRAY(lookupTable, char, 512);
#define MAX_LENGTH 1
int helloworld2(void *ctx)
{
//print the values in the lookup table
#pragma clang loop unroll(full)
for (int i = 0; i < 1; i++) {
int k = i;
char *key = lookupTable.lookup(&k);
if (key) {
bpf_trace_printk("%s\n", key);
}
}
return 0;
}

Unicode characters not shown correctly

I am making a C program that supports many languages. The program send emails using the type WCHAR instead of char. The problem is that when I receive the email and read it, some characters are not shown correctly, even some English ones like e, m, ... This is an example:
<!-- language: lang-c -->
curl_easy_setopt(hnd, CURLOPT_READFUNCTION, payload_source);
curl_easy_setopt(hnd, CURLOPT_READDATA, &upload_ctx);
static const WCHAR *payload_text[]={
L"To: <me#mail.com>\n",
L"From: <me#mail.com>(Example User)\n",
L"Subject: Hello!\n",
L"\n",
L"Message sent\n",
NULL
};
struct upload_status {
int lines_read;
};
static size_t payload_source(void *ptr, size_t size, size_t nmemb, void *userp){
struct upload_status *upload_ctx = (struct upload_status *)userp;
const WCHAR *data;
if ((size == 0) || (nmemb == 0) || ((size*nmemb) < 1)) {
return 0;
}
data = payload_text[upload_ctx->lines_read];
if (data) {
size_t len = wcslen(data);
memcpy(ptr, data, len);
upload_ctx->lines_read ++;
return len;
}
return 0;
}

memcpy() operates on bytes, not on characters. You are not taking into account that sizeof(wchar_t) > 1. It is 2 bytes on some systems and 4 bytes on others. This descrepency makes wchar_t a bad choice when writing portable code. You should be using a Unicode library instead, such as icu or iconv).
You need to take sizeof(wchar_t) into account when calling memcpy(). You also need to take into account that the destination buffer may be smaller than the size of the text bytes you are trying to copy. Keeping track of the lines_read by itself is not enough, you have to also keep track of how many bytes of the current line you have copied so you can handle cases when the current line of text straddles across multiple destination buffers.
Try something more like this instead:
static size_t payload_source(void *ptr, size_t size, size_t nmemb, void *userp)
{
struct upload_status *upload_ctx = (struct upload_status *) userp;
unsigned char *buf = (unsignd char *) ptr;
size_t available = (size * nmemb);
size_t total = 0;
while (available > 0)
{
wchar_t *data = payload_text[upload_ctx->lines_read];
if (!data) break;
unsigned char *rawdata = (unsigned char *) data;
size_t remaining = (wcslen(data) * sizeof(wchar_t)) - upload_ctx->line_bytes_read;
while ((remaining > 0) && (available > 0))
{
size_t bytes_to_copy = min(remaining, available);
memcpy(buf, rawdata, bytes_to_copy);
buf += bytes_to_copy;
available -= bytes_to_copy;
total = bytes_to_copy;
rawdata += bytes_to_copy;
remaining -= bytes_to_copy;
upload_ctx->line_bytes_read += bytes_to_copy;
}
if (remaining < 1)
{
upload_ctx->lines_read ++;
upload_ctx->line_bytes_read = 0;
}
}
return total;
}

C sharp delimiter

In a given sentence i want to split into 10 character string. The last word should not be incomplete in the string. Splitting should be done based on space or , or .
For example:
this is ram.he works at mcity.
now the substring of 10 chars is,
this is ra.
but the output should be,
this is.
Last word should not be incomplete

You can use a regular expression that checks that the character after the match is not a word character:
string input = "this is ram.he";
Match match = Regex.Match(input, #"^.{0,10}(?!\w)");
string result;
if (match.Success)
{
result = match.Value;
}
else
{
result = string.Empty;
}
Result:
this is
An alternative approach is to build the string up token by token until adding another token would exceed the character limit:
StringBuilder sb = new StringBuilder();
foreach (Match match in Regex.Matches(input, #"\w+|\W+"))
{
if (sb.Length + match.Value.Length > 10) { break; }
sb.Append(match.Value);
}
string result = sb.ToString();

Not sure if this is the sort of thing you were looking for. Note that this could be done a lot cleaner, but should get you started ... (may want to use StringBuilder instead of String).
char[] delimiterChars = { ',', '.',' ' };
string s = "this is ram.he works at mcity.";
string step1 = s.Substring(0, 10); // Get first 10 chars
string[] step2a = step1.Split(delimiterChars); // Get words
string[] step2b = s.Split(delimiterChars); // Get words
string sFinal = "";
for (int i = 0; i < step2a.Count()-1; i++) // copy count-1 words
{
if (i == 0)
{
sFinal = step2a[i];
}
else
{
sFinal = sFinal + " " + step2a[i];
}
}
// Check if last word is a complete word.
if (step2a[step2a.Count() - 1] == step2b[step2a.Count() - 1])
{
sFinal = sFinal + " " + step2b[step2a.Count() - 1] + ".";
}
else
{
sFinal = sFinal + ".";
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Postgres COPY command - fields with commas, quoted with double quotes - postgresql

Related

read subject key identifier extension with mbedTLS

how to combine ADRESH and ADRESL on 12 bit ADC

Sending strings to BPF Map Space and printing them out

Unicode characters not shown correctly

C sharp delimiter

Categories

Resources