Pyspark: ValueError - pyspark

I have a dictionary of PySpark RDDs and am trying to convert them to data frames, save them as variable and then join them. When I attempt to convert one of my RDDs to a data frame I get the following error:
File "./spark-1.3.1/python/pyspark/sql/types.py",
line 986, in _verify_type
"length of fields (%d)" % (len(obj), len(dataType.fields)))
ValueError: Length of object (52) does not match with length of fields (7)
Does anyone know what this exactly means or can help me with a work around?

I agree - we need to see more code - obfuscated data is fine.
You are using SparkQL it seems (sql types) - mapped onto what ? HDFS/Text
From the error it would appear that your create schema is incorrect - leading to an error - when to create a Data Frame.

This was due to the passing of an incorrect RDD, sorry everyone. I was passing the incorrect RDD which caused didn't fit the code I was using.

Related

how to use json_extract_path_text?

I am facing an issue with JSON extract using JSON_EXTRACT_PATH_TEXT in redshift
I have two separate JSON columns
One containing the modems the customer is using and the other one containing the recharge details
{"Mnfcr": "Technicolor","Model_Name":"Technicolor ABC1243","Smart_Modem":"Y"}
For the above, I have no issue extracting the Model_name using JSON_EXTRACT_PATH_TEXT(COLUMN_NAME, 'Model_Name') as model_name
[{"Date":"2021-12-24 21:42:01","Amt":50.00}]
This one is causing me trouble. I used the same method above and it did not work. it gave me the below error
ERROR: JSON parsing error Detail: ----------------------------------------------- error: JSON parsing error code: 8001 context: invalid json object [{"Date":"2021-07-03 17:12:16","Amt":50.00
Can I please get assistance on how to extract this using the json_extract_path_text?
One other method I have found and it worked was to use regexp_substring.
This second string is a json array (square braces), not an object (curly brackets). The array contains a single element which is an object. So you need to extract the object from the array before using JSON_EXTRACT_PATH_TEXT().
The junction for this is JSON_EXTRACT_ARRAY_ELEMENT_TEXT().
Putting this all together we get:
JSON_EXTRACT_PATH_TEXT(
JSON_EXTRACT_ARRAY_ELEMENT_TEXT( <column>, 0)
, 'Amt')
you can use json_extract_path_text like below example
json_extract_path_text(json_columnName, json_keyName) = compareValue
for more you can refer this article
https://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_PATH_TEXT.html

postgres array range notation with jooq

in my postgres database i have an array of jsons (json[]) to keep track of errors in the form of json objects. if an update comes in i want to append the latest error to that array but only keep the latest 5 errors (all sql is simplified):
update dc
set
error_history =
array_append(dc.error_history[array_length(dc.error_history, 1) - 4:array_length(dc.error_history, 1)+1], cast('{"blah": 6}' as json))
where dc.id = 'f57520db-5b03-4586-8e77-284ed2dca6b1'
;
this works fine in native sql, i tried to replicate this in jooq as follows:
.set(dc.ERROR_HISTORY,
field("array_append(dc.error_history[array_length(dc.error_history, 1) - 4:array_length(dc.error_history, 1) + 1], cast('{0}' as json))", dc.ERROR_HISTORY.getType(), latestError)
);
but it seems the : causes the library to think that there is a bind parameter there. the generated sql is:
array_append(dc.error_history[array_length(dc.error_history, 1) - 4?(dc.error_history, 1) + 1], cast('{0}' as json)
and the error i get is
nested exception is org.postgresql.util.PSQLException: ERROR: syntax error at or near "$5"
which i totally agree with :D
is there some way to escape the : in the java code or is there a better way to do this?
edit:
i have also tried just to remove the first element upon updating using the arrayRemove function but that also didn't work because it doesn't work by index but by element and postgres doesn't know how to check json elements for equality.
omg, the answer was really simple :D
field("array_append(dc.error_history[array_length(dc.error_history, 1) - 4 : array_length(dc.error_history, 1) + 1], cast({0} as json))", dc.ERROR_HISTORY.getType(), latestError)
just add spaces around the colon and it works correctly. note that i made another mistake by putting single quotes around {0}, because the latestError object is already of type JSON.

Camel: UTF-8 Encoding is lost after using Group

I'm using camel 2.14.1 and splitting huge xml file with Chinese/Japanese characters using group=10000 within tokenize tag.
Files are created successfully based on grouping but Chinese/Japanese text codes are converted to Junk characters.
I tried enforcing UTF-8 before new XML creation using "ConvertBodyTo" but still issue persists.
Can someone help me !!
I had run into a similar issue while trying to split a csv file using tokenize with grouping.
Sample csv file: (with Delimiter - '|')
CandidateNumber|CandidateLastName|CandidateFirstName|EducationLevel
CAND123C001|Wells|Jimmy|Bachelor's Degree (±16 years)
CAND123C002|Wells|Tom|Bachelor's Degree (±16 years)
CAND123C003|Wells|James|Bachelor's Degree (±16 years)
CAND123C004|Wells|Tim|Bachelor's Degree (±16 years)
The ± character is corrupted after tokenize with grouping. I was initially under the assumption that the problem was with not setting the proper File Encoding for split, but the exchange seems to have the right value for property CamelCharsetName=ISO-8859-1.
from("file://<dir with csv files>?noop=true&charset=ISO-8859-1")
.split(body().tokenize("\n",2,true)).streaming()
.log("body: ${body}");
The same works fine with dont use grouping.
from("file://<dir with csv files>?noop=true&charset=ISO-8859-1")
.split(body().tokenize("\n")).streaming()
.log("body: ${body}");
Thanks to this post, it confirmed the issue is while grouping.
Looking at GroupTokenIterator in camel code base the problem seems to be with the way TypeConverter is used to convert String to InputStream
// convert to input stream
InputStream is =
camelContext.getTypeConverter().mandatoryConvertTo(InputStream.class, data);
...
Note: the mandatoryConvertTo() has an overloaded method with exchange
<T> T mandatoryConvertTo(Class<T> type, Exchange exchange, Object value)
As the exchange is not passed as argument it always falls back to default charset set using system property "org.apache.camel.default.charset"
Potential Fix:
// convert to input stream
InputStream is =
camelContext.getTypeConverter().mandatoryConvertTo(InputStream.class, exchange, data);
...
As this fix is in the camel-core, another potential option is to use split without grouping and use AgrregateStrategy with completionSize() and completionTimeout().
Although it would be great to get this fixed in camel-core.

Create ordinal array with multiple groups

I need to categorize a dataset according to different age groups. The categorization depends on whether the Sex is Male or Female. I first subset the data by gender and then use the ordinal function (dataset is from a Matlab example). The following code crashes on the last line when I try to vertically concatenate the subsets:
load hospital;
subset_m=hospital(hospital.Sex=='Male',:);
subset_f=hospital(hospital.Sex=='Female',:);
edges_f=[0 20 max(subset_f.Age)];
edges_m=[0 30 max(subset_m.Age)];
labels_m = {'0-19','20+'};
labels_f = {'0-29','30+'};
subset_m.AgeGroup= ordinal(subset_m.Age,labels_m,[],edges_m);
subset_f.AgeGroup = ordinal(subset_f.Age,labels_f,[],edges_f);
vertcat(subset_m,subset_f);
Error using dataset/vertcat (line 76)
Could not concatenate the dataset variable 'AgeGroup' using VERTCAT.
Caused by:
Error using ordinal/vertcat (line 36)
Ordinal levels and their ordering must be identical.
Edit
It seems that a vital part was missing in the question, here is the answer to the corrected question. You need to use join rather than vertcat, for example:
joinFull = join(subset_f,subset_m,'LeftKeys','LastName','RightKeys','LastName','type','rightouter','mergekeys',true)
Solution of original problem
It seems like you are actually trying to work with the wrong variable. If I change all instances of hospitalCopy into hospital then everything works fine for me.
Perhaps you copied hospital and edited it, thus losing the validity of the input.
If you really need to have hospitalCopy make sure to assign to it directly after load hospital.
If this does not help, try using clear all before running the code and make sure there is no file called 'hospital' in your current directory.

dataFrame keying using pandas groupby method

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
filePath,vp,score
E:\Audio\7168965711_5601_4.wav,Cust_9709495726,-2
E:\Audio\7168965711_5601_4.wav,Cust_9708568031,-80
E:\Audio\7168965711_5601_4.wav,Cust_9702445777,-2
E:\Audio\7168965711_5601_4.wav,Cust_7023544759,-35
E:\Audio\7168965711_5601_4.wav,Cust_9702229339,-77
E:\Audio\7168965711_5601_4.wav,Cust_9513243289,25
E:\Audio\7168965711_5601_4.wav,Cust_2102513187,18
E:\Audio\7168965711_5601_4.wav,Cust_6625625104,-56
E:\Audio\7168965711_5601_4.wav,Cust_6073165338,-40
E:\Audio\7168965711_5601_4.wav,Cust_5105831247,-30
E:\Audio\7168965711_5601_4.wav,Cust_9513082770,-55
E:\Audio\7168965711_5601_4.wav,Cust_5753907026,-79
E:\Audio\7168965711_5601_4.wav,Cust_7403410322,11
E:\Audio\7168965711_5601_4.wav,Cust_4062144116,-70
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
res.index
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Cust_9709495726]
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
res['Cust_4062144116']
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.
Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
res['E:\Audio\7168965711_5601_4.wav]
and will return:
Cust_2102513187 2
Cust_4062144116 8
....
but trying to index by the inner index (the Cust_ indexes) will fail.
You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]