How do I better clean extremely messy data with OpenRefine? - data-cleaning

I'm using the latest version of OpenRefine and I am cleaning a CSV file of 33,230 rows. There are 5 rows that I'm working with (Name, Personal_email, Phone_number, Twitter_handle, and Website). I have cleaned the data by adding titles, sorting the ID, applying titlecase, etc...but I have one major problem that I can't fix, and atomization didn't work. For instance, the Name column is perfect, but for some rows, the personal email is in the Phone_number column, or the Twitter handle is in Website row. How do I fix this without having to manually check each row? Also, how would I add "http" to the beginning of each Website entry without changing the information already in that column? Thanks!

I am not a user of OpenRefine, but I have lots of experience to handle messy data using python and pandas.
In the data cleaning process, first, I will find the rules inside the data and filter the rows without proper format from the raw data, e.g.
Personal_email must contain '#'.
Phone_number, should only have digits and '-'.
Do Twitter_handle has length limit?
website starts with www?
Second, I will throw these rows if they are not important or the number of them is small vs the retains ones(5 vs 33,230).
Or just handle it in the raw file if the mistakes are litter.

I believe, OpenRefine has all features, required for this task.
This task can be solved by the following steps:
create new column "mixed_data"
using join of your columns (Name, Personal_email, Phone_number, Twitter_handle, and Website) with some unique separator character, which can not be found as a part of value text, let say "|", then cell in this column may look as
"Smith, John|s.john1#mail.com|+1 800 123 456|#sjohn|www.jsmith.com"
join menu
join parameters
consequently create copies of your original columns, based on "mixed_data" using regular expressions:
Personal_email2:
value.find(/(^|[|])([A-z0-9_.-]+[#][A-z0-9_.-]+)([|]|$)/)[0]
Phone_number2: value.find(/(^|[|])([+]?[0-9 -]+)([|]|$)/)[0]
Twitter_handle2: value.find(/(^|[|])([#][A-z0-9]+)([|]|$)/)[0]
Website2: value.find(/(^|[|])([A-z0-9_][A-z0-9_.-]+)([|]|$)/)[0]
Note. Regular expressions above can be enhanced, but in some cases would be enough.
Resulting columns will look like ones on attached picture:
OpenRefine sample sheet
Remove remaining "|" characters in columns by "Edit cells->Replace" feature
Or just copy&paste json below and apply it in "Apply Operation History" window:
[
{
"op": "core/column-addition",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"baseColumnName": "Website",
"expression": "join ([cells['Website'].value,cells['Name'].value,cells['Personal_email'].value,cells['Phone_number'].value,cells['Twitter_handle'].value],'|')",
"onError": "keep-original",
"newColumnName": "mixed_data",
"columnInsertIndex": 5,
"description": "Create column mixed_data at index 5 based on column Website using expression join ([cells['Website'].value,cells['Name'].value,cells['Personal_email'].value,cells['Phone_number'].value,cells['Twitter_handle'].value],'|')"
},
{
"op": "core/column-addition",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"baseColumnName": "mixed_data",
"expression": "grel:value.find(/(^|[|])([A-z0-9_][A-z0-9_.-]+)([|]|$)/)[0]",
"onError": "set-to-blank",
"newColumnName": "Website2",
"columnInsertIndex": 6,
"description": "Create column Website2 at index 6 based on column mixed_data using expression grel:value.find(/(^|[|])([A-z0-9_][A-z0-9_.-]+)([|]|$)/)[0]"
},
{
"op": "core/column-addition",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"baseColumnName": "mixed_data",
"expression": "grel:value.find(/(^|[|])([#][A-z0-9]+)([|]|$)/)[0]",
"onError": "set-to-blank",
"newColumnName": "Twitter_handle2",
"columnInsertIndex": 6,
"description": "Create column Twitter_handle2 at index 6 based on column mixed_data using expression grel:value.find(/(^|[|])([#][A-z0-9]+)([|]|$)/)[0]"
},
{
"op": "core/column-addition",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"baseColumnName": "mixed_data",
"expression": "grel:value.find(/(^|[|])([+]?[0-9 -]+)([|]|$)/)[0]",
"onError": "set-to-blank",
"newColumnName": "Phone_number2",
"columnInsertIndex": 6,
"description": "Create column Phone_number2 at index 6 based on column mixed_data using expression grel:value.find(/(^|[|])([+]?[0-9 -]+)([|]|$)/)[0]"
},
{
"op": "core/column-addition",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"baseColumnName": "mixed_data",
"expression": "grel:value.find(/(^|[|])([A-z0-9_.-]+[#][A-z0-9_.-]+)([|]|$)/)[0]",
"onError": "set-to-blank",
"newColumnName": "Personal_email2",
"columnInsertIndex": 6,
"description": "Create column Personal_email2 at index 6 based on column mixed_data using expression grel:value.find(/(^|[|])([A-z0-9_.-]+[#][A-z0-9_.-]+)([|]|$)/)[0]"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "Personal_email2",
"expression": "value.replace(\"|\",\"\")",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column Personal_email2 using expression value.replace(\"|\",\"\")"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "Phone_number2",
"expression": "value.replace(\"|\",\"\")",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column Phone_number2 using expression value.replace(\"|\",\"\")"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "Twitter_handle2",
"expression": "value.replace(\"|\",\"\")",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column Twitter_handle2 using expression value.replace(\"|\",\"\")"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "Website2",
"expression": "value.replace(\"|\",\"\")",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column Website2 using expression value.replace(\"|\",\"\")"
}
]

Related

How to check multiple conditions in ADF

enter image description hereI have a scenario where I have to check two conditions and if both are true then execute set of activities in ADF.
I tried if condition activity inside a if condition but ADF is not allowing it.
so basically my design is two lookups to read data and then if condition to check condition 1, if that is true then go inside again two lookups to read data and if condition to check condition 2. But it is not working.
is there any other work around for this?
I tried AND condition inside IF condition activity but it is not working. Please suggest.
Since we cannot use IF within IF activity, we can leverage multiple conditions within an IF activity via expressions which includes if,and ,or etc functions.
The below JSON pipeline is somewhat an example :
{
"name": "pipeline7",
"properties": {
"activities": [
{
"name": "If Condition1",
"type": "IfCondition",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"expression": {
"value": "#and(greater(pipeline().parameters.Test2, pipeline().parameters.Test1),greater(pipeline().parameters.Test4, pipeline().parameters.Test3))",
"type": "Expression"
},
"ifFalseActivities": [
{
"name": "Wait2",
"type": "Wait",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"waitTimeInSeconds": 1
}
}
],
"ifTrueActivities": [
{
"name": "Wait1",
"type": "Wait",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"waitTimeInSeconds": 1
}
}
]
}
}
],
"parameters": {
"Test1": {
"type": "int",
"defaultValue": 1
},
"Test2": {
"type": "int",
"defaultValue": 2
},
"Test3": {
"type": "int",
"defaultValue": 3
},
"Test4": {
"type": "int",
"defaultValue": 4
}
},
"annotations": []
}
}
Your approach is correct in case if you dont want to create another pipeline and use execute activity within the IF activity for another comparision

Differentiate dropdown multi select (without options defined) with regular text columns

Is there any way to differentiate columns that are of type drop down multi select from regular text columns :
This is supposed to be a multi select drop down list without any option :
"id": 5414087443146628,
"version": 2,
"index": 2,
"title": "Column3",
"type": "TEXT_NUMBER",
"validation": false,
"width": 150
Same question goes for multi contact list without contact options defined.
If you think of multi-contact or multi-dropdown as new versions of the various GET requests, then its easier to return the correct values. For multi-dropdown, you use a combination of query parameters of "level=3" and "include=objectValue", then you'll see the column type change to MULTI_PICKLIST instead of TEXT. (The TEXT value is to maintain backwards compatibility.)
So, essentially, your request would look something like GET /sheets/{sheetId}?level=3&include=objectValue.
To test the scenario you've described, I created the following sheet structure in Smartsheet, where the column names indicate the type of each column:
Then I used Postman to issue a Get Sheet request for that sheet:
GET https://api.smartsheet.com/2.0/sheets/5831916227192708
The columns portion of the API response looks like this:
{
"id": 5831916227192708,
...
"columns": [
{
"id": 1256050323154820,
"version": 0,
"index": 0,
"title": "Description",
"type": "TEXT_NUMBER",
"primary": true,
"validation": false,
"width": 124
},
{
"id": 5759649950525316,
"version": 0,
"index": 1,
"title": "Type=Text/Number",
"type": "TEXT_NUMBER",
"validation": false,
"width": 128
},
{
"id": 1323283741206404,
"version": 0,
"index": 2,
"title": "Type=Dropdown (single select)",
"type": "PICKLIST",
"validation": false,
"width": 111
},
{
"id": 7741495861110660,
"version": 2,
"index": 3,
"title": "Type=Dropdown (multiple select)",
"type": "TEXT_NUMBER",
"validation": false,
"width": 113
},
{
"id": 3048711514285956,
"version": 0,
"index": 4,
"title": "Type=Contact List (single select)",
"type": "CONTACT_LIST",
"validation": false,
"width": 122
},
{
"id": 3992195570132868,
"version": 1,
"index": 5,
"title": "Type=Contact List (multiple select)",
"type": "TEXT_NUMBER",
"validation": false,
"width": 125
}
],
...
}
In this response, we see the following:
If column type is specified as Text/Number, the type attribute value is TEXT_NUMBER
If column type is specified as Dropdown (single select), the type attribute value is PICKLIST
If column type is specified as Dropdown (multiple select), the type attribute value is TEXT_NUMBER
If column type is specified as Contact List (single select), the type attribute value is CONTACT_LIST
If column type is specified as Contact List (multiple select), the type attribute value is TEXT_NUMBER
Therefore, it doesn't seem possible to programmatically differentiate a Dropdown (multiple select) column from a Text/Number column or a Contact List (multiple select) column from a Text/Number column, based on column metadata alone. IMO, seems like a bug for the Dropdown (multiple select) column type and Contact List (multiple select) column type to return type: TEXT_NUMBER. Perhaps someone with Smartsheet can comment here to provide more insight into this behavior.
Did a few tests and level 3 isn't available : https://api.smartsheet.com/2.0/sheets/{sheetId}?level=3 :
{
"errorCode": 1018,
"message": "The value '3' was not valid for the parameter 'level'.",
"refId": "1godowa5cigf1"
}
Although i tried with level 2 and got the info :
https://api.smartsheet.com/2.0/sheets/{sheetId}?level=2&include=objectValue
Results for a multi drop down list :
{
"id": 5414087443146628,
"version": 2,
"index": 2,
"title": "Column3",
"type": "MULTI_PICKLIST",
"options": [
"a",
"b"
],
"validation": false,
"width": 150
}

Parsing Really Messy Nested JSON Strings

I have a series of deeply nested json strings in a pyspark dataframe column. I need to explode and filter based on the contents of these strings and would like to add them as columns. I've tried defining the StructTypes but each time it continues to return an empty DF.
Tried using json_tuples to parse but there are no common keys to rejoin the dataframes and the row numbers dont match up? I think it might have to do with some null fields
The sub field can be nullable
Sample JSON
{
"TIME": "datatime",
"SID": "yjhrtr",
"ID": {
"Source": "Person",
"AuthIFO": {
"Prov": "Abc",
"IOI": "123",
"DETAILS": {
"Id": "12345",
"SId": "ABCDE"
}
}
},
"Content": {
"User1": "AB878A",
"UserInfo": "False",
"D": "ghgf64G",
"T": "yjuyjtyfrZ6",
"Tname": "WE ARE THE WORLD",
"ST": null,
"TID": "BPV 1431: 1",
"src": "test",
"OT": "test2",
"OA": "test3",
"OP": "test34
},
"Test": false
}

Remove Parse-specific fields from database results

Every time I query something from database I get objects like this:
{
"title": "faq",
"description": "",
"type": "application/pdf",
"size": 122974,
"filename": "faq.pdf",
"order": 0,
"createdAt": "2017-08-17T08:10:33.101Z",
"updatedAt": "2017-08-17T08:10:33.101Z",
"ACL": {
"role:cxcccccc_public": {
"read": true
},
"role:cxaaaaaaa_admin": {
"read": true,
"write": true
}
},
"objectId": "l6L5J1mRpH",
"__type": "Object",
"className": "Document"
}
A lot of these fields are Parse-specific: createdAt, updatedAt, ACL, className...
I didn't insert them when I added the row in the database but still they are returned when I do a query.
I want to expose this data through a clean REST API so I want to get rid of them.
Is there a way to get only the data I specified at insert time?
For now I am using lowdash, filtering with
data = _.omit(data, ['ACL', '__type', 'className', ...])

Mongodb multilingual query

Material collection represents user submissions.
Default language is defined by the user. Other users can extend first submission with some other languages.
When I request a material, maybe I need more than one language, so I think it fits ok with just one collection where I should support multilanguage.
My current desing is like this (document for testing):
{
"activities": [ "Ficha de actividades", "Juego Colectivo" ],
"files": [ "Asociacion_Profesiones.zip"],
"areas": ["Literatura", "Ciencias Naturales", "Ciencias Sociales" ],
"authors": [
{ "author": "César", "email": "xxx#gmail.com"},
{ "author": "José", "email": "xxxx#gmail.com"}
],
"desc": "La leche, el agua, el vino y la bebida",
"state": 1,
"date": ISODate("2017-01-10 19:40:39"),
"updated": null,
"id": 1,
"language": "spanish",
"images": [],
"license": "Creative Commons BY-NC-SA",
"downloads": 0,
"popular": false,
"title": "La bebida",
"file": "1.zip",
"translations": [
{"language": "english", "title": "Beverages", "desc": "Milk, water, wine and all related with drinking", "authors": [], "files": [], "file": "1.zip", "images": [], "downloads": 0, "state": 1, "created": null, "updated": null},
{"language": "french", "title": "le boisson", "desc": "du lait, de l'eau, du vin, boire", "authors": [], "files": [], "file": "1.zip", "images": [], "downloads": 0, "state": 1, "created": null, "updated": null}
]
}
I create a text index focused on title and descripcion (desc) fields.
db.materials.createIndex({ "title": "text", "desc": "text", "translations.title": "text", "translations.desc": "text"})
I know the user default language, but I don't know how he should make the search.
Imagine an spanish user want to search for drinks:
db.materials.find( { $text: { $search: "drink", $language: "es"} } ).explain(true)
It returns our document. It searches drink, and as "Milk, water, wine and all related with drinking" has drink as stemming for drinking it works ok.
However... if I look for water:
db.materials.find( { $text: { $search: "water", $language: "es"} } ).explain(true)
It returns nothing. water gets converted to wat (as spanish verbs end with er suffix) and "wat" is not found.
Any improvement to my searches?
The only workaround I can imagine is to add language dropdown selector to user searches.