I'm doing some tests in order to figure out how to best import data from a CSV file into Neo4J. I have the following data in the Person.csv file (3 headers, Person_ID, Name, and Type):
Person_ID Name Type
HUA001 Jaap Layperson
HUA002 Teems Priest
HUA003 Frank Layperson
I want to import the nodes which are of a particular type (e.g. 'Layperson').
I thought about creating a LOAD CSV command with a WHERE statement (see below), but Neo4J doesn't particularly like that WHERE statement. Any ideas how to get this (or a query with a similar result) working?
LOAD CSV WITH HEADERS FROM 'file:///Person.csv' AS row
WHERE row.Type='Layperson'
CREATE (p:Person:Layperson {ID: row.Person_ID, name: row.Name})
You can use WITH and WHERE combined to filter the required rows and pass on the filtered rows to the next query which creates the nodes.
LOAD CSV WITH HEADERS FROM 'file:///Person.csv' AS row
WITH row
WHERE row.Type='Layperson'
CREATE (p:Person:Layperson {ID: row.Person_ID, name: row.Name})
Related
I have an Azure Data factory trigger that is fired off when a file is placed in blob storage, this trigger will start pipeline execution and pass the file name to the data flow activity. I would like to make sure that all the column names from the header row in the file are in the sink table. There is an identity column in the sink table that should not be in the comparison. Not sure how to tackle this task, I've read about the 'derived column' activity, is that the route I should take?
You can select or filter which columns reside in sink dataset or table by using "Field mapping". You can optionally use "derived columns" transformation, however in the "sink transformation" you will have this by default and is set to "Auto mapping". Here you can add or remove which columns are written to sink.
In the below example the column "id" can be assumed as similar to "Identity" column in your table. Assuming all the files have same columns:
Once you have modified as per your need, you can confirm the same from the "inspect" tab before run.
Strategy:
Use two ADF pipelines, one to get a list of all files and another one to process each file copying its content to a specific SQL table.
Setup:
I’ve created 4 CSV files, following the pattern you need: “[CustomerID][TableName][FileID].csv” and 4 SQL tables, one for each type of file.
A_inventory_0001.csv: inventory records for customer A, to be
inserted into the SQL table “A_Inventory”.
A_sales_0003.csv: sales
records for customer A, to be inserted into the SQL table “A_Sales”.
B_inventory_0002.csv: inventory records for customer B, to be
inserted into the SQL table “B_Inventory”.
B_sales_0004.csv: sales
records for customer B, to be inserted into the SQL table “B_Sales”
Linked Services
In Azure Data Factory, the following linked services were create using Key Vault (Key Vault is optional).
Datasets
The following datasets were created. Note we have created some parameters to allow the pipeline to specify the source file and the destination SQL table.
The dataset “AzureSQLTable” has a parameter to specify the name of the destination SQL table.
The dataset “DelimitedTextFile” has a parameter to specify the name of the source CSV file.
The dataset “DelimitedTextFiles” has no parameter because it will be used to list all files from source folder.
Pipelines
The first pipeline “Get Files” will get the list of CSV files from source folder (Get Metadata activity), and then, for each file, call the second pipeline passing the CSV file name as a parameter.
Inside the foreach loop, there is a call to the second pipeline “Process File” passing the file name as a parameter.
The second pipeline has a parameter “pFileName” to receive the name of the file to be processed and a variable to calculate the name of the destination table based on the file name.
The first activity is to use a split in the file name to extract the parts we need to compose the destination table name.
In the expression bellow we are splitting the file name using the “__” separator and then using the first and second parts to compose the destination table name.
#concat(string(split(pipeline().parameters.pFileName, '_')[0]),'_',string(split(pipeline().parameters.pFileName, '_')[10]))
The second activity will then copy the file from the source “pFileName” to the desnation table “vTableName” using dynamic mapping, ie not adding specific column names as this will be dynamic.
The files I used in this example and the ADF code are available here:
https://github.com/diegoeick/stack-overflow/tree/main/69340699
I hope this will resolve your issue.
In case you still need to save the CustomerID and FileID in the database tables, you can use the dynamic mapping and use the available parameters (filename) and create a json with the dynamic mapping in the mapping tab of your copy activity. You can find more details here: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#parameterize-mapping
I have a simple clients table in sql server that contains 2 columns - client name and city. In Azure Data Factory, how can I export this table to multiple csv files that each file will contain only a list of clients from the same city, which will be the name of the file
I already tried, and succeeded, to split it to different files using lookup and foreach, but the data remains unfiltered by the city
any ideas anyone?
You can use Data Flow to achieve that easily.
I made an example for you. I create a table as source, export this table to multiple csv files that each file will contain only a list of clients from the same city, which will be the name of the file.
Data Flow Source:
Data Flow Sink settings: File name options: as data in column and use auto mapping.
Check the output files and data in it:
HTH.
You would need to follow the below flow chart:
LookUp Activity : Query : Select distinct city from table
For each activity
Input : #activity('LookUp').output.value
a) Copy activity
i) Source : Dynamic Query Select * from t1 where city=#item().City
This should generate separate files for each country as needed
Steps:
1)
The batch job can be any nbr of parallel executions
Create a parameterised dataset:
5)
Result: I have 2 different Entities, so 2 files are generated.
Input :
Output:
I have an ADF Copy Data flow and I'm getting the following error at runtime:
My source is defined as follows:
In my data set, the column is defined as shown below:
As you can see from the second image, the column IsLiftStation is defined in the source. Any idea why ADF cannot find the column?
I've had the same error. You can solve this by either selecting all columns (*) in the source and then mapping those you want to the sink schema, or by 'clearing' the mapping in which case the ADF Copy component will auto map to columns in the sink schema (best if columns have the same names in source and sink). Either of these approaches works.
Unfortunately, clicking the import schema button in the mapping tab doesn't work. It does produce the correct column mappings based on the columns in the source query but I still get the original error 'the column could not be located in the actual source' after doing this mapping.
could you check that is there a column named 'ae_type_id' in your schema? If that's the case, could you remove that column and try again? The columns in the schema must be aligned with columns in the query.
The issue is caused by an incomplete schema in one of the data sources. My solution is:
Step through the data flow selecting the first schema, Import projection
Go to the flow and Data Preview
Repeat for each step.
In my case, there were trailing commas in one of the CSV files. This caused automated column names to be created in the import allowing me to fix the data file.
I have a simple data table.
Schemas>resources>tables>image_data contains the columns
image_id(integer) raw_data(bytea)
How can I export this data to files named based on the image_id? Once I figure that out I want to use the image_id to name the files based on a reference in another table.
So far I've got this:
SELECT image_id, encode(raw_data, 'hex')
FROM resources.image_data;
It generates a csv with all the images as HEX, but I don't know what to do with it.
Data-sample:
"image_id";"encode"
166;"89504e470d0a1a0a0000000d49484452000001e0000003160806000000298 ..."
I need to export data from a table in database A, then import it into an identically-structured table in database B. This needs to be done via phpMyAdmin. Here's the problem: no matter what format I choose for the export (CSV or SQL) ALL columns (including the auto-incremented ID field) get exported. Because there's already data in the table in database B, I can't import the ID field with the new records - I need it to import the records and assign new auto-incremented values to the records. What settings do I need to use in either the export (to be able to choose which columns to export) or the import (to tell it to ignore the ID column in the file)?
Or should I just export as CSV, then open in Excel and delete the ID column? Is there a way to tell phpMyAdmin that it should generate new auto-incremented IDs for the records being imported, without it telling me that there's an incorrect column count in the import file?
EDIT: to clarify, I'm exporting only data, not structure.
Excel is an option to remove the column and probably the fastest at this point.
But if these databases are on the same server and you have access you can just to an INSERT INTO databaseB.table (column_list) SELECT column_list FROM databaseA.table.
You can also run the SELECT statement to just get the desired columns and then export the results. This link should be available in the recent versions of PHPMyAdmin.
It is several years since the original question, but this still came out top in a google search so I'll comment on what worked for me:
If I delete the Id column in my CSV and then try to import I get the 'Invalid column count in CSV input on line 1.' error.
But if I keep the Id column but change all of the Id values to NULL in excel (just typing NULL into the cell), then when I import this the id auto-increment fills in the new records with consecutive numbers (presumably starting with the highest existing record Id +1 ).
I'm using PHPMyAdmin 4.7.0
Another way is
Go to the import menu for that table
Add the CSV file (without an ID column)
Pick CSV in the Format section
In the section where you pick the format of the file (separated by which char, which char encloses fields, etc) there's a field called Column names Type the names of the columns you ARE including, separated by commas.