I have a text file which look like below.
HDR¶20200101
BDY¶1¶Jimmy
BDY¶1¶Something
TRL¶123
I would like to parse it to a Glue Dynamic Dataframe by filtering out the header trailer. Also assign the header as ID, Name. I tried the below code and it doesn't seem to work.
dyf_test = glueContext.create_dynamic_frame.from_options(
format_options={"withHeader": False, "separator": "¶"},
connection_type="s3",
format="csv",
connection_options={
"paths": [
"s3://Files/test.gz"
],
"recurse": True,
})
dyf_test = Filter.apply(
frame=dyf_test,
f=lambda row: (
bool(re.match("HDR", row[0]))
and bool(re.match("TRL", row[0]))
)
)
Error : com.amazonaws.services.glue.util.FatalException: Unable to parse file: test.gz
Related
Having the following binary file (mp3) that send audio to a service in Azure to be trascripted.
The following code works in Databricks.
import os
import requests
url = "https://endpoint_service"
headers = {
'Ocp-Apim-Subscription-Key': 'MyKey',
'Content-Type': 'audio/mpeg'
}
def send_audio_transcript(url, payload, header):
"""Send audio.mp3 to a Azure service to be transcripted to text."""
response = requests.request("POST", url, headers=headers, data=payload)
return response.json()
full_path = <my_path>file.mp3
with open(full_path, mode='rb') as file: # b is important -> binary
fileContent = file.read()
send_audio_transcript(url, fileContent, headers) # a POST request its works
But my audio files are in a sensitive storage in Data lake and the only way to access them is by spark read.
looking for the documentation the way to read a binary file is.
df = spark.read.format("binaryFile").load(full_path)
display(df)
path || modificationTime || length || content
path || sometime || some_lenght || 2dnUwAC
first try:
content = df.content
test_service = send_audio_transcript(url, content , headers)
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Second try(convert spark to pandas):
pandas_df = df.toPandas()
content = pandas_df["content"]
test_service = send_audio_transcript(url, content , headers)
Valuerror:ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
What is the exactly translate in python-pyspark to:
with open(full_path, mode='rb') as file: # b is important -> binary
fileContent = file.read()
Your content data comming from Spark is not the same as the content data comming from open file.
From spark and later pandas you have a pandas series but from open the file you will have a class bytes
with open(full_path, mode='rb') as file: # b is important -> binary
fileContent = file.read()
print(type(fileContent)) # will return <class 'bytes'>
but from Spark
input_df = spark.read.format("binaryFile").load(full_path)
pandas_df = input_df.toPandas()
content = pandas_df['content']
print(type(content)) # return <class 'pandas.core.series.Series'>
In your case to fix your problem you need to take just the first element of the series.
content_good = content[0]
print(content_good) # you have your <class 'bytes'> wich is what you need
I am trying to create a table in Apache Flink SQL client. I want to filter my JSON data in Flink, which arrives continously from a Kafka cluster.
The JSON looks like this:
{"lat":25.77,"lon":-80.19,"timezone":"America\/New_York",
"timezone_offset":-14400,
"current.dt":1592151550,
"current.sunrise":1592130546,
"current.sunset":1592179999,
"current.temp":302.77,
"current.feels_like":306.9,
"current.pressure":1017,
"current.humidity":78,
"current.dew_point":298.52,
"current.uvi":11.97,
"current.clouds":75,
"current.visibility":16093,
"current.wind_speed":3.6,
"current.wind_deg":60,
"current.weather.0.id":803,
"current.weather.0.main":"Clouds",
"current.weather.0.description":"broken clouds",
"current.weather.0.icon":"04d"}
The part I am interested in :
"current.weather.0.description":"broken clouds"
I want to filter my data whenever the current.weather description is "moderate rain". I tried to create two tables in Flink:
the Rain table, where the whole JSON arrives, and
where my filtered data will be stored and sent back to another Kafka cluster.
CREATE TABLE Rain (current.weather.0.description varchar) WITH ('connector.type' = 'kafka',
'connector.version' = 'universal',
'connector.topic' = 'WeatherRawData',
'format.type' = 'json',
'connector.properties.0.key' = 'bootstrap.servers',
'connector.properties.0.value' = 'kafka:9092',
'connector.properties.1.key' = 'group.id',
'connector.properties.1.value' = 'flink-input-group',
'connector.startup-mode' = 'earliest-offset'
);
CREATE TABLE ProcessedRain(
current.weather.0.description varchar
) WITH (
'connector.type' = 'kafka',
'connector.version' = 'universal',
'connector.topic' = 'WeatherProcessedData',
'format.type' = 'json',
'connector.properties.0.key' = 'bootstrap.servers',
'connector.properties.0.value' = 'kafka:9092',
'connector.properties.1.key' = 'group.id',
'connector.properties.1.value' = 'flink-output-group'
);
The error message I get :
[ERROR] Could not execute SQL statement. Reason: org.apache.flink.table.api.SqlParserException: SQL parse failed. Encountered "current" at line 1, column 20. Was expecting one of:
"PRIMARY" ...
"UNIQUE" ...
"WATERMARK" ...
<BRACKET_QUOTED_IDENTIFIER> ...
<QUOTED_IDENTIFIER> ...
<BACK_QUOTED_IDENTIFIER> ...
<IDENTIFIER> ...
<UNICODE_QUOTED_IDENTIFIER> ...
How should my CREATE TABLE be created correctly?
I think it should be
CREATE TABLE ProcessedRain (
`current.weather.0.description` VARCHAR
) WITH (
'connector.type' = 'kafka',
'connector.version' = 'universal',
'connector.topic' = 'WeatherProcessedData',
'format.type' = 'json',
'connector.properties.bootstrap.servers' = 'kafka:9092',
'connector.properties.group.id' = 'flink-output-group'
);
I have this csv file which contains description of several cities:
Cities_information_extract.csv
I can parse this file just fine using python pandas.read_csv or R read.csv methods. They both return 693 rows for 25 columns.
I am trying, unsuccessfully, to load the csv using Spark 1.6.0 and scala.
For this I am using spark-csv and commons-csv (which I have included in the spark jars path).
This is is what I have tried:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/Cities_information_extract.csv")
cities_info.count()
// ERROR
java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:275)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)
Then I've tried using univocity parser:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("parserLib", "univocity").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/Cities_information_extract.csv")
cities_info.count()
// ERROR
Caused by: java.lang.ArrayIndexOutOfBoundsException: 100000
at com.univocity.parsers.common.input.DefaultCharAppender.append(DefaultCharAppender.java:103)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:115)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:108)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
[...]
Inspecting the file I've noticed the presence of several html tags in description fields with embedding quotes like:
<div id="someID">
I've tried, using python, to remove all html tags using regular expressions:
import os
import re
pattern = re.compile("<[^>]*>") # find all html tags <..>
with io.open("Cities_information_extract.csv", "r", encoding="utf-8") as infile:
text = infile.read()
text = re.sub(pattern, " ", text)
with io.open("cities_info_clean.csv", "w", encoding="utf-8") as outfile:
outfile.write(text)
Next I've tried again with the new file without the html tags:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/cities_info_clean.csv")
cities_info.count()
// ERROR
java.io.IOException: (startline 1) EOF reached before encapsulated token finished
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)
at com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:304)
[...]
And with the univocity parser:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("parserLib", "univocity").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/cities_info_clean.csv")
cities_info.count()
// ERROR
Caused by: java.lang.ArrayIndexOutOfBoundsException: 100000
at com.univocity.parsers.common.input.DefaultCharAppender.append(DefaultCharAppender.java:103)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:115)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:108)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
[...]
Both python and R are able to parse correctly both files, while spark-csv still fails. Any suggestion for correctly parsing this csv file using spark-scala?
I have to read 3 different lines from log files based on some text and then output the fields in a csv file.
sample log data:-
20110607 095826 [.] !! Begin test. Script filename/text.txt
20110607 095826 [.] Full path: filename/test/text.txt
20110607 095828 [.] FAILED: Test Failed()..
i have to read file name after !!Begin test. Script. this is my conf file
filter{
grok
{
match => {"message" => "%{BASE10NUM:Date}%{SPACE:pat}{BASE10NUM:Number}%
{SPACE:pat}[.]%{SPACE:pat}%{SPACE:pat}!! Begin test. Script%
{SPACE:pat}%{GREEDYDATA:file}"
}
overwrite => ["message"]
}
if "_grokparserfailure" in [tags]
{
drop{}
}
}
but its not giving me single record, its parsing full log file in json format no parsed field.
I was trying to stream import data into big query using tabledata.insert_all
job_data = {
kind: 'bigquery#tableDataInsertAllRequest',
rows: [
{ json: { column_name: value} }
]
}
response = execute(
api_method: bigquery.tabledata.insert_all,
parameters: {
projectId: config['project_id'],
datasetId: DATASET_ID,
tableId: table_id
},
body_object: job_data
)
But I always get the following error message
Google::APIClient::Request Sending API request post https://www.googleapis.com/bigquery/v2/projects/propane-tribute-90023/datasets/development/tables/api_requests_20150414/insertAll {"User-Agent"=>"My Test App/1.0 google-api-ruby-client/0.8.5 Mac OS X/10.9.5\n (gzip)", "Content-Type"=>"application/json", "Accept-Encoding"=>"gzip", "Authorization"=>"Bearer ya29.VgFYvU2nxGDhWiCdS47XRw0J-7GLenRry0Cd3AA2D1RDzMh5gnf-m85I5GeSr9oNW51OuUb9mdwObg", "Cache-Control"=>"no-store"}
Decompressing gzip encoded response (155 bytes)
Decompressed (261 bytes)
Google::APIClient::Request Result: 400 {"Vary"=>"X-Origin", "Content-Type"=>"application/json; charset=UTF-8", "Date"=>"Wed, 15 Apr 2015 03:14:17 GMT", "Expires"=>"Wed, 15 Apr 2015 03:14:17 GMT", "Cache-Control"=>"private, max-age=0", "X-Content-Type-Options"=>"nosniff", "X-Frame-Options"=>"SAMEORIGIN", "X-XSS-Protection"=>"1; mode=block", "Server"=>"GSE", "Alternate-Protocol"=>"443:quic,p=0.5", "Transfer-Encoding"=>"chunked"} => {"error"=>{"errors"=>[{"domain"=>"global", "reason"=>"invalid", "message"=>"No rows present in the request.", "locationType"=>"other", "location"=>"rows"}], "code"=>400, "message"=>"No rows present in the request."}}
Does anyone have the same the issue and know how to fix it?
Thanks.
Make sure you are providing appropriate values for all of the column headers present in your table’s schema. Providing separate “json” entries will populate individual rows with the column data you provide. Unless you have already assigned values to the variables named column_name and value, you need to provide those values in the statement following the json declaration.
A sample Ruby syntax for a tabledata.insert_all “rows” operation would look as follows:
body = {
"rows" =>[
{"json" => { "person_id" => 10, "person_name" => "test"}},
{"json" => { "person_id" => 11, "person_name" => "test2"}}
]
}