spark-csv fails parsing with embedded html and quotes - scala

I have this csv file which contains description of several cities:
Cities_information_extract.csv
I can parse this file just fine using python pandas.read_csv or R read.csv methods. They both return 693 rows for 25 columns.
I am trying, unsuccessfully, to load the csv using Spark 1.6.0 and scala.
For this I am using spark-csv and commons-csv (which I have included in the spark jars path).
This is is what I have tried:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/Cities_information_extract.csv")
cities_info.count()
// ERROR
java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:275)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)
Then I've tried using univocity parser:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("parserLib", "univocity").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/Cities_information_extract.csv")
cities_info.count()
// ERROR
Caused by: java.lang.ArrayIndexOutOfBoundsException: 100000
at com.univocity.parsers.common.input.DefaultCharAppender.append(DefaultCharAppender.java:103)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:115)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:108)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
[...]
Inspecting the file I've noticed the presence of several html tags in description fields with embedding quotes like:
<div id="someID">
I've tried, using python, to remove all html tags using regular expressions:
import os
import re
pattern = re.compile("<[^>]*>") # find all html tags <..>
with io.open("Cities_information_extract.csv", "r", encoding="utf-8") as infile:
text = infile.read()
text = re.sub(pattern, " ", text)
with io.open("cities_info_clean.csv", "w", encoding="utf-8") as outfile:
outfile.write(text)
Next I've tried again with the new file without the html tags:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/cities_info_clean.csv")
cities_info.count()
// ERROR
java.io.IOException: (startline 1) EOF reached before encapsulated token finished
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)
at com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:304)
[...]
And with the univocity parser:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("parserLib", "univocity").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/cities_info_clean.csv")
cities_info.count()
// ERROR
Caused by: java.lang.ArrayIndexOutOfBoundsException: 100000
at com.univocity.parsers.common.input.DefaultCharAppender.append(DefaultCharAppender.java:103)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:115)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:108)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
[...]
Both python and R are able to parse correctly both files, while spark-csv still fails. Any suggestion for correctly parsing this csv file using spark-scala?

Related

Open binary file data with Spark - ValueError: The truth value of a Series is ambiguous

Having the following binary file (mp3) that send audio to a service in Azure to be trascripted.
The following code works in Databricks.
import os
import requests
url = "https://endpoint_service"
headers = {
'Ocp-Apim-Subscription-Key': 'MyKey',
'Content-Type': 'audio/mpeg'
}
def send_audio_transcript(url, payload, header):
"""Send audio.mp3 to a Azure service to be transcripted to text."""
response = requests.request("POST", url, headers=headers, data=payload)
return response.json()
full_path = <my_path>file.mp3
with open(full_path, mode='rb') as file: # b is important -> binary
fileContent = file.read()
send_audio_transcript(url, fileContent, headers) # a POST request its works
But my audio files are in a sensitive storage in Data lake and the only way to access them is by spark read.
looking for the documentation the way to read a binary file is.
df = spark.read.format("binaryFile").load(full_path)
display(df)
path || modificationTime || length || content
path || sometime || some_lenght || 2dnUwAC
first try:
content = df.content
test_service = send_audio_transcript(url, content , headers)
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Second try(convert spark to pandas):
pandas_df = df.toPandas()
content = pandas_df["content"]
test_service = send_audio_transcript(url, content , headers)
Valuerror:ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
What is the exactly translate in python-pyspark to:
with open(full_path, mode='rb') as file: # b is important -> binary
fileContent = file.read()
Your content data comming from Spark is not the same as the content data comming from open file.
From spark and later pandas you have a pandas series but from open the file you will have a class bytes
with open(full_path, mode='rb') as file: # b is important -> binary
fileContent = file.read()
print(type(fileContent)) # will return <class 'bytes'>
but from Spark
input_df = spark.read.format("binaryFile").load(full_path)
pandas_df = input_df.toPandas()
content = pandas_df['content']
print(type(content)) # return <class 'pandas.core.series.Series'>
In your case to fix your problem you need to take just the first element of the series.
content_good = content[0]
print(content_good) # you have your <class 'bytes'> wich is what you need

Read file with New Line in Python

Need help, i receive a file with new line.
name,age
"Maria",28
"Kevin",30
"Joseph",31
"Faith",20
"Arnel
",21
"Kate",40
How can I identify that line and remove it from the list?
output should be
name,age
"Maria",28
"Kevin",30
"Joseph",31
"Faith",20
"Kate",40
This is one approach
import csv
data = []
with open(filename) as infile:
reader = csv.reader(infile)
for line in reader:
if not line[0].endswith("\n"):
data.append(line)
with open(filename, "w") as outfile:
writer = csv.writer(outfile)
writer.writerows(data)
You can also correct the entry using str.strip().
Ex:
import csv
data = []
with open(filename) as infile:
reader = csv.reader(infile)
for line in reader:
if line[0].endswith("\n"):
line[0] = line[0].strip()
data.append(line)
with open(filename, "w") as outfile:
writer = csv.writer(outfile)
writer.writerows(data)

How to execute spark scala script saved in a text file

I have written word count scala-script in a text file and saved it in home directory.
How can i call and execute the script file "wordcount.txt" ?
If I try the command: spark-submit wordcount.txt, it is not working.
Content of "wordcount.txt" file-
val text = sc.textFile("/data/mr/wordcount/big.txt");
val counts = text.flatMap(line => line.split(" ").map(word => (word.toLowerCase(),1)).reduceByKey(+).sortBy(_._2,false).saveAsTextFile(“count_output”);

org.apache.spark.sql.AnalysisException: cannot resolve '`angelique`' given input columns

I have a table created by using a data frame and I try to launch a query as below shown :
val sc = SparkSession.builder()
.master("local")
.appName("Lea")
.getOrCreate()
// example login = angelique
var login:String = (givenName+"."+sn).replaceAll(" ", "")
sc.sql("SELECT login FROM global_temp.users where login="+login).show
Output error :
17/04/26 10:11:01 INFO SparkSqlParser: Parsing command: users
17/04/26 10:11:01 INFO SparkSqlParser: Parsing command: SELECT login FROM global_temp.users where login=angelique
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`angelique`' given input columns: [idExterne, login, password, uid]; line 1 pos 48;
'Project ['login]
+- 'Filter (login#60 = 'angelique)
+- SubqueryAlias users, `global_temp`.`users`
+- Project [_1#50 AS idExterne#59, _2#51 AS login#60, _3#52 AS password#61, _4#53 AS uid#62]
Replace this code
sc.sql(s"SELECT login FROM global_temp.users where login='$login'").show
it will work
As far as I know, a dot is not permitted in a column name in Spark 2.0.1. I'm not sure it is permitted in a DF name. Maybe you can try to omit or replace the dot in the DF name?
try this
sc.sql(s"SELECT login FROM global_temp.users where login==='$login'").show

How to use Spark BigQuery Connector locally?

For test purpose, I would like to use BigQuery Connector to write Parquet Avro logs in BigQuery. As I'm writing there is no way to read directly Parquet from the UI to ingest it so I'm writing a Spark job to do so.
In Scala, for the time being, job body is the following:
val events: RDD[RichTrackEvent] =
readParquetRDD[RichTrackEvent, RichTrackEvent](sc, googleCloudStorageUrl)
val conf = sc.hadoopConfiguration
conf.set("mapred.bq.project.id", "myproject")
// Output parameters
val projectId = conf.get("fs.gs.project.id")
val outputDatasetId = "logs"
val outputTableId = "test"
val outputTableSchema = LogSchema.schema
// Output configuration
BigQueryConfiguration.configureBigQueryOutput(
conf, projectId, outputDatasetId, outputTableId, outputTableSchema
)
conf.set(
"mapreduce.job.outputformat.class",
classOf[BigQueryOutputFormat[_, _]].getName
)
events
.mapPartitions {
items =>
val gson = new Gson()
items.map(e => gson.fromJson(e.toString, classOf[JsonObject]))
}
.map(x => (null, x))
.saveAsNewAPIHadoopDataset(conf)
As the BigQueryOutputFormat isn't finding the Google Credentials, it fallbacks on metadata host to try to discover them with the following stacktrace:
016-06-13 11:40:53 WARN HttpTransport:993 - exception thrown while executing request
java.net.UnknownHostException: metadata
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589 at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:160)
at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:207)
at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:72)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.createBigQueryCredential(BigQueryFactory.java:81)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQuery(BigQueryFactory.java:101)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQueryHelper(BigQueryFactory.java:89)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputCommitter.<init>(BigQueryOutputCommitter.java:70)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:102)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:84)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:30)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1135)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1078)
It is of course expected but it should be able to use my service account and its key as GoogleCredential.getApplicationDefault() returns appropriate credentials fetched from GOOGLE_APPLICATION_CREDENTIALS environment variable.
As the connector seems to read credentials, from hadoop configuration, what's the keys to set so that it reads GOOGLE_APPLICATION_CREDENTIALS ? Is there a way to configure the output format to use a provided GoogleCredential object ?
If I understand your question correctly - you might want to set:
<name>mapred.bq.auth.service.account.enable</name>
<name>mapred.bq.auth.service.account.email</name>
<name>mapred.bq.auth.service.account.keyfile</name>
<name>mapred.bq.project.id</name>
<name>mapred.bq.gcs.bucket</name>
Here, the mapred.bq.auth.service.account.keyfile should point to the full file path to the older-style "P12" keyfile; alternatively, if you're using the newer "JSON" keyfiles, you should replace the "email" and "keyfile" entries with the single mapred.bq.auth.service.account.json.keyfile key:
<name>mapred.bq.auth.service.account.enable</name>
<name>mapred.bq.auth.service.account.json.keyfile</name>
<name>mapred.bq.project.id</name>
<name>mapred.bq.gcs.bucket</name>
Also you might want to take a look at https://github.com/spotify/spark-bigquery - which is much more civilised way of working with BQ and Spark. The setGcpJsonKeyFile method used in this case is the same JSON file you'd set for mapred.bq.auth.service.account.json.keyfile if using the BQ connector for Hadoop.