Python 3.5 — Spark 2.0 — AWS Redshift
Check dependencies’ versions
I was trying to set up an EC2 Ubuntu machine to read the data from AWS Redshift into Spark using Python. Our system informations are as following:
Python 3.5.2
Spark 2.0.0
Ubuntu 14.04.4
Following are resources that we used and articles which are very useful along the way:
Spark-Redshift (main library)
Detailed explanation and dependencies from AWS
Bug
Another useful article
Repo to download a lot of dependencies
Error > Invalid S3 URI: hostname does not appear to be a valid S3 endpoint
Solution > You will need to use aws-java-sdk-1.7.4.2.jar instead of the latest version. And depending on your Spark version, please use the correct spark-redshift version.
Error > AWS Redshift Driver
Solution > You will need to use RedshiftJDBC41–1.2.1.1001.jar latest versions are not still compatible yet.
Hope this post will be useful for our fellow data engineers and data scientists. If you are trying to set this up and others problems, please let me know and I might have solution for them.