This is just a short post, but it’s a prelude to a larger one I’m writing on record-linkage using Synapse Spark Pools with this step being a requisite.
Synapse Spark comes with a massive amount of libraries that should cater for most of your needs. However, there’ll will always be a time when you need to bring in some external libraries for a particular piece of work.
This functionality is well supported inside Synapse with decent online guidance here . However for a non-native Python/developer like myself there were a couple of sections that weren’t 100% clear as I imported the packages, so just in case there’re any other folks out there in a similar boat to me, here’s what to do (but read the link above, it’s a great place to start and you might be okay on that alone!)
In this scenario, I need a python library called Splink that wasn’t installed on the cluster, indicated by this error –
A quick check of installed libraries confirmed that Splink wasn’t installed:
import pkg_resources for d in pkg_resources.working_set: print(d)
Installing the missing library via a Requirements File
A requirements.txt file is essentially a file that you upload to the spark cluster and runs the equivalent of a “Pip install” when the cluster starts for all the packages listed in the file. You add your extra packages here and restart the cluster (or force apply).
Requirements.txt file contents –
Then, to upload to your cluster you simply navigate to “Manage”, then choose “Spark Pools”, click the three dots on your Spark cluster that you want to add the package to. From there, upload your requirements file and click “apply”.
Once applied, you can run the same code as before and verify that the package has been installed.
You can now use your new libraries as needed 🙂
Hope that helps!