Setting up your Python environment for Data Science (DS) can be a tricky task. If you’ve never set up something like that before, you might spend hours fiddling with different commands trying to get the thing to work. But we just want to get right to the DS!
Setting up your Python environment for data science can be a tricky task. If you’ve never set up something like that before, you might spend hours fiddling with different commands trying to get the thing to work. But we just want to get right to the DS!
In this tutorial, you will learn how to set up a stable Python data science development environment. You’ll be able to get right down into the DS and never have to worry about installing packages ever again.
(1) Set up Python 3 and Pip
The first step is to install pip , a Python package manager:
sudo apt-get install python3-pip
Using pip, we’ll be able to install any Python package that’s indexed in the Python Package Indexwith a simple pip install your_package
. You’ll see soon how we use it to set up our virtual environment too.
Next, we’ll set Python 3 to be the default when running either the pip or python commands from command line. This makes using Python 3 easier and more convenient. If we didn’t do this, then if we wanted to use Python 3, we’d have to remember to type out pip3 and python3 every time!
To force Python 3 to be the default, we’re going to modify the ~/.bashrc
file. From the command line, execute the following command to view that file:
nano ~/.bashrc
Scroll on down to the # some more ls aliases section and add the following line:
alias python='python3'
Save the file and reload your changes:
source ~/.bashrc
Boom! Python 3 is now your default Python! You can run it with a simple python your_programon the command line.
(2) Create a virtual environment
Now we’ll set up a virtual environment. In there, we’ll install all of the python packages that we need for data science.
We use virtual environments in order to separate our coding set ups. Imagine if at some point you wanted to do 2 different projects on your computer, which required different libraries of different versions. Having them all in the same working environment can be messy and you’ll likely run into the problem of conflicting library versions. Your DS code for project 1 needs version 1.0 of numpy, but project 2 needs version 1.15. Yikes!
A virtual environment allows us to isolate our working areas to avoid those conflicts.
First, install the relevant packages:
sudo pip install virtualenv virtualenvwrapper
Once we have virtualenv and virtualenvwrapper installed, we’ll again need to edit our ~/.bashrc
file. Place these 3 lines right at the bottom and save it.
export WORKON_HOME=$HOME/.virtualenvs<br> export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3<br> source /usr/local/bin/virtualenvwrapper.sh<br>
Save the file and reload your changes:
source ~/.bashrc
Great! Now we can finally create our virtual environment like so:
mkvirtualenv DS
We’ve just created a virtual environment called DS . To enter it, do this:
workon DS
Nice! Any library installations that you do while in the DS virtualenv will be isolated in there and never conflict with any other environments! So whenever you wish to run code that depends on libraries installed in the DS environment, enter into first with the workon
command and then run your code as normal.
If you need to exit the virtualenv, run this command:
deactivate
(3) Install data science libraries
Now we can install our DS libraries! We’ll go with the most commonly used ones:
numpy: for any work with matrices, especially math operations
scipy: scientific and technical computing
pandas: data handling, manipulation, and analysis
matplotlib: data visualisation
scikit learn: data science
Here’s a simple trick to install all of those libraries in one quick shot! Create a requirements.txt
file and list all of the packages you wish to install like so:
numpy<br>scipy<br>pandas<br>matplotlib<br>scikit-learn
Once that’s done, just execute this command:
pip install -r requirements.txt<br>
Voila! Pip will go ahead and install all of the packages listed in the file in one shot.
Congratulations, your environment is set up and you’re ready to do data science!
Setting up your Python environment for DS can be a tricky task. If you’ve never set up something like that before, you might spend hours fiddling with different commands trying to get the thing to work. But we just want to get right to the DS!
In this tutorial, you will learn how to set up a stable Python DS development environment. You’ll be able to get right down into the DS and never have to worry about installing packages ever again.
Source: George Seif
Also related: https://community.cloudera.com/t5/Community-Articles/Using-VirtualEnv-with-PySpark/ta-p/245905
Recent Comments