Data Scientist's starter pack - Part 1
September 20, 2020 — 8 min
When working on a data science project, many skills might be required from theorical to
technical ones. In this article, I will mainly focus on some of the most
important tools to have and work with, tools which allow better and
cleaner coding and
faster way of
The summary is as follows:
- Visual Studio Code
- Bash commands
- Virtual environment
- Unit Testing
The daily tasks of a data scientist are mainly about coding. For this reason, it is important to have a nice editor capable of encapsulating many other tools. VS code does that! It a tool developed by Microsoft and considered to be one of the most used editors. It allows the run, edit and debug of the your code plus many other useful extensions such as:
Python: “A Visual Studio Code extension with rich support for the Python language, including features such as IntelliSense, linting, debugging, code navigation, code formatting, Jupyter notebook support, refactoring, variable explorer, test explorer, snippets, and more!”
Gitlens: “GitLens supercharges the Git capabilities built into Visual Studio Code. It helps you to visualize code authorship at a glance via Git blame annotations and code lens, seamlessly navigate and explore Git repositories, gain valuable insights via powerful comparison commands, and so much more.”
vscode-icons: Bring icons to your Visual Studio Code for better representation and visualization
Grammarly: an extension allowing to correct the misspellings which is very useful when writing articles ;)
Docker: “The Docker extension makes it easy to build, manage, and deploy containerized applications from Visual Studio Code. It also provides one-click debugging of Node.js, Python, and .NET Core inside a container.” If you are unfamiliar with docker, don’t worry about this part.
Black: a python formatting library which is installed by default (if not use pip) and allows to automatically format your code on save. To do so, you can add the following lines to the “setting.json” file of VS code
"python.formatting.provider": "black" "editor.formatOnSave": true
Here is an example of Black’s code formatting:
Better Comments: when writing code, you usually add comments to detail your scripts and this extension help you make it more friendly and more understandable. You can use for instance the sign ! to make the comments red and hence better attract your attention.
Here is an example of the highlight generated by the extension:
For more info, please visit the official website of VS code.
Bash commands are very useful to quickly navigate into your operating system and are efficient for handling files. They also come to light when working on a virtual machine with no graphic interface, environment in the cloud for instance. When working with bash commands, it is important to have a nice terminal capable of visually encapsuling many elements. Below are two consoles which I recommand:
Iterm 2, it offers a very nice visual terminal on which you can find information about the current working directory, git, its branches(project dev) and its status (yellow: waiting for commit) and also the virtual environment on which you are working(project_env).
Cmder, it is a more enhanced terminal that also allows to run linux commands which is very useful in some cases.
Do not hesitate to check Powerlevel10K, it is a cool feature to have on your terminal and which enables a better style and a more efficient and flexible way of working.
Here is a brief list of the most used command-lines (on macOS):
|Current working directory||pwd|
|Listing the content of the current directory||ls|
|Details on the content of the current directory||ll|
|Change directory||cd newpath|
|Move back up to one level||cd ..|
|Code history||upper/lower arrow|
|Create a folder||mkdir folder1|
|Create many folders||mkdir folder1 folder2 …|
|Create a file||touch fileName.extension|
|Open the file||open filName.extension|
|Show the content||cat fileName.extension|
|Move a file||mv fileOldPath fileNewPath|
|Copy a file||cp filePath destinationPath|
|copy a folder||cp -r folderPath destinationPath|
|Delete a file||rm filePath.extension|
|Delete a folder||rm -r folderPath|
When working on a project, you, very often, want it to be reproducible in other machines and clouds. Since the versions of the used packages keep changing, it is important to set a private or
virtualenvironment in which you can develop your project and save the versions of the packages.
virtualenv is a python library which allow you to carried out the isolation and versioning discussed above as follows:
When installing python for the first time on your laptop, its default environment is called “base” or “global”. It is worth mentioning that the created virtual environment do not communicated with base, which means that they are initially empty!
#PIP cd dossier project pip install virtualenv # if not done before virtualenv <env> source <env>/bin/activate #Anaconda cd dossier project conda create -n <env> conda activate <env>
pip install <packageName>/
conda install <packageName> progressively when needed and once your project is stabilized:
#PIP pip freeze > packages.txt/ #Anaconda conda env export > packages.yml
if you clone a project with a “packages.txt” file in order to run it, you can:
- First create and activate a new virtual environment
- Run the following command to install all the necessary packages in it from the “packages.txt” file :
#PIP pip install -r packages.txt/ #Anaconda conda env create -f environment.yml #No need to create an env before
When creating a virtual environment, it’s important to choose it as as an interpreter. VS Code is capable of detecting it and will suggest it for use, otherwise you can choose it as follows:
- In settings, click on
Python: Select Interpreter
- Choose it from the list:
Please visit the website of virtualenv for more info.
Unit testing is a software testing technic which runs a sequence of individual tests that verify the good functioning of a developed module or a part of it. These tests come to light especially when adopting a
test-driven development approach in your project, which consists on first writing the tests that your script should pass before starting its development.
In python you can use unit testing via the framework
pytest which is very well convenient for small and big projects.
Suppose, we are developping a function
f which for each set of values , it should return an . After writing the tests that should be passed we develop the function and then write a script with the following scheme:
#Import packages import pytest import ... #Create function def f(var1, var2,...): return ... #Unit tests @pytest.mark.parametrize( "var1, var2,...,expected_result", [ (val1_1, val2_1,...,expected_result_1), (val1_2, val2_2,...,expected_result_2), ... ] ) def test_func(var1, var2,..., expected_result): assert f(var1,var2,...)==expected_result
The script above will test all the different tests and check if they match their corresponding values one by one and that by running the following command-line:
python -m pytest -v --tb=line --disable-warnings pathtotestfile.py::function
- Example: Power function For the sake of illustration we will consider a simple function: the power of a number. The unit testing file is the following:
# Import packages import pytest # Create function def f(a, n): return a ** n # Unit tests @pytest.mark.parametrize( "a, n, c", [ (2, 3, 8), (5, 2, 25), (4, 2, 16) ] ) def test_func(a, n, c): assert f(a, n) == c
We run the following command-line:
python -m pytest -v --tb=line --disable-warnings tests.py::test_func
We get the following results:
The three tests have passed. The more tests we right the more stable the code becomes!
As a data scientist, it is very important to master these technologies in order to make your work more efficient and scalable. Tools like git are indispensable and make collaboration powerful and straight-forward.