Data Scientist's starter pack - Part 1

September 20, 2020 — 8 min

When working on a data science project, many skills might be required from theorical to technical ones. In this article, I will mainly focus on some of the most important tools to have and work with, tools which allow better and cleaner coding and faster way of working.

Tabe of contents

The summary is as follows:

  1. Visual Studio Code
  2. Bash commands
  3. Virtual environment
  4. Unit Testing

1. Visual Studio Code

vs code screen

The daily tasks of a data scientist are mainly about coding. For this reason, it is important to have a nice editor capable of encapsulating many other tools. VS code does that! It a tool developed by Microsoft and considered to be one of the most used editors. It allows the run, edit and debug of the your code plus many other useful extensions such as:

  • Python: “A Visual Studio Code extension with rich support for the Python language, including features such as IntelliSense, linting, debugging, code navigation, code formatting, Jupyter notebook support, refactoring, variable explorer, test explorer, snippets, and more!”
  • Gitlens: “GitLens supercharges the Git capabilities built into Visual Studio Code. It helps you to visualize code authorship at a glance via Git blame annotations and code lens, seamlessly navigate and explore Git repositories, gain valuable insights via powerful comparison commands, and so much more.”
  • vscode-icons: Bring icons to your Visual Studio Code for better representation and visualization
  • Grammarly: an extension allowing to correct the misspellings which is very useful when writing articles ;)
grammarly
  • Docker: “The Docker extension makes it easy to build, manage, and deploy containerized applications from Visual Studio Code. It also provides one-click debugging of Node.js, Python, and .NET Core inside a container.” If you are unfamiliar with docker, don’t worry about this part.
  • Black: a python formatting library which is installed by default (if not use pip) and allows to automatically format your code on save. To do so, you can add the following lines to the “setting.json” file of VS code

    "python.formatting.provider": "black"
    "editor.formatOnSave": true

Here is an example of Black’s code formatting:

  • Better Comments: when writing code, you usually add comments to detail your scripts and this extension help you make it more friendly and more understandable. You can use for instance the sign ! to make the comments red and hence better attract your attention.

Here is an example of the highlight generated by the extension:

better comments

For more info, please visit the official website of VS code.


2. Bash commands

command bash screen

Bash commands are very useful to quickly navigate into your operating system and are efficient for handling files. They also come to light when working on a virtual machine with no graphic interface, environment in the cloud for instance. When working with bash commands, it is important to have a nice terminal capable of visually encapsuling many elements. Below are two consoles which I recommand:

  • Mac: Iterm 2, it offers a very nice visual terminal on which you can find information about the current working directory, git, its branches(project dev) and its status (yellow: waiting for commit) and also the virtual environment on which you are working(project_env).
iterm2
  • Windows: Cmder, it is a more enhanced terminal that also allows to run linux commands which is very useful in some cases.
cmder

Do not hesitate to check Powerlevel10K, it is a cool feature to have on your terminal and which enables a better style and a more efficient and flexible way of working.

Here is a brief list of the most used command-lines (on macOS):

Action Bash command
Current working directory pwd
Listing the content of the current directory ls
Details on the content of the current directory ll
Change directory cd newpath
Move back up to one level cd ..
Code history upper/lower arrow
Create a folder mkdir folder1
Create many folders mkdir folder1 folder2 …
Create a file touch fileName.extension
Open the file open filName.extension
Show the content cat fileName.extension
Move a file mv fileOldPath fileNewPath
Copy a file cp filePath destinationPath
copy a folder cp -r folderPath destinationPath
Delete a file rm filePath.extension
Delete a folder rm -r folderPath

3. Virtual environment

vir env screen

When working on a project, you, very often, want it to be reproducible in other machines and clouds. Since the versions of the used packages keep changing, it is important to set a private or virtualenvironment in which you can develop your project and save the versions of the packages. virtualenv is a python library which allow you to carried out the isolation and versioning discussed above as follows:

vir env

When installing python for the first time on your laptop, its default environment is called “base” or “global”. It is worth mentioning that the created virtual environment do not communicated with base, which means that they are initially empty!

Create and activate a virtual environment

#PIP
cd dossier project
pip install virtualenv # if not done before
virtualenv <env>
source <env>/bin/activate

#Anaconda
cd dossier project
conda create -n <env>
conda activate <env>

Install packages pip install <packageName>/ conda install <packageName> progressively when needed and once your project is stabilized:

#PIP
pip freeze > packages.txt/

#Anaconda
conda env export > packages.yml

if you clone a project with a “packages.txt” file in order to run it, you can:

  • First create and activate a new virtual environment
  • Run the following command to install all the necessary packages in it from the “packages.txt” file :
#PIP
pip install -r packages.txt/

#Anaconda
conda env create -f environment.yml #No need to create an env before

VS Code & virtual environments

When creating a virtual environment, it’s important to choose it as as an interpreter. VS Code is capable of detecting it and will suggest it for use, otherwise you can choose it as follows:

  • In settings, click on Command Palette
  • Type: Python: Select Interpreter
  • Choose it from the list:
interpreters list

Please visit the website of virtualenv for more info.


4. Unit Testing

unit testing screen

Unit testing is a software testing technic which runs a sequence of individual tests that verify the good functioning of a developed module or a part of it. These tests come to light especially when adopting a test-driven development approach in your project, which consists on first writing the tests that your script should pass before starting its development. In python you can use unit testing via the framework pytest which is very well convenient for small and big projects. Suppose, we are developping a function f which for each set of values (var1,var2,...)(var1, var2,...), it should return an expected_valueexpected\_value. After writing the tests that should be passed we develop the function and then write a script with the following scheme:

#Import packages
import pytest
import ...

#Create function
def f(var1, var2,...):
	return ...

#Unit tests
@pytest.mark.parametrize(
	"var1, var2,...,expected_result",
	[
		(val1_1, val2_1,...,expected_result_1),
		(val1_2, val2_2,...,expected_result_2),
		...
	]
)
def test_func(var1, var2,..., expected_result):
	assert f(var1,var2,...)==expected_result

The script above will test all the different tests (var1_i,var2_i,...)(var1\_i, var2\_i,...) and check if they match their corresponding values expected_value_iexpected\_value\_i one by one and that by running the following command-line:

python -m pytest -v --tb=line --disable-warnings pathtotestfile.py::function
  • Example: Power function For the sake of illustration we will consider a simple function: the power of a number. The unit testing file is the following:
# Import packages
import pytest

# Create function
def f(a, n):
    return a ** n


# Unit tests
@pytest.mark.parametrize(
  "a, n, c",
  [
    (2, 3, 8),
    (5, 2, 25),
    (4, 2, 16)
    ]
)
def test_func(a, n, c):
    assert f(a, n) == c

We run the following command-line:

python -m pytest -v --tb=line --disable-warnings tests.py::test_func

We get the following results:

pytest res

The three tests have passed. The more tests we right the more stable the code becomes!

Conclusion

As a data scientist, it is very important to master these technologies in order to make your work more efficient and scalable. Tools like git are indispensable and make collaboration powerful and straight-forward.