Data Scientist's starter pack

September 20, 2020 — 10 min

When working on a data science project, many skills might be required from theorical to technical ones. In this article, I will mainly focus on some of the most important tools to have and work with, tools which allow better and cleaner codingand faster way of collaboration.

Tabe of contents

The summary is as follows:

  1. Visual Studio Code
  2. Virtual environment
  3. Bash commands
  4. Git & Github

1. Visual Studio Code

vs code screen

The daily tasks of a data scientist are mainly about coding. For this reason, it is important to have a nice editor capable of encapsulating many other tools. VS code does that! It a tool developed by Microsoft and considered to be one of the most used editors. It allows the run, edit and debug of the your code plus many other useful extensions such as:

  • Python: “A Visual Studio Code extension with rich support for the Python language, including features such as IntelliSense, linting, debugging, code navigation, code formatting, Jupyter notebook support, refactoring, variable explorer, test explorer, snippets, and more!”
  • Gitlens: “GitLens supercharges the Git capabilities built into Visual Studio Code. It helps you to visualize code authorship at a glance via Git blame annotations and code lens, seamlessly navigate and explore Git repositories, gain valuable insights via powerful comparison commands, and so much more.”
  • vscode-icons: Bring icons to your Visual Studio Code for better representation and visualization
  • Grammarly: an extension allowing to correct the misspellings which is very useful when writing articles ;)
grammarly
  • Docker: “The Docker extension makes it easy to build, manage, and deploy containerized applications from Visual Studio Code. It also provides one-click debugging of Node.js, Python, and .NET Core inside a container.” If you are unfamiliar with docker, don’t worry about this part.
  • Black: a python formatting library which is installed by default (if not use pip) and allow to automatically format your code on save. To do so, your can add the following lines to the “setting.json” file of VS code

    "python.formatting.provider": "black"
    "editor.formatOnSave": true

Here is an example of Black’s code formatting:

For more info, please visit the official website of VS code.


2. Virtual environment

vir env screen

When working on a project, you, very often, want it to be reproducible in other machines and clouds. Since the versions of the used packages keep changing, it is important to set a private or virtualenvironment in which you can develop your project and save the versions of the packages. virtualenv is a python library which allow you to carried out the isolation and versioning discussed above as follows:

vir env

When installing python for the first time on your laptop, its default environment is called “base” or “global”. It is worth mentioning that the created virtual environment do not communicated with base, which means that they are initially empty!

Create and activate a virtual environment

cd dossier projet
pip install virtualenv # if not done before
virtualenv <env>
source <env>/bin/activate

Install packages (pip install packageName) progressively when needed and once your project is stabilized:

pip freeze > packages.txt/

if you clone a project with a “packages.txt” file in order to run it, you can:

  • First create and activate a new virtual environment
  • Run the following command to install all the necessary packages in it from the “packages.txt” file :
pip install -r packages.txt/

VS Code & virtual environments

When creating a virtual environment, it’s important to choose it as as an interpreter. VS Code is capable of detecting it and will suggest it for use, otherwise you can choose it as follows:

  • In settings, click on Command Palette
  • Type: Python: Select Interpreter
  • Choose it from the list:
interpreters list

Please visit the website of virtualenv for more info.


3. Bash commands

command bash screen

Bash commands are very useful to quickly navigate into your operating system and are efficient for handling files. They also come to light when working on a virtual machine with no graphic interface, environment in the cloud for instance. Here is a brief list of the most used command-lines (on macOS):

Action Bash command
Current working directory pwd
Listing the content of the current directory ls
Details on the content of the current directory ll
Change directory cd newpath
Move back up to one level cd ..
Code history upper/lower arrow
Create a folder mkdir folder1
Create many folders mkdir folder1 folder2 …
Create a file touch fileName.extension
Open the file open filName.extension
Show the content cat fileName.extension
Move a file mv fileOldPath fileNewPath
Copy a file cp filePath destinationPath
copy a folder cp -r folderPath destinationPath
Delete a file rm filePath.extension
Delete a folder rm -r folderPath

4. Git & Github

github screen

Git is a coding tool used mainly for three reasons:

  • Time versioning your code
  • Keeping track of the changes made
  • Allowing parallel collaborations of multiple parts

To do so, git works in three stages:

  1. Working directory: is the local folder hosting all the file of your project, more specifically, it is the folder where git was initialized
  2. Staging directory: it notes and indexes each modified document
  3. Git local repository: each carried-out changes lead to a version of the document or a snapshot which can be taken and labelled using a message.

Before digging into the command-lines of git let’s first see how to install and configure it.

Installing Git

  • Windows: download the Windows Git installer from the following website
  • Mac: most versions of MacOs have already git installed.

Configurating Git

git config --global user.name "Name”
git config --global user.mail "mail"

Hands-on

  • Creating your first Git repo

Once you have installed and configured git, you can create your first git project using the following command-lines:

mkdir projectFolder
cd projectFolder
git init #initialize git on your project
git status
touch file.md # create new file
open file.md # open the file and modify it
git status # file untracked -> add it to git index
git add file.md
git commit -m "file.md modified ..."

In the next paragraph, you will find the graph summing up the relationships between the different stages.

  • Hosting your repository on Github

Github is a platform which allows to host your project and coordinate the multi-parties collaboration. The following steps detail how you can do so:

  1. Create an account in Github
  2. Create new Repository
  3. Copy its link
  4. Clone it in a folder on your computer
  5. Make all the necessary changes assigned to their commits
  6. Host the changes on Github

NB1: The four first steps are carried-out only once while the 5th and 6th are iterative (see next paragraph). NB2: It is also possible to push an existing folder into the github directory using:

git remote add origin gitAdress
git branch -M master
git push -u origin master
  • Pushing and Pulling changes to and from Github

Once you have done all the changes, you would want to share your work into the github repository in order to make it accessible to the other members of your team. Below are the necessary steps to follow:

  • Modify the file in your computer
  • Run:
  git add file.extension #Add it to the git index:
  git commit -m "message" #Commit the changes
  git push #Push the changes to github

You would also want to get the latest changes done by the others by using the following command-line:

git pull

The following graph represents the connexion between the local repository and the Github one.

git

NB: When pushing or pulling from Github your credentials are to be filled in.

  • Illustration

In the following GIF, I will create a local folder in which I will initiate a git versioning and also create a file called README.md and then add and commit the changes. Later on, I will create a github repo to which I will push the local folder and check if github got updated:

Collaboration

When working on a complex project, each member of the team is usually assigned to a task or a feature on which he or she can work independently. With that being said, each project can be seen as multiple subprojects handled by different members.
In order to coordinate their work, git uses the notion of branches: the main one is called master and other branches can be merged into it once the work on them is stabilized. Branches can also be used to separate the production version from the development one, in which new features are constantly developed, hence the name. Here is a smart workflow to combine both:

git workflow

To do so, you can use the following bash script, once you are in the git repository:

git branch # get current branch (master)

git checkout -b devBranch # create development branch
git checkout devBranch # switch to development branch

git checkout -b featureBranch devBranch # Create feature branch over the development branch
git add file.extension # Add changed file to staging
git commit -m "message" # commit the message

git checkout devBranch # switch back to development branch
git merge featureBranch # merge featureBranch into devBranch

git push origin featureBranch # push the changes on featureBranch to github
git push origin devBranch # push the changes on devBranch to github

He is an illustration of branching and merging using the previous project:

When merging branches, some conflicts might occure, specially when two persons worked on the same file, in this case, you should

  • Open the file raising the conflicts
  • Resolve the conflicts
  • run:
git add .
git commit -m "message"

NB: branches can also be seen as a way to contribute to open-source project, i.e their code is publically published on github.

Ignoring file

In some situation, certain files should be kept in local, in this case the relative path of this files should be added to the .gitignore file which is automatically reacted one git was initiated. Git and Github are mainly made to handle code versioning, hence it makes no sense to send databases to Github, their paths are usually added to .gitignore file

CheatTable

Action Bash command
Configurate git git config —global user.name “John Doe”
git config —global user.mail “mail”
Initialize git in working directory git init
Get the status of the git repository git status
Add a file to git’s staging git add file.extension
Commit the changes made git commit —m “message”
Roll back to the code version linked to the commit_id git checkout idcommit_
Get the difference between the local and staged version git diff
Get historic of commits git log
Get the last person who has modified the file git blame filePath
Create a new branch git branch branchName
Switch to a branch git checkout branchName
Get the difference between master and other branch git diff master branchName
Merge branch into master git merge branchName
Delete a branch git branch -d branchName branchName_
Rename a branch git branch -m branchName newbranchname

For more details, feel free to visit the official documentation of git.


Conclusion

As a data scientist, it is very important to master these technologies in order to make your work more efficient and scalable. Tools like git are indispensable and make collaboration powerful and straight-forward.