Data Scientist's starter pack
September 20, 2020 — 10 min
When working on a data science project, many skills might be required from theorical to
technical ones. In this article, I will mainly focus on some of the most
important tools to have and work with, tools which allow better and
faster way of
The summary is as follows:
- Visual Studio Code
- Virtual environment
- Bash commands
- Git & Github
The daily tasks of a data scientist are mainly about coding. For this reason, it is important to have a nice editor capable of encapsulating many other tools. VS code does that! It a tool developed by Microsoft and considered to be one of the most used editors. It allows the run, edit and debug of the your code plus many other useful extensions such as:
Python: “A Visual Studio Code extension with rich support for the Python language, including features such as IntelliSense, linting, debugging, code navigation, code formatting, Jupyter notebook support, refactoring, variable explorer, test explorer, snippets, and more!”
Gitlens: “GitLens supercharges the Git capabilities built into Visual Studio Code. It helps you to visualize code authorship at a glance via Git blame annotations and code lens, seamlessly navigate and explore Git repositories, gain valuable insights via powerful comparison commands, and so much more.”
vscode-icons: Bring icons to your Visual Studio Code for better representation and visualization
Grammarly: an extension allowing to correct the misspellings which is very useful when writing articles ;)
Docker: “The Docker extension makes it easy to build, manage, and deploy containerized applications from Visual Studio Code. It also provides one-click debugging of Node.js, Python, and .NET Core inside a container.” If you are unfamiliar with docker, don’t worry about this part.
Black: a python formatting library which is installed by default (if not use pip) and allow to automatically format your code on save. To do so, your can add the following lines to the “setting.json” file of VS code
"python.formatting.provider": "black" "editor.formatOnSave": true
Here is an example of Black’s code formatting:
For more info, please visit the official website of VS code.
When working on a project, you, very often, want it to be reproducible in other machines and clouds. Since the versions of the used packages keep changing, it is important to set a private or
virtualenvironment in which you can develop your project and save the versions of the packages.
virtualenv is a python library which allow you to carried out the isolation and versioning discussed above as follows:
When installing python for the first time on your laptop, its default environment is called “base” or “global”. It is worth mentioning that the created virtual environment do not communicated with base, which means that they are initially empty!
cd dossier projet pip install virtualenv # if not done before virtualenv <env> source <env>/bin/activate
Install packages (pip install packageName) progressively when needed and once your project is stabilized:
pip freeze > packages.txt/
if you clone a project with a “packages.txt” file in order to run it, you can:
- First create and activate a new virtual environment
- Run the following command to install all the necessary packages in it from the “packages.txt” file :
pip install -r packages.txt/
When creating a virtual environment, it’s important to choose it as as an interpreter. VS Code is capable of detecting it and will suggest it for use, otherwise you can choose it as follows:
- In settings, click on
Python: Select Interpreter
- Choose it from the list:
Please visit the website of virtualenv for more info.
Bash commands are very useful to quickly navigate into your operating system and are efficient for handling files. They also come to light when working on a virtual machine with no graphic interface, environment in the cloud for instance. Here is a brief list of the most used command-lines (on macOS):
|Current working directory||pwd|
|Listing the content of the current directory||ls|
|Details on the content of the current directory||ll|
|Change directory||cd newpath|
|Move back up to one level||cd ..|
|Code history||upper/lower arrow|
|Create a folder||mkdir folder1|
|Create many folders||mkdir folder1 folder2 …|
|Create a file||touch fileName.extension|
|Open the file||open filName.extension|
|Show the content||cat fileName.extension|
|Move a file||mv fileOldPath fileNewPath|
|Copy a file||cp filePath destinationPath|
|copy a folder||cp -r folderPath destinationPath|
|Delete a file||rm filePath.extension|
|Delete a folder||rm -r folderPath|
Git is a coding tool used mainly for three reasons:
- Time versioning your code
- Keeping track of the changes made
- Allowing parallel collaborations of multiple parts
To do so, git works in three stages:
Working directory: is the local folder hosting all the file of your project, more specifically, it is the folder where git was initialized
Staging directory: it notes and indexes each modified document
Git local repository: each carried-out changes lead to a version of the document or a snapshot which can be taken and labelled using a message.
Before digging into the command-lines of git let’s first see how to install and configure it.
- Windows: download the Windows Git installer from the following website
- Mac: most versions of MacOs have already git installed.
git config --global user.name "Name” git config --global user.mail "mail"
- Creating your first Git repo
Once you have installed and configured git, you can create your first git project using the following command-lines:
mkdir projectFolder cd projectFolder git init #initialize git on your project git status touch file.md # create new file open file.md # open the file and modify it git status # file untracked -> add it to git index git add file.md git commit -m "file.md modified ..."
In the next paragraph, you will find the graph summing up the relationships between the different stages.
- Hosting your repository on Github
Github is a platform which allows to host your project and coordinate the multi-parties collaboration. The following steps detail how you can do so:
- Create an account in Github
- Create new Repository
- Copy its link
- Clone it in a folder on your computer
- Make all the necessary changes assigned to their commits
- Host the changes on Github
NB1: The four first steps are carried-out only once while the 5th and 6th are iterative (see next paragraph). NB2: It is also possible to push an existing folder into the github directory using:
git remote add origin gitAdress git branch -M master git push -u origin master
- Pushing and Pulling changes to and from Github
Once you have done all the changes, you would want to share your work into the github repository in order to make it accessible to the other members of your team. Below are the necessary steps to follow:
- Modify the file in your computer
git add file.extension #Add it to the git index: git commit -m "message" #Commit the changes git push #Push the changes to github
You would also want to get the latest changes done by the others by using the following command-line:
The following graph represents the connexion between the local repository and the Github one.
NB: When pushing or pulling from Github your credentials are to be filled in.
In the following GIF, I will create a local folder in which I will initiate a git versioning and also create a file called README.md and then add and commit the changes. Later on, I will create a github repo to which I will push the local folder and check if github got updated:
When working on a complex project, each member of the team is usually assigned to a task or a feature on which he or she can work independently. With that being said, each project can be seen as multiple subprojects handled by different members.
In order to coordinate their work, git uses the notion of branches: the main one is called
master and other branches can be merged into it once the work on them is stabilized.
Branches can also be used to separate the
production version from the
development one, in which new features are constantly developed, hence the name.
Here is a smart workflow to combine both:
To do so, you can use the following bash script, once you are in the git repository:
git branch # get current branch (master) git checkout -b devBranch # create development branch git checkout devBranch # switch to development branch git checkout -b featureBranch devBranch # Create feature branch over the development branch git add file.extension # Add changed file to staging git commit -m "message" # commit the message git checkout devBranch # switch back to development branch git merge featureBranch # merge featureBranch into devBranch git push origin featureBranch # push the changes on featureBranch to github git push origin devBranch # push the changes on devBranch to github
He is an illustration of branching and merging using the previous project:
When merging branches, some conflicts might occure, specially when two persons worked on the same file, in this case, you should
- Open the file raising the conflicts
- Resolve the conflicts
git add . git commit -m "message"
NB: branches can also be seen as a way to contribute to open-source project, i.e their code is publically published on github.
In some situation, certain files should be kept in local, in this case the
relative path of this files should be added to the
.gitignore file which is automatically reacted one git was initiated.
Git and Github are mainly made to handle code versioning, hence it makes no sense to send databases to Github, their paths are usually added to
|Configurate git||git config —global user.name “John Doe”
git config —global user.mail “mail”
|Initialize git in working directory||git init|
|Get the status of the git repository||git status|
|Add a file to git’s staging||git add file.extension|
|Commit the changes made||git commit —m “message”|
|Roll back to the code version linked to the commit_id||git checkout idcommit_|
|Get the difference between the local and staged version||git diff|
|Get historic of commits||git log|
|Get the last person who has modified the file||git blame filePath|
|Create a new branch||git branch branchName|
|Switch to a branch||git checkout branchName|
|Get the difference between master and other branch||git diff master branchName|
|Merge branch into master||git merge branchName|
|Delete a branch||git branch -d branchName branchName_|
|Rename a branch||git branch -m branchName newbranchname|
For more details, feel free to visit the official documentation of git.
As a data scientist, it is very important to master these technologies in order to make your work more efficient and scalable. Tools like git are indispensable and make collaboration powerful and straight-forward.