Dear Data Scientist, Be Organized! | by Alexandre Rosseto Lemos | Feb, 2023

A.I. Black GuyFebruary 3, 2023

0 1 7 minutes read

A quick guide to improve your organization skills and, consequently, your performance.

First of all, I have a few simple questions for you:

Have you ever been lost inside your own notebook, not knowing in which order the cells must be executed so that the code runs smoothly?During the development of a project, have you ever lost precious minutes or even hours trying to remember where an analysis that you had previously carried out was and what was the result of it?Have you ever had difficulty locating, quickly, accurately, and easily, where the data you have been using in your projects is?Do you have difficulty explaining your codes, especially after a long period of not using them?When you went to explain your code, or show some analysis performed to a coworker, did you needed a few minutes to remember what you had done or where the results of the analysis performed were?

If you answered yes to any of these questions, I’m sorry to inform you but you most likely have an organization problem.

Don’t be ashamed, this is more common than it seems!

The good news is that problems of this type are easy to solve, but they are laborious and require dedication. I’ve already suffered a lot with problems caused by my lack of organization and therefore I ended up creating a guide to help me be more organized and methodical in my work.

I like to split my organization into 4 topics:

Data organizationFile organization (notebooks)Notebook structure organizationCode organization

Data science projects involve using data, either to do some analysis on it or to develop Machine Learning models, so data is the foundation of any project in this area.

It is logical to think that data access must occur precisely so that the correct data is selected when needed. However, this task can be difficult when we are dealing with large projects or projects that end up extending for long periods of time. During the development, several databases can be generated and, if they are not correctly cataloged, they can end up confusing, leading to decisions being taken wrongly.

So, to deal with these problems, the way I like to organize my data follows this structure:

Schematic of the data organization structure (image by author)

Yes, it is a very simple and intuitive structure, yet people (and I was included here) tend to save all the data in the same folder. Nowadays I like to have a separate folder for each project that I work on and, for each project, I always have a separate folder for the raw data, which contains the initial information, and the processed data, which contains data after some type of preprocessing is performed.

It is very useful to save the processed data so you don’t have to run the whole preprocessing pipeline to generate the dataset you will use to make your models or your analysis.

In my experience, I learned that it is a good practice to keep the versions of the data that was used throughout the project, because you can use it to compare different results obtained and you guarantee the replicability of previously results and analyzes. However, a while after the end of the project it is good to clear some of the old files to reduce the amount of storage used.

Photo by Maksym Kaharlytskyi on Unsplash

Usually in a data science project there are several steps that need to be done to reach a result. For instance, if you are developing a Machine Learning model for a specific task some steps are pretty common such as the development of the database that will be used, the EDA (Exploratory Data Analysis), the development of the preprocessing pipeline, the tuning of the model and the evaluation of the results.

Each of these tasks usually requires a reasonable amount of lines of code to be carried out, which makes having them all in the same notebook impractical, due to the enormous amount of information present in the same file. The chances of the notebook becoming confused and messy with so much information are very high, even applying the tips that I will talk about in the next topics.

Another point worth mentioning is that the notebook can become so computationally expensive that the kernel can become unstable if you try to run all the cells in sequence, leading to the loss of precious minutes, or hours.

Therefore, a way to organize my files that I have been using and that has helped me a lot in my projects has the following format:

Example of file organization structure (image by author)

Basically I order my notebooks in chronological development (using numbers in the file name to sort them), I give very clear and explicit names to what was developed in each notebook, I create folders to store generated files to make it easier to find them in the future. In addition to facilitating the organization of information, following this structure makes it clear what were the steps taken to arrive at the result and what was the reasoning used.

Again, using this organization structure, or one similar to it, is an extremely simple task to accomplish, requiring only discipline.

The next two steps are the most challenging ones because they require you to write and to think as well as to be disciplined.

What I mean when I say notebook structure organization is to be clear in what you are doing each part of the same notebook. For this topic, the option to write in Markdown will be your best friend. That’s because you can use the different styles to differentiate between topics and subtopics of your code.

For me, this part is almost the same as writing a report, where you have to clearly state what are your intentions with each experiment, your hypothesis, results and analysis that you are making. The tricky part is that you have to be very clear in your goals and analysis to do a good job, and this requires thinking ahead and being objective.

Being organized in the structure of the notebook helps a lot in making things easy to explain, which is a very good think when you are going to make a presentation of your analysis to your peers and superiors.

Example of notebook structure organization (image by author)

As you can see this part will require a lot of work depending on the objective of the notebook. However, I have found that this work is often rewarded with clearer sense of the goals that I am pursuing with each analysis and the results obtained. This has helped me a lot, specially when working with more than one project at the same time.

The best way to explain this topic is this: write your code as if you were explaining it to someone else. Use a lot of comments throughout your code and always try to be as clear as possible. At the beginning this habit will be a little hard to do and you will be feel a little laziness to comment your codes, but over time you end up becoming very good at commenting things quickly and effectively, doing things practically automatically. Always try to be as direct as possible, saving on words and the time needed to understand what you said.

Another very important point is to comment on the functions you develop. It’s no use making an extremely useful function if nobody knows what it’s for! Here, the use of docstrings is highly recommended. Again, be very detailed in your explanations of their functions, but try not to use too many words.

A template that I have been using basically contains three pieces of information: what is the function’s purpose (in the Info section), what are the function’s input variables (in the Input section) and what are the function’s responses, if any (in the Output section) . Below is an example of a function I made for another article of mine here on Medium: Genetic Algorithm and its practicality in Machine Learning.

Photo by Glenn Carstens-Peters on Unsplash

Source link

Dear Data Scientist, Be Organized! | by Alexandre Rosseto Lemos | Feb, 2023

A quick guide to improve your organization skills and, consequently, your performance.

Related

A.I. Black Guy

Leave a Reply Cancel reply

Project Mugetsu Legendary Orb Guide – Ultimate Reroll Item

WWE SuperCard QR Codes – 2023!

Bloodtide Secret Codes – Bunker, Vault, and Subway

Widgetable APK/iOS + MOD 1.4.030 (Premium) Download

Camp Buddy MOD APK/iOS v2.2.4 (Unlock All Characters)

A quick guide to improve your organization skills and, consequently, your performance.

Related

A.I. Black Guy

Wordle hint and answer today: Let's solve #594, February 3

23 Artificial Intelligence AI-Fueled Secret Websites That Will Blow Your Mind (2023)

Related Articles

An AI dataset carves new paths to tornado detection | MIT News

Defog AI Introduces LLama-3-based SQLCoder-8B: A State-of-the-Art AI Model for Generating SQL Queries from Natural Language

Learn Discrete Fourier Transform (DFT) | by Omar Alkousa | Feb, 2023

What We Learned from a Year of Building with LLMs (Part I) – O’Reilly

Leave a Reply Cancel reply