A.I. Tools

Dear Data Scientist, Be Organized! | by Alexandre Rosseto Lemos | Feb, 2023

Photo by Matthew Kwong on Unsplash

First of all, I have a few simple questions for you:

Have you ever been lost inside your own notebook, not knowing in which order the cells must be executed so that the code runs smoothly?During the development of a project, have you ever lost precious minutes or even hours trying to remember where an analysis that you had previously carried out was and what was the result of it?Have you ever had difficulty locating, quickly, accurately, and easily, where the data you have been using in your projects is?Do you have difficulty explaining your codes, especially after a long period of not using them?When you went to explain your code, or show some analysis performed to a coworker, did you needed a few minutes to remember what you had done or where the results of the analysis performed were?

If you answered yes to any of these questions, I’m sorry to inform you but you most likely have an organization problem.

Don’t be ashamed, this is more common than it seems!

The good news is that problems of this type are easy to solve, but they are laborious and require dedication. I’ve already suffered a lot with problems caused by my lack of organization and therefore I ended up creating a guide to help me be more organized and methodical in my work.

I like to split my organization into 4 topics:

Data organizationFile organization (notebooks)Notebook structure organizationCode organization

Photo by Nana Smirnova on Unsplash

Data science projects involve using data, either to do some analysis on it or to develop Machine Learning models, so data is the foundation of any project in this area.

It is logical to think that data access must occur precisely so that the correct data is selected when needed. However, this task can be difficult when we are dealing with large projects or projects that end up extending for long periods of time. During the development, several databases can be generated and, if they are not correctly cataloged, they can end up confusing, leading to decisions being taken wrongly.

So, to deal with these problems, the way I like to organize my data follows this structure:

Schematic of the data organization structure (image by author)

Yes, it is a very simple and intuitive structure, yet people (and I was included here) tend to save all the data in the same folder. Nowadays I like to have a separate folder for each project that I work on and, for each project, I always have a separate folder for the raw data, which contains the initial information, and the processed data, which contains data after some type of preprocessing is performed.

It is very useful to save the processed data so you don’t have to run the whole preprocessing pipeline to generate the dataset you will use to make your models or your analysis.

In my experience, I learned that it is a good practice to keep the versions of the data that was used throughout the project, because you can use it to compare different results obtained and you guarantee the replicability of previously results and analyzes. However, a while after the end of the project it is good to clear some of the old files to reduce the amount of storage used.

Photo by Maksym Kaharlytskyi on Unsplash

Usually in a data science project there are several steps that need to be done to reach a result. For instance, if you are developing a Machine Learning model for a specific task some steps are pretty common such as the development of the database that will be used, the EDA (Exploratory Data Analysis), the development of the preprocessing pipeline, the tuning of the model and the evaluation of the results.

Each of these tasks usually requires a reasonable amount of lines of code to be carried out, which makes having them all in the same notebook impractical, due to the enormous amount of information present in the same file. The chances of the notebook becoming confused and messy with so much information are very high, even applying the tips that I will talk about in the next topics.

Another point worth mentioning is that the notebook can become so computationally expensive that the kernel can become unstable if you try to run all the cells in sequence, leading to the loss of precious minutes, or hours.

Therefore, a way to organize my files that I have been using and that has helped me a lot in my projects has the following format:

Example of file organization structure (image by author)

Basically I order my notebooks in chronological development (using numbers in the file name to sort them), I give very clear and explicit names to what was developed in each notebook, I create folders to store generated files to make it easier to find them in the future. In addition to facilitating the organization of information, following this structure makes it clear what were the steps taken to arrive at the result and what was the reasoning used.

Again, using this organization structure, or one similar to it, is an extremely simple task to accomplish, requiring only discipline.

The next two steps are the most challenging ones because they require you to write and to think as well as to be disciplined.

Photo by Kelly Sikkema on Unsplash

What I mean when I say notebook structure organization is to be clear in what you are doing each part of the same notebook. For this topic, the option to write in Markdown will be your best friend. That’s because you can use the different styles to differentiate between topics and subtopics of your code.

For me, this part is almost the same as writing a report, where you have to clearly state what are your intentions with each experiment, your hypothesis, results and analysis that you are making. The tricky part is that you have to be very clear in your goals and analysis to do a good job, and this requires thinking ahead and being objective.

Being organized in the structure of the notebook helps a lot in making things easy to explain, which is a very good think when you are going to make a presentation of your analysis to your peers and superiors.

Example of notebook structure organization (image by author)

As you can see this part will require a lot of work depending on the objective of the notebook. However, I have found that this work is often rewarded with clearer sense of the goals that I am pursuing with each analysis and the results obtained. This has helped me a lot, specially when working with more than one project at the same time.

Photo by Fahrul Razi on Unsplash

The best way to explain this topic is this: write your code as if you were explaining it to someone else. Use a lot of comments throughout your code and always try to be as clear as possible. At the beginning this habit will be a little hard to do and you will be feel a little laziness to comment your codes, but over time you end up becoming very good at commenting things quickly and effectively, doing things practically automatically. Always try to be as direct as possible, saving on words and the time needed to understand what you said.

Another very important point is to comment on the functions you develop. It’s no use making an extremely useful function if nobody knows what it’s for! Here, the use of docstrings is highly recommended. Again, be very detailed in your explanations of their functions, but try not to use too many words.

A template that I have been using basically contains three pieces of information: what is the function’s purpose (in the Info section), what are the function’s input variables (in the Input section) and what are the function’s responses, if any (in the Output section) . Below is an example of a function I made for another article of mine here on Medium: Genetic Algorithm and its practicality in Machine Learning.

The practice of commenting and organizing your codes is extremely useful, not only for you to be clear about what you do, but also for your code to be of use to other people in your organization. Reusable, well-made and organized codes are very valuable within a team as they end up saving the time of several employees when performing a repeated or similar task.

Another important point is that if you happen to go on vacation or leave in the middle of a project, if your code is well organized, you will be able to pick up where you left off much faster than if you need to remember everything you did or what prompted you to write the several lines of code several days ago.

Photo by Glenn Carstens-Peters on Unsplash

In this article I showed you several benefits of being more organized in your work. Being organized makes you more efficient, clear and objective, in addition to exercising your discipline and thinking, because to write the result of an analysis or a comment in a code, you must first think about what you are going to write. Don’t be surprised if you find yourself rewriting something more than once, that is, most of the time, you exercising your critical sense and looking for a more efficient way to express yourself, forcing you to exercise this very important skill.

If you are at an earlier position in your career, being more organized will help you a lot. Being able to clearly answer a question asked by a superior or clearly explain what your objectives were with an analysis or a part of a code are things that will add points to your work.

Finally, my tip is to start little by little and adapt the way you organize things in such a way that it is the best possible for you. The ways I’ve described here were ways I adapted for myself and they serve me very well, but sometimes you’re better off using a different way of organizing. Be that as it may, being organized will only bring you benefits.


Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button
Translate »