Why You Should Consider Using Fortran As A Data Scientist | by Egor Howell | May, 2023

A.I. Black GuyMay 16, 2023

0 2 6 minutes read

Why You Should Consider Using Fortran As A Data Scientist | by Egor Howell | May, 2023

An exploration of the benefits that Fortran can bring to Data Science and Machine Learning

Python is widely considered the gold standard language for Data Science, and the entire range of packages, literature, and resources related to Data Science is always available in Python. This is not necessarily a bad thing, as it means that there are numerous documented solutions for any data-related problem that you may encounter.

However, with the advent of larger datasets and the rise of more complex models, it may be time to explore other languages. This is where the old-timer, Fortran, may become popular again. Therefore, it is worthwhile for today’s Data Scientists to become aware of it and maybe even try to implement some solutions.

Fortran, short for Formula Translator, was the first widely used programming language that originated in the 1950s. Despite its age, it remains a high-performance computing language and can be faster than both C and C++.

Initially designed for scientists and engineers to run large-scale models and simulations in areas such as fluid dynamics and organic chemistry, Fortran is still frequently used today by physicists. I even learned it during my physics undergrad!

Its specialty lies in modelling and simulations, which are essential for numerous fields, including Machine Learning. Therefore, Fortran is perfectly poised to tackle Data Science problems, as that’s exactly what it was invented to do decades ago.

Fortran has several key advantages over other programming languages such as C++ and Python. Here are some of the main points:

Easy to Read: Fortran is a compact language with only five native data types: INTEGER, REAL, COMPLEX, LOGICAL, and CHARACTER. This simplicity makes it easy to read and understand, especially for scientific applications.High Performance: Fortran is often used to benchmark the speed of high-performance computers.Large Libraries: Fortran has a wide range of libraries available, mainly for scientific purposes. These libraries provide developers with a vast array of functions and tools for performing complex calculations and simulations.Historical Array Support: Fortran has had multi-dimensional array support from the beginning, which is essential for Machine Learning and Data Science such as Neural Networks.Designed for Engineers and Scientists: Fortran was built specifically for pure number crunching, which is different from the more general-purpose use of C/C++ and Python.

However, it is not all sunshine and rainbows. Here are some of Fortran’s drawbacks:

Text operations: Not ideal for characters and text manipulation, so not optimal for natural language processing.Python has more packages: Even though Fortran has many libraries, it is far from the total number in Python.Small community: The Fortran language has not got as large a following as other languages. This means it hasn’t got a lot of IDE and plugin support or stack overflow answers!Not suitable for many applications: It is explicitly a scientific language, so don’t try to build a website with it!

Homebrew

Let’s quickly go over how to install Fortran on your computer. First, you should install Homebrew (link here), which is a package manager for MacOS.

To install Homebrew, simply run the command from their website:

/bin/bash -c “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)”

You can verify Homebrew is installed by running the command brew help. If there are no errors, then Homebrew has been successfully installed on your system.

GCC Compiler

As Fortran is a compiled language, we need a compiler that can compile Fortran source code. Unfortunately, MacOS doesn’t ship with a Fortran compiler pre-installed, so we need to install one ourselves.

A popular option is the GCC (GNU Compiler Collection) compiler, which you can install through Homebrew: brew install gcc. The GCC compiler is a set of compilers for languages like C, Go, and of course Fortran. The Fortran compiler in the GCC group is called gfortran, that can compile all major versions of Fortran such as 77, 90, 95, 2003, and 2008. It is recommended to use the .f90 extension for Fortran code files, although there is some discussion on this topic.

To verify that gfortran and GCC have been successfully installed, run the command which fortran. The output should look something like this:

/opt/homebrew/bin/gfortran

The gfortran compiler is by far the most popular, however there are several other compilers out there. A list of can be found here.

IDE’s & Text Editors

Once we have our Fortran compiler, the next step is to choose an Integrated Development Environment (IDE) or text editor to write our Fortran source code in. This is a matter of personal preference since there are many options available. Personally, I use PyCharm and install the Fortran plugin because I prefer not to have multiple IDEs. Other popular text editors suggested by the Fortran website include Sublime Text, Notepad++, and Emacs.

Running a Program

Before we go onto our first program, it is important to note that I won’t be doing a syntax or command tutorial in this article. Linked here is a short guide that will cover all the basic syntax.

Below is a simple program called example.f90:

GitHub Gist by author.

Here’s how we compile it:

gfortran -o example example.f90

This command compiles the code and creates an executable file named example. You can replace example with any other name you prefer. If you don’t specify a name using the -o flag, the compiler will use a default name which is typically a.out for most Unix based operating systems.

Here’s how to run the example executable:

./example

The ./ prefix is included to indicate that the executable is in the current directory. The output from this command will look like this:

Hello world1

Now, lets tackle a more ‘real’ problem!

Overview

The knapsack problem is a well-known combinatorial optimization problem that poses:

A set of items, each with a value and weight, must be packed into a knapsack that maximizes the total value whilst respecting the weight constraint of the knapsack

Although the problem sounds simple, the number of solutions increases exponentially with the number of items. Thus, making it intractable to solve by brute force beyond a certain number of items.

Heuristic methods such as genetic algorithms can be used to find a ‘good enough’ or ‘approximate’ solution in a reasonable amount of time. If you’re interested in learning how to solve the knapsack problem using the genetic algorithm, check out my previous post:

The knapsack problem has sundry applications in Data Science and Operations Research, including stock management and supply chain efficiency, rendering it important to solve efficiently for business decisions.

In this section, we will see how quickly Fortran can solve the knapsack problem by pure brute-force compared to Python.

Note: We will be focusing on the basic version, which is the 0–1 knapsack problem where each item is either fully in the knapsack or not in at all.

Python

Let’s start with Python.

The following code solves the knapsack problem for 22 items using a brute-force search. Each item is encoded as a 0 (not in) or 1 (in) in a 22-element length array (each element refers to an item). As each item has only 2 possible values, the number of total combinations is 2^(num_items). We utilise the itertools.product method that computes the cartesian product of all the possible solutions and then we iterate through them.

GitHub Gist by author.

The output of this code:

Items in best solution:Item 1: weight=10, value=10Item 6: weight=60, value=68Item 7: weight=70, value=75Item 8: weight=80, value=58Item 17: weight=170, value=200Item 19: weight=190, value=300Item 21: weight=210, value=400Total value: 1111Time taken: 13.78832197189331 seconds

Fortran

Now, let’s solve the same problem, with the same exact variables, but in Fortran. Unlike Python, Fortran does not contain a package for performing permutations and combinations operations.

Our approach is to use the modulo operator to convert the iteration number into a binary representation. For example, if the iteration number is 6, the modulo of 6 by 2 is 0, which means the first item is not selected. We then divide the iteration number by 2 to shift the bits to the right and take the modulo again to get the binary representation for the next item. This is repeated for every item (so 22 times) and eventually leads us to getting every possible combination.

GitHub Gist by author.

Compile and execute using the linux time command:

time gfortran -o brute brute_force.f90time ./brute

Output:

Items in best solution:Item: 1 Weight: 10 Value: 10Item: 6 Weight: 60 Value: 68Item: 7 Weight: 70 Value: 75Item: 8 Weight: 80 Value: 58Item: 17 Weight: 170 Value: 200Item: 19 Weight: 190 Value: 300Item: 21 Weight: 210 Value: 400Best value found: 1111./brute 0.26s user 0.01s system 41% cpu 0.645 total

The Fortran code is ~21 times quicker!

Comparison

To get a more visual comparison, we can plot the execution time as a function of the number of items:

Fortran blows Python out of the water!

Even though thte compute time for Fortran does increase, its growth is not nearly as large as it is for Python. This truly displays the computational power of Fortran when it comes to solving optimisation problems, which are of critical importance in many areas of Data Science.

Although Python has been the go-to for Data Science, languages like Fortran can still provide significant value especially when dealing with optimisation problems due to its inherent number-crunching abilities. It outperforms Python in solving the knapsack problem by brute-force, and the performance gap widens further as more items are added to the problem. Therefore, as a Data Scientist, you might want to consider investing your time in Fortran if you need an edge in computational power to solve your business and industry problems.

The full code used in this article can be found at my GitHub here: