How to Create Your Own CV Dataset Using Satellite Imagery: Wildfires from Space | by Aleksei Rozanov | May, 2024

A.I. Black GuyMay 9, 2024

0 0 4 minutes read

How to Create Your Own CV Dataset Using Satellite Imagery: Wildfires from Space | by Aleksei Rozanov | May, 2024

Collecting images to train CNNs

Unless otherwise noted, all images are by the author, based on Sentinel-2 data.

Have you ever had this idea that a pet project on the application of ML to satellite images might significantly strengthen your data science portfolio? Or have you trained some models based on datasets developed by other people but not your own? If the answer is yes, I have a good piece of news for you!

In this article I’ll guide you through the process of creating a Computer Vision (CV) dataset consisting of high-resolution satellite images, so you could use a similar approach and build a solid pet project!

🔥The problem: wildfire detection (binary classification task).🛰️The instrument: Sentinel 2 (10/60 m resolution). ⏰The time range: 2017/01/01–2024/01/01.🇬🇧The area of interest: the UK.🐍The python code: GitHub.

Before acquiring any imagery, it’s vital to know where and when the wildfires were happening. To get such data, we will use the NASA Fire Information for Resource Management System (FIRMS) archive. Based on your requirements, you can select there a source of data and the region of interest, submit a request, and get your data in a matter of minutes.

I decided to use MODIS-based data in the form of a csv file. It comprises many different variables, but we are only interested in latitude, longitude, acquisition time, confidence and type. The last two variables are of particular interest to us. As you may guess, confidence is basically the probability that a wildfire was actually happening. So to exclude “wrong alarms” I decided to filter out everything lower than 70% confidence. The second important variable was type. Basically, it’s a classification of wildfires. I was interested only in burning vegetation, so only the class 0 is kept. The resulting dataset has 1087 cases of wildfires.

df = pd.read_csv(‘./fires.csv’)df = df[(df.confidence>70)&(df.type==0)]

Now we can overlay the hotspots with the shape of the UK.

proj = ccrs.PlateCarree()fig, ax = plt.subplots(subplot_kw=dict(projection=proj), figsize=(16, 9))

shape.geometry.plot(ax=ax, color=’black’)gdf.geometry.plot(ax=ax, color=’red’, markersize=10)

ax.gridlines(draw_labels=True,linewidth=1, alpha=0.5, linestyle=’–‘, color=’black’)

The second stage of the work involves my favorite Google Earth Engine (GEE) and its python version ee (you can check out my other articles illustrating the capabilities of this service).

At ideal conditions, Sentinel 2 derives images with a temporal resolution of 5 days and spatial resolution of 10 m for RGB bands and 20 m for SWIR bands (we will discuss later what these are). However, it doesn’t mean that we have an image of each location once in 5 days, since there are many factors influencing image acquisition, including clouds. So there is no chance we get 1087 images; the amount will be much lower.

Let’s create a script, which would get for each point a Sentinel-2 image with cloud percentage lower than 50%. For each pair of coordinates we create a buffer and stretch it to a rectangle, which is cut off the bigger image later. All the images are converted to multidimensional array and saved as .npy file.

import eeimport pandas as pd

ee.Authenticate()ee.Initialize()

uk = ee.FeatureCollection(‘FAO/GAUL/2015/level2’).filter(ee.Filter.eq(‘ADM0_NAME’, ‘U.K. of Great Britain and Northern Ireland’))SBands = [‘B2’, ‘B3′,’B4’, ‘B11′,’B12’]points = []for i in range(len(df)):points.append(ee.Geometry.Point([df.longitude.values[i], df.latitude.values[i]]))

for i in range(len(df)):startDate = pd.to_datetime(df.acq_date.values[i])endDate = startDate+datetime.timedelta(days=1)S2 = ee.ImageCollection(‘COPERNICUS/S2_SR_HARMONIZED’)\.filterDate(startDate.strftime(‘%Y-%m-%d’), endDate.strftime(‘%Y-%m-%d’))\.filterBounds(points[i].buffer(2500).bounds())\.select(SBands)\.filter(ee.Filter.lt(‘CLOUDY_PIXEL_PERCENTAGE’, 50))if S2.size().getInfo()!=0:S2_list = S2.toList(S2.size())for j in range(S2_list.size().getInfo()):img = ee.Image(S2_list.get(j)).select(SBands)img = img.reproject(‘EPSG:4326’, scale=10, crsTransform=None)roi = points[i].buffer(2500).bounds()array = ee.data.computePixels({‘expression’: img.clip(roi),’fileFormat’: ‘NUMPY_NDARRAY’})np.save(join(‘./S2′,f'{i}_{j}.npy’), array)print(f’Index: {i}/{len(df)-1}\tDate: {startDate}’)

What are these SWIR bands (in particular, bands 11 and 12)? SWIR stands for Short-Wave Infrared. SWIR bands are a part of the electromagnetic spectrum that covers wavelengths ranging from approximately 1.4 to 3 micrometers.

SWIR bands are used in wildfire analysis for several reasons:

Thermal Sensitivity: SWIR bands are sensitive to temperature variations, allowing them to detect heat sources associated with wildfires. So SWIR bands can capture info about the location and intensity of the fire.Penetration of Smoke: Smoke generated by wildfires can obscure visibility in RGB images (i.e. you simply can’t see “under” the clouds). SWIR radiation has better penetration through smoke compared to visible range, allowing for more reliable fire detection even in smoky conditions.Discrimination of Burned Areas: SWIR bands can help in identifying burned areas by detecting changes in surface reflectance caused by fire-induced damage. Burned vegetation and soil often exhibit distinct spectral signatures in SWIR bands, enabling the delineation of the extent of the fire-affected area.Nighttime Detection: SWIR sensors can detect thermal emissions from fires even during nighttime when visible and near-infrared sensors are ineffective due to lack of sunlight. This enables continuous monitoring of wildfires round the clock.

So if we have a look at a random image from the collected data, we will be able to see, that when based on RGB image it’s hard to say whether it’s smoke or cloud, SWIR bands clearly demonstrate the presence of fire.

Now is my least favorite part. It’s crucial to go through all of the pictures and check if there is a wildfire on each image (remember, 70% confidence) and the picture is generally correct.

For example, images like these (no hotspots are present) were acquired and automatically downloaded to the wildfire folder:

The total amount of images after cleaning: 228.

And the last stage is getting images without hotspots for our dataset. Since we are building a dataset for a classification task, we need to balance the two classes, so we need to get at least 200 pictures.

To do that we will randomly sample points from the territory of the UK (I decided to sample 300):

min_x, min_y, max_x, max_y = polygon.boundspoints = []while len(points)<300:random_point = Point(np.random.uniform(min_x, max_x), np.random.uniform(min_y, max_y))if random_point.within(polygon):points.append(ee.Geometry.Point(random_point.xy[0][0],random_point.xy[1][0]))print(‘Done!’)

Then applying the code written above, we acquire Sentinel-2 images and save them.

Boring stage again. Now we need to be sure that among these point there are no wildfires/disturbed or incorrect images.

After doing that, I ended up with 242 images like this:

VI. Augmentation.

The final stage is image augmentation. In simple words, the idea is to increase the amount of images in the dataset using the ones we already have. In this dataset we will simply rotate images on 180°, hence, getting a two-times greater amount of pictures in the dataset!

Now it’s possible to randomly sample two classess of images and visualize them.

No-WF:

WF:

That’s it, we’re done! As you can see it’s not that hard to collect a lot of remote sensing data if you use GEE. The dataset we created now can be used as for training CNNs of different architectures and comparison of their performance. On my opinion, it’s a perfect project to add in your data science portfolio, since it solves non-trivial and important problem.

Hopefully this article was informative and insightful for you!

===========================================

References:

===========================================

All my publications on Medium are free and open-access, that’s why I’d really appreciate if you followed me here!

P.s. I’m extremely passionate about (Geo)Data Science, ML/AI and Climate Change. So if you want to work together on some project pls contact me in LinkedIn.

🛰️Follow for more🛰️

Source link