A simple take on effusion detection using chest X-rays

Ankur Sarode
7 min readMar 28, 2022

Chest X-rays are the most common forms of X-ray images. There are a lot of medical conditions that can be diagnosed using a chest X-ray, making them some of the most important artificial images available.

For this task, we are interested only in effusion. Effusion is the medical term for the build-up of liquid (mostly water, might include blood/lymph) in the lungs. Although in major scenarios this is not very serious, if untreated for long, it can be fatal.

You can find the notebook and dataset here.

The Data

The dataset has the following 2 classes —

  1. Effusion
  2. NoFinding

The structure of data is as —

CXR_data/
├── effusion/
└── nofinding/

The classes are classified into separate directories as a part of the dataset. Here’s a side by side comparison of the images of the two classes.

x-ray classes comparison

It is quite evident that the image on the left seems to have some inconsistencies: the lungs are not symmetric, there is some white fog blocking one of the lungs (which is the effusion), the edges of the lungs are not clear.

Data Preprocessing

DATASET_PATH = './CXR_data/'
disease_cls = ['effusion', 'nofinding']
effusion_path = os.path.join(DATASET_PATH, disease_cls[0], '*')
effusion = glob.glob(effusion_path)
effusion = io.imread(effusion[0])
effusion.shape
Out[]: (1024, 1024)

Our training images are of shape (1024, 1024) which is high resolution for any image, making this dataset quite heavy.

Since our use case is to detect serious effusion, which can be seen over a large section of our image, we can resize our images. The loss of information associated with resizing, won’t affect us negatively.

There are 2 things to consider when we load our image dataset-
i) The amount of processing power we have.
ii) The amount of training data we have.

Data Generation

To build any neural network, the first thing we need is good processing power and RAM. Unfortunately, it’s not always possible to train our models on the best of GPUs.

So to counter this, we will not be loading the entire dataset into memory, rather creating a mechanism where we can choose the number of batches of data to keep in memory.

The ImageDataGeneration class apart from reading data from storage does two things:

  1. Holds ‘b’ number of — batches in memory, which can be referenced later
  2. Reads data from storage, and generates a randomly augmented copy

Caching

As we discussed, loading the entire dataset into memory requires a lot of RAM, and processing it requires significant processing power.

To counter this, we’ll be introducing an Ordered Dictionary that stores the entire batch as a tuple of (X, y) against the index of the batch (batch number)

self.cached_batches = OrderedDict() ##{batch_number : (X, y)}

At every epoch, the __getitem__() function is called to load the batches into memory. In this call, we will check if the batch is already present in our cached_batches, if so, will return directly from the cache without reading from the disk. On the other hand, if the batch is not present in the cache, we will read it from the disk and store it in our cache as well.

All the time while reading and writing to the cache, we will maintain a maximum size for the cache, if this size is exceeded, we will replace the oldest entry in the cache with our new batch.

This will give us some flexibility over how we want to use our resources and get the best out of our setup.

Augmentation

Having slightly modified images for our input data is a good idea, considering that we have limited training data. Augmentation is applied only on the training data, although resizing will be done for both training and validation.

We’ll define our augmentation datagen as —

datagen = ImageDataGenerator(
featurewise_center=True,
featurewise_std_normalization=True,
rotation_range=10,
width_shift_range=0,
height_shift_range=0,
vertical_flip=False,)
def resize_img(img):
img = (img - img.min())/(img.max() - img.min())
img = rescale(img, 0.25, multichannel=True, mode='constant')
return img

def augment_img(img, mode):
if mode == 'train':
img = datagen.random_transform(img)
return img

Now there are certain limitations to the augmentation that we can use on an X-ray image. For example, you will never find an X-ray that is vertically flipped or one that is inverted (ie. bones are black and the background is white). So before we apply any augmentation to our images, we need to analyze how practical the transformation is.

We will randomly apply the following augmentations after resizing our image to 25% of the initial size —
i) featurewise_center: Standardize the image (mean = 0)
ii) featurewise_std_normalization:
Normalize the image (divide by std deviation)
iii) 10 deg rotation_range:
Rotate within the range of -10 to +10 deg

The augmented image will have a shape of (256, 256) and might have one or more of the above augmentation applied randomly. The reduced size will make our dataset lighter and yield faster computations.

The extent of augmentation is random, and we have introduced a flag ‘isAugmented’ to choose whether or not to apply augmentation completely.

The Model

We’ll be using vanilla resnet_18 for this task. The code for resnet is added as an additional script to the notebook.

Ablation Test

Executing an ablation test for our model with 5% of train data for 5 epochs yields the following result-

Epoch 5/5
2/2 [==============================] - 0s 66ms/step - loss: 1.2721 - accuracy: 0.9032

At first glance, the result looks promising. The model seems to be overfitting the data, which is a good sign for our ablation test.

But, since we know that our dataset has a significant imbalance, using accuracy as an evaluation metric will not show us the full picture.
For example, If there are 100 sample images with 99 samples of nofinding and 1 sample for effusion. Even if our model classifies all the samples as nofinding the model has a 0.99 accuracy score.

Therefore, introducing the ROC-AUC score for our evaluation metric makes sense. We can create a call back for our model that evaluates the roc_score using the validation data for each epoch.

Running an ablation test with this call back for 5 epochs using 20% of test data yields the following result-

Epoch 5/5
5/5 [==============================] - 0s 84ms/step - loss: 1.2414 - accuracy: 0.9119 - val_loss: 1.6121 - val_accuracy: 0.8710

Val AUC for epoch4: 0.4444444444444444

Quite evidently, the accuracy is 0.91 for training and 0.87 for validation. But the true story is represented by the AUC score of 0.44. We can infer that the model is not able to learn the underlying pattern and is significantly underfitting.

Model Tuning

1) Weighted Cross-Entropy
To counter the class imbalance present in the data, a good way would be using weighted cross-entropy loss.
For every incorrect classification (FN / FP), we will penalize the model with a higher value.

bin_weights[0, 1] = 5 #For False Negative
bin_weights[1, 0] = 5 #For False Positive

2) Decaying Learning Rate
When we use a fixed learning rate, there is a risk of it being either too small or too large. If it’s too small, the time required for convergence will be too much and on the other hand, if it’s too large, it will keep oscillating around the minima.

To counter this, we use a decaying learning rate, which starts are a large value- maybe 0.01, and decays after every epoch. The decay can be linear or exponential based on our requirements. Here we will use an exponential decay function.

new_lr = self.base_lr * (0.7 ** (epoch // self.decay_epoch))

Final Run

Before we train our model on the complete dataset, it is a good practice to set up a checkpoint mechanism to store the best model as we run through epochs.
In case the model tends to overfit after a certain epoch, we don't want to re-train the model since that would require a significant amount of resources.

filepath = 'models/best_model.hdf5'
checkpoint = ModelCheckpoint(filepath, monitor='val_auc', verbose=1, save_best_only=True, mode='max')

With all this in place, we are good to run our training with the complete dataset for 30 epochs.

Training yields the following result —

Epoch 30/30
55/55 [==============================] - 4s 74ms/step - loss: 1.8010 - accuracy: 0.7929 - val_loss: 1.8938 - val_accuracy: 0.7900
New LR: 3.219905755813174e-07 epoch: 29

Validation AUC for epoch29: 0.8590936884299205
Epoch 00030: val_auc did not improve from 0.85909

The accuracy scores are not around 0.79 for both training and validation which implies no overfitting.
The ROC-AUC score is 0.85 which is a significantly good score for this task considering we have a very limited dataset.

Overall, we can say that training looks good, and the model seems generalizable for effusion detection.

Predictions

Predicting a randomly chosen sample from our data for the ‘effusion’ category

val_model.predict(img[np.newaxis,:])
Out[]: array([[0.13240255, 0.86759746]], dtype=float32)

The model has predicted the image to belong to the ‘effusion’ category with 0.86 probability, which seems like a good probability.

Final Thoughts

Effusion detection using Chest X-Rays is a broad task on multiple levels. It’s always possible that a few other conditions also look similar to an effusion.

Training a model that can perfectly predict effusions requires much more data with different categories. The data we used was only a sample for this simple implementation of a CNN model.

After all, a CNN model can only hint towards the presence of an ‘effusion’ but the final diagnosis is always in the hands of a doctor.

--

--

Ankur Sarode

I just like to explain things I’ve learnt. It helps me better my understanding :D