Note: the code for all of this is available on Gitlab here: https://gitlab.ulb.be/lisa/tcgasampler. It may even perhaps work.
The goal
I would like to have a relatively big stock of TCGA whole-slide images, from various tissue locations and cancer types, and at a relatively low resolution, so that I can quickly test algorithms such as tissue segmentation or artifact detection that generally don’t require super-high magnification.
The National Cancer Institute provides an API to access their datasets programatically, so this is what I’d like to do:
- Get the list of all the diagnostic tissue slides.
- Randomly select N slides.
- Download and save a copy of those slides at at chosen resolution.
Accessing image data
There are basically two ways to access image data for a given slide:
- Downloading the entire whole-slide image using the “data endpoint” of the API:
https://api.gdc.cancer.gov/data/{file_id}
- Directly loading the tiles from their tile server, used by their online slide viewer:
https://portal.gdc.cancer.gov/auth/api/v0/tile/{file_id}?level={level}&x={x}&y={y}
.
Both options come with their own set of problems.
In the first case, we have to download a full-resolution slide, which is often 1 GB or more, just to extract the low-resolution image from it (and then delete the full WSI to avoid filling up local storage too quickly).
In the second case… we don’t have access to any of the metadata of the slide, including, crucially, the resolution and magnification level. This means that we have to kinda guess at which “level” – as understood by the slide server – we have to work. Of course, the “levels” of the slide server are not the same as the “levels” of the image pyramid in the slide file. But at least we only download the data that we want, which is a lot faster.
So I’ve tried both: with the first option, I can have exactly the resolution I want for all images, but it takes a ton of time; in the second option, it’s faster but I will have variations in the resolution.
The overall pipeline
We start by setting a random seed for repeatability, and then we load and shuffle the file ids from the manifest:
random.seed(RANDOM_SEED)
with open(PATH_TO_MANIFEST, 'r') as fp:
all_tissue_slides = [line.strip().split('\t') for line in fp.readlines()]
rows = all_tissue_slides[1:]
row_ids = list(range(len(rows)))
random.shuffle(row_ids)
We can then iterate through the row_ids
to get the file_id
and filename
for each slide. This will randomly sample through the full TCGA dataset.
Option A: downloading full slides
In the first option, we use the data endpoint to load the whole-slide file and save it to the disk:
response = requests.get(data_endpt, headers={"Content-Type": "application/json"})
with open(os.path.join(f"{GDC_DATA_ENDPT}/{file_id}", f"tmp/{filename}"), "wb") as fp:
fp.write(response.content)
Then we can use our WholeSlide
class (available in our openwholeslide
package) to extract the image at the target resolution:
wsi = WholeSlide(os.path.join(TCGA_LOCAL_PATH, f"tmp/{filename}"))
region = wsi.read_full(resolution=TARGET_RESOLUTION)
imsave(os.path.join(TCGA_LOCAL_PATH, f"images/{filename}.png"), region.as_ndarray)
wsi.slide.close()
And finally we remove the downloaded file, so that the disk doesn’t fill up:
os.remove(os.path.join(TCGA_LOCAL_PATH, f"tmp/{filename}"))
Option B: downloading tiles
Getting a tile from the GDC API requires to provide the file id – which we have from the manifest – alongside the level, and an “x-y” position. The “x-y” position is a tile index, so that the first tile is (x=0, y=0), then the one to the right is (x=1, y=0), etc. The level is related to the size of the image. It’s not really documented, but from playing a bit with the API it’s clear that the system is basically just starting from a 1x1px at level 0, then increasing by powers of 2. At some point it diverges from the expected 2x2, 4x4, 8x8, etc., so that the real aspect ratio is preserved. So for instance for the image with id 455d9335-c6f3-4966-8b3c-1291e2d31093, we have:
- level 0: 1x1
- level 1: 2x2
- level 2: 4x3
- level 3: 7x6
- level 4: 14x11 …
- level 9: 428x351
We can get some metadata from the API using: https://portal.gdc.cancer.gov/auth/api/v0/tile/metadata/455d9335-c6f3-4966-8b3c-1291e2d31093. It tells us the full size of the image at maximum resolution (here 109559x89665), and the tile size (here 512). So we know that above level 9, we will start to have multiple tiles.
Let’s try to grab the images at a resolution such that the image is, at most, 2048px in its largest dimension. For that, we need to use level 11. But first, we need to grab the metadata to get the aspect ratio and have an idea of the total size of the resulting image.
meta_url = f"{GDC_META_ENDPT}/{file_id}"
response_meta = requests.get(meta_url)
metadata = json.loads(response_meta.content)
full_width = int(metadata['Width'])
full_height = int(metadata['Height'])
overlap = int(metadata['Overlap'])
tile_size = int(metadata['TileSize'])
if full_width > full_height:
expected_width = 2048
expected_height = math.ceil(2048 * full_height/full_width)
else:
expected_width = math.ceil(2048 * full_width / full_height)
expected_height = 2048
downloaded_image = np.zeros((expected_height, expected_width, 3), dtype=np.uint8)
We will then use the tile_size to compute the maximum number of tiles per dimension, and slowly fill the image from the tiles. We also keep track in max_x
and max_y
of the real size of the resulting image so that we can crop it from the temporary black image at the end.
tile_url = lambda x, y: f"{GDC_TILE_ENDPT}/{file_id}?level=11&x={x}&y={y}"
max_tiles = (2048 // (tile_size-overlap)) + 1
max_x = 0
max_y = 0
for y in range(max_tiles):
for x in range(max_tiles):
response_img = requests.get(tile_url(x, y))
if response_img.ok:
image = imread(BytesIO(response_img.content))
startx = x*(tile_size-overlap)
starty = y*(tile_size-overlap)
downloaded_image[starty:starty+image.shape[0], startx:startx+image.shape[1]] = image[..., :3]
max_x = max(max_x, startx+image.shape[1])
max_y = max(max_y, starty+image.shape[0])
imsave(down_path, downloaded_image[:max_y, :max_x])
Conclusions
With Option A, it took me about 11 hours to get 100 random images at a resolution of 15µm/px – but at least I know their resolution. It took only 2 hours to get the same amount of images with option B… but with varying levels of resolution.
In the end, I think both options can be useful ways of randomly sampling TCGA data without storing the full WSIs locally. Which, given their 12.95TB size, I’d rather not do.