This repo contains a suite of tools used for training an image segmentation model on UK Lidar data with the aim of detecting archaeological features in the landscape.
The UK National Lidar Programme provides 1m-resolution digital terrain model (DTM) elevation data across a large fraction of the UK, made available by Defra. As Lidar is able to penetrate surface vegetation and trees, it can be used as a tool to discover topographical features not visible to satellite imagery. Lidar has been previously utilised in regions of dense forest canopy to identify sites of historic or archaeological significance, and its use has been growing.
The raw data from Defra consists of around 400GB across over 5000 tiles, each tile covering a 5km x 5km square in the British National Grid System system.
The data for each tile is stored in a directory containing a .tif
with the raw elevation data (between 50-100MB in size per tile), and a subdirectory index/
containing geospatial metadata.
lidarnn_raw/LIDAR-DTM-1m-2022-NY11se/
|-- index/
|-- lidar_used_in_merging_process/
|-- SP52sw-DTM-1m.tif
|-- SP52sw-DTM-1m.tif.xml
`-- SP52sw-DTM-1m.tfw
Rather than training a model on raw high-precision elevation data, data/lidar_helper.py
applies a hillshade filter as a part of the image preprocessing step. This serves a dual purpose of amplifying the local features we expect to be important, and compression.
The Historic England Scheduled Monuments dataset covers close to 20,000 scheduled monuments across the UK. Examples include Roman-era sites, barrows and tumuli, castles, earthworks, and the remains of ancient villages. Each element of this dataset consists of a detailed polygon tracing the perimeter of the feature which makes it an excellent candidate for use as a training label as it precisely masks the geographical features we expect to be present in the Lidar data.
Our task is to build and train a neural network model that takes Lidar images as an input, and outputs a binary mask. The baseline model architecture chosen for this is based on a U-Net architecture, first developed for biomedical image segmentation.
The model can be found in model/unet.py
The scripts under data/
perform the bulk of data aquisition and preprocessing.
-
data/lidar_downloader.py
SFTP interface (usingparamiko
) for connecting to the DEFRA ftp server, listing contents and downloading data in smaller chunks. For this to work, see How To Run. -
data/lidar_helper.py
Helper functions for processing Lidar data into model-ready features. Usesrasterio
andgeopandas
. -
data/lidar_plan.py
manages processing pipeline asynchronously using task queues shared between multiple processes. This helps speed up overall work, as well as being able to run the pipeline on smaller chunks for testing before scaling up.
-
data/synthetic_data.py
Utility for creating synthetic features using real masks with noise. Not a great model for real Lidar data, but handy for verifying model convergence and sanity. -
util/data_loading.py
Implementation of pytorch Dataset class for accessing the features and labels. This file contains both an implementation ofLidarDataset
andLidarDatasetSynthetic
. This means we can swap
model/unet.py
Implementation of U-Net using pytorch.
[Under construction]
train.py
[TODO] Helper functions for trainingtrain.ipynb
[TODO] Notebook for running training + visualisation
[Under construction]
If you want to run this code yourself, you will need to download the following data:
- UK National Lidar Programme. Refer to the link for download options. If downloading via sftp, create the following file in the project root named
sftpconfig.json
:
{
"SFTP_USER": "",
"SFTP_HOST": "",
"SFTP_PASSWORD": "",
"SFTP_REMOTE_DIRECTORY": ""
}
- Historic England Scheduled Monuments Extract to
{SHAPE_PATH}/monuments/
such that this file exists{SHAPE_PATH}/monuments/Scheduled_monuments.shp
. - UK boundaries. This is used to mask out the sea on tiles straddling the coastline. Extract to
{SHAPE_PATH}/gb/
.
from data.lidar_plan import DataPipeline
from data.lidar_downloader import list_files
# Run this the first time - it lists the contents of all DTM files on the remote
# server and saves it to ./ls.txt. This file is subsequently used to manage the task queue.
list_files('sftpconfig.json', 'ls.txt')
pipeline = DataPipeline(
data_raw_path = RAW_PATH, # location .zip files will be downloaded to
data_out_path = OUT_PATH, # location preprocessed features and masks will be placed
shape_path = SHAPE_PATH, # location that the monuments and UK boundaries datasets extracted to
remote_ls_file = 'ls.txt'
)
# Run entire pipeline on first 500 items. By default it will spin up 1 process for downloading,
# 1 for unzipping, and 2 for preprocessing
pipeline.run(N=500)
See the if __name__=='__main__'
section of data/lidar_plan.py
for example usage