Noise pollution data cleanup

Feb 8, 2019·
Georg Heiler
Georg Heiler
· 3 min read
Cleaned results as a pandas dataframe

The Austrian governemnt provides a great source of open data for noise pollution. However, it can only easily be looked at using their map-based visualization. When one wants to download the data and perform custom analytics with it some data cleaning is required as one is presented with a hierarchy of files.

getting started

The first step is to download the data from Lärminfo.

Then after unzipping there are multiple categories available represented as a hiererchy of folders per state:

├── flug_2017_noise
│   ├── INSPIRE_2017_FLUGHAEFEN_24H_ZONEN_KT
...
│   └── INSPIRE_2017_FLUGHAEFEN_NACHT_ZONEN_WI
├── industrie_2017_noise
│   ├── INSPIRE_2017_IPPC_24H_ZONEN_NO
....
│   └── INSPIRE_2017_IPPC_NACHT_ZONEN_WI
├── schiene_2017_noise
│   ├── INSPIRE_2017_SCHIENE_24H_ZONEN_BG
....
│   └── INSPIRE_2017_SCHIENE_NACHT_ZONEN_WI
└── strasse_2017_noise
    ├── INSPIRE_2017_STRASSE_24H_ZONEN_BG
    ....
    └── INSPIRE_2017_STRASSE_NACHT_ZONEN_WI

Using python data cleaning can easily be accomplished. The folder hierarchy needs to be parsed recursively. Do note, that a Generator is constructed and not a list. This can be more memory efficient as only the objects which are really required during processing need to be held in memory.

def iter_dirs(directory_in_str, glob):
    pathlist = Path(directory_in_str).glob(glob)
    for path in pathlist:
        yield str(path)

Each path looks like 2017_STRASSE_NACHT_ZONEN_TI.shp and contains some attribute values. These are required later on to diffenciate between the different layers and need to be retained. parse_attributes_from_path will extract them.

def parse_attributes_from_path(path):
    file_name = path.split('/')[-1]
    elements = file_name.split('_')
    result = {}
    result['year'] = elements[0]
    result['kind'] = elements[1]
    result['timing'] = elements[2]
    result['state'] = elements[-1].split('.')[0]
    return result

Finally, you can start to use the functions defined above and loop over all shapefiles:

paths = iter_dirs(c.BASE_PATH, '**/*.shp')
tmp_appended_data = []
for path in tqdm(paths):
    print(path)
    attributes_from_filenname = parse_attributes_from_path(path)
    df = gp.read_file(path)
    df = add_columns_to_df(df, attributes_from_filenname)
    tmp_appended_data.append(df)

This will extract the attributes from all file paths and concatenate the results. You should end up with somethign similar to:

DB_LO	ZUST	geometry	kind	state	timing	year
0	45	ALLE	POLYGON ((252412.115130411 374722.80843502, 25...	STRASSE	TI	NACHT	2017
1	45	ALLE	POLYGON ((250225.579655268 374848.450664676, 2...	STRASSE	TI	NACHT	2017
2	45	ALLE	POLYGON ((257144.687224785 375790.285411876, 2...	STRASSE	TI	NACHT	2017
3	45	ALLE	POLYGON ((252474.722654981 374521.47906019, 25...	STRASSE	TI	NACHT	2017
4	45	ALLE	POLYGON ((252519.897499734 376489.588762502, 2...	STRASSE	TI	NACHT	2017
...

summary

Using some snippets of python to clean the data makes the task feel almost too easy to obtain a neatly cleaned dataset. I was overwhelmed by the speed of reading and processing the data until everything is concatenated and found the last step of writing the result to disk rather slow.

NOTE: I decided to output to gzip compressed CSV files. This is not ideal, but easy to generate and flexible, i.e. allows to have multiple types of geometry in the same column (POLYGON and MULTIPOLYGON). Geo package files would be better suited though. These could contain for example coordinate system reference information or an spatial index, but do not allow for multiple types of geometry in the same column.

EDIT: Meanwhile, the transformation to a common geometry type has been implemented:

def convert_polygon_to_multipolygon(raw_geometry):
    if(isinstance(raw_geometry, shapely.geometry.polygon.Polygon)):
        return MultiPolygon([raw_geometry])
    else:
        # we currently only have MULTIPOLYGON and POLYGON so plain else is good enough
        return raw_geometry
            
df.geometry = df.geometry.apply(convert_polygon_to_multipolygon)

So, geo packages are now supported as well. They are about 2.7 GB in size though.

The code can be found on GitHub.