view contrib/gmaggregate/README.md @ 5584:7ed9e32706d0 surveysperbottleneckid

Merged delault
author Sascha Wilde <wilde@sha-bang.de>
date Fri, 01 Apr 2022 16:47:53 +0200
parents 02c2d0edeb2a
children
line wrap: on
line source

# gmaggregate

*Attention:* This is a copy of [gmaggregate](https://heptapod.host/intevation/gemma/gmaggregate).

A log message transformation tool for gauge measurement (gm) imports
in the gemma server.

We recognized that the logging of the gm imports is producing a lot
of data itself by been very verbose and redundant. The has led
to the fact that over 99% of the log messages of all imports
in the gemma server are stemming from the gm imports ...
hundreds of millions of log lines.

The logging itself is now done in a more compact and aggregated
way increasing the readability of the logs, too.
To get rid of the repetitive old logs entries without losing
information these logs has to be aggregated in the same way,
too.

Normally, we use SQL or PL/pgSQL scripts for this kind of migrations.
We had a version of this for this task but first experiments
had shown that its run time would only be acceptable
for small data sets, but not for the multi million data sets
of the production system. It also had no to very little potential
to be significantly improved. Therefore we re-crafted this tool in Go.

## Build

You need a working Go build environment (tested successfully with 1.17).

```(shell)
hg clone https://heptapod.host/intevation/gemma/gmaggregate
cd gmaggregate
go build
```

Place the resulting `gmaggregate` binary into the `PATH` of your
database server. It needs execution rights for the `postgres` user.

If you've modified the expressions in [matcher.rl](matcher.rl) you need
an installation of the [Ragel](http://www.colm.net/open-source/ragel/) FSM compiler.
Compile the modified sources with:

```(shell)
go generate
go build
```

## Usage

`gmaggregate` works in two phases: **filter** and **transfer**.  
The **filter** phase creates a new table in the database in which
the aggregated logs of the gm import logs are stored. In this
phase the original logs are __not__ modified. The modifications
are done in the **transfer** phase. In this phase the original
log lines are removed from the database which are associated
with the gm imports leading to the entries in the table
created in the first phase. All other log lines are not
touched. After the deletion off the old lines the content
of the new table is copied back into the log table and
the new table is dropped. All operations in the transfer
phase are are encapsulated within a transaction so that
no harm is done if the execution is failing.

`gmaggregate` runs the two phases **filter** and **transfer**
one right after each other. If you want to run them
separated by hand you can can do this this the `-phases`
flag.

> **CSV export**: For debugging purposes `gmaggregate` supports
> exporting the aggregated log lines as a CSV file. Use the
> `-c` flag to specify the file to write it to. When running
> the CVS export the new table in the database is not created.
> So the transfer phase will fail. Therefore you should use
> the CSV export togetjer with the `-phase=filter` flag.

For more options see `gmaggregate --help`.

> Tip: After running the `gmaggregate` migration you should consider running  
> `VACCUM FULL; CLUSTER import.import_logs USING import_logs_import_id;`  
> from a `psql` shell on the database to recover the space
> used by the original log lines and physically order the data
> in a way corresponing to the process of logging.

## License

This is Free Software covered by the terms of the
GNU Affero General Public License version 3.0 or later.
See [AGPL-3.0.txt](../../LICENSES/AGPL-3.0.txt) for details.