annotate contrib/gmaggregate/README.md @ 5560:f2204f91d286

Join the log lines of imports to the log exports to recover data from them. Used in SR export to extract information that where in the meta json but now are only found in the log.
author Sascha L. Teichmann <sascha.teichmann@intevation.de>
date Wed, 09 Feb 2022 18:34:40 +0100
parents 02c2d0edeb2a
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
5548
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
1 # gmaggregate
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
2
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
3 *Attention:* This is a copy of [gmaggregate](https://heptapod.host/intevation/gemma/gmaggregate).
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
4
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
5 A log message transformation tool for gauge measurement (gm) imports
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
6 in the gemma server.
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
7
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
8 We recognized that the logging of the gm imports is producing a lot
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
9 of data itself by been very verbose and redundant. The has led
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
10 to the fact that over 99% of the log messages of all imports
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
11 in the gemma server are stemming from the gm imports ...
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
12 hundreds of millions of log lines.
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
13
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
14 The logging itself is now done in a more compact and aggregated
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
15 way increasing the readability of the logs, too.
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
16 To get rid of the repetitive old logs entries without losing
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
17 information these logs has to be aggregated in the same way,
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
18 too.
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
19
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
20 Normally, we use SQL or PL/pgSQL scripts for this kind of migrations.
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
21 We had a version of this for this task but first experiments
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
22 had shown that its run time would only be acceptable
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
23 for small data sets, but not for the multi million data sets
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
24 of the production system. It also had no to very little potential
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
25 to be significantly improved. Therefore we re-crafted this tool in Go.
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
26
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
27 ## Build
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
28
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
29 You need a working Go build environment (tested successfully with 1.17).
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
30
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
31 ```(shell)
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
32 hg clone https://heptapod.host/intevation/gemma/gmaggregate
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
33 cd gmaggregate
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
34 go build
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
35 ```
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
36
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
37 Place the resulting `gmaggregate` binary into the `PATH` of your
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
38 database server. It needs execution rights for the `postgres` user.
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
39
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
40 If you've modified the expressions in [matcher.rl](matcher.rl) you need
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
41 an installation of the [Ragel](http://www.colm.net/open-source/ragel/) FSM compiler.
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
42 Compile the modified sources with:
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
43
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
44 ```(shell)
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
45 go generate
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
46 go build
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
47 ```
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
48
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
49 ## Usage
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
50
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
51 `gmaggregate` works in two phases: **filter** and **transfer**.
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
52 The **filter** phase creates a new table in the database in which
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
53 the aggregated logs of the gm import logs are stored. In this
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
54 phase the original logs are __not__ modified. The modifications
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
55 are done in the **transfer** phase. In this phase the original
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
56 log lines are removed from the database which are associated
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
57 with the gm imports leading to the entries in the table
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
58 created in the first phase. All other log lines are not
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
59 touched. After the deletion off the old lines the content
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
60 of the new table is copied back into the log table and
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
61 the new table is dropped. All operations in the transfer
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
62 phase are are encapsulated within a transaction so that
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
63 no harm is done if the execution is failing.
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
64
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
65 `gmaggregate` runs the two phases **filter** and **transfer**
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
66 one right after each other. If you want to run them
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
67 separated by hand you can can do this this the `-phases`
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
68 flag.
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
69
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
70 > **CSV export**: For debugging purposes `gmaggregate` supports
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
71 > exporting the aggregated log lines as a CSV file. Use the
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
72 > `-c` flag to specify the file to write it to. When running
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
73 > the CVS export the new table in the database is not created.
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
74 > So the transfer phase will fail. Therefore you should use
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
75 > the CSV export togetjer with the `-phase=filter` flag.
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
76
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
77 For more options see `gmaggregate --help`.
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
78
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
79 > Tip: After running the `gmaggregate` migration you should consider running
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
80 > `VACCUM FULL; CLUSTER import.import_logs USING import_logs_import_id;`
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
81 > from a `psql` shell on the database to recover the space
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
82 > used by the original log lines and physically order the data
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
83 > in a way corresponing to the process of logging.
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
84
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
85 ## License
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
86
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
87 This is Free Software covered by the terms of the
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
88 GNU Affero General Public License version 3.0 or later.
02c2d0edeb2a Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff changeset
89 See [AGPL-3.0.txt](../../LICENSES/AGPL-3.0.txt) for details.