Mercurial > gemma
annotate contrib/gmaggregate/README.md @ 5560:f2204f91d286
Join the log lines of imports to the log exports to recover data from them.
Used in SR export to extract information that where in the meta json
but now are only found in the log.
author | Sascha L. Teichmann <sascha.teichmann@intevation.de> |
---|---|
date | Wed, 09 Feb 2022 18:34:40 +0100 |
parents | 02c2d0edeb2a |
children |
rev | line source |
---|---|
5548
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
1 # gmaggregate |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
2 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
3 *Attention:* This is a copy of [gmaggregate](https://heptapod.host/intevation/gemma/gmaggregate). |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
4 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
5 A log message transformation tool for gauge measurement (gm) imports |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
6 in the gemma server. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
7 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
8 We recognized that the logging of the gm imports is producing a lot |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
9 of data itself by been very verbose and redundant. The has led |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
10 to the fact that over 99% of the log messages of all imports |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
11 in the gemma server are stemming from the gm imports ... |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
12 hundreds of millions of log lines. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
13 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
14 The logging itself is now done in a more compact and aggregated |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
15 way increasing the readability of the logs, too. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
16 To get rid of the repetitive old logs entries without losing |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
17 information these logs has to be aggregated in the same way, |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
18 too. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
19 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
20 Normally, we use SQL or PL/pgSQL scripts for this kind of migrations. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
21 We had a version of this for this task but first experiments |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
22 had shown that its run time would only be acceptable |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
23 for small data sets, but not for the multi million data sets |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
24 of the production system. It also had no to very little potential |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
25 to be significantly improved. Therefore we re-crafted this tool in Go. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
26 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
27 ## Build |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
28 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
29 You need a working Go build environment (tested successfully with 1.17). |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
30 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
31 ```(shell) |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
32 hg clone https://heptapod.host/intevation/gemma/gmaggregate |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
33 cd gmaggregate |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
34 go build |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
35 ``` |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
36 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
37 Place the resulting `gmaggregate` binary into the `PATH` of your |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
38 database server. It needs execution rights for the `postgres` user. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
39 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
40 If you've modified the expressions in [matcher.rl](matcher.rl) you need |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
41 an installation of the [Ragel](http://www.colm.net/open-source/ragel/) FSM compiler. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
42 Compile the modified sources with: |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
43 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
44 ```(shell) |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
45 go generate |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
46 go build |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
47 ``` |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
48 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
49 ## Usage |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
50 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
51 `gmaggregate` works in two phases: **filter** and **transfer**. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
52 The **filter** phase creates a new table in the database in which |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
53 the aggregated logs of the gm import logs are stored. In this |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
54 phase the original logs are __not__ modified. The modifications |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
55 are done in the **transfer** phase. In this phase the original |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
56 log lines are removed from the database which are associated |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
57 with the gm imports leading to the entries in the table |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
58 created in the first phase. All other log lines are not |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
59 touched. After the deletion off the old lines the content |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
60 of the new table is copied back into the log table and |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
61 the new table is dropped. All operations in the transfer |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
62 phase are are encapsulated within a transaction so that |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
63 no harm is done if the execution is failing. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
64 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
65 `gmaggregate` runs the two phases **filter** and **transfer** |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
66 one right after each other. If you want to run them |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
67 separated by hand you can can do this this the `-phases` |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
68 flag. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
69 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
70 > **CSV export**: For debugging purposes `gmaggregate` supports |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
71 > exporting the aggregated log lines as a CSV file. Use the |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
72 > `-c` flag to specify the file to write it to. When running |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
73 > the CVS export the new table in the database is not created. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
74 > So the transfer phase will fail. Therefore you should use |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
75 > the CSV export togetjer with the `-phase=filter` flag. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
76 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
77 For more options see `gmaggregate --help`. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
78 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
79 > Tip: After running the `gmaggregate` migration you should consider running |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
80 > `VACCUM FULL; CLUSTER import.import_logs USING import_logs_import_id;` |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
81 > from a `psql` shell on the database to recover the space |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
82 > used by the original log lines and physically order the data |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
83 > in a way corresponing to the process of logging. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
84 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
85 ## License |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
86 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
87 This is Free Software covered by the terms of the |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
88 GNU Affero General Public License version 3.0 or later. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
89 See [AGPL-3.0.txt](../../LICENSES/AGPL-3.0.txt) for details. |