Mercurial > gemma
annotate contrib/gmaggregate/README.md @ 5584:7ed9e32706d0 surveysperbottleneckid
Merged delault
author | Sascha Wilde <wilde@sha-bang.de> |
---|---|
date | Fri, 01 Apr 2022 16:47:53 +0200 |
parents | 02c2d0edeb2a |
children |
rev | line source |
---|---|
5548
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
1 # gmaggregate |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
2 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
3 *Attention:* This is a copy of [gmaggregate](https://heptapod.host/intevation/gemma/gmaggregate). |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
4 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
5 A log message transformation tool for gauge measurement (gm) imports |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
6 in the gemma server. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
7 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
8 We recognized that the logging of the gm imports is producing a lot |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
9 of data itself by been very verbose and redundant. The has led |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
10 to the fact that over 99% of the log messages of all imports |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
11 in the gemma server are stemming from the gm imports ... |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
12 hundreds of millions of log lines. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
13 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
14 The logging itself is now done in a more compact and aggregated |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
15 way increasing the readability of the logs, too. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
16 To get rid of the repetitive old logs entries without losing |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
17 information these logs has to be aggregated in the same way, |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
18 too. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
19 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
20 Normally, we use SQL or PL/pgSQL scripts for this kind of migrations. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
21 We had a version of this for this task but first experiments |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
22 had shown that its run time would only be acceptable |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
23 for small data sets, but not for the multi million data sets |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
24 of the production system. It also had no to very little potential |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
25 to be significantly improved. Therefore we re-crafted this tool in Go. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
26 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
27 ## Build |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
28 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
29 You need a working Go build environment (tested successfully with 1.17). |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
30 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
31 ```(shell) |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
32 hg clone https://heptapod.host/intevation/gemma/gmaggregate |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
33 cd gmaggregate |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
34 go build |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
35 ``` |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
36 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
37 Place the resulting `gmaggregate` binary into the `PATH` of your |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
38 database server. It needs execution rights for the `postgres` user. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
39 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
40 If you've modified the expressions in [matcher.rl](matcher.rl) you need |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
41 an installation of the [Ragel](http://www.colm.net/open-source/ragel/) FSM compiler. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
42 Compile the modified sources with: |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
43 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
44 ```(shell) |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
45 go generate |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
46 go build |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
47 ``` |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
48 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
49 ## Usage |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
50 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
51 `gmaggregate` works in two phases: **filter** and **transfer**. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
52 The **filter** phase creates a new table in the database in which |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
53 the aggregated logs of the gm import logs are stored. In this |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
54 phase the original logs are __not__ modified. The modifications |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
55 are done in the **transfer** phase. In this phase the original |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
56 log lines are removed from the database which are associated |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
57 with the gm imports leading to the entries in the table |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
58 created in the first phase. All other log lines are not |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
59 touched. After the deletion off the old lines the content |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
60 of the new table is copied back into the log table and |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
61 the new table is dropped. All operations in the transfer |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
62 phase are are encapsulated within a transaction so that |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
63 no harm is done if the execution is failing. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
64 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
65 `gmaggregate` runs the two phases **filter** and **transfer** |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
66 one right after each other. If you want to run them |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
67 separated by hand you can can do this this the `-phases` |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
68 flag. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
69 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
70 > **CSV export**: For debugging purposes `gmaggregate` supports |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
71 > exporting the aggregated log lines as a CSV file. Use the |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
72 > `-c` flag to specify the file to write it to. When running |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
73 > the CVS export the new table in the database is not created. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
74 > So the transfer phase will fail. Therefore you should use |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
75 > the CSV export togetjer with the `-phase=filter` flag. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
76 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
77 For more options see `gmaggregate --help`. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
78 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
79 > Tip: After running the `gmaggregate` migration you should consider running |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
80 > `VACCUM FULL; CLUSTER import.import_logs USING import_logs_import_id;` |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
81 > from a `psql` shell on the database to recover the space |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
82 > used by the original log lines and physically order the data |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
83 > in a way corresponing to the process of logging. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
84 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
85 ## License |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
86 |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
87 This is Free Software covered by the terms of the |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
88 GNU Affero General Public License version 3.0 or later. |
02c2d0edeb2a
Added gmaggregate tool as contrib.
Sascha L. Teichmann <sascha.teichmann@intevation.de>
parents:
diff
changeset
|
89 See [AGPL-3.0.txt](../../LICENSES/AGPL-3.0.txt) for details. |