annotate kallithea/lib/indexers/__init__.py @ 6477:168cc92c1b53

search: prevent pathname related conditions from removing "stop words" Before this revision, pathname related conditions below cause unintentional ignorance of "stop words". - path:,extension: (for "File contents" or "File names") - added:, removed:, changed: (for "Commit messages") Therefore, pathname related conditions with "this", "a", "you", and so on are completely ignored, even if they are valid pathname components. To prevent pathname related conditions from removing "stop words", this revision explicitly specifies "analyzer" for pathname related fields of SCHEMA and CHGSETS_SCHEMA. Difference between PATHANALYZER and default analyzer of TEXT is whether "stop words" are preserved or not. Tokenization is still applied on pathnames. This revision requires full re-building index tables, because indexing schemas are changed.
author FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
date Mon, 23 Jan 2017 02:17:38 +0900
parents 8b7c0ef62427
children c0b2410d63a5
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
903
04c9bb9ca6d6 code docs, updates
Marcin Kuzminski <marcin@python-works.com>
parents: 894
diff changeset
1 # -*- coding: utf-8 -*-
1206
a671db5bdd58 fixed license issue #149
Marcin Kuzminski <marcin@python-works.com>
parents: 1203
diff changeset
2 # This program is free software: you can redistribute it and/or modify
a671db5bdd58 fixed license issue #149
Marcin Kuzminski <marcin@python-works.com>
parents: 1203
diff changeset
3 # it under the terms of the GNU General Public License as published by
a671db5bdd58 fixed license issue #149
Marcin Kuzminski <marcin@python-works.com>
parents: 1203
diff changeset
4 # the Free Software Foundation, either version 3 of the License, or
a671db5bdd58 fixed license issue #149
Marcin Kuzminski <marcin@python-works.com>
parents: 1203
diff changeset
5 # (at your option) any later version.
1203
6832ef664673 source code cleanup: remove trailing white space, normalize file endings
Marcin Kuzminski <marcin@python-works.com>
parents: 1198
diff changeset
6 #
903
04c9bb9ca6d6 code docs, updates
Marcin Kuzminski <marcin@python-works.com>
parents: 894
diff changeset
7 # This program is distributed in the hope that it will be useful,
04c9bb9ca6d6 code docs, updates
Marcin Kuzminski <marcin@python-works.com>
parents: 894
diff changeset
8 # but WITHOUT ANY WARRANTY; without even the implied warranty of
04c9bb9ca6d6 code docs, updates
Marcin Kuzminski <marcin@python-works.com>
parents: 894
diff changeset
9 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
04c9bb9ca6d6 code docs, updates
Marcin Kuzminski <marcin@python-works.com>
parents: 894
diff changeset
10 # GNU General Public License for more details.
1203
6832ef664673 source code cleanup: remove trailing white space, normalize file endings
Marcin Kuzminski <marcin@python-works.com>
parents: 1198
diff changeset
11 #
903
04c9bb9ca6d6 code docs, updates
Marcin Kuzminski <marcin@python-works.com>
parents: 894
diff changeset
12 # You should have received a copy of the GNU General Public License
1206
a671db5bdd58 fixed license issue #149
Marcin Kuzminski <marcin@python-works.com>
parents: 1203
diff changeset
13 # along with this program. If not, see <http://www.gnu.org/licenses/>.
4116
ffd45b185016 Imported some of the GPLv3'd changes from RhodeCode v2.2.5.
Bradley M. Kuhn <bkuhn@sfconservancy.org>
parents: 3960
diff changeset
14 """
5376
0ad053c172fa cleanup: make module self-naming consistent
Mads Kiilerich <madski@unity3d.com>
parents: 5375
diff changeset
15 kallithea.lib.indexers
0ad053c172fa cleanup: make module self-naming consistent
Mads Kiilerich <madski@unity3d.com>
parents: 5375
diff changeset
16 ~~~~~~~~~~~~~~~~~~~~~~
4116
ffd45b185016 Imported some of the GPLv3'd changes from RhodeCode v2.2.5.
Bradley M. Kuhn <bkuhn@sfconservancy.org>
parents: 3960
diff changeset
17
4212
24c0d584ba86 General renaming to Kallithea
Bradley M. Kuhn <bkuhn@sfconservancy.org>
parents: 4211
diff changeset
18 Whoosh indexing module for Kallithea
4116
ffd45b185016 Imported some of the GPLv3'd changes from RhodeCode v2.2.5.
Bradley M. Kuhn <bkuhn@sfconservancy.org>
parents: 3960
diff changeset
19
4211
1948ede028ef RhodeCode GmbH is not the sole author of this work
Bradley M. Kuhn <bkuhn@sfconservancy.org>
parents: 4208
diff changeset
20 This file was forked by the Kallithea project in July 2014.
1948ede028ef RhodeCode GmbH is not the sole author of this work
Bradley M. Kuhn <bkuhn@sfconservancy.org>
parents: 4208
diff changeset
21 Original author and date, and relevant copyright and licensing information is below:
4116
ffd45b185016 Imported some of the GPLv3'd changes from RhodeCode v2.2.5.
Bradley M. Kuhn <bkuhn@sfconservancy.org>
parents: 3960
diff changeset
22 :created_on: Aug 17, 2010
ffd45b185016 Imported some of the GPLv3'd changes from RhodeCode v2.2.5.
Bradley M. Kuhn <bkuhn@sfconservancy.org>
parents: 3960
diff changeset
23 :author: marcink
4211
1948ede028ef RhodeCode GmbH is not the sole author of this work
Bradley M. Kuhn <bkuhn@sfconservancy.org>
parents: 4208
diff changeset
24 :copyright: (c) 2013 RhodeCode GmbH, and others.
4208
ad38f9f93b3b Correct licensing information in individual files.
Bradley M. Kuhn <bkuhn@sfconservancy.org>
parents: 4187
diff changeset
25 :license: GPLv3, see LICENSE.md for more details.
4116
ffd45b185016 Imported some of the GPLv3'd changes from RhodeCode v2.2.5.
Bradley M. Kuhn <bkuhn@sfconservancy.org>
parents: 3960
diff changeset
26 """
ffd45b185016 Imported some of the GPLv3'd changes from RhodeCode v2.2.5.
Bradley M. Kuhn <bkuhn@sfconservancy.org>
parents: 3960
diff changeset
27
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
28 import os
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
29 import sys
2102
04d26165c3d9 Whoosh logging is now controlled by the .ini files logging setup
Marcin Kuzminski <marcin@python-works.com>
parents: 1995
diff changeset
30 import logging
5998
037efd94e955 cleanup: get rid of dn as shortcut for os.path.dirname
domruf <dominikruf@gmail.com>
parents: 5997
diff changeset
31 from os.path import dirname
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
32
4175
e9f6b533a8f6 Remove wrong/unnecessary/unfixable comment(s)
Bradley M. Kuhn <bkuhn@sfconservancy.org>
parents: 4116
diff changeset
33 # Add location of top level folder to sys.path
5998
037efd94e955 cleanup: get rid of dn as shortcut for os.path.dirname
domruf <dominikruf@gmail.com>
parents: 5997
diff changeset
34 sys.path.append(dirname(dirname(dirname(os.path.realpath(__file__)))))
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
35
6475
caef0be39948 search: make "repository:" condition work as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6474
diff changeset
36 from whoosh.analysis import RegexTokenizer, LowercaseFilter, IDTokenizer
3062
a08624dd675e Implemented filtering of admin journal based on Whoosh Query language
Marcin Kuzminski <marcin@python-works.com>
parents: 2718
diff changeset
37 from whoosh.fields import TEXT, ID, STORED, NUMERIC, BOOLEAN, Schema, FieldType, DATETIME
478
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
38 from whoosh.formats import Characters
3915
a42bfe8a9335 moved make-index command to paster_commands module
Marcin Kuzminski <marcin@python-works.com>
parents: 3339
diff changeset
39 from whoosh.highlight import highlight as whoosh_highlight, HtmlFormatter, ContextFragmenter
4186
7e5f8c12a3fc First step in two-part process to rename directories to kallithea.
Bradley M. Kuhn <bkuhn@sfconservancy.org>
parents: 4175
diff changeset
40 from kallithea.lib.utils2 import LazyProperty
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
41
2640
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
42 log = logging.getLogger(__name__)
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
43
1995
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
44 # CUSTOM ANALYZER wordsplit + lowercase filter
436
28f19fa562df updated config files,
Marcin Kuzminski <marcin@python-works.com>
parents: 406
diff changeset
45 ANALYZER = RegexTokenizer(expression=r"\w+") | LowercaseFilter()
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
46
6475
caef0be39948 search: make "repository:" condition work as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6474
diff changeset
47 # CUSTOM ANALYZER raw-string + lowercase filter
caef0be39948 search: make "repository:" condition work as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6474
diff changeset
48 #
caef0be39948 search: make "repository:" condition work as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6474
diff changeset
49 # This is useful to:
caef0be39948 search: make "repository:" condition work as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6474
diff changeset
50 # - avoid tokenization
caef0be39948 search: make "repository:" condition work as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6474
diff changeset
51 # - avoid removing "stop words" from text
caef0be39948 search: make "repository:" condition work as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6474
diff changeset
52 # - search case-insensitively
caef0be39948 search: make "repository:" condition work as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6474
diff changeset
53 #
caef0be39948 search: make "repository:" condition work as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6474
diff changeset
54 ICASEIDANALYZER = IDTokenizer() | LowercaseFilter()
caef0be39948 search: make "repository:" condition work as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6474
diff changeset
55
6476
8b7c0ef62427 search: make "repository:" condition work case-insensitively as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6475
diff changeset
56 # CUSTOM ANALYZER raw-string
8b7c0ef62427 search: make "repository:" condition work case-insensitively as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6475
diff changeset
57 #
8b7c0ef62427 search: make "repository:" condition work case-insensitively as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6475
diff changeset
58 # This is useful to:
8b7c0ef62427 search: make "repository:" condition work case-insensitively as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6475
diff changeset
59 # - avoid tokenization
8b7c0ef62427 search: make "repository:" condition work case-insensitively as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6475
diff changeset
60 # - avoid removing "stop words" from text
8b7c0ef62427 search: make "repository:" condition work case-insensitively as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6475
diff changeset
61 #
8b7c0ef62427 search: make "repository:" condition work case-insensitively as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6475
diff changeset
62 IDANALYZER = IDTokenizer()
8b7c0ef62427 search: make "repository:" condition work case-insensitively as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6475
diff changeset
63
6477
168cc92c1b53 search: prevent pathname related conditions from removing "stop words"
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6476
diff changeset
64 # CUSTOM ANALYZER wordsplit + lowercase filter, for pathname-like text
168cc92c1b53 search: prevent pathname related conditions from removing "stop words"
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6476
diff changeset
65 #
168cc92c1b53 search: prevent pathname related conditions from removing "stop words"
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6476
diff changeset
66 # This is useful to:
168cc92c1b53 search: prevent pathname related conditions from removing "stop words"
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6476
diff changeset
67 # - avoid removing "stop words" from text
168cc92c1b53 search: prevent pathname related conditions from removing "stop words"
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6476
diff changeset
68 # - search case-insensitively
168cc92c1b53 search: prevent pathname related conditions from removing "stop words"
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6476
diff changeset
69 #
168cc92c1b53 search: prevent pathname related conditions from removing "stop words"
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6476
diff changeset
70 PATHANALYZER = RegexTokenizer() | LowercaseFilter()
168cc92c1b53 search: prevent pathname related conditions from removing "stop words"
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6476
diff changeset
71
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
72 #INDEX SCHEMA DEFINITION
1995
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
73 SCHEMA = Schema(
2388
a0ef98f2520b #453 added ID field in whoosh SCHEMA that solves the issue of reindexing modified files
Marcin Kuzminski <marcin@python-works.com>
parents: 2373
diff changeset
74 fileid=ID(unique=True),
1995
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
75 owner=TEXT(),
6476
8b7c0ef62427 search: make "repository:" condition work case-insensitively as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6475
diff changeset
76 # this field preserves case of repository name for exact matching
8b7c0ef62427 search: make "repository:" condition work case-insensitively as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6475
diff changeset
77 repository_rawname=TEXT(analyzer=IDANALYZER),
6475
caef0be39948 search: make "repository:" condition work as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6474
diff changeset
78 repository=TEXT(stored=True, analyzer=ICASEIDANALYZER),
6477
168cc92c1b53 search: prevent pathname related conditions from removing "stop words"
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6476
diff changeset
79 path=TEXT(stored=True, analyzer=PATHANALYZER),
1995
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
80 content=FieldType(format=Characters(), analyzer=ANALYZER,
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
81 scorable=True, stored=True),
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
82 modtime=STORED(),
6477
168cc92c1b53 search: prevent pathname related conditions from removing "stop words"
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6476
diff changeset
83 extension=TEXT(stored=True, analyzer=PATHANALYZER)
1995
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
84 )
478
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
85
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
86 IDX_NAME = 'HG_INDEX'
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
87 FORMATTER = HtmlFormatter('span', between='\n<span class="break">...</span>\n')
1995
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
88 FRAGMENTER = ContextFragmenter(200)
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
89
2640
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
90 CHGSETS_SCHEMA = Schema(
2642
88b0e82bcba4 rename changeset index key to match raw_id rather than path for greater consistency
Indra Talip <indra.talip@gmail.com>
parents: 2640
diff changeset
91 raw_id=ID(unique=True, stored=True),
2693
66c778b8cb54 Extended commit search schema with date of commit
Marcin Kuzminski <marcin@python-works.com>
parents: 2673
diff changeset
92 date=NUMERIC(stored=True),
2640
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
93 last=BOOLEAN(),
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
94 owner=TEXT(),
6476
8b7c0ef62427 search: make "repository:" condition work case-insensitively as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6475
diff changeset
95 # this field preserves case of repository name for exact matching
8b7c0ef62427 search: make "repository:" condition work case-insensitively as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6475
diff changeset
96 # and unique-ness in index table
8b7c0ef62427 search: make "repository:" condition work case-insensitively as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6475
diff changeset
97 repository_rawname=ID(unique=True),
8b7c0ef62427 search: make "repository:" condition work case-insensitively as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6475
diff changeset
98 repository=ID(stored=True, analyzer=ICASEIDANALYZER),
2640
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
99 author=TEXT(stored=True),
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
100 message=FieldType(format=Characters(), analyzer=ANALYZER,
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
101 scorable=True, stored=True),
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
102 parents=TEXT(),
6477
168cc92c1b53 search: prevent pathname related conditions from removing "stop words"
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6476
diff changeset
103 added=TEXT(analyzer=PATHANALYZER),
168cc92c1b53 search: prevent pathname related conditions from removing "stop words"
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6476
diff changeset
104 removed=TEXT(analyzer=PATHANALYZER),
168cc92c1b53 search: prevent pathname related conditions from removing "stop words"
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6476
diff changeset
105 changed=TEXT(analyzer=PATHANALYZER),
2640
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
106 )
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
107
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
108 CHGSET_IDX_NAME = 'CHGSET_INDEX'
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
109
3062
a08624dd675e Implemented filtering of admin journal based on Whoosh Query language
Marcin Kuzminski <marcin@python-works.com>
parents: 2718
diff changeset
110 # used only to generate queries in journal
a08624dd675e Implemented filtering of admin journal based on Whoosh Query language
Marcin Kuzminski <marcin@python-works.com>
parents: 2718
diff changeset
111 JOURNAL_SCHEMA = Schema(
6474
2ff913970025 journal: make "username:" filtering condition work as expected
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 6473
diff changeset
112 username=ID(),
3062
a08624dd675e Implemented filtering of admin journal based on Whoosh Query language
Marcin Kuzminski <marcin@python-works.com>
parents: 2718
diff changeset
113 date=DATETIME(),
a08624dd675e Implemented filtering of admin journal based on Whoosh Query language
Marcin Kuzminski <marcin@python-works.com>
parents: 2718
diff changeset
114 action=TEXT(),
6473
73e3599971da journal: make "repository:" filtering condition work as expected (Issue #261)
FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
parents: 5998
diff changeset
115 repository=ID(),
3062
a08624dd675e Implemented filtering of admin journal based on Whoosh Query language
Marcin Kuzminski <marcin@python-works.com>
parents: 2718
diff changeset
116 ip=TEXT(),
a08624dd675e Implemented filtering of admin journal based on Whoosh Query language
Marcin Kuzminski <marcin@python-works.com>
parents: 2718
diff changeset
117 )
a08624dd675e Implemented filtering of admin journal based on Whoosh Query language
Marcin Kuzminski <marcin@python-works.com>
parents: 2718
diff changeset
118
2718
82fb2a161ddf fixes issue #524
Marcin Kuzminski <marcin@python-works.com>
parents: 2693
diff changeset
119
2319
4c239e0dcbb7 fixes issue #454 Search results under Windows include preceeding backslash
Marcin Kuzminski <marcin@python-works.com>
parents: 2109
diff changeset
120 class WhooshResultWrapper(object):
4c239e0dcbb7 fixes issue #454 Search results under Windows include preceeding backslash
Marcin Kuzminski <marcin@python-works.com>
parents: 2109
diff changeset
121 def __init__(self, search_type, searcher, matcher, highlight_items,
4c239e0dcbb7 fixes issue #454 Search results under Windows include preceeding backslash
Marcin Kuzminski <marcin@python-works.com>
parents: 2109
diff changeset
122 repo_location):
556
65b2f150beb7 Added searching for file names within the repository in rhodecode
Marcin Kuzminski <marcin@python-works.com>
parents: 547
diff changeset
123 self.search_type = search_type
478
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
124 self.searcher = searcher
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
125 self.matcher = matcher
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
126 self.highlight_items = highlight_items
1995
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
127 self.fragment_size = 200
2319
4c239e0dcbb7 fixes issue #454 Search results under Windows include preceeding backslash
Marcin Kuzminski <marcin@python-works.com>
parents: 2109
diff changeset
128 self.repo_location = repo_location
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
129
478
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
130 @LazyProperty
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
131 def doc_ids(self):
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
132 docs_id = []
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
133 while self.matcher.is_active():
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
134 docnum = self.matcher.id()
479
149940ba96d9 fixed search chunking bug and optimized chunk size
Marcin Kuzminski <marcin@python-works.com>
parents: 478
diff changeset
135 chunks = [offsets for offsets in self.get_chunks()]
149940ba96d9 fixed search chunking bug and optimized chunk size
Marcin Kuzminski <marcin@python-works.com>
parents: 478
diff changeset
136 docs_id.append([docnum, chunks])
478
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
137 self.matcher.next()
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
138 return docs_id
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
139
478
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
140 def __str__(self):
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
141 return '<%s at %s>' % (self.__class__.__name__, len(self.doc_ids))
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
142
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
143 def __repr__(self):
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
144 return self.__str__()
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
145
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
146 def __len__(self):
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
147 return len(self.doc_ids)
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
148
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
149 def __iter__(self):
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
150 """
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
151 Allows Iteration over results,and lazy generate content
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
152
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
153 *Requires* implementation of ``__getitem__`` method.
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
154 """
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
155 for docid in self.doc_ids:
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
156 yield self.get_full_content(docid)
406
b153a51b1d3b Implemented search using whoosh. Still as experimental option.
Marcin Kuzminski <marcin@python-works.com>
parents:
diff changeset
157
1198
02a7f263a849 fixed issue with latest webhelpers pagination module
Marcin Kuzminski <marcin@python-works.com>
parents: 1183
diff changeset
158 def __getitem__(self, key):
478
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
159 """
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
160 Slicing of resultWrapper
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
161 """
1198
02a7f263a849 fixed issue with latest webhelpers pagination module
Marcin Kuzminski <marcin@python-works.com>
parents: 1183
diff changeset
162 i, j = key.start, key.stop
02a7f263a849 fixed issue with latest webhelpers pagination module
Marcin Kuzminski <marcin@python-works.com>
parents: 1183
diff changeset
163
1995
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
164 slices = []
478
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
165 for docid in self.doc_ids[i:j]:
1995
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
166 slices.append(self.get_full_content(docid))
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
167 return slices
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
168
478
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
169 def get_full_content(self, docid):
479
149940ba96d9 fixed search chunking bug and optimized chunk size
Marcin Kuzminski <marcin@python-works.com>
parents: 478
diff changeset
170 res = self.searcher.stored_fields(docid[0])
5375
0210d0b769d4 cleanup: pass log strings unformatted - avoid unnecessary % formatting when not logging
Mads Kiilerich <madski@unity3d.com>
parents: 4422
diff changeset
171 log.debug('result: %s', res)
2640
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
172 if self.search_type == 'content':
5997
b313d735d9c8 cleanup: get rid of jn as shortcut for os.path.join
domruf <dominikruf@gmail.com>
parents: 5376
diff changeset
173 full_repo_path = os.path.join(self.repo_location, res['repository'])
2642
88b0e82bcba4 rename changeset index key to match raw_id rather than path for greater consistency
Indra Talip <indra.talip@gmail.com>
parents: 2640
diff changeset
174 f_path = res['path'].split(full_repo_path)[-1]
88b0e82bcba4 rename changeset index key to match raw_id rather than path for greater consistency
Indra Talip <indra.talip@gmail.com>
parents: 2640
diff changeset
175 f_path = f_path.lstrip(os.sep)
2640
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
176 content_short = self.get_short_content(res, docid[1])
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
177 res.update({'content_short': content_short,
2642
88b0e82bcba4 rename changeset index key to match raw_id rather than path for greater consistency
Indra Talip <indra.talip@gmail.com>
parents: 2640
diff changeset
178 'content_short_hl': self.highlight(content_short),
88b0e82bcba4 rename changeset index key to match raw_id rather than path for greater consistency
Indra Talip <indra.talip@gmail.com>
parents: 2640
diff changeset
179 'f_path': f_path
4116
ffd45b185016 Imported some of the GPLv3'd changes from RhodeCode v2.2.5.
Bradley M. Kuhn <bkuhn@sfconservancy.org>
parents: 3960
diff changeset
180 })
2718
82fb2a161ddf fixes issue #524
Marcin Kuzminski <marcin@python-works.com>
parents: 2693
diff changeset
181 elif self.search_type == 'path':
5997
b313d735d9c8 cleanup: get rid of jn as shortcut for os.path.join
domruf <dominikruf@gmail.com>
parents: 5376
diff changeset
182 full_repo_path = os.path.join(self.repo_location, res['repository'])
2718
82fb2a161ddf fixes issue #524
Marcin Kuzminski <marcin@python-works.com>
parents: 2693
diff changeset
183 f_path = res['path'].split(full_repo_path)[-1]
82fb2a161ddf fixes issue #524
Marcin Kuzminski <marcin@python-works.com>
parents: 2693
diff changeset
184 f_path = f_path.lstrip(os.sep)
82fb2a161ddf fixes issue #524
Marcin Kuzminski <marcin@python-works.com>
parents: 2693
diff changeset
185 res.update({'f_path': f_path})
2640
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
186 elif self.search_type == 'message':
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
187 res.update({'message_hl': self.highlight(res['message'])})
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
188
5375
0210d0b769d4 cleanup: pass log strings unformatted - avoid unnecessary % formatting when not logging
Mads Kiilerich <madski@unity3d.com>
parents: 4422
diff changeset
189 log.debug('result: %s', res)
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
190
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
191 return res
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
192
479
149940ba96d9 fixed search chunking bug and optimized chunk size
Marcin Kuzminski <marcin@python-works.com>
parents: 478
diff changeset
193 def get_short_content(self, res, chunks):
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
194
479
149940ba96d9 fixed search chunking bug and optimized chunk size
Marcin Kuzminski <marcin@python-works.com>
parents: 478
diff changeset
195 return ''.join([res['content'][chunk[0]:chunk[1]] for chunk in chunks])
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
196
479
149940ba96d9 fixed search chunking bug and optimized chunk size
Marcin Kuzminski <marcin@python-works.com>
parents: 478
diff changeset
197 def get_chunks(self):
478
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
198 """
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
199 Smart function that implements chunking the content
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
200 but not overlap chunks so it doesn't highlight the same
556
65b2f150beb7 Added searching for file names within the repository in rhodecode
Marcin Kuzminski <marcin@python-works.com>
parents: 547
diff changeset
201 close occurrences twice.
478
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
202 """
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
203 memory = [(0, 0)]
2673
d5e42c00f3c1 white space cleanup
Marcin Kuzminski <marcin@python-works.com>
parents: 2643
diff changeset
204 if self.matcher.supports('positions'):
2640
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
205 for span in self.matcher.spans():
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
206 start = span.startchar or 0
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
207 end = span.endchar or 0
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
208 start_offseted = max(0, start - self.fragment_size)
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
209 end_offseted = end + self.fragment_size
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
210
2640
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
211 if start_offseted < memory[-1][1]:
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
212 start_offseted = memory[-1][1]
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
213 memory.append((start_offseted, end_offseted,))
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
214 yield (start_offseted, end_offseted,)
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
215
478
7010af6efde5 Reimplemented searching for speed on large files and added paging for search results
Marcin Kuzminski <marcin@python-works.com>
parents: 436
diff changeset
216 def highlight(self, content, top=5):
2640
5f21a9dcb09d create an index for commit messages and the ability to search them and see results
Indra Talip <indra.talip@gmail.com>
parents: 2389
diff changeset
217 if self.search_type not in ['content', 'message']:
556
65b2f150beb7 Added searching for file names within the repository in rhodecode
Marcin Kuzminski <marcin@python-works.com>
parents: 547
diff changeset
218 return ''
3915
a42bfe8a9335 moved make-index command to paster_commands module
Marcin Kuzminski <marcin@python-works.com>
parents: 3339
diff changeset
219 hl = whoosh_highlight(
2389
324b838250c9 UI fixes for searching
Marcin Kuzminski <marcin@python-works.com>
parents: 2388
diff changeset
220 text=content,
1995
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
221 terms=self.highlight_items,
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
222 analyzer=ANALYZER,
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
223 fragmenter=FRAGMENTER,
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
224 formatter=FORMATTER,
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
225 top=top
b6c902d88472 bumbed whoosh to 2.3.X series
Marcin Kuzminski <marcin@python-works.com>
parents: 1824
diff changeset
226 )
631
05528ad948c4 Hacking for git support,and new faster repo scan
Marcin Kuzminski <marcin@python-works.com>
parents: 629
diff changeset
227 return hl