summaryrefslogtreecommitdiff
path: root/src/mailman/handlers/docs/tagger.rst
blob: fcefdb01c8f40ce8b0c48b7bb1952fdbce929834 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
==============
Message tagger
==============

Mailman has a topics system which works like this: a mailing list
administrator sets up one or more topics, which is essentially a named regular
expression.  The topic name can be any arbitrary string, and the name serves
double duty as the *topic tag*.  Each message that flows the mailing list has
its ``Subject:`` and ``Keywords:`` headers compared against these regular
expressions.  The message then gets tagged with the topic names of each hit.

    >>> mlist = create_list('_xtest@example.com')

Topics must be enabled for Mailman to do any topic matching, even if topics
are defined.
::

    >>> mlist.topics = [('bar fight', '.*bar.*', 'catch any bars', False)]
    >>> mlist.topics_enabled = False
    >>> mlist.topics_bodylines_limit = 0

    >>> msg = message_from_string("""\
    ... Subject: foobar
    ... Keywords: barbaz
    ...
    ... """)
    >>> msgdata = {}

    >>> from mailman.handlers.tagger import process
    >>> process(mlist, msg, msgdata)
    >>> print(msg.as_string())
    Subject: foobar
    Keywords: barbaz
    <BLANKLINE>
    <BLANKLINE>
    >>> msgdata
    {}

However, once topics are enabled, message will be tagged.  There are two
artifacts of tagging; an ``X-Topics:`` header is added with the topic name,
and the message metadata gets a key with a list of matching topic names.

    >>> mlist.topics_enabled = True
    >>> msg = message_from_string("""\
    ... Subject: foobar
    ... Keywords: barbaz
    ...
    ... """)
    >>> msgdata = {}
    >>> process(mlist, msg, msgdata)
    >>> print(msg.as_string())
    Subject: foobar
    Keywords: barbaz
    X-Topics: bar fight
    <BLANKLINE>
    <BLANKLINE>
    >>> msgdata['topichits']
    ['bar fight']


Scanning body lines
===================

The tagger can also look at a certain number of body lines, but only for
``Subject:`` and ``Keyword:`` header-like lines.  When set to zero, no body
lines are scanned.

    >>> msg = message_from_string("""\
    ... From: aperson@example.com
    ... Subject: nothing
    ... Keywords: at all
    ...
    ... X-Ignore: something else
    ... Subject: foobar
    ... Keywords: barbaz
    ... """)
    >>> msgdata = {}
    >>> process(mlist, msg, msgdata)
    >>> print(msg.as_string())
    From: aperson@example.com
    Subject: nothing
    Keywords: at all
    <BLANKLINE>
    X-Ignore: something else
    Subject: foobar
    Keywords: barbaz
    <BLANKLINE>
    >>> msgdata
    {}

But let the tagger scan a few body lines and the matching headers will be
found.

    >>> mlist.topics_bodylines_limit = 5
    >>> msg = message_from_string("""\
    ... From: aperson@example.com
    ... Subject: nothing
    ... Keywords: at all
    ...
    ... X-Ignore: something else
    ... Subject: foobar
    ... Keywords: barbaz
    ... """)
    >>> msgdata = {}
    >>> process(mlist, msg, msgdata)
    >>> print(msg.as_string())
    From: aperson@example.com
    Subject: nothing
    Keywords: at all
    X-Topics: bar fight
    <BLANKLINE>
    X-Ignore: something else
    Subject: foobar
    Keywords: barbaz
    <BLANKLINE>
    >>> msgdata['topichits']
    ['bar fight']

However, scanning stops at the first body line that doesn't look like a
header.

    >>> msg = message_from_string("""\
    ... From: aperson@example.com
    ... Subject: nothing
    ... Keywords: at all
    ...
    ... This is not a header
    ... Subject: foobar
    ... Keywords: barbaz
    ... """)
    >>> msgdata = {}
    >>> process(mlist, msg, msgdata)
    >>> print(msg.as_string())
    From: aperson@example.com
    Subject: nothing
    Keywords: at all
    <BLANKLINE>
    This is not a header
    Subject: foobar
    Keywords: barbaz
    >>> msgdata
    {}

When set to a negative number, all body lines will be scanned.

    >>> mlist.topics_bodylines_limit = -1
    >>> lots_of_headers = '\n'.join(['X-Ignore: zip'] * 100)
    >>> msg = message_from_string("""\
    ... From: aperson@example.com
    ... Subject: nothing
    ... Keywords: at all
    ...
    ... %s
    ... Subject: foobar
    ... Keywords: barbaz
    ... """ % lots_of_headers)
    >>> msgdata = {}
    >>> process(mlist, msg, msgdata)
    >>> # Rather than print out 100 X-Ignore: headers, let's just prove that
    >>> # the X-Topics: header exists, meaning that the tagger did its job.
    >>> print(msg['x-topics'])
    bar fight
    >>> msgdata['topichits']
    ['bar fight']


Scanning sub-parts
==================

The tagger will also scan the body lines of text subparts in a multipart
message, using the same rules as if all those body lines lived in a single
text payload.

    >>> msg = message_from_string("""\
    ... Subject: Was
    ... Keywords: Raw
    ... Content-Type: multipart/alternative; boundary="BOUNDARY"
    ...
    ... --BOUNDARY
    ... From: sabo
    ... To: obas
    ...
    ... Subject: farbaw
    ... Keywords: barbaz
    ...
    ... --BOUNDARY--
    ... """)
    >>> msgdata = {}
    >>> process(mlist, msg, msgdata)
    >>> print(msg.as_string())
    Subject: Was
    Keywords: Raw
    Content-Type: multipart/alternative; boundary="BOUNDARY"
    X-Topics: bar fight
    <BLANKLINE>
    --BOUNDARY
    From: sabo
    To: obas
    <BLANKLINE>
    Subject: farbaw
    Keywords: barbaz
    <BLANKLINE>
    --BOUNDARY--
    <BLANKLINE>
    >>> msgdata['topichits']
    ['bar fight']

But the tagger will not descend into non-text parts.

    >>> msg = message_from_string("""\
    ... Subject: Was
    ... Keywords: Raw
    ... Content-Type: multipart/alternative; boundary=BOUNDARY
    ...
    ... --BOUNDARY
    ... From: sabo
    ... To: obas
    ... Content-Type: message/rfc822
    ...
    ... Subject: farbaw
    ... Keywords: barbaz
    ...
    ... --BOUNDARY
    ... From: sabo
    ... To: obas
    ... Content-Type: message/rfc822
    ...
    ... Subject: farbaw
    ... Keywords: barbaz
    ...
    ... --BOUNDARY--
    ... """)
    >>> msgdata = {}
    >>> process(mlist, msg, msgdata)
    >>> print(msg['x-topics'])
    None
    >>> msgdata
    {}