Tokenizer
services.tokenizer.core
Core tokenizer components.
Exports: - AbstractTokenizer - TokenizerConfig - TokenList - LanguageFamily - TokenType - CaseHandling
Intended usage
from services.tokenizer.core import ( AbstractTokenizer, TokenizerConfig, TokenList, LanguageFamily, TokenType, CaseHandling, )
Modules:
Name | Description |
---|---|
base |
AbstractTokenizer abstract base class |
types |
TokenizerConfig, enums, and shared types |
Classes:
Name | Description |
---|---|
AbstractTokenizer |
Abstract base class for all tokenizer implementations. |
CaseHandling |
How to handle character case during tokenization. |
LanguageFamily |
Language families that affect tokenization strategies. |
TokenType |
Types of tokens that can be extracted. |
TokenizerConfig |
Configuration for tokenizer behavior. |
AbstractTokenizer
Bases: ABC
Abstract base class for all tokenizer implementations.
This class defines the core interface that all tokenizer plugins must implement. It provides a clean contract for tokenization operations while allowing for different implementation strategies.
Methods:
Name | Description |
---|---|
__init__ |
Initialize the tokenizer with configuration. |
tokenize |
Tokenize input text into a list of tokens. |
Attributes:
Name | Type | Description |
---|---|---|
config |
TokenizerConfig
|
Get the current tokenizer configuration. |
Source code in services/tokenizer/core/base.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
|
__init__(config=None)
Initialize the tokenizer with configuration.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
Optional[TokenizerConfig]
|
Tokenizer configuration. If None, default config will be used. |
None
|
Source code in services/tokenizer/core/base.py
23 24 25 26 27 28 29 30 |
|
config
property
Get the current tokenizer configuration.
tokenize(text)
abstractmethod
Tokenize input text into a list of tokens.
This is the main tokenization method that all implementations must provide.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
str
|
Input text to tokenize |
required |
Returns:
Type | Description |
---|---|
TokenList
|
List of tokens extracted from the input text |
Source code in services/tokenizer/core/base.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
CaseHandling
Bases: Enum
How to handle character case during tokenization.
Source code in services/tokenizer/core/types.py
38 39 40 41 42 43 44 |
|
LanguageFamily
Bases: str
, Enum
Language families that affect tokenization strategies.
Source code in services/tokenizer/core/types.py
14 15 16 17 18 19 20 21 |
|
TokenType
Bases: str
, Enum
Types of tokens that can be extracted.
Source code in services/tokenizer/core/types.py
24 25 26 27 28 29 30 31 32 33 34 35 |
|
TokenizerConfig
pydantic-model
Bases: BaseModel
Configuration for tokenizer behavior.
Controls all aspects of text tokenization including script handling, social media entity processing, and output formatting.
Social Media Entity Behavior: - extract_hashtags/extract_mentions: When False, splits into component words - include_urls/include_emails: When False, completely excludes (no fragmentation)
Fields:
-
fallback_language_family
(LanguageFamily
) -
include_punctuation
(bool
) -
include_numeric
(bool
) -
include_emoji
(bool
) -
case_handling
(CaseHandling
) -
normalize_unicode
(bool
) -
extract_hashtags
(bool
) -
extract_mentions
(bool
) -
include_urls
(bool
) -
include_emails
(bool
) -
min_token_length
(int
) -
max_token_length
(Optional[int]
) -
strip_whitespace
(bool
)
Source code in services/tokenizer/core/types.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|
case_handling = CaseHandling.LOWERCASE
pydantic-field
How to handle character case during tokenization.
extract_hashtags = True
pydantic-field
Whether to preserve hashtags as single tokens. If False, splits into component words.
extract_mentions = True
pydantic-field
Whether to preserve @mentions as single tokens. If False, splits into component words.
fallback_language_family = LanguageFamily.MIXED
pydantic-field
Default language family when detection fails or mixed content is found.
include_emails = True
pydantic-field
Whether to include email addresses as tokens. If False, emails are completely excluded (not fragmented).
include_emoji = False
pydantic-field
Whether to include emoji characters as tokens.
include_numeric = True
pydantic-field
Whether to include numeric tokens (integers, decimals, etc.).
include_punctuation = False
pydantic-field
Whether to include punctuation marks as separate tokens.
include_urls = True
pydantic-field
Whether to include URLs as tokens. If False, URLs are completely excluded (not fragmented).
max_token_length = None
pydantic-field
Maximum length for tokens. If None, no length limit is applied.
min_token_length = 1
pydantic-field
Minimum length for tokens to be included in output.
normalize_unicode = True
pydantic-field
Whether to apply Unicode NFKC normalization for consistent character representation.
strip_whitespace = True
pydantic-field
Whether to strip leading/trailing whitespace from tokens.
base
AbstractTokenizer abstract base class
This module contains the abstract base class that defines the interface for all tokenizer implementations.
Classes:
Name | Description |
---|---|
AbstractTokenizer |
Abstract base class for all tokenizer implementations. |
AbstractTokenizer
Bases: ABC
Abstract base class for all tokenizer implementations.
This class defines the core interface that all tokenizer plugins must implement. It provides a clean contract for tokenization operations while allowing for different implementation strategies.
Methods:
Name | Description |
---|---|
__init__ |
Initialize the tokenizer with configuration. |
tokenize |
Tokenize input text into a list of tokens. |
Attributes:
Name | Type | Description |
---|---|---|
config |
TokenizerConfig
|
Get the current tokenizer configuration. |
Source code in services/tokenizer/core/base.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
|
__init__(config=None)
Initialize the tokenizer with configuration.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
Optional[TokenizerConfig]
|
Tokenizer configuration. If None, default config will be used. |
None
|
Source code in services/tokenizer/core/base.py
23 24 25 26 27 28 29 30 |
|
config
property
Get the current tokenizer configuration.
tokenize(text)
abstractmethod
Tokenize input text into a list of tokens.
This is the main tokenization method that all implementations must provide.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
str
|
Input text to tokenize |
required |
Returns:
Type | Description |
---|---|
TokenList
|
List of tokens extracted from the input text |
Source code in services/tokenizer/core/base.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
types
TokenizerConfig, enums, and shared types
This module contains configuration models, enumerations, and shared type definitions used across the tokenizer service.
Classes:
Name | Description |
---|---|
CaseHandling |
How to handle character case during tokenization. |
LanguageFamily |
Language families that affect tokenization strategies. |
TokenType |
Types of tokens that can be extracted. |
TokenizerConfig |
Configuration for tokenizer behavior. |
CaseHandling
Bases: Enum
How to handle character case during tokenization.
Source code in services/tokenizer/core/types.py
38 39 40 41 42 43 44 |
|
LanguageFamily
Bases: str
, Enum
Language families that affect tokenization strategies.
Source code in services/tokenizer/core/types.py
14 15 16 17 18 19 20 21 |
|
TokenType
Bases: str
, Enum
Types of tokens that can be extracted.
Source code in services/tokenizer/core/types.py
24 25 26 27 28 29 30 31 32 33 34 35 |
|
TokenizerConfig
pydantic-model
Bases: BaseModel
Configuration for tokenizer behavior.
Controls all aspects of text tokenization including script handling, social media entity processing, and output formatting.
Social Media Entity Behavior: - extract_hashtags/extract_mentions: When False, splits into component words - include_urls/include_emails: When False, completely excludes (no fragmentation)
Fields:
-
fallback_language_family
(LanguageFamily
) -
include_punctuation
(bool
) -
include_numeric
(bool
) -
include_emoji
(bool
) -
case_handling
(CaseHandling
) -
normalize_unicode
(bool
) -
extract_hashtags
(bool
) -
extract_mentions
(bool
) -
include_urls
(bool
) -
include_emails
(bool
) -
min_token_length
(int
) -
max_token_length
(Optional[int]
) -
strip_whitespace
(bool
)
Source code in services/tokenizer/core/types.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|
case_handling = CaseHandling.LOWERCASE
pydantic-field
How to handle character case during tokenization.
extract_hashtags = True
pydantic-field
Whether to preserve hashtags as single tokens. If False, splits into component words.
extract_mentions = True
pydantic-field
Whether to preserve @mentions as single tokens. If False, splits into component words.
fallback_language_family = LanguageFamily.MIXED
pydantic-field
Default language family when detection fails or mixed content is found.
include_emails = True
pydantic-field
Whether to include email addresses as tokens. If False, emails are completely excluded (not fragmented).
include_emoji = False
pydantic-field
Whether to include emoji characters as tokens.
include_numeric = True
pydantic-field
Whether to include numeric tokens (integers, decimals, etc.).
include_punctuation = False
pydantic-field
Whether to include punctuation marks as separate tokens.
include_urls = True
pydantic-field
Whether to include URLs as tokens. If False, URLs are completely excluded (not fragmented).
max_token_length = None
pydantic-field
Maximum length for tokens. If None, no length limit is applied.
min_token_length = 1
pydantic-field
Minimum length for tokens to be included in output.
normalize_unicode = True
pydantic-field
Whether to apply Unicode NFKC normalization for consistent character representation.
strip_whitespace = True
pydantic-field
Whether to strip leading/trailing whitespace from tokens.
services.tokenizer.basic
Basic tokenizer implementation.
This module exports the BasicTokenizer implementation that provides fundamental Unicode-aware tokenization capabilities for social media text.
Modules:
Name | Description |
---|---|
patterns |
Regex patterns for text tokenization. |
tokenizer |
BasicTokenizer implementation. |
Classes:
Name | Description |
---|---|
BasicTokenizer |
Unicode-aware basic tokenizer for social media text. |
TokenizerConfig |
Configuration for tokenizer behavior. |
Functions:
Name | Description |
---|---|
create_basic_tokenizer |
Create a BasicTokenizer with optional configuration. |
get_patterns |
Get global TokenizerPatterns instance. |
tokenize_text |
Simple convenience function for basic text tokenization. |
BasicTokenizer
Bases: AbstractTokenizer
Unicode-aware basic tokenizer for social media text.
This tokenizer handles mixed-script content, preserves social media entities (@mentions, #hashtags, URLs), and applies appropriate tokenization strategies for different script families.
Methods:
Name | Description |
---|---|
__init__ |
Initialize BasicTokenizer with configuration. |
tokenize |
Tokenize input text into a list of tokens. |
Source code in services/tokenizer/basic/tokenizer.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 |
|
__init__(config=None)
Initialize BasicTokenizer with configuration.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
Optional[TokenizerConfig]
|
Tokenizer configuration. If None, default config will be used. |
None
|
Source code in services/tokenizer/basic/tokenizer.py
25 26 27 28 29 30 31 32 33 |
|
tokenize(text)
Tokenize input text into a list of tokens.
Applies appropriate tokenization strategies for mixed-script content while preserving social media entities and handling Unicode correctly.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
str
|
Input text to tokenize |
required |
Returns:
Type | Description |
---|---|
TokenList
|
List of tokens extracted from the input text in document order |
Source code in services/tokenizer/basic/tokenizer.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
|
TokenizerConfig
pydantic-model
Bases: BaseModel
Configuration for tokenizer behavior.
Controls all aspects of text tokenization including script handling, social media entity processing, and output formatting.
Social Media Entity Behavior: - extract_hashtags/extract_mentions: When False, splits into component words - include_urls/include_emails: When False, completely excludes (no fragmentation)
Fields:
-
fallback_language_family
(LanguageFamily
) -
include_punctuation
(bool
) -
include_numeric
(bool
) -
include_emoji
(bool
) -
case_handling
(CaseHandling
) -
normalize_unicode
(bool
) -
extract_hashtags
(bool
) -
extract_mentions
(bool
) -
include_urls
(bool
) -
include_emails
(bool
) -
min_token_length
(int
) -
max_token_length
(Optional[int]
) -
strip_whitespace
(bool
)
Source code in services/tokenizer/core/types.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|
case_handling = CaseHandling.LOWERCASE
pydantic-field
How to handle character case during tokenization.
extract_hashtags = True
pydantic-field
Whether to preserve hashtags as single tokens. If False, splits into component words.
extract_mentions = True
pydantic-field
Whether to preserve @mentions as single tokens. If False, splits into component words.
fallback_language_family = LanguageFamily.MIXED
pydantic-field
Default language family when detection fails or mixed content is found.
include_emails = True
pydantic-field
Whether to include email addresses as tokens. If False, emails are completely excluded (not fragmented).
include_emoji = False
pydantic-field
Whether to include emoji characters as tokens.
include_numeric = True
pydantic-field
Whether to include numeric tokens (integers, decimals, etc.).
include_punctuation = False
pydantic-field
Whether to include punctuation marks as separate tokens.
include_urls = True
pydantic-field
Whether to include URLs as tokens. If False, URLs are completely excluded (not fragmented).
max_token_length = None
pydantic-field
Maximum length for tokens. If None, no length limit is applied.
min_token_length = 1
pydantic-field
Minimum length for tokens to be included in output.
normalize_unicode = True
pydantic-field
Whether to apply Unicode NFKC normalization for consistent character representation.
strip_whitespace = True
pydantic-field
Whether to strip leading/trailing whitespace from tokens.
create_basic_tokenizer(config=None)
Create a BasicTokenizer with optional configuration.
Source code in services/tokenizer/basic/__init__.py
14 15 16 17 18 |
|
get_patterns()
Get global TokenizerPatterns instance.
Returns:
Type | Description |
---|---|
TokenizerPatterns
|
Singleton TokenizerPatterns instance |
Source code in services/tokenizer/basic/patterns.py
316 317 318 319 320 321 322 323 324 325 326 |
|
patterns
Regex patterns for text tokenization.
This module contains compiled regular expressions for extracting different types of tokens from social media text, with fallback support for both regex and re modules.
Classes:
Name | Description |
---|---|
TokenizerPatterns |
Compiled regex patterns for tokenization. |
Functions:
Name | Description |
---|---|
get_patterns |
Get global TokenizerPatterns instance. |
TokenizerPatterns
Compiled regex patterns for tokenization.
Organizes patterns logically and provides efficient compiled regex objects for different token types found in social media text.
Methods:
Name | Description |
---|---|
__init__ |
Initialize and compile all tokenization patterns. |
get_comprehensive_pattern |
Build comprehensive tokenization pattern based on configuration. |
get_exclusion_pattern |
Build pattern to identify and skip excluded entities in text. |
get_pattern |
Get compiled pattern by name. |
list_patterns |
Get list of available pattern names. |
Source code in services/tokenizer/basic/patterns.py
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 |
|
__init__()
Initialize and compile all tokenization patterns.
Source code in services/tokenizer/basic/patterns.py
143 144 145 146 |
|
get_comprehensive_pattern(config)
Build comprehensive tokenization pattern based on configuration.
This creates a single regex pattern that finds ALL tokens in document order, eliminating the need for segmentation and reassembly. URLs and emails are conditionally included in the regex itself based on configuration, avoiding the need for post-processing filtering.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
TokenizerConfig specifying which token types to include |
required |
Returns:
Type | Description |
---|---|
Any
|
Compiled regex pattern that matches all desired token types in priority order |
Source code in services/tokenizer/basic/patterns.py
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 |
|
get_exclusion_pattern(config)
Build pattern to identify and skip excluded entities in text.
This creates a pattern that matches URLs and emails that should be excluded, allowing the tokenizer to skip over them entirely instead of breaking them into component words.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
TokenizerConfig specifying which token types to exclude |
required |
Returns:
Type | Description |
---|---|
Any
|
Compiled regex pattern that matches excluded entities, or None if no exclusions |
Source code in services/tokenizer/basic/patterns.py
226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 |
|
get_pattern(pattern_name)
Get compiled pattern by name.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
str
|
Name of the pattern to retrieve |
required |
Returns:
Type | Description |
---|---|
Any
|
Compiled regex pattern |
Raises:
Type | Description |
---|---|
KeyError
|
If pattern name is not found |
Source code in services/tokenizer/basic/patterns.py
148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
|
list_patterns()
Get list of available pattern names.
Source code in services/tokenizer/basic/patterns.py
269 270 271 |
|
get_patterns()
Get global TokenizerPatterns instance.
Returns:
Type | Description |
---|---|
TokenizerPatterns
|
Singleton TokenizerPatterns instance |
Source code in services/tokenizer/basic/patterns.py
316 317 318 319 320 321 322 323 324 325 326 |
|
tokenize_text(text, config=None)
Simple convenience function for basic text tokenization.
Source code in services/tokenizer/basic/__init__.py
21 22 23 24 |
|
tokenizer
BasicTokenizer implementation.
This module contains the main BasicTokenizer class that implements Unicode-aware tokenization for social media text with entity preservation.
Classes:
Name | Description |
---|---|
BasicTokenizer |
Unicode-aware basic tokenizer for social media text. |
BasicTokenizer
Bases: AbstractTokenizer
Unicode-aware basic tokenizer for social media text.
This tokenizer handles mixed-script content, preserves social media entities (@mentions, #hashtags, URLs), and applies appropriate tokenization strategies for different script families.
Methods:
Name | Description |
---|---|
__init__ |
Initialize BasicTokenizer with configuration. |
tokenize |
Tokenize input text into a list of tokens. |
Source code in services/tokenizer/basic/tokenizer.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 |
|
__init__(config=None)
Initialize BasicTokenizer with configuration.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
Optional[TokenizerConfig]
|
Tokenizer configuration. If None, default config will be used. |
None
|
Source code in services/tokenizer/basic/tokenizer.py
25 26 27 28 29 30 31 32 33 |
|
tokenize(text)
Tokenize input text into a list of tokens.
Applies appropriate tokenization strategies for mixed-script content while preserving social media entities and handling Unicode correctly.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
str
|
Input text to tokenize |
required |
Returns:
Type | Description |
---|---|
TokenList
|
List of tokens extracted from the input text in document order |
Source code in services/tokenizer/basic/tokenizer.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
|