autodetect_search_language.py 4.2 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
  1. # SPDX-License-Identifier: AGPL-3.0-or-later
  2. # lint: pylint
  3. """Plugin to detect the search language from the search query.
  4. The language detection is done by using the fastText_ library (`python
  5. fasttext`_). fastText_ distributes the `language identification model`_, for
  6. reference:
  7. - `FastText.zip: Compressing text classification models`_
  8. - `Bag of Tricks for Efficient Text Classification`_
  9. The `language identification model`_ support the language codes (ISO-639-3)::
  10. af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr
  11. ca cbk ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa
  12. fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io
  13. is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv
  14. mai mg mhr min mk ml mn mr mrj ms mt mwl my myv mzn nah nap nds ne new nl nn
  15. no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah sc scn sco sd
  16. sh si sk sl so sq sr su sv sw ta te tg th tk tl tr tt tyv ug uk ur uz vec vep
  17. vi vls vo wa war wuu xal xmf yi yo yue zh
  18. The `language identification model`_ is harmonized with the SearXNG's language
  19. (locale) model. General conditions of SearXNG's locale model are:
  20. a. SearXNG's locale of a query is passed to the
  21. :py:obj:`searx.locales.get_engine_locale` to get a language and/or region
  22. code that is used by an engine.
  23. b. SearXNG and most of the engines do not support all the languages from
  24. language model and there might be also a discrepancy in the ISO-639-3 and
  25. ISO-639-2 handling (:py:obj:`searx.locales.get_engine_locale`). Further
  26. more, in SearXNG the locales like ``zh-TH`` (``zh-CN``) are mapped to
  27. ``zh_Hant`` (``zh_Hans``).
  28. Conclusion: This plugin does only auto-detect the languages a user can select in
  29. the language menu (:py:obj:`supported_langs`).
  30. SearXNG's locale of a query comes from (*highest wins*):
  31. 1. The ``Accept-Language`` header from user's HTTP client.
  32. 2. The user select a locale in the preferences.
  33. 3. The user select a locale from the menu in the query form (e.g. ``:zh-TW``)
  34. 4. This plugin is activated in the preferences and the locale (only the language
  35. code / none region code) comes from the fastText's language detection.
  36. Conclusion: There is a conflict between the language selected by the user and
  37. the language from language detection of this plugin. For example, the user
  38. explicitly selects the German locale via the search syntax to search for a term
  39. that is identified as an English term (try ``:de-DE thermomix``, for example).
  40. .. hint::
  41. To SearXNG maintainers; please take into account: under some circumstances
  42. the auto-detection of the language of this plugin could be detrimental to
  43. users expectations. Its not recommended to activate this plugin by
  44. default. It should always be the user's decision whether to activate this
  45. plugin or not.
  46. .. _fastText: https://fasttext.cc/
  47. .. _python fasttext: https://pypi.org/project/fasttext/
  48. .. _language identification model: https://fasttext.cc/docs/en/language-identification.html
  49. .. _Bag of Tricks for Efficient Text Classification: https://arxiv.org/abs/1607.01759
  50. .. _`FastText.zip: Compressing text classification models`: https://arxiv.org/abs/1612.03651
  51. """
  52. from flask_babel import gettext
  53. import babel
  54. from searx.utils import detect_language
  55. from searx.languages import language_codes
  56. name = gettext('Autodetect search language')
  57. description = gettext('Automatically detect the query search language and switch to it.')
  58. preference_section = 'general'
  59. default_on = False
  60. supported_langs = set()
  61. """Languages supported by most searxng engines (:py:obj:`searx.languages.language_codes`)."""
  62. def pre_search(request, search): # pylint: disable=unused-argument
  63. lang = detect_language(search.search_query.query, min_probability=0)
  64. if lang in supported_langs:
  65. search.search_query.lang = lang
  66. try:
  67. search.search_query.locale = babel.Locale.parse(lang)
  68. except babel.core.UnknownLocaleError:
  69. pass
  70. return True
  71. def init(app, settings): # pylint: disable=unused-argument
  72. for searxng_locale in language_codes:
  73. supported_langs.add(searxng_locale[0].split('-')[0])
  74. return True