maif / melusine Goto Github PK

View Code? Open in Web Editor NEW

348.0 21.0 58.0 46.04 MB

📧 Melusine: Use python to automatize your email processing workflow

Home Page: https://maif.github.io/melusine

License: Other

Makefile 0.67% Python 99.33%

python nlp-machine-learning datascience emails natural-language-processing nlp python3 courriels

melusine's People

Contributors

Stargazers

Watchers

melusine's Issues

Add an integration with common mailboxes such as Outlook Exchange

Description of Problem:
To make email routing effective, Melusine has to be linked to a mailbox system.

Overview of the Solution:
A connector with Exchange using the python package exchangelib is considered.

getting started

Hello, I am writing this report hoping for some help.
I have been trying to work on your projet for a while but have had difficulties due to a proxy on my laptop. The solution found to install le Melusine librairy was to charge all the librairies on another laptop and then copy them on mine.
I have yet to finish the installation of jupyter so hen I finally had access to melusine I ran some tests to make sure it was working well so I wrote a .py file with the code below and ran it in my terminal and had the error in the screenshot below.
I don't understand the error so I wanted to know if the problem was with how I installed melusine or with the fact that I didn't have jupyter yet.

from melusine.data.data_loader import load_email_data
df_emails = load_email_data()
print(df_emails[1])

I am using python 3.8
I am on Windows

Python version :

Melusine version :

Operating System :

Bug NeuralModel ?

Discussed in #147

^{Originally posted by aagostinelli86 January 3, 2023}
Hello @TFA-MAIF I have an issue with the tutorial notebook n 13, cell 33 (Camembert model): "OSError: Model name 'jplu/tf-camembert-base' was not found in tokenizers model name list (camembert-base). We assumed 'jplu/tf-camembert-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['sentencepiece.bpe.model'] but couldn't find such vocabulary files at this path or url."
I think it's due to the proxy, so I downloaded locally the pretrained model (config.json + .tf_model.h5) and I specified its path in your NeuralModel wrapper. The issue holds with this different message: "OSError: Model name 'C:\Users\vgkj536\git_projects\melusine\tutorial\jplu\tf-camembert-base' was not found in tokenizers model name list (camembert-base). We assumed 'C:\Users\vgkj536\git_projects\melusine\tutorial\jplu\tf-camembert-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['sentencepiece.bpe.model'] but couldn't find such vocabulary files at this path or url."
On the opposite, if I call directly the TFCamembertModel.from_pretrained() with this local path, it works well. Is this due to a bug ?

Setup.py miss `melusine/config/config/config.json`

When installing with pip, melusine/config/config/conf.json is missing.

Because glob in setup.py only capture JSON files in sub-folders of config, and skip JSON files in config folder itself.

Can be fixed by using glob.glob("melusine/config/**/*.json", recursive=True) instead of glob.glob("melusine/config/**/*.json").
I will create a pull request for this issue.

Generic datetime is not handled

Hi !
There is an issue with this type of datetime : 2021-06-24T13:02:43+00:00 in the 'date' column. Indeed, in MetaDate class you expect the date to have this format : regex_date_format=r"\w+ (\d+) (\w+) (\d{4}) (\d{2}) h (\d{2})" . In case of our date does not fit the format, the except return the raw format so we could expect it continue but in fact the to_datetime just after expect the format : date_format="%d/%m/%Y %H:%M" . In our case, to_datetime :

 pd.to_datetime(X['date'],
                format=self.date_format,
                infer_datetime_format=False,
                errors="coerce",
            )

return NaT

Python version :
3.7
Melusine version :
2.3.1
Operating System :
Windows 10

Add release notes for Melusine 2.3

New features:

Added a class ExchangeConnector to interact with an Exchange Mailbox
Added new tutorial tutorial14_exchange_connector to demonstrate the usage of the ExchangeConnector class.

Updates:

Gensim upgrade (4.0.0)
Propagate modifications stemming from the Gensim upgrade (code and tutorials)
Package deployment : switch from Travis CI to Github actions

Different versions between GitHub and Pypi.org

Hey !

I have tried to reinstall Melusine since I had the version 2.3.1 of october.
I saw on GitHub that changes have been made on NLP tools (the Phraser object for example).

First, I uninstalled melusine with : pip uninstall melusine.
When I did : pip install melusine. It downloaded melusine 2.3.2.

However, when I tried to use the new version of the class Phraser, there was a mistake. The code downloaded with pip has still the old version of the file "phraser.py". So there are differences between the code of melusine in GitHub and the code we can download with pip.

The ...\Lib\site-packages\melusine\nlp_tools\phraser.py that I have in my virtual env :

The phraser.py file in GitHub :

Is it normal ? Am I the only one to have this problem ?

Thank you !

Issue with attachment type metadata

Hey !

We had a problem with the attachment type in metadata. As you can see in the screenshot below, we had only two values after applying our Metadata pipeline. 0 for the presence of an attachment file and 1 if there is no attachment file in the mail. The screenshot is an extract of the DataFrame call df_email.

Here is the way we create our pipeline and how we apply it on our emails :

Metadatapipeline = Pipeline([('MetaExtension', MetaExtension()),
('MetaDate', MetaDate()),
('MetaAttachmentType', MetaAttachmentType()),
('Dummifier', Dummifier(columns_to_dummify = ['extension', 'attachment_type', 'dayofweek','hour', 'min']))])
df_meta = Metadatapipeline.fit_transform(df_emails)

Then, this is the function which is supposed to extract the type of the attachment file in melusine/prepare_email/metadata_engineering.py:

We added some prints to understand what is the problem. As you can see, when there is at least one attachment file in the mail, the type of x is str, and when there is no attachment file the value of x is nan.
When the function has to deal with a mail with an attachment file, the value of the row["attachment"] is a str. For example, we could have "['image002.png', 'image003.jpg']". Then, the for loop will just take it as a str and deal with the char one by one. This seems to be the reason of our issue.

To fix this problem, we did :

This seems to solve our issue :

Python version : 3.8.12

Melusine version : 2.3.1

Operating System : Windows

melusine is not compatible with Python 3.12

Instructions

In: melusine/_config.py, the following code snippet

    def export_default_config(self, path: str) -> List[str]:
        from distutils.dir_util import copy_tree

        source = self.DEFAULT_CONFIG_PATH
        file_list: List[str] = copy_tree(source, path)

        return file_list

imports distutils, which was removed from Python 3.12+.

Debug Information

Python version: 3.12

Melusine version: master@7dc9d99a57d

Make the dummifier more flexible

Hi !

If we want to use new metadata like it's explained in tutorials, we may need to modify the dummifier object.

If we have metadata with a type list, we need to modify the code of the Dummifier to be able to use our new metadata, like it's done in the list of attachment type.

For example, this is the actual code of the fit() method from the Dummifier:

In our case, I wanted to add the name of receivers in metadata used for the training of our model.
To do it, I needed to create the case, in the Dummifier object, where we have a column receivers in df_emails.
It was not very clean, so I made modifications on the Dummifier object to make it more flexible.

In the PR that I will link, we have a new parameter for the Dummifier object, which is a list of the columns with a type list.
Then, we apply the same functions on these columns.

I hope I was clear.

Best regards,

Maxime

Python version : 3.9.7

Melusine version : 2.3.4

Operating System : Windows

Value Error : Input 0 of layer dense_2 is incompatible with the layer

Hi !
I think I have detected a bug about emails metadata's structure. Indeed, I trained a model with a specific metadata dataset structure (look at the first picture attached). But when I try to predict a result with a dataset which has a bigger structure, the followig error appears :

"ValueError: Input 0 of layer dense_2 is incompatible with the layer: expected axis -1 of input shape to have value 9 but received input with shape [None, 10]"

I join you an example of this dataset (the second picture) so you can see what I mean. I think it could be a problem from tensorflow but I'm not sure.

Python version :
3.8

Melusine version :
2.3.1

Operating System :
Windows 10

Return the histogram of the training

Hello,

Description of Problem: We wanted to get the values of metrics (like accuracy) from the train function. It would give us the possibility to plot the evolution of the accuracy during the different epochs of the training, or get the last value as the accuracy on the training dataset.

Overview of the Solution: To do that, we just need to store the output of the fit function in the train function of melusine. Then we return the value of this output.

Possibility:

body extraction

Hello !
Your project looks awesome and I really want to try it.
I see in your project that you use very clean text in your body data.
In "real" email life the content of mail body is very dirty (HTML, encoding, formating, multipart, different language...).
Did you manage it ? (or maybe you work only with your internal company emails ?)

Faithfully

Remove the pinned transformers==3.4.0 version in the dependencies

Description of Problem:
Currently, Melusine only works with transformers 3.4.0 (or no transformers at all).
Latest releases of the transformers library are thus not available (including latest models).

Overview of the Solution:
Make the Melusine code compatible with transformers>=4

Definition of Done:
Melusine works without the pinned transformers==3.4.0 version

Exchange connexion error

Hi !
I tried to do a connexion to Exchange with the ExchangeConnector class from melusine. But when I ran the program the following error appeared :

exchangelib.errors.AutoDiscoverFailed: All steps in the autodiscover protocol failed for email '[email protected]'. If you think this is an error, consider doing an official test at https://testconnectivity.microsoft.com

So, I tried to modify the melusine class in order to force the connexion. Here is the code :

class ExchangeConnector:

    def __init__(
        self,
        login_address: str,
        password: str,
        mailbox_address: str = None,
        max_wait: int = 60,
        routing_folder_path: str = None,
        correction_folder_path: str = None,
        done_folder_path: str = None,
        target_column="target",
    ):
        """
        Parameters
        ----------
        login_address: str
            Email address used to login and send emails.
        password: str
            Password to login to the Exchange mailbox
        mailbox_address: str
            Email address of the mailbox. By default, the login address is used
        max_wait: int
            Maximum time (in s) to wait when connecting to mailbox
        routing_folder_path: str
            Path of the base routing folder
        correction_folder_path: str
            Path of the base correction folder
        done_folder_path: str
            Path of the Done folder
        target_column: str
            Name of the DataFrame column containing target folder names
        """
        self.login_address = login_address
        self.mailbox_address = mailbox_address or login_address
        self.folder_list = None
        self.target_column = target_column

        # Connect to mailbox
        self.credentials = Credentials(self.login_address, password)
        self.exchangelib_config = Configuration(server= 'mail.company.domain', retry_policy=FaultTolerance(max_wait=max_wait), credentials=self.credentials, auth_type=NTLM)
        #add the server name, the auth_type. These parameters can be found in outlook with 'auto configuration test'
        # Mailbox account (Routing, Corrections, etc)
        self.mailbox_account = Account(
            primary_smtp_address=self.mailbox_address,
            credentials=self.credentials,
            autodiscover=False,
            #turn autodiscover off
            config=self.exchangelib_config,
            access_type=DELEGATE
            #acces_type can, as well as the server name or auth_type  be found with 'auto configuration test'
        )
        # Sender accounts (send emails)
        self.sender_account = Account(
            primary_smtp_address=self.mailbox_address,
            credentials=self.credentials,
            autodiscover=False,
            config=self.exchangelib_config,
            access_type=DELEGATE
        )

        # Setup correction folder and done folder
        self.routing_folder_path = routing_folder_path
        self.correction_folder_path = correction_folder_path
        self.done_folder_path = done_folder_path

        logger.info(
            f"Connected to mailbox {self.mailbox_address} as user {self.login_address}"
        )

Usually with this new code the connexion has to be done. But it might be possible that a SSL certificate error pops up. To counter that you have to disable http verification like that :

from exchangelib.protocol import BaseProtocol, NoVerifyHTTPAdapter

BaseProtocol.HTTP_ADAPTER_CLS = NoVerifyHTTPAdapter

Now the connexion works !

Python version :
3.8

Melusine version :
2.3.1

Operating System :
Windows 10

Add phraser_on_clean_text

Hello !

Description of Problem: We wanted to concatenate the clean_header and the clean_body to create a column clean_text. As you can see on the screenshot below, we did it after applying the Transformer Pipeline. Then if we follow the examples in the tutorials, we want to use a NLP Pipeline, with a phraser function and a tokenizer. Our problem : we can apply the tokenizer on the column we want (clean_text), but we can't apply the phraser on this column. Indeed, there are only two phraser function in melusine (phraser_on_body and phraser_on_header) which apply the phraser on the columns clean_body and clean_header. We can't concatenate clean_body and clean_header after applying their phraser function, because they are in the same pipeline that the tokeniser function which can be applied on clean_text.

Overview of the Solution: One possible solution is the creation of a phraser function for the clean_text in "melusine/nlp_tools/phraser.py". It would allow us to apply the phraser on a column name clean_text.

Result:

Make Melusine compatible with python 3.7 (at least)

Description of Problem:
Melusine currently works only with python 3.6.
Making python compatible with python 3.7 would make it possible to use DataClasses.
Data Classes would be very relevant for data validation. For instance custom configurations could be validated by data classes.

Overview of the Solution:
Currently, the main elements preventing Melusine from working with python 3.7 is conflicts between libraries and python versions, in particular:

Not all TensorFlow versions are compatible with python 3.8
Gensim 4+ requires numpy 1.20+
Tensorflow requires numpy ~= 1.19
Transfoermers 3.4 does not work with all versions of python
etc

Definition of Done:
Melusine works with pythonn version 3.7 (or more)

NeuralMode' object has no attribute 'bert_tokenizer'

Dear all,
I can not run the following training from R environment.

from melusine.models.neural_architectures import cnn_model
from melusine.models.train import NeuralModel

nn_model = NeuralModel(architecture_function=rnn_model,
                        pretrained_embedding=pretrained_embedding,
                        text_input_column="clean_body",
                        meta_input_list=['extension', 'dayofweek','hour', 'min', 'attachment_type'],
                        n_epochs=10)

SyntaxError: unexpected EOF while parsing (<string>, line 2)
Erreur dans py_call_impl(callable, dots$args, dots$keywords) : 
  AttributeError: 'NeuralModel' object has no attribute 'bert_tokenizer'

Detailed traceback:
  File "/usr/lib/python3.6/pprint.py", line 144, in pformat
    self._format(object, sio, 0, 0, {}, 0)
  File "/usr/lib/python3.6/pprint.py", line 161, in _format
    rep = self._repr(object, context, level)
  File "/usr/lib/python3.6/pprint.py", line 393, in _repr
    self._depth, level)
  File "/usr/lib/python3.6/pprint.py", line 405, in format
    return _safe_repr(object, context, maxlevels, level)
  File "/usr/lib/python3.6/pprint.py", line 555, in _safe_repr
    rep = repr(object)
  File "/home/kirus/.local/lib/python3.6/site-packages/sklearn/base.py", line 260, in __repr__
    repr_ = pp.pformat(self)
  File "/usr/lib/python3.6/pprint.py", line 144, in pformat
    self._format(object, sio, 0, 0, {}, 0)
  File "/usr/lib/python3.6/pprint.py", line 161, in _format
    rep = self._repr(object, context, level)
  File "/usr/lib/python3.6/pprint.py", line 393, in _repr
    self._dept

When I import cnn_model class, I got:

from melusine.models.neural_architectures import cnn_model
2021-09-16 11:13:08.221051: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib:/usr/lib/R/lib::/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/lib/server
2021-09-16 11:13:08.221084: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

Any idea is welcome. Thanks

I have libcudart9.1 and not libcudart.so10.1.

sudo apt list | grep libcudart
libcudart9.1/bionic,now 9.1.85-3ubuntu1 amd64  [installé]

I do not have GPU set up, nor nvidia card

lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation Iris Graphics 6100 (rev 09)

Python version : 3.6

Melusine version : 2.3.1

Operating System : 18.04.5 LTS on Macbook Pro Retina

Python Module versions

kirus@izis:~$ pip3.6 list
Package                 Version
----------------------- -------------------
absl-py                 0.13.0
altair                  4.1.0
appdirs                 1.4.4
apt-xapian-index        0.47
argon2-cffi             21.1.0
asn1crypto              0.24.0
astor                   0.8.1
astunparse              1.6.3
async-generator         1.10
attrs                   21.2.0
backcall                0.2.0
backports.zoneinfo      0.2.1
base58                  2.1.0
bleach                  4.1.0
blinker                 1.4
blis                    0.7.4
cachetools              4.2.2
catalogue               2.0.6
certifi                 2018.1.18
cffi                    1.14.4
chardet                 3.0.4
charset-normalizer      2.0.4
click                   7.1.2
coala-utils             0.7.0
colorama                0.3.9
command-not-found       0.3
contextvars             2.4
cryptography            2.1.4
cupshelpers             1.0
cycler                  0.10.0
cymem                   2.0.5
dataclasses             0.8
dbr                     8.0.1
decorator               5.0.9
defusedxml              0.7.1
dependency-management   0.4.0
distro-info             0.18ubuntu0.18.04.1
Django                  2.0.3
django-sslserver        0.20
en-core-web-sm          3.1.0
entrypoints             0.3
filelock                3.0.12
flashtext               2.7
fr-core-news-md         3.1.0
gast                    0.3.3
gensim                  4.1.0
gitdb2                  2.0.3
GitPython               2.1.9
google-auth             1.35.0
google-auth-oauthlib    0.4.6
google-pasta            0.2.0
grpcio                  1.40.0
h5py                    2.10.0
httplib2                0.11.3
idna                    2.6
immutables              0.16
importlib-metadata      4.8.1
importlib-resources     5.2.2
ipykernel               5.5.5
ipython                 7.16.1
ipython-genutils        0.2.0
ipywidgets              7.6.4
jedi                    0.18.0
Jinja2                  3.0.1
joblib                  1.0.1
jsonschema              3.2.0
jupyter-client          7.0.2
jupyter-core            4.7.1
jupyterlab-pygments     0.1.2
jupyterlab-widgets      1.0.1
Keras-Preprocessing     1.1.2
keyring                 10.6.0
keyrings.alt            3.0
kiwisolver              1.3.1
language-selector       0.1
libzbar-cffi            0.2.1
Markdown                3.3.4
MarkupSafe              2.0.1
matplotlib              3.3.4
melusine                2.3.1
mistune                 0.8.4
murmurhash              1.0.5
nbclient                0.5.4
nbconvert               6.0.7
nbformat                5.1.3
nest-asyncio            1.5.1
netifaces               0.10.4
nltk                    3.6.2
notebook                6.4.3
numpy                   1.18.5
oauthlib                3.1.1
olefile                 0.45.1
opt-einsum              3.3.0
packaging               21.0
pandas                  1.1.5
pandocfilters           1.4.3
parso                   0.8.2
pathy                   0.6.0
pbr                     4.0.3
pexpect                 4.2.1
pickleshare             0.7.5
Pillow                  8.3.2
pip                     21.2.4
plotly                  5.3.1
preshed                 3.0.5
prometheus-client       0.11.0
prompt-toolkit          3.0.20
protobuf                3.17.3
ptyprocess              0.7.0
pyarrow                 5.0.0
pyasn1                  0.4.8
pyasn1-modules          0.2.8
pycairo                 1.16.2
pycparser               2.20
pycrypto                2.6.1
pycups                  1.9.73
pydantic                1.8.2
pydeck                  0.6.2
Pygments                2.10.0
PyGObject               3.26.1
pyparsing               2.4.7
PyPrint                 0.2.6
pyrsistent              0.18.0
python-apt              1.6.5+ubuntu0.6
python-dateutil         2.8.2
python-debian           0.1.32
pytz                    2018.3
pyxdg                   0.25
PyYAML                  3.12
pyzbar                  0.1.7
pyzmq                   22.2.1
regex                   2021.8.28
reportlab               3.4.0
requests                2.26.0
requests-oauthlib       1.3.0
requests-unixsocket     0.1.5
rsa                     4.7.2
sacremoses              0.0.45
sarge                   0.1.6
scikit-learn            0.24.2
scipy                   1.5.4
scour                   0.36
SecretStorage           2.3.1
Send2Trash              1.8.0
sentencepiece           0.1.96
setuptools              58.0.4
simplemma               0.3.0
six                     1.16.0
smart-open              5.2.1
smmap                   3.0.4
smmap2                  3.0.1
spacy                   3.1.2
spacy-legacy            3.0.8
srsly                   2.4.1
streamlit               0.88.0
systemd-python          234
tenacity                8.0.1
tensorboard             2.6.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.0
tensorflow              2.3.4
tensorflow-estimator    2.3.0
termcolor               1.1.0
terminado               0.12.1
testfixtures            5.3.1
testpath                0.5.0
thinc                   8.0.10
threadpoolctl           2.1.0
tokenizers              0.9.2
toml                    0.10.2
toolz                   0.11.1
tornado                 6.1
tqdm                    4.62.2
traitlets               4.3.3
transformers            3.4.0
typer                   0.3.2
typing-extensions       3.10.0.2
tzlocal                 3.0
ubuntu-advantage-tools  27.2
ubuntu-drivers-common   0.0.0
ufw                     0.36
unattended-upgrades     0.1
Unidecode               1.3.1
urllib3                 1.22
usb-creator             0.3.3
validators              0.18.2
vboxapi                 1.0
virtualenv              16.2.0
vulture                 0.10
wasabi                  0.8.2
watchdog                2.1.5
wcwidth                 0.2.5
webencodings            0.5.1
Werkzeug                2.0.1
wheel                   0.30.0
whitenoise              3.3.1
widgetsnbextension      3.5.1
wordcloud               1.8.1
wrapt                   1.12.1
xkit                    0.0.0
zipp                    3.5.0

Bug to import connectors - missing path in setup.py

Provide a minimal code snippet example that reproduces the bug.
Provide screenshots where appropriate

Python version : Python 3.6

Melusine version : 2.6.0

Operating System :Linux

use flashtext as a replacement for regex

FlashText :

is written in pure Python and has no extra dependencies
can extract and replace keywords made up of multiple words
offers significant speedups compared to re.sub and re.find
can take "spelling errors" into consideration, via levenshtein distance

I see melusine uses a lot of regex for preprocessing/cleaning
I wonder if this would be useful to melusine

The get_meta return no columns in tutorial_7

Are you using Mac, Linux or Windows?
Linux
Python version :
Python 3.7.10
Melusine version :
latest

Hello Melusine,

I face a problem with the _get_meta method but I think I have a solution.

_get_meta() function does not work because the columns are never selected at the following line

melusine/melusine/models/train.py

Lines 571 to 573 in 10424aa

 meta_columns_list = [ 

 col for col in columns_list if col.startswith(tuple(meta_input_list)) 

 ]

indeed each element of column_list has '__' at the end because of the following (to select the dummified columns only)

melusine/melusine/models/train.py

Line 564 in 10424aa

meta_input_list = [col + "__" for col in meta_input_list]

On the following image I'm using the debugger in the tutorial_7_models.ipynb with a breakpoint in the _get_meta method
And we can see that even if we want to use : meta_input_list=['extension','attachment_type', 'dayofweek', 'hour', 'min'],
meta_input_list is empty and it is because we don't have the dummified columns (we have the original columns).

This is all because the columns are not Dummyfied at this step and it should be
(to have [extension__1, extension__2, ...] columns)
In this case, this tutorial is missing an important step : the encoding of the meta

In fact there is no problem in tutorial09_full_pipeline_quick.ipynb because it has dummified the meta.

So I suggest :
A/ either we assume the tutorial07 is not using meta and we set `meta_input_list=[]
B/ or we should add the Meta columns to the dataset for this tutorial

Sentiments mutualistes ;)

use joblib for multiprocessing

The current TransformerScheduler class uses a multiprocess version of pandas apply, imported from melusine.utils.multiprocessing

Actually scikit-learn already provides a high level interface for multiprocess/multithread processing over numpy/pandas, accessible from the external namespace

from sklearn.externals import joblib

using joblib would reduce the code base and improve compatibility with the sklearn ecosystem

Stemmer and Emoji Flagger

Description of Problem:
Need a stemmed column and flag emoji in text

Overview of the Solution:
Develop a stemmer
Develop preprocess function to flag emojis

Examples:
Stemmer
["envoye", "courrier"] becomes ["envoy", "courri"]
["semblerait", "trouver"] becomes ["sembl", "trouv"]

tensorflow2.0

Hello,

Since the setup contains the requirement 'tensorflow>=1.10.0'
The installation I launched installed tensorflow=2.0.0a0

It causes the issue : module 'tensorflow' has no attribute 'get_default_graph'
So we need to downgrade the version of tensorflow (1.13.1 for example) or to upgrade the code in order to make it work with tensorflow 2.0
Best regards,

Error with unseen values of metadata during inference

Hi !

We encountered an error during our tests on Melusine.

When we train the pipeline with metadata, in the file "metadata_enginnering.py", label encoders from sklearn are trained by the values of metadatas we have in our training dataset of emails.

It allows to associate a string to a numerical value.
For example, the attachment type "JPG" will be associated with the numerical value "4".

When we use again the metadata pipeline for the inference, it will call the function "transform" which call the function "encode_extension".
In this function, if the value has not been seen during the training of label encoders, it will return the value "other".

So, if we have already encounter the value during the training of label encoders, it will return the numerical value associated.
However, if it's a new value of metadata, unseen in the training dataset, we will have errors like that :

Because, the value "other" hasn't been used to train the label encoder, so there is no numerical value associated with this value.

We have the problem for the extension of the email address and for the type of attachment.

To fix this error, we need to add the value "other" to the list of metedata used to train the label encoders.

I will join a PR with our modifications.

Best regards,

Maxime

Python version : 3.9.7

Melusine version : 2.3.4

Operating System : Windows

Extract Body

Python version : 3.8.10

Melusine version : 2.3.2

Operating System : Windows

Hello, I have an issue regarding the extraction of the segmenting of the body to clean, during the segmentation Melusine tags the CC of the email as body resulting in the body_header_extract funtion to consider the CC as the body to clean.

In the exemple above, you see the part that was selected as the body (yellow) does not correspond to the actual body of the email (between blue)

Thank you for your help.
Best regards.
Camelia

loading a model

Hey there and thank you for using Issue Tracker!

Python version : 3.8.10

Melusine version : 2.3.2

Operating System : Windows

Hello, I am trying to save a model and then load it and test it in another notebook.
Unfortunately I can't find any exammple of this process in your documentation and I can't figure out how to load the model properly and keep having an error.

Could you give me some pointers ?

Best regards,
Camelia.

Tensorflow-probability

Description of Problem:
For the existing determinist neural networks, the predict_proba method gives a basic estimation of the probability per class.

melusine/melusine/models/train.py

Line 357 in 4a0a181

def predict_proba(self, X, **kwargs):

With a specific type of neural networks we are able to compute a better uncertainty estimation on the outputs of the models.
For users that give importance to uncertainty estimation (especially usefull for datasets with errors in labels), this type of model may give the same performance as deterministics neural nets but provides better uncertainty estimation.
The only drawback is we need to choose a prior on the weights of the neural net and it needs more computation to train.

Overview of the Solution:
Using the package tensorflow-probability we can setup a Neural Network to return a Distribution on the outputs (and not only point estimation).
For each prediction : this estimated distribution allows us to have :

A point estimation (mean of the distribution for example) : quiet the same as the existing predict_proba method
An estimation of uncertainty around this prediction (for examepl : using a standard deviation around with gaussian assumption)

Examples:
Using the tutorial of Melusine, instead of just having the point estimation with predict or predict_proba method, we can have upper bounds and lower bounds on the estimated probabilities.

In this example the category is "vehicle" and the model finds the good category with a good score but it also modulate the interval around this probability estimation. We can choose the level of confidence around this interval (in this example : 95% with gaussian assumption). This approach is very recommanded for critical process where uncertainty is key. It also can help us to find the error links to this mislabelling or more generally noise in the data.

Blockers:

Warning with the dependency tensorflow-probability. In my environment tf-probability is already here thanks to tensorflow. But we could make this dependency optionnal if someone doesn't want it in its environment.
The tf-probability version of cnn_model, rnn_model, transformers_model... will look very like the existing architectures. To be compatible with NeuralModel, I can just propose new functions that look exactly the same but with little modifications. If the architectures were cut in macro-blocks (embedding/Conv/RNN/Transformer/Outputs) then we could avoid the ctrl-c ctrl-v that I'm about to do.

Definition of Done:
Add new models in neural_architecture (like cnn_model but with tf-probability capabilities)

melusine/melusine/models/neural_architectures.py

Line 23 in 4a0a181

def cnn_model(

compatible with the existing NeuralModel class

melusine/melusine/models/train.py

Line 24 in 4a0a181

class NeuralModel(BaseEstimator, ClassifierMixin):

The predict_proba method of this new type of model will provide a better estimation of predicted probability and upper bounds / lower bounds.

I'm currently working on it. Happy to discuss about this topic.

Create a class to connect to gmail mailbox as Exchange does

Description of Problem:

Currently, only one connector is implemented, the one for exchange boxes. Many people have Gmail mailboxes, and Melusine would be more interesting for occasional users if a connector allowed them to simply connect to their mailbox and then use the full power of the Melusine framework.

Overview of the Solution:

Copy Paste ExchangeConnector and adapt using Gmail architecture and API.

Examples:

gc = GmailConnector()
df = gc.get_emails(5)
df["target"] = ["TRASH", "TRASH", "USECASE1", "USECASE2", "TRASH"]
gc.route_emails(df)

Blockers:

None

Definition of Done:

GmailCollector class implemented, unit tested and functionally tested

Cross Validation

Python version : 3.8.10

Melusine version : 2.3.2

Operating System : Windows

Hello,
I would like to implement the technique of cross-validation to validate my results with the neural network but I have this error :

Is Melusine not compatible with cross_val ?

Best regards,
Camelia

text_body error with our server version

Hey !

We had an issue when we tried to get the mails from the mailbox server using get_emails. It seems it's once again due to the version of our mailbox server.

It seems that the text_body atribute is not available for the older version of Exchange. To fix that, we made two changes in melusine/connectors/exchange.py :

Mailbox Server : Microsoft Exchange Server 2010

Python version : 3.8.12

Melusine version : 2.3.1

Operating System : Windows

train.py "clean_text" should be text_input_column from init

When using NeuralModel with a dataframe that does not have a 'clean_text' column name (ex: on 'clean_body') it will send an error.

melusine/melusine/models/train.py

Lines 421 to 423 in 553eb40

 sequence = X["clean_text"] 

 else: 

 sequence = X["clean_text"].values.tolist()

Github action for version upgrade

Description of Problem:
We are curently using Travis CI to upgrade and deploy new versions of Melusine on Pypi.
However, Travis CI has now restricted access for the free version.
OSS by MAIF has choosen to use Github Actions to deploy each new version of our open source libraries.

Overview of the Solution:
Github Actions is directly accessible through Github. Et voilà!

Definition of Done:
Deployments works with Github Actions.
We have created a main.yml file which configure the Github Actions.

X_input in case of BertModel with no Meta

Hello,

Just a doubt that needs clarification.

melusine/melusine/models/train.py

Lines 379 to 383 in 4a0a181

 else: 

 X_seq, X_attention = self._prepare_bert_sequences(X) 

 X_meta, nb_meta_features = self._get_meta(X) 

 if nb_meta_features == 0: 

 X_input = [X_seq, X_meta]

architecture_function is a Bert model so X_attention exists
there is no meta features (because nb_meta_features==0)

But the X_input is : X_input = [X_seq, X_meta]

Shouldn't it be X_input = [X_seq, X_attention] ?

SSL Certificate Error

Hey !

When we tried to connect to the server where we want to get our mails, we got some errors due to SSL Certificate Error :

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)

MaxRetryError: HTTPSConnectionPool(host='mail.verlingue.fr', port=443): Max retries exceeded with url: /EWS/Exchange.asmx (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

SSLError: HTTPSConnectionPool(host='mail.verlingue.fr', port=443): Max retries exceeded with url: /EWS/Exchange.asmx (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

TransportError: HTTPSConnectionPool(host='mail.verlingue.fr', port=443): Max retries exceeded with url: /EWS/Exchange.asmx (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

This is probably due to the version of the mailbox server we use in our company, because when I tried to connect to an Office 365 mailbox, I hadn't the same issue.
To fix that, we added these lines at the beginning of melusine/connectors/exchange.py, and it works.

from exchangelib.protocol import BaseProtocol, NoVerifyHTTPAdapter
BaseProtocol.HTTP_ADAPTER_CLS = NoVerifyHTTPAdapter

Mailbox Server : Microsoft Exchange Server 2010

Python version : 3.8.12

Melusine version : 2.3.1

Operating System : Windows

Wrapping lemmatizers

Hello Melusine Team,

As discussed together, here is our issue before the pull request for our lemmatizers wrappers.

We would like to contribute to Melusine with different lemmatizers wrapped in sklearn transformers objects.

Overview of the Solution:
We wrapped in a sklearn transformer object the different lemmatizers provided by Spacy s well as Lefff lemmatizer.
These objects are compatible with the TransformerScheduler as well as sklearn pipelines.

Best Regards,

Unable to open configurations in Mélusine library

Pull Request Title:

Unable to open configurations in Mélusine library

Pull Request Description:

I am trying to use the Mélusine library to configure a pipeline, but I am unable to open any configuration file. I have tried using the demo_pipeline example, but nothing works.

Here are the steps I have taken:

I installed the Mélusine library using pip.
I created a new Python file and imported the Mélusine library.

The error message is:

`KeyError Traceback (most recent call last)
Cell In[869], line 3
1 from melusine.pipeline import MelusinePipeline
----> 3 pipeline = MelusinePipeline.from_config(config_key="demo_pipeline")

File ~/anaconda3/envs/.../lib/python3.8/site-packages/melusine/pipeline.py:195, in MelusinePipeline.from_config(cls, config_key, config_dict, **kwargs)
193 # Get config dict
194 if config_key and not config_dict:
--> 195 raw_config_dict = config[config_key]
196 config_dict = cls.parse_pipeline_config(raw_config_dict)
198 elif config_dict and not config_key:

File ~/anaconda3/envs/....lib/python3.8/site-packages/melusine/config/config.py:104, in getitem(self, key)

KeyError: 'demo_pipeline'`

I am not sure what I am doing wrong. Can you please help me?

Additional Information:

I am using Python 3.8.18
I am using the latest version of the Mélusine library.

Actual Behavior:

An error is raised.

Request:

Please help me to troubleshoot this issue.

Deal_with_drafts

Hey !

We found an error during our utilisation of Melusine.

When we use the function get_emails() to load the emails, it will raise an error if it encounters a draft.

Some drafts have no sender or datetime_sent attributes, so it will raise an error in the function _extract_email_attributes().

I will join a PR with the solution we proposed to deal with drafts in the file tree.
For each email, we look if it has a datetime_sent and sender attributes. If it's not the case, we replace these attributes by None values.

If you have other ideas to deal with this issue, we are listening !

We already told the users of the mailbox to delete drafts in the file tree or move them in the folder "Brouillon".
But It may be safer to have a way to deal with a draft forgotten in the file tree, instead of having our script for the training of the DL model (which can be long) failed because of one draft.

Best regards,

Maxime

Python version : 3.9.7

Melusine version : 2.3.4

Operating System : Windows

Non-existent TransformerScheduler class

Hello,

I've encountered an issue when trying to use the transformer_scheduler class as referenced in the README.md file of the repository. The documentation suggests using this class as part of the preprocessing pipeline setup example, but I cannot find the actual class implementation in the codebase.

Steps to Reproduce:

Look through the README.md file and attempt the Pre Processing pipeline snippet.
Attempt to locate the transformer_scheduler class in the repository.

Expected Behavior:
The transformer_scheduler class should be found within the repository or the documentation should provide accurate guidance on how to access this class.

Actual Behavior:
The transformer_scheduler class is not present in the repository, and there is no indication that it is part of an external dependency.

Screenshots:

Python Version: 3.9.17
Operating System: Windows 11

[BUGFIX] Save Bert model with embeddings matrix

Hello every ones. Thank you for this amazin project. I would like to report a bug I have on my project.

When I 'm saving Bert Model with this code :

import joblib
_ = joblib.dump(nn_model,"./data/nn_model.pickle",compress=True)

I'im getting this error :

c:\users\cgoncalves\documents\melusine_fork\melusine\melusine\models\train.py in __getstate__(self)
    215         if "model" in dict_attr:
    216             del dict_attr["model"]
--> 217             del dict_attr["embedding_matrix"]
    218             del dict_attr["pretrained_embedding"]
    219         return dict_attr

KeyError: 'embedding_matrix'

The Model i use is :

memory_start = get_available_memory()
CamemBert_model = NeuralModel(architecture_function=bert_model,
                             pretrained_embedding=pretrained_embedding,
                                text_input_column="clean_text",
                                meta_input_list=['extension', 'dayofweek', 'hour', 'min', 'attachment_type'],
                                n_epochs=1,
                        bert_tokenizer='jplu/tf-camembert-base',
                        bert_model='jplu/tf-camembert-base')
training_start = time.time()
CamemBert_model.fit(X, y)
training_end = time.time()
CamemBert_memory = memory_start - get_available_memory()
CamemBert_memory = round(CamemBert_memory / 1e9 * 1024 , 1)
print('CamemBert is using {} Mb memory (RAM).'.format(str(CamemBert_memory)))

Python version : 3.7

Melusine version : 2.3.2 (latest)

Operating System : windows 10

Tensorflow NotImplementedError error with Camembert

Hi,

I think I have detected a bug when we want to save the model (save_nn_model / load_nn_model) in the melusine.models.train module with the pretrained Camembert hidden layer. I received a Python NotImplementedError error from Tensorflow library (from get_config). I think this is because TFCamembertModel is a Subclass Networks and isn't a Graph Networks. It seems there are answers in this topic.

Best regards

Modernize the Melusine tokenizer

Description of Problem:
Depending on the context, tokenization may cover different functionalities. For exemple:

Gensim (gensim.utils.tokenize) : Tokenization is limited to splitting text into tokens
HuggingFace tokenizers (encode methods) : Full NLP tokenization pipeline including Text normalization, Pre-tokenization, Tokenizer model and Post-processing.

Tokenization in Melusine is currently a hybrid which covers the following functionalities:

Splitting
Stopwords removal
Name flagging

It seems to me that the Full NLP tokenization pipeline is a bit spread across the Melusine package (prepare_data.cleaning, nlp_tools.tokenizer and even the prepare_data method of the models.train.NeuralModel).

This issue can be split into a few questions:

How can we refactor the code to make the Full Tokenization pipeline standout ?
How can we easily configure the Tokenization pipeline ? (Ex: a user friendly and readable tokenizer.json file)
How can package the tokenizer to ensure repeatability ?

Overview of the Solution:
I suggest to create a revamped MelusineTokenizer class with its load and save method.
The class should neatly package many functionalities commonly found in a "Full NLP Tokenization pipeline" such as:

Text cleaning ?
Flagging (phone numbers, email addresses, etc)
Splitting
Stopwords removal

The tokenizer could be saved and loaded from a human readable "json" file.

Examples:

tokenizer = MelusineTokenizer(tokenizer_regex, stopwords, flags)
tokens = tokenizer.tokenize("Hello John how are you")
tokenizer.save("tokenizer.json")
tokenizer_reloaded = MelusineTokenizer.load("tokenizer.json")

Definition of Done:
The new tokenizer class works fine.
The tokenizer can be read from / saved into a human readable config file
The tokenizer centralizes all tokenization functionalities in the larger sens.

structure_email crashes because empty since build_historic give an empty text --> {text:""}

Hello, I cannot get rid of this error with the following recalamation. I tried to update the congif but it did not work (I added Expéditeur in transition_list). I think there should be a control/warning to signal an empty "text" as result of build_historic(row).

Any idea ? Is this due to the config file that need to be updated in a different way ? thanks a lot

"Expéditeur : Moi Reçu le : 04/01/2021 à 14:08:15 Bonjour, Dans votre message ci-dessous, vous attirez notre attention sur un chèque n'apparaissant pas au crédit de votre compte. Afin de pouvoir répondre au mieux à votre demande, nous avons besoin des éléments suivants : - le nom de l'émetteur - le nom et l'adresse de la banque - le numéro de la formule de chèque - la copie lisible du bordereau de dépôt \xa0 Restant à votre disposition, nous vous remercions de votre confiance. Cordialement, La Banque Postale La Banque Postale - Société Anonyme à Directoire et Conseil de Surveillance au capital de 6 585 350 218 euros Siège social et adresse postale : 115, rue de Sèvres - 75 275 Paris Cedex 06 RCS Paris 421 100 645 - Code APE 6419Z, intermédiaire d'assurance, immatriculé à l'ORIAS sous le n° 07 023 424 De : NAME SURNAME Le : 31 décembre 2020 18:55:30 Objet : Dépot de chèques Bonsoir, et bonnes fêtes ! Je me permets de vous contacter car j'ai fait un dépôt de chèques le 19/12/20 d'un montant de 1954,15 euros à l'Agence Postale de Drap (06). A ce jour mon compte n'a toujours pas été crédité. Pourriez-vous s'il vous plait faire de recherches pour résoudre ce soucis. Merci et meilleurs vœux. NAME SURNAME;"

ERROR: [Conda] Cannot install melusine because package versions have conflicting dependencies

Hello Melusine developers,

Thanks a lot your work.
I succeded to install Melusine on Google Colab environment but I am facing an installation error in my local conda environment.
An error with tensor flow is raised while Tensor Flow is not installed in my environment yet (see pip freeze below)

Here is the error traceback following pip install melusine while attempting to collect Keras>=2.2.0;


The conflict is caused by:
    melusine 2.2.6 depends on tensorflow>=2.0.0
    melusine 2.2.5 depends on tensorflow>=2.0.0
    melusine 2.2.1 depends on tensorflow>=2.0.0
    melusine 2.2.0 depends on tensorflow>=2.0.0
    melusine 2.1.0 depends on tensorflow>=2.0.0
    melusine 2.0.4 depends on tensorflow>=2.0.0
    melusine 2.0.3 depends on tensorflow>=2.0.0
    melusine 1.11.1 depends on tensorflow>=2.0.0
    melusine 1.11.0 depends on tensorflow>=2.0.0
    melusine 1.10.0 depends on tensorflow>=2.0
    melusine 1.9.6 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.9.5 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.9.4 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.9.3 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.9.2 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.9.1 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.9.0 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.8.9 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.8.8 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.8.7 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.8.6 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.8.5 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.8.4 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.8.3 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.8.1 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.8.0 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.7.1 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.7.0 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.6.1 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.6.0 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.5.5 depends on tensorflow<=1.13.1 and >=1.10.0
    melusine 1.5.4 depends on tensorflow>=1.10.0
    melusine 1.5.2 depends on tensorflow>=1.10.0
    melusine 1.5.1 depends on tensorflow>=1.10.0
    melusine 1.5.0 depends on tensorflow>=1.10.0
    melusine 1.4.0 depends on tensorflow>=1.10.0
    melusine 1.2.0 depends on tensorflow>=1.10.0

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

Output of pip freeze:

aiohttp @ file:///home/conda/feedstock_root/build_artifacts/aiohttp_1610358552152/work
alabaster==0.7.12
async-timeout==3.0.1
attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1605083924122/work
awscli==1.19.3
Babel==2.9.0
beautifulsoup4 @ file:///tmp/build/80754af9/beautifulsoup4_1601924105527/work
blinker==1.4
botocore==1.20.3
brotlipy==0.7.0
cachetools @ file:///home/conda/feedstock_root/build_artifacts/cachetools_1611555765219/work
certifi==2020.12.5
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1606601120025/work
chardet @ file:///home/conda/feedstock_root/build_artifacts/chardet_1602255302154/work
click==7.1.2
colorama==0.4.3
coverage==5.4
cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography_1610338668572/work
docutils==0.15.2
flake8==3.8.4
google-api-core==1.25.1
google-api-python-client==1.12.8
google-auth @ file:///home/conda/feedstock_root/build_artifacts/google-auth_1608136875028/work
google-auth-httplib2==0.0.4
google-auth-oauthlib @ file:///home/conda/feedstock_root/build_artifacts/google-auth-oauthlib_1603996258953/work
googleapis-common-protos==1.52.0
gspread @ file:///home/conda/feedstock_root/build_artifacts/gspread_1594582188011/work
httplib2==0.19.0
idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1593328102638/work
imagesize==1.2.0
Jinja2==2.11.3
jmespath==0.10.0
MarkupSafe==1.1.1
mccabe==0.6.1
multidict @ file:///home/conda/feedstock_root/build_artifacts/multidict_1610318999200/work
numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1612116912722/work
oauthlib==3.0.1
packaging==20.9
pandas==1.2.1
protobuf==3.14.0
pyasn1==0.4.8
pyasn1-modules==0.2.7
pycodestyle==2.6.0
pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1593275161868/work
pyflakes==2.2.0
Pygments==2.7.4
PyJWT @ file:///home/conda/feedstock_root/build_artifacts/pyjwt_1610910308735/work
pyOpenSSL @ file:///home/conda/feedstock_root/build_artifacts/pyopenssl_1608055815057/work
pyparsing==2.4.7
PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1610291451001/work
python-dateutil==2.8.1
python-dotenv==0.15.0
pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1612179539967/work
PyYAML==5.3.1
requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1608156231189/work
requests-oauthlib @ file:///home/conda/feedstock_root/build_artifacts/requests-oauthlib_1595492159598/work
rsa==4.5
s3transfer==0.3.4
six @ file:///home/conda/feedstock_root/build_artifacts/six_1590081179328/work
snowballstemmer==2.1.0
soupsieve==2.0.1
Sphinx==3.4.3
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==1.0.3
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.4
-e [email protected]:Galsor/mailizer.git@0f252e34881382910d191baa4a6aab794444003c#egg=src
typing-extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1602702424206/work
uritemplate==3.0.1
urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1611695416663/work
yarl @ file:///home/conda/feedstock_root/build_artifacts/yarl_1610354135407/work

Any ideas how to resolve it ?

Cleanup Melusine Configurations

Description of Problem:
The way configs are handled in Melusine is not ideal:

A ConfigJsonReader class object is instanciated at the beggining of each module
The conf files are read multiple times
Using custom configurations is more complex than it should be
The csv format for the names file is not robust

Overview of the Solution:
The following changes are suggested:

Load the configs once and import them in modules with a simple import statement
Use an env variable to specify a folder containing custom configuration files
Change the format of the names file

Examples:

os.environ["MELUSINE_CONFIG_DIR"] = "path/to/custom/config/dir"
from melusine import config
print(config["words_list"])

NeuralModel loading method doesn't work for CNNs

Python version :
3.6

Melusine version :
2.3.1

Operating System :
Mac

Steps to reproduce :
The code below does the following:

Train a model
Save the model
(Re)Load the model
Predict with both the original model and the reloaded model

The predictions should be equal but they are different

from melusine.models.neural_architectures import cnn_model
from melusine.models.train import NeuralModel
from melusine.nlp_tools.embedding import Embedding

X, y = MY_DATA
pretrained_embedding = Embedding().load("MY_EMBEDDING")

model = NeuralModel(
    architecture_function = cnn_model,
    pretrained_embedding = pretrained_embedding,
)
model.fit(X, y)
model.save_nn_model("MY_MODEL")

# Load the saved model
loaded_model = NeuralModel(
    architecture_function = cnn_model,
    pretrained_embedding = pretrained_embedding,
)
loaded_model.load_nn_model("MY_MODEL")

y_pred_base = model.predict(X)
y_pred_loaded = loaded_model.predict(X)

print((y_pred_base == y_pred_loaded).all())

Bug explanation :
When making a prediction (NeuralModel.predict), the NeuralModel uses the attribute vocabulary_dict.
This attribute is an empty dict at model initialization and it is filled when the model is fitted.

The vocabulary_dict attribute is not saved by the save_nn_model method,
therefore when the model is loaded and the predict method is called, all the tokens are mapped to the unknown token and the predicted values make no sens.

Suggestions :
Quick Fix : If you are in a rush, you can use the following code to save and load the model:

import joblib
model.save_nn_model("MY_MODEL")
joblib.dump(model, "MY_MODEL.pkl", compress=True)

loaded_model = joblib.load("MY_MODEL.pkl")
loaded_model.load_nn_model("MY_MODEL")

The NeuralModel save and load methods should be refactored (save all the relevant attributes and make load a class method) to fix the bug.

However, the current NeuralModel class is far from current Deep Learning standards (ex: PyTorch Lightning modules) and could be refactored entirely.
In particular, we could separate the model itself from the trainer class (Training attributes are unnecessary when doing inference).

	meta_columns_list = [
	col for col in columns_list if col.startswith(tuple(meta_input_list))
	]

	sequence = X["clean_text"]
	else:
	sequence = X["clean_text"].values.tolist()

	else:
	X_seq, X_attention = self._prepare_bert_sequences(X)
	X_meta, nb_meta_features = self._get_meta(X)
	if nb_meta_features == 0:
	X_input = [X_seq, X_meta]

maif / melusine Goto Github PK

melusine's People

Contributors

Stargazers

Watchers

Forkers

melusine's Issues

Discussed in #147

Instructions

Debug Information

Pull Request Title:

Pull Request Description:

Additional Information:

Actual Behavior:

Request:

Recommend Projects

Recommend Topics

Recommend Org