maif / melusine Goto Github PK
View Code? Open in Web Editor NEW📧 Melusine: Use python to automatize your email processing workflow
Home Page: https://maif.github.io/melusine
License: Other
📧 Melusine: Use python to automatize your email processing workflow
Home Page: https://maif.github.io/melusine
License: Other
Description of Problem:
To make email routing effective, Melusine has to be linked to a mailbox system.
Overview of the Solution:
A connector with Exchange using the python package exchangelib is considered.
Hello, I am writing this report hoping for some help.
I have been trying to work on your projet for a while but have had difficulties due to a proxy on my laptop. The solution found to install le Melusine librairy was to charge all the librairies on another laptop and then copy them on mine.
I have yet to finish the installation of jupyter so hen I finally had access to melusine I ran some tests to make sure it was working well so I wrote a .py file with the code below and ran it in my terminal and had the error in the screenshot below.
I don't understand the error so I wanted to know if the problem was with how I installed melusine or with the fact that I didn't have jupyter yet.
from melusine.data.data_loader import load_email_data
df_emails = load_email_data()
print(df_emails[1])
Python version :
Melusine version :
Operating System :
Originally posted by aagostinelli86 January 3, 2023
Hello @TFA-MAIF I have an issue with the tutorial notebook n 13, cell 33 (Camembert model): "OSError: Model name 'jplu/tf-camembert-base' was not found in tokenizers model name list (camembert-base). We assumed 'jplu/tf-camembert-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['sentencepiece.bpe.model'] but couldn't find such vocabulary files at this path or url."
I think it's due to the proxy, so I downloaded locally the pretrained model (config.json + .tf_model.h5) and I specified its path in your NeuralModel wrapper. The issue holds with this different message: "OSError: Model name 'C:\Users\vgkj536\git_projects\melusine\tutorial\jplu\tf-camembert-base' was not found in tokenizers model name list (camembert-base). We assumed 'C:\Users\vgkj536\git_projects\melusine\tutorial\jplu\tf-camembert-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['sentencepiece.bpe.model'] but couldn't find such vocabulary files at this path or url."
On the opposite, if I call directly the TFCamembertModel.from_pretrained() with this local path, it works well. Is this due to a bug ?
When installing with pip
, melusine/config/config/conf.json
is missing.
Because glob in setup.py
only capture JSON files in sub-folders of config
, and skip JSON files in config
folder itself.
Can be fixed by using glob.glob("melusine/config/**/*.json", recursive=True)
instead of glob.glob("melusine/config/**/*.json")
.
I will create a pull request for this issue.
Hi !
There is an issue with this type of datetime : 2021-06-24T13:02:43+00:00 in the 'date' column. Indeed, in MetaDate class you expect the date to have this format : regex_date_format=r"\w+ (\d+) (\w+) (\d{4}) (\d{2}) h (\d{2})"
. In case of our date does not fit the format, the except return the raw format so we could expect it continue but in fact the to_datetime just after expect the format : date_format="%d/%m/%Y %H:%M"
. In our case, to_datetime :
pd.to_datetime(X['date'],
format=self.date_format,
infer_datetime_format=False,
errors="coerce",
)
return NaT
Python version :
3.7
Melusine version :
2.3.1
Operating System :
Windows 10
Add release notes for Melusine 2.3
New features:
ExchangeConnector
to interact with an Exchange Mailboxtutorial14_exchange_connector
to demonstrate the usage of the ExchangeConnector
class.Updates:
Hey !
I have tried to reinstall Melusine since I had the version 2.3.1 of october.
I saw on GitHub that changes have been made on NLP tools (the Phraser object for example).
First, I uninstalled melusine with : pip uninstall melusine.
When I did : pip install melusine. It downloaded melusine 2.3.2.
However, when I tried to use the new version of the class Phraser, there was a mistake. The code downloaded with pip has still the old version of the file "phraser.py". So there are differences between the code of melusine in GitHub and the code we can download with pip.
The ...\Lib\site-packages\melusine\nlp_tools\phraser.py that I have in my virtual env :
The phraser.py file in GitHub :
Is it normal ? Am I the only one to have this problem ?
Thank you !
Hey !
We had a problem with the attachment type in metadata. As you can see in the screenshot below, we had only two values after applying our Metadata pipeline. 0 for the presence of an attachment file and 1 if there is no attachment file in the mail. The screenshot is an extract of the DataFrame call df_email.
Here is the way we create our pipeline and how we apply it on our emails :
Metadatapipeline = Pipeline([('MetaExtension', MetaExtension()),
('MetaDate', MetaDate()),
('MetaAttachmentType', MetaAttachmentType()),
('Dummifier', Dummifier(columns_to_dummify = ['extension', 'attachment_type', 'dayofweek','hour', 'min']))])
df_meta = Metadatapipeline.fit_transform(df_emails)
Then, this is the function which is supposed to extract the type of the attachment file in melusine/prepare_email/metadata_engineering.py:
We added some prints to understand what is the problem. As you can see, when there is at least one attachment file in the mail, the type of x is str, and when there is no attachment file the value of x is nan.
When the function has to deal with a mail with an attachment file, the value of the row["attachment"] is a str. For example, we could have "['image002.png', 'image003.jpg']". Then, the for loop will just take it as a str and deal with the char one by one. This seems to be the reason of our issue.
This seems to solve our issue :
Python version : 3.8.12
Melusine version : 2.3.1
Operating System : Windows
In: melusine/_config.py
, the following code snippet
def export_default_config(self, path: str) -> List[str]:
from distutils.dir_util import copy_tree
source = self.DEFAULT_CONFIG_PATH
file_list: List[str] = copy_tree(source, path)
return file_list
imports distutils
, which was removed from Python 3.12+.
Python version: 3.12
Melusine version: master@7dc9d99a57d
Hi !
If we want to use new metadata like it's explained in tutorials, we may need to modify the dummifier object.
If we have metadata with a type list, we need to modify the code of the Dummifier to be able to use our new metadata, like it's done in the list of attachment type.
For example, this is the actual code of the fit() method from the Dummifier:
In our case, I wanted to add the name of receivers in metadata used for the training of our model.
To do it, I needed to create the case, in the Dummifier object, where we have a column receivers in df_emails.
It was not very clean, so I made modifications on the Dummifier object to make it more flexible.
In the PR that I will link, we have a new parameter for the Dummifier object, which is a list of the columns with a type list.
Then, we apply the same functions on these columns.
I hope I was clear.
Best regards,
Maxime
Python version : 3.9.7
Melusine version : 2.3.4
Operating System : Windows
Hi !
I think I have detected a bug about emails metadata's structure. Indeed, I trained a model with a specific metadata dataset structure (look at the first picture attached). But when I try to predict a result with a dataset which has a bigger structure, the followig error appears :
"ValueError: Input 0 of layer dense_2 is incompatible with the layer: expected axis -1 of input shape to have value 9 but received input with shape [None, 10]"
I join you an example of this dataset (the second picture) so you can see what I mean. I think it could be a problem from tensorflow but I'm not sure.
Python version :
3.8
Melusine version :
2.3.1
Operating System :
Windows 10
Hello,
Description of Problem: We wanted to get the values of metrics (like accuracy) from the train function. It would give us the possibility to plot the evolution of the accuracy during the different epochs of the training, or get the last value as the accuracy on the training dataset.
Overview of the Solution: To do that, we just need to store the output of the fit function in the train function of melusine. Then we return the value of this output.
Hello !
Your project looks awesome and I really want to try it.
I see in your project that you use very clean text in your body data.
In "real" email life the content of mail body is very dirty (HTML, encoding, formating, multipart, different language...).
Did you manage it ? (or maybe you work only with your internal company emails ?)
Faithfully
Description of Problem:
Currently, Melusine only works with transformers 3.4.0 (or no transformers at all).
Latest releases of the transformers library are thus not available (including latest models).
Overview of the Solution:
Make the Melusine code compatible with transformers>=4
Definition of Done:
Melusine works without the pinned transformers==3.4.0 version
Hi !
I tried to do a connexion to Exchange with the ExchangeConnector class from melusine. But when I ran the program the following error appeared :
exchangelib.errors.AutoDiscoverFailed: All steps in the autodiscover protocol failed for email '[email protected]'. If you think this is an error, consider doing an official test at https://testconnectivity.microsoft.com
So, I tried to modify the melusine class in order to force the connexion. Here is the code :
class ExchangeConnector:
def __init__(
self,
login_address: str,
password: str,
mailbox_address: str = None,
max_wait: int = 60,
routing_folder_path: str = None,
correction_folder_path: str = None,
done_folder_path: str = None,
target_column="target",
):
"""
Parameters
----------
login_address: str
Email address used to login and send emails.
password: str
Password to login to the Exchange mailbox
mailbox_address: str
Email address of the mailbox. By default, the login address is used
max_wait: int
Maximum time (in s) to wait when connecting to mailbox
routing_folder_path: str
Path of the base routing folder
correction_folder_path: str
Path of the base correction folder
done_folder_path: str
Path of the Done folder
target_column: str
Name of the DataFrame column containing target folder names
"""
self.login_address = login_address
self.mailbox_address = mailbox_address or login_address
self.folder_list = None
self.target_column = target_column
# Connect to mailbox
self.credentials = Credentials(self.login_address, password)
self.exchangelib_config = Configuration(server= 'mail.company.domain', retry_policy=FaultTolerance(max_wait=max_wait), credentials=self.credentials, auth_type=NTLM)
#add the server name, the auth_type. These parameters can be found in outlook with 'auto configuration test'
# Mailbox account (Routing, Corrections, etc)
self.mailbox_account = Account(
primary_smtp_address=self.mailbox_address,
credentials=self.credentials,
autodiscover=False,
#turn autodiscover off
config=self.exchangelib_config,
access_type=DELEGATE
#acces_type can, as well as the server name or auth_type be found with 'auto configuration test'
)
# Sender accounts (send emails)
self.sender_account = Account(
primary_smtp_address=self.mailbox_address,
credentials=self.credentials,
autodiscover=False,
config=self.exchangelib_config,
access_type=DELEGATE
)
# Setup correction folder and done folder
self.routing_folder_path = routing_folder_path
self.correction_folder_path = correction_folder_path
self.done_folder_path = done_folder_path
logger.info(
f"Connected to mailbox {self.mailbox_address} as user {self.login_address}"
)
Usually with this new code the connexion has to be done. But it might be possible that a SSL certificate error pops up. To counter that you have to disable http verification like that :
from exchangelib.protocol import BaseProtocol, NoVerifyHTTPAdapter
BaseProtocol.HTTP_ADAPTER_CLS = NoVerifyHTTPAdapter
Now the connexion works !
Python version :
3.8
Melusine version :
2.3.1
Operating System :
Windows 10
Hello !
Description of Problem: We wanted to concatenate the clean_header and the clean_body to create a column clean_text. As you can see on the screenshot below, we did it after applying the Transformer Pipeline. Then if we follow the examples in the tutorials, we want to use a NLP Pipeline, with a phraser function and a tokenizer. Our problem : we can apply the tokenizer on the column we want (clean_text), but we can't apply the phraser on this column. Indeed, there are only two phraser function in melusine (phraser_on_body and phraser_on_header) which apply the phraser on the columns clean_body and clean_header. We can't concatenate clean_body and clean_header after applying their phraser function, because they are in the same pipeline that the tokeniser function which can be applied on clean_text.
Overview of the Solution: One possible solution is the creation of a phraser function for the clean_text in "melusine/nlp_tools/phraser.py". It would allow us to apply the phraser on a column name clean_text.
Description of Problem:
Melusine currently works only with python 3.6.
Making python compatible with python 3.7 would make it possible to use DataClasses.
Data Classes would be very relevant for data validation. For instance custom configurations could be validated by data classes.
Overview of the Solution:
Currently, the main elements preventing Melusine from working with python 3.7 is conflicts between libraries and python versions, in particular:
Definition of Done:
Melusine works with pythonn version 3.7 (or more)
Dear all,
I can not run the following training from R environment.
from melusine.models.neural_architectures import cnn_model
from melusine.models.train import NeuralModel
nn_model = NeuralModel(architecture_function=rnn_model,
pretrained_embedding=pretrained_embedding,
text_input_column="clean_body",
meta_input_list=['extension', 'dayofweek','hour', 'min', 'attachment_type'],
n_epochs=10)
SyntaxError: unexpected EOF while parsing (<string>, line 2)
Erreur dans py_call_impl(callable, dots$args, dots$keywords) :
AttributeError: 'NeuralModel' object has no attribute 'bert_tokenizer'
Detailed traceback:
File "/usr/lib/python3.6/pprint.py", line 144, in pformat
self._format(object, sio, 0, 0, {}, 0)
File "/usr/lib/python3.6/pprint.py", line 161, in _format
rep = self._repr(object, context, level)
File "/usr/lib/python3.6/pprint.py", line 393, in _repr
self._depth, level)
File "/usr/lib/python3.6/pprint.py", line 405, in format
return _safe_repr(object, context, maxlevels, level)
File "/usr/lib/python3.6/pprint.py", line 555, in _safe_repr
rep = repr(object)
File "/home/kirus/.local/lib/python3.6/site-packages/sklearn/base.py", line 260, in __repr__
repr_ = pp.pformat(self)
File "/usr/lib/python3.6/pprint.py", line 144, in pformat
self._format(object, sio, 0, 0, {}, 0)
File "/usr/lib/python3.6/pprint.py", line 161, in _format
rep = self._repr(object, context, level)
File "/usr/lib/python3.6/pprint.py", line 393, in _repr
self._dept
When I import cnn_model
class, I got:
from melusine.models.neural_architectures import cnn_model
2021-09-16 11:13:08.221051: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib:/usr/lib/R/lib::/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/lib/server
2021-09-16 11:13:08.221084: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Any idea is welcome. Thanks
I have libcudart9.1 and not libcudart.so10.1.
sudo apt list | grep libcudart
libcudart9.1/bionic,now 9.1.85-3ubuntu1 amd64 [installé]
I do not have GPU set up, nor nvidia card
lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation Iris Graphics 6100 (rev 09)
Python version : 3.6
Melusine version : 2.3.1
Operating System : 18.04.5 LTS on Macbook Pro Retina
Python Module versions
kirus@izis:~$ pip3.6 list
Package Version
----------------------- -------------------
absl-py 0.13.0
altair 4.1.0
appdirs 1.4.4
apt-xapian-index 0.47
argon2-cffi 21.1.0
asn1crypto 0.24.0
astor 0.8.1
astunparse 1.6.3
async-generator 1.10
attrs 21.2.0
backcall 0.2.0
backports.zoneinfo 0.2.1
base58 2.1.0
bleach 4.1.0
blinker 1.4
blis 0.7.4
cachetools 4.2.2
catalogue 2.0.6
certifi 2018.1.18
cffi 1.14.4
chardet 3.0.4
charset-normalizer 2.0.4
click 7.1.2
coala-utils 0.7.0
colorama 0.3.9
command-not-found 0.3
contextvars 2.4
cryptography 2.1.4
cupshelpers 1.0
cycler 0.10.0
cymem 2.0.5
dataclasses 0.8
dbr 8.0.1
decorator 5.0.9
defusedxml 0.7.1
dependency-management 0.4.0
distro-info 0.18ubuntu0.18.04.1
Django 2.0.3
django-sslserver 0.20
en-core-web-sm 3.1.0
entrypoints 0.3
filelock 3.0.12
flashtext 2.7
fr-core-news-md 3.1.0
gast 0.3.3
gensim 4.1.0
gitdb2 2.0.3
GitPython 2.1.9
google-auth 1.35.0
google-auth-oauthlib 0.4.6
google-pasta 0.2.0
grpcio 1.40.0
h5py 2.10.0
httplib2 0.11.3
idna 2.6
immutables 0.16
importlib-metadata 4.8.1
importlib-resources 5.2.2
ipykernel 5.5.5
ipython 7.16.1
ipython-genutils 0.2.0
ipywidgets 7.6.4
jedi 0.18.0
Jinja2 3.0.1
joblib 1.0.1
jsonschema 3.2.0
jupyter-client 7.0.2
jupyter-core 4.7.1
jupyterlab-pygments 0.1.2
jupyterlab-widgets 1.0.1
Keras-Preprocessing 1.1.2
keyring 10.6.0
keyrings.alt 3.0
kiwisolver 1.3.1
language-selector 0.1
libzbar-cffi 0.2.1
Markdown 3.3.4
MarkupSafe 2.0.1
matplotlib 3.3.4
melusine 2.3.1
mistune 0.8.4
murmurhash 1.0.5
nbclient 0.5.4
nbconvert 6.0.7
nbformat 5.1.3
nest-asyncio 1.5.1
netifaces 0.10.4
nltk 3.6.2
notebook 6.4.3
numpy 1.18.5
oauthlib 3.1.1
olefile 0.45.1
opt-einsum 3.3.0
packaging 21.0
pandas 1.1.5
pandocfilters 1.4.3
parso 0.8.2
pathy 0.6.0
pbr 4.0.3
pexpect 4.2.1
pickleshare 0.7.5
Pillow 8.3.2
pip 21.2.4
plotly 5.3.1
preshed 3.0.5
prometheus-client 0.11.0
prompt-toolkit 3.0.20
protobuf 3.17.3
ptyprocess 0.7.0
pyarrow 5.0.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycairo 1.16.2
pycparser 2.20
pycrypto 2.6.1
pycups 1.9.73
pydantic 1.8.2
pydeck 0.6.2
Pygments 2.10.0
PyGObject 3.26.1
pyparsing 2.4.7
PyPrint 0.2.6
pyrsistent 0.18.0
python-apt 1.6.5+ubuntu0.6
python-dateutil 2.8.2
python-debian 0.1.32
pytz 2018.3
pyxdg 0.25
PyYAML 3.12
pyzbar 0.1.7
pyzmq 22.2.1
regex 2021.8.28
reportlab 3.4.0
requests 2.26.0
requests-oauthlib 1.3.0
requests-unixsocket 0.1.5
rsa 4.7.2
sacremoses 0.0.45
sarge 0.1.6
scikit-learn 0.24.2
scipy 1.5.4
scour 0.36
SecretStorage 2.3.1
Send2Trash 1.8.0
sentencepiece 0.1.96
setuptools 58.0.4
simplemma 0.3.0
six 1.16.0
smart-open 5.2.1
smmap 3.0.4
smmap2 3.0.1
spacy 3.1.2
spacy-legacy 3.0.8
srsly 2.4.1
streamlit 0.88.0
systemd-python 234
tenacity 8.0.1
tensorboard 2.6.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.0
tensorflow 2.3.4
tensorflow-estimator 2.3.0
termcolor 1.1.0
terminado 0.12.1
testfixtures 5.3.1
testpath 0.5.0
thinc 8.0.10
threadpoolctl 2.1.0
tokenizers 0.9.2
toml 0.10.2
toolz 0.11.1
tornado 6.1
tqdm 4.62.2
traitlets 4.3.3
transformers 3.4.0
typer 0.3.2
typing-extensions 3.10.0.2
tzlocal 3.0
ubuntu-advantage-tools 27.2
ubuntu-drivers-common 0.0.0
ufw 0.36
unattended-upgrades 0.1
Unidecode 1.3.1
urllib3 1.22
usb-creator 0.3.3
validators 0.18.2
vboxapi 1.0
virtualenv 16.2.0
vulture 0.10
wasabi 0.8.2
watchdog 2.1.5
wcwidth 0.2.5
webencodings 0.5.1
Werkzeug 2.0.1
wheel 0.30.0
whitenoise 3.3.1
widgetsnbextension 3.5.1
wordcloud 1.8.1
wrapt 1.12.1
xkit 0.0.0
zipp 3.5.0
I see melusine uses a lot of regex for preprocessing/cleaning
I wonder if this would be useful to melusine
Hello Melusine,
I face a problem with the _get_meta
method but I think I have a solution.
_get_meta()
function does not work because the columns are never selected at the following line
melusine/melusine/models/train.py
Lines 571 to 573 in 10424aa
indeed each element of column_list
has '__' at the end because of the following (to select the dummified columns only)
melusine/melusine/models/train.py
Line 564 in 10424aa
On the following image I'm using the debugger in the tutorial_7_models.ipynb
with a breakpoint in the _get_meta
method
And we can see that even if we want to use : meta_input_list=['extension','attachment_type', 'dayofweek', 'hour', 'min']
,
meta_input_list is empty and it is because we don't have the dummified columns (we have the original columns).
This is all because the columns are not Dummyfied at this step and it should be
(to have [extension__1
, extension__2
, ...] columns)
In this case, this tutorial is missing an important step : the encoding of the meta
In fact there is no problem in tutorial09_full_pipeline_quick.ipynb
because it has dummified the meta.
So I suggest :
A/ either we assume the tutorial07 is not using meta and we set `meta_input_list=[]
B/ or we should add the Meta columns to the dataset for this tutorial
Sentiments mutualistes ;)
The current TransformerScheduler
class uses a multiprocess version of pandas apply, imported from melusine.utils.multiprocessing
Actually scikit-learn already provides a high level interface for multiprocess/multithread processing over numpy/pandas, accessible from the external namespace
from sklearn.externals import joblib
using joblib would reduce the code base and improve compatibility with the sklearn ecosystem
Description of Problem:
Need a stemmed column and flag emoji in text
Overview of the Solution:
Develop a stemmer
Develop preprocess function to flag emojis
Examples:
Stemmer
["envoye", "courrier"] becomes ["envoy", "courri"]
["semblerait", "trouver"] becomes ["sembl", "trouv"]
Hello,
Since the setup contains the requirement 'tensorflow>=1.10.0'
The installation I launched installed tensorflow=2.0.0a0
It causes the issue : module 'tensorflow' has no attribute 'get_default_graph'
So we need to downgrade the version of tensorflow (1.13.1 for example) or to upgrade the code in order to make it work with tensorflow 2.0
Best regards,
Hi !
We encountered an error during our tests on Melusine.
When we train the pipeline with metadata, in the file "metadata_enginnering.py", label encoders from sklearn are trained by the values of metadatas we have in our training dataset of emails.
It allows to associate a string to a numerical value.
For example, the attachment type "JPG" will be associated with the numerical value "4".
When we use again the metadata pipeline for the inference, it will call the function "transform" which call the function "encode_extension".
In this function, if the value has not been seen during the training of label encoders, it will return the value "other".
So, if we have already encounter the value during the training of label encoders, it will return the numerical value associated.
However, if it's a new value of metadata, unseen in the training dataset, we will have errors like that :
Because, the value "other" hasn't been used to train the label encoder, so there is no numerical value associated with this value.
We have the problem for the extension of the email address and for the type of attachment.
To fix this error, we need to add the value "other" to the list of metedata used to train the label encoders.
I will join a PR with our modifications.
Best regards,
Maxime
Python version : 3.9.7
Melusine version : 2.3.4
Operating System : Windows
Python version : 3.8.10
Melusine version : 2.3.2
Operating System : Windows
Hello, I have an issue regarding the extraction of the segmenting of the body to clean, during the segmentation Melusine tags the CC of the email as body resulting in the body_header_extract funtion to consider the CC as the body to clean.
In the exemple above, you see the part that was selected as the body (yellow) does not correspond to the actual body of the email (between blue)
Thank you for your help.
Best regards.
Camelia
Hey there and thank you for using Issue Tracker!
Python version : 3.8.10
Melusine version : 2.3.2
Operating System : Windows
Hello, I am trying to save a model and then load it and test it in another notebook.
Unfortunately I can't find any exammple of this process in your documentation and I can't figure out how to load the model properly and keep having an error.
Could you give me some pointers ?
Best regards,
Camelia.
Description of Problem:
For the existing determinist neural networks, the predict_proba
method gives a basic estimation of the probability per class.
melusine/melusine/models/train.py
Line 357 in 4a0a181
Overview of the Solution:
Using the package tensorflow-probability we can setup a Neural Network to return a Distribution on the outputs (and not only point estimation).
For each prediction : this estimated distribution allows us to have :
predict_proba
methodExamples:
Using the tutorial of Melusine, instead of just having the point estimation with predict
or predict_proba
method, we can have upper bounds and lower bounds on the estimated probabilities.
Blockers:
NeuralModel
, I can just propose new functions that look exactly the same but with little modifications. If the architectures were cut in macro-blocks (embedding/Conv/RNN/Transformer/Outputs) then we could avoid the ctrl-c ctrl-v that I'm about to do.Definition of Done:
Add new models in neural_architecture (like cnn_model but with tf-probability capabilities)
melusine/melusine/models/train.py
Line 24 in 4a0a181
The predict_proba
method of this new type of model will provide a better estimation of predicted probability and upper bounds / lower bounds.
I'm currently working on it. Happy to discuss about this topic.
Description of Problem:
Currently, only one connector is implemented, the one for exchange boxes. Many people have Gmail mailboxes, and Melusine would be more interesting for occasional users if a connector allowed them to simply connect to their mailbox and then use the full power of the Melusine framework.
Overview of the Solution:
Copy Paste ExchangeConnector and adapt using Gmail architecture and API.
Examples:
gc = GmailConnector()
df = gc.get_emails(5)
df["target"] = ["TRASH", "TRASH", "USECASE1", "USECASE2", "TRASH"]
gc.route_emails(df)
Blockers:
None
Definition of Done:
GmailCollector class implemented, unit tested and functionally tested
Hey !
We had an issue when we tried to get the mails from the mailbox server using get_emails. It seems it's once again due to the version of our mailbox server.
It seems that the text_body atribute is not available for the older version of Exchange. To fix that, we made two changes in melusine/connectors/exchange.py :
Mailbox Server : Microsoft Exchange Server 2010
Python version : 3.8.12
Melusine version : 2.3.1
Operating System : Windows
When using NeuralModel with a dataframe that does not have a 'clean_text' column name (ex: on 'clean_body') it will send an error.
melusine/melusine/models/train.py
Lines 421 to 423 in 553eb40
Description of Problem:
We are curently using Travis CI to upgrade and deploy new versions of Melusine on Pypi.
However, Travis CI has now restricted access for the free version.
OSS by MAIF has choosen to use Github Actions to deploy each new version of our open source libraries.
Overview of the Solution:
Github Actions is directly accessible through Github. Et voilà!
Definition of Done:
Deployments works with Github Actions.
We have created a main.yml file which configure the Github Actions.
Hello,
Just a doubt that needs clarification.
melusine/melusine/models/train.py
Lines 379 to 383 in 4a0a181
But the X_input is : X_input = [X_seq, X_meta]
Shouldn't it be X_input = [X_seq, X_attention]
?
Hey !
When we tried to connect to the server where we want to get our mails, we got some errors due to SSL Certificate Error :
SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)
MaxRetryError: HTTPSConnectionPool(host='mail.verlingue.fr', port=443): Max retries exceeded with url: /EWS/Exchange.asmx (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))
SSLError: HTTPSConnectionPool(host='mail.verlingue.fr', port=443): Max retries exceeded with url: /EWS/Exchange.asmx (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))
TransportError: HTTPSConnectionPool(host='mail.verlingue.fr', port=443): Max retries exceeded with url: /EWS/Exchange.asmx (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))
This is probably due to the version of the mailbox server we use in our company, because when I tried to connect to an Office 365 mailbox, I hadn't the same issue.
To fix that, we added these lines at the beginning of melusine/connectors/exchange.py, and it works.
from exchangelib.protocol import BaseProtocol, NoVerifyHTTPAdapter
BaseProtocol.HTTP_ADAPTER_CLS = NoVerifyHTTPAdapter
Mailbox Server : Microsoft Exchange Server 2010
Python version : 3.8.12
Melusine version : 2.3.1
Operating System : Windows
Hello Melusine Team,
As discussed together, here is our issue before the pull request for our lemmatizers wrappers.
We would like to contribute to Melusine with different lemmatizers wrapped in sklearn transformers objects.
Overview of the Solution:
We wrapped in a sklearn transformer object the different lemmatizers provided by Spacy s well as Lefff lemmatizer.
These objects are compatible with the TransformerScheduler as well as sklearn pipelines.
Best Regards,
Unable to open configurations in Mélusine library
I am trying to use the Mélusine library to configure a pipeline, but I am unable to open any configuration file. I have tried using the demo_pipeline example, but nothing works.
Here are the steps I have taken:
The error message is:
`KeyError Traceback (most recent call last)
Cell In[869], line 3
1 from melusine.pipeline import MelusinePipeline
----> 3 pipeline = MelusinePipeline.from_config(config_key="demo_pipeline")
File ~/anaconda3/envs/.../lib/python3.8/site-packages/melusine/pipeline.py:195, in MelusinePipeline.from_config(cls, config_key, config_dict, **kwargs)
193 # Get config dict
194 if config_key and not config_dict:
--> 195 raw_config_dict = config[config_key]
196 config_dict = cls.parse_pipeline_config(raw_config_dict)
198 elif config_dict and not config_key:
File ~/anaconda3/envs/....lib/python3.8/site-packages/melusine/config/config.py:104, in getitem(self, key)
KeyError: 'demo_pipeline'`
I am not sure what I am doing wrong. Can you please help me?
An error is raised.
Please help me to troubleshoot this issue.
Hey !
We found an error during our utilisation of Melusine.
When we use the function get_emails() to load the emails, it will raise an error if it encounters a draft.
Some drafts have no sender or datetime_sent attributes, so it will raise an error in the function _extract_email_attributes().
I will join a PR with the solution we proposed to deal with drafts in the file tree.
For each email, we look if it has a datetime_sent and sender attributes. If it's not the case, we replace these attributes by None values.
If you have other ideas to deal with this issue, we are listening !
We already told the users of the mailbox to delete drafts in the file tree or move them in the folder "Brouillon".
But It may be safer to have a way to deal with a draft forgotten in the file tree, instead of having our script for the training of the DL model (which can be long) failed because of one draft.
Best regards,
Maxime
Python version : 3.9.7
Melusine version : 2.3.4
Operating System : Windows
Hello,
I've encountered an issue when trying to use the transformer_scheduler class as referenced in the README.md file of the repository. The documentation suggests using this class as part of the preprocessing pipeline setup example, but I cannot find the actual class implementation in the codebase.
Steps to Reproduce:
Look through the README.md file and attempt the Pre Processing pipeline snippet.
Attempt to locate the transformer_scheduler class in the repository.
Expected Behavior:
The transformer_scheduler class should be found within the repository or the documentation should provide accurate guidance on how to access this class.
Actual Behavior:
The transformer_scheduler class is not present in the repository, and there is no indication that it is part of an external dependency.
Python Version: 3.9.17
Operating System: Windows 11
Hello every ones. Thank you for this amazin project. I would like to report a bug I have on my project.
When I 'm saving Bert Model with this code :
import joblib
_ = joblib.dump(nn_model,"./data/nn_model.pickle",compress=True)
I'im getting this error :
c:\users\cgoncalves\documents\melusine_fork\melusine\melusine\models\train.py in __getstate__(self)
215 if "model" in dict_attr:
216 del dict_attr["model"]
--> 217 del dict_attr["embedding_matrix"]
218 del dict_attr["pretrained_embedding"]
219 return dict_attr
KeyError: 'embedding_matrix'
The Model i use is :
memory_start = get_available_memory()
CamemBert_model = NeuralModel(architecture_function=bert_model,
pretrained_embedding=pretrained_embedding,
text_input_column="clean_text",
meta_input_list=['extension', 'dayofweek', 'hour', 'min', 'attachment_type'],
n_epochs=1,
bert_tokenizer='jplu/tf-camembert-base',
bert_model='jplu/tf-camembert-base')
training_start = time.time()
CamemBert_model.fit(X, y)
training_end = time.time()
CamemBert_memory = memory_start - get_available_memory()
CamemBert_memory = round(CamemBert_memory / 1e9 * 1024 , 1)
print('CamemBert is using {} Mb memory (RAM).'.format(str(CamemBert_memory)))
Python version : 3.7
Melusine version : 2.3.2 (latest)
Operating System : windows 10
Hi,
I think I have detected a bug when we want to save the model (save_nn_model / load_nn_model) in the melusine.models.train module with the pretrained Camembert hidden layer. I received a Python NotImplementedError error from Tensorflow library (from get_config). I think this is because TFCamembertModel is a Subclass Networks and isn't a Graph Networks. It seems there are answers in this topic.
Best regards
Description of Problem:
Depending on the context, tokenization may cover different functionalities. For exemple:
Tokenization in Melusine is currently a hybrid which covers the following functionalities:
It seems to me that the Full NLP tokenization pipeline is a bit spread across the Melusine package (prepare_data.cleaning, nlp_tools.tokenizer and even the prepare_data method of the models.train.NeuralModel).
This issue can be split into a few questions:
Overview of the Solution:
I suggest to create a revamped MelusineTokenizer class with its load and save method.
The class should neatly package many functionalities commonly found in a "Full NLP Tokenization pipeline" such as:
The tokenizer could be saved and loaded from a human readable "json" file.
Examples:
tokenizer = MelusineTokenizer(tokenizer_regex, stopwords, flags)
tokens = tokenizer.tokenize("Hello John how are you")
tokenizer.save("tokenizer.json")
tokenizer_reloaded = MelusineTokenizer.load("tokenizer.json")
Definition of Done:
The new tokenizer class works fine.
The tokenizer can be read from / saved into a human readable config file
The tokenizer centralizes all tokenization functionalities in the larger sens.
Hello, I cannot get rid of this error with the following recalamation. I tried to update the congif but it did not work (I added Expéditeur in transition_list). I think there should be a control/warning to signal an empty "text" as result of build_historic(row).
Any idea ? Is this due to the config file that need to be updated in a different way ? thanks a lot
"Expéditeur : Moi Reçu le : 04/01/2021 à 14:08:15 Bonjour, Dans votre message ci-dessous, vous attirez notre attention sur un chèque n'apparaissant pas au crédit de votre compte. Afin de pouvoir répondre au mieux à votre demande, nous avons besoin des éléments suivants : - le nom de l'émetteur - le nom et l'adresse de la banque - le numéro de la formule de chèque - la copie lisible du bordereau de dépôt \xa0 Restant à votre disposition, nous vous remercions de votre confiance. Cordialement, La Banque Postale La Banque Postale - Société Anonyme à Directoire et Conseil de Surveillance au capital de 6 585 350 218 euros Siège social et adresse postale : 115, rue de Sèvres - 75 275 Paris Cedex 06 RCS Paris 421 100 645 - Code APE 6419Z, intermédiaire d'assurance, immatriculé à l'ORIAS sous le n° 07 023 424 De : NAME SURNAME Le : 31 décembre 2020 18:55:30 Objet : Dépot de chèques Bonsoir, et bonnes fêtes ! Je me permets de vous contacter car j'ai fait un dépôt de chèques le 19/12/20 d'un montant de 1954,15 euros à l'Agence Postale de Drap (06). A ce jour mon compte n'a toujours pas été crédité. Pourriez-vous s'il vous plait faire de recherches pour résoudre ce soucis. Merci et meilleurs vœux. NAME SURNAME;"
Hello Melusine developers,
Thanks a lot your work.
I succeded to install Melusine on Google Colab environment but I am facing an installation error in my local conda environment.
An error with tensor flow is raised while Tensor Flow is not installed in my environment yet (see pip freeze
below)
Here is the error traceback following pip install melusine
while attempting to collect Keras>=2.2.0;
The conflict is caused by:
melusine 2.2.6 depends on tensorflow>=2.0.0
melusine 2.2.5 depends on tensorflow>=2.0.0
melusine 2.2.1 depends on tensorflow>=2.0.0
melusine 2.2.0 depends on tensorflow>=2.0.0
melusine 2.1.0 depends on tensorflow>=2.0.0
melusine 2.0.4 depends on tensorflow>=2.0.0
melusine 2.0.3 depends on tensorflow>=2.0.0
melusine 1.11.1 depends on tensorflow>=2.0.0
melusine 1.11.0 depends on tensorflow>=2.0.0
melusine 1.10.0 depends on tensorflow>=2.0
melusine 1.9.6 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.9.5 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.9.4 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.9.3 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.9.2 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.9.1 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.9.0 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.8.9 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.8.8 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.8.7 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.8.6 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.8.5 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.8.4 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.8.3 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.8.1 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.8.0 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.7.1 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.7.0 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.6.1 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.6.0 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.5.5 depends on tensorflow<=1.13.1 and >=1.10.0
melusine 1.5.4 depends on tensorflow>=1.10.0
melusine 1.5.2 depends on tensorflow>=1.10.0
melusine 1.5.1 depends on tensorflow>=1.10.0
melusine 1.5.0 depends on tensorflow>=1.10.0
melusine 1.4.0 depends on tensorflow>=1.10.0
melusine 1.2.0 depends on tensorflow>=1.10.0
To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies
Output of pip freeze
:
aiohttp @ file:///home/conda/feedstock_root/build_artifacts/aiohttp_1610358552152/work
alabaster==0.7.12
async-timeout==3.0.1
attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1605083924122/work
awscli==1.19.3
Babel==2.9.0
beautifulsoup4 @ file:///tmp/build/80754af9/beautifulsoup4_1601924105527/work
blinker==1.4
botocore==1.20.3
brotlipy==0.7.0
cachetools @ file:///home/conda/feedstock_root/build_artifacts/cachetools_1611555765219/work
certifi==2020.12.5
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1606601120025/work
chardet @ file:///home/conda/feedstock_root/build_artifacts/chardet_1602255302154/work
click==7.1.2
colorama==0.4.3
coverage==5.4
cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography_1610338668572/work
docutils==0.15.2
flake8==3.8.4
google-api-core==1.25.1
google-api-python-client==1.12.8
google-auth @ file:///home/conda/feedstock_root/build_artifacts/google-auth_1608136875028/work
google-auth-httplib2==0.0.4
google-auth-oauthlib @ file:///home/conda/feedstock_root/build_artifacts/google-auth-oauthlib_1603996258953/work
googleapis-common-protos==1.52.0
gspread @ file:///home/conda/feedstock_root/build_artifacts/gspread_1594582188011/work
httplib2==0.19.0
idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1593328102638/work
imagesize==1.2.0
Jinja2==2.11.3
jmespath==0.10.0
MarkupSafe==1.1.1
mccabe==0.6.1
multidict @ file:///home/conda/feedstock_root/build_artifacts/multidict_1610318999200/work
numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1612116912722/work
oauthlib==3.0.1
packaging==20.9
pandas==1.2.1
protobuf==3.14.0
pyasn1==0.4.8
pyasn1-modules==0.2.7
pycodestyle==2.6.0
pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1593275161868/work
pyflakes==2.2.0
Pygments==2.7.4
PyJWT @ file:///home/conda/feedstock_root/build_artifacts/pyjwt_1610910308735/work
pyOpenSSL @ file:///home/conda/feedstock_root/build_artifacts/pyopenssl_1608055815057/work
pyparsing==2.4.7
PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1610291451001/work
python-dateutil==2.8.1
python-dotenv==0.15.0
pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1612179539967/work
PyYAML==5.3.1
requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1608156231189/work
requests-oauthlib @ file:///home/conda/feedstock_root/build_artifacts/requests-oauthlib_1595492159598/work
rsa==4.5
s3transfer==0.3.4
six @ file:///home/conda/feedstock_root/build_artifacts/six_1590081179328/work
snowballstemmer==2.1.0
soupsieve==2.0.1
Sphinx==3.4.3
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==1.0.3
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.4
-e [email protected]:Galsor/mailizer.git@0f252e34881382910d191baa4a6aab794444003c#egg=src
typing-extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1602702424206/work
uritemplate==3.0.1
urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1611695416663/work
yarl @ file:///home/conda/feedstock_root/build_artifacts/yarl_1610354135407/work
Any ideas how to resolve it ?
Description of Problem:
The way configs are handled in Melusine is not ideal:
Overview of the Solution:
The following changes are suggested:
Examples:
os.environ["MELUSINE_CONFIG_DIR"] = "path/to/custom/config/dir"
from melusine import config
print(config["words_list"])
Python version :
3.6
Melusine version :
2.3.1
Operating System :
Mac
Steps to reproduce :
The code below does the following:
The predictions should be equal but they are different
from melusine.models.neural_architectures import cnn_model
from melusine.models.train import NeuralModel
from melusine.nlp_tools.embedding import Embedding
X, y = MY_DATA
pretrained_embedding = Embedding().load("MY_EMBEDDING")
model = NeuralModel(
architecture_function = cnn_model,
pretrained_embedding = pretrained_embedding,
)
model.fit(X, y)
model.save_nn_model("MY_MODEL")
# Load the saved model
loaded_model = NeuralModel(
architecture_function = cnn_model,
pretrained_embedding = pretrained_embedding,
)
loaded_model.load_nn_model("MY_MODEL")
y_pred_base = model.predict(X)
y_pred_loaded = loaded_model.predict(X)
print((y_pred_base == y_pred_loaded).all())
Bug explanation :
When making a prediction (NeuralModel.predict
), the NeuralModel
uses the attribute vocabulary_dict
.
This attribute is an empty dict at model initialization and it is filled when the model is fitted.
The vocabulary_dict
attribute is not saved by the save_nn_model
method,
therefore when the model is loaded and the predict
method is called, all the tokens are mapped to the unknown token and the predicted values make no sens.
Suggestions :
Quick Fix : If you are in a rush, you can use the following code to save and load the model:
import joblib
model.save_nn_model("MY_MODEL")
joblib.dump(model, "MY_MODEL.pkl", compress=True)
loaded_model = joblib.load("MY_MODEL.pkl")
loaded_model.load_nn_model("MY_MODEL")
The NeuralModel
save and load methods should be refactored (save all the relevant attributes and make load
a class method) to fix the bug.
However, the current NeuralModel class is far from current Deep Learning standards (ex: PyTorch Lightning modules) and could be refactored entirely.
In particular, we could separate the model itself from the trainer class (Training attributes are unnecessary when doing inference).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.