Comments (25)
I will update the code and documentation on how to generate the sequence of number in the next few days.It will include two methods which depends on your own logs: Time sliding window and hard disk ID sequence window.
from logdeep.
Thanks alot for sharing your expertise.
I've gone through your code, I get your idea as (I am putting minute details so that it could be helpful to someone in future.)
- First gather all the logs which was obtained from normal execution of application, i.e., logs without errors.
- Combine these logs and convert to
_structured.csv
and_template.csv
file using drain from logpai. - Train the model using obtained
_structured.csv
from step 2. - After successful training and saved model, it's time to test the model' s accuracy using a abnormal log file (log file with anomaly) and normal log file followed by inference of the model for the new logs files.
- To implement step 4, since log files will be different in the sequence of events, so obtaining
_structured.csv
and_template.csv
file using drain will not make any sense as randomly generatedevent_id
will be completely different for an event from generatedevent_id
for same event from the log file used for training. So, you proposedstructure_bgl.py
, using which I can generateevent_id
for completely new logs based on theevent_id
of the logs used for training using the generatedevent_template
. Further,sample_bgl.py
will convert the structured log into sequence of event_id which can further be replaced by its equivalent integer and thus testing can be performed. - Further to inference the model, new log line or logs lines in particular time window can be mapped with training file's event_template to obtain event_id.
Could I figured it out correctly?
Please feel free to correct me if I failed to describe your approach.
Thanks for your time.
from logdeep.
Sorry for the late reply,
These are the three code snippets I wrote before, run them in orderI hope it will be useful to you!
@huhui ,@arunbaruah ,@nagsubhadeep, @Magical66
1.
# -*- coding: utf-8 -*-
"""
Created on Mon Dec 23 10:54:57 2019
@author: lidongxu1
"""
import re
import spacy
import json
def data_read(filepath):
fp = open(filepath, "r")
datas = [] # 存储处理后的数据
lines = fp.readlines() # 读取整个文件数据
i = 0 # 为一行数据
for line in lines:
row = line.strip('\n') # 去除两头的换行符,按空格分割
datas.append(row)
i = i + 1
fp.close()
return datas
def camel_to_snake(name):
"""
# To handle more advanced cases specially (this is not reversible anymore):
# Ref: https://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case
"""
name = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
return re.sub('([a-z0-9])([A-Z])', r'\1_\2', name).lower()
def replace_all_blank(value):
"""
去除value中的所有非字母内容,包括标点符号、空格、换行、下划线等
:param value: 需要处理的内容
:return: 返回处理后的内容
# https://juejin.im/post/5d50c132f265da03de3af40b
# \W 表示匹配非数字字母下划线
"""
result = re.sub('\W+', ' ', value).replace("_", ' ')
result = re.sub('\d',' ',result)
return result
# https://github.com/explosion/spaCy
# https://github.com/hamelsmu/Seq2Seq_Tutorial/issues/1
nlp = spacy.load('en_core_web_sm')
def lemmatize_stop(text):
"""
https://stackoverflow.com/questions/45605946/how-to-do-text-pre-processing-using-spacy
"""
# nlp = spacy.load('en_core_web_sm')
document = nlp(text)
# lemmas = [token.lemma_ for token in document if not token.is_stop]
lemmas = [token.text for token in document if not token.is_stop]
return lemmas
def dump_2_json(dump_dict, target_path):
'''
:param dump_dict: submits dict
:param target_path: json dst save path
:return:
'''
class MyEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, bytes):
return str(obj, encoding='utf-8')
return json.JSONEncoder.default(self, obj)
file = open(target_path, 'w', encoding='utf-8')
file.write(json.dumps(dump_dict, cls=MyEncoder, indent=4))
file.close()
data = data_read('template.txt')
result = {}
for i in range(len(data)):
temp = data[i]
temp = camel_to_snake(temp)
temp = replace_all_blank(temp)
temp = " ".join(temp.split())
temp = lemmatize_stop(temp)
result[i] = temp
print(result)
dump_2_json(result, 'eventid2template.json')
# 单独保存需要用到的fasttext词向量
template_set = set()
for key in result.keys():
for word in result[key]:
template_set.add(word)
import io
from tqdm import tqdm
# https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md
def load_vectors(fname):
fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
n, d = map(int, fin.readline().split())
data = {}
for line in tqdm(fin):
tokens = line.rstrip().split(' ')
data[tokens[0]] = map(float, tokens[1:])
return data
fasttext = load_vectors('cc.en.300.vec')
template_fasttext_map = {}
for word in template_set:
template_fasttext_map[word] = list(fasttext[word])
dump_2_json(template_fasttext_map,'fasttext_map.json')
import os
import json
import numpy as np
import pandas as pd
from collections import Counter
import math
def read_json(filename):
with open(filename, 'r') as load_f:
file_dict = json.load(load_f)
return file_dict
eventid2template = read_json('eventid2template.json')
fasttext_map = read_json('fasttext_map.json')
print(eventid2template)
dataset = list()
with open('data/'+'deepLog_hdfs_train.txt', 'r') as f:
for line in f.readlines():
line = tuple(map(lambda n: n - 1, map(int, line.strip().split())))
dataset.append(line)
print(len(dataset))
idf_matrix = list()
for seq in dataset:
for event in seq:
idf_matrix.append(eventid2template[str(event)])
print(len(idf_matrix))
idf_matrix = np.array(idf_matrix)
X_counts = []
for i in range(idf_matrix.shape[0]):
word_counts = Counter(idf_matrix[i])
X_counts.append(word_counts)
print(X_counts[1000])
X_df = pd.DataFrame(X_counts)
X_df = X_df.fillna(0)
print(len(X_df))
print(X_df.head())
events = X_df.columns
print(events)
X = X_df.values
num_instance, num_event = X.shape
print('tf-idf here')
df_vec = np.sum(X > 0, axis=0)
print(df_vec)
print('*'*20)
print(num_instance)
# smooth idf like sklearn
idf_vec = np.log((num_instance + 1) / (df_vec + 1)) + 1
print(idf_vec)
idf_matrix = X * np.tile(idf_vec, (num_instance, 1))
X_new = idf_matrix
print(X_new.shape)
print(X_new[1000])
word2idf = dict()
for i,j in zip(events,idf_vec):
word2idf[i]=j
# smooth idf when oov
word2idf['oov'] = (math.log((num_instance + 1) / (29+1)) + 1)
print(word2idf)
def dump_2_json(dump_dict, target_path):
'''
:param dump_dict: submits dict
:param target_path: json dst save path
:return:
'''
class MyEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, bytes):
return str(obj, encoding='utf-8')
return json.JSONEncoder.default(self, obj)
file = open(target_path, 'w', encoding='utf-8')
file.write(json.dumps(dump_dict, cls=MyEncoder, indent=4))
file.close()
dump_2_json(word2idf,'word2idf.json')
import json
import numpy as np
from collections import Counter
def read_json(filename):
with open(filename, 'r') as load_f:
file_dict = json.load(load_f)
return file_dict
event2template = read_json('eventid2template.json')
fasttext = read_json('fasttext_map.json')
word2idf = read_json('word2idf.json')
event2semantic_vec = dict()
# todo :
# 计算每个seq的tf,然后计算句向量
for event in event2template.keys():
template = event2template[event]
tem_len = len(template)
count = dict(Counter(template))
for word in count.keys():
# TF
TF = count[word]/tem_len
# IDF
IDF = word2idf.get(word,word2idf['oov'])
# print(word)
# print(TF)
# print(IDF)
# print('-'*20)
count[word] = TF*IDF
# print(count)
# print(sum(count.values()))
value_sum = sum(count.values())
for word in count.keys():
count[word] = count[word]/value_sum
semantic_vec = np.zeros(300)
for word in count.keys():
fasttext_weight = np.array(fasttext[word])
semantic_vec += count[word]*fasttext_weight
event2semantic_vec[event] = list(semantic_vec)
def dump_2_json(dump_dict, target_path):
'''
:param dump_dict: submits dict
:param target_path: json dst save path
:return:
'''
class MyEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, bytes):
return str(obj, encoding='utf-8')
return json.JSONEncoder.default(self, obj)
file = open(target_path, 'w', encoding='utf-8')
file.write(json.dumps(dump_dict, cls=MyEncoder, indent=4))
file.close()
dump_2_json(event2semantic_vec,'event2semantic_vec_sameoov.json')
from logdeep.
Thanks for your response.
Also, thanking you in advance for sharing your methods to generate sequence of numbers from any log file as almost every log anomaly detection repo uses hdfs logs with pre-computed sequence of numbers from log message and I could not figure out how to use it for every log file case.
from logdeep.
Hope it will help you :)
Example of how to sample your own log
from logdeep.
What I do in structure_bgl.py is just to use the template extracted by drain to map the log file to event_id and extract the time and other information to be used next part.
This part can be said that it is just the meaning of data cleaning?
It seems to be no problem with the other parts you mentioned.
from logdeep.
TBH, this is exactly what I meant, using structure_bgl.py
, one can use the template extracted by drain as used during training to map new log files to event_id
of that template.
from logdeep.
Hi! Thank you for posting this amazing project and thank you @kartikeyporwal for opening this issue here! I also got a couple of questions about using the model on my own datasets. And here is my understanding of the workflows here:
- Take HDFS raw log dataset as an example, I first need to transform it into a structured log dataset using the LogParser. And I will ended up getting this two files
_structured.csv
and_template.csv
- And to get the training data and test data that look like the hdfs_train, hdfs_test_normal and hdfs_test_abnormal from the structured log dataset that I got from step 1, I will need to first do the sampling to generate the sequence of number as you stated in the example of how to sample your own log. Then after having the event sequences, I will need to do the train test split mannually to get the three datsasets that listed above.
- After having the datasets, we can perform the model (e.g. deeplog.py) training on the hdfs_train file that uses the sliding window sampling methods to generate sequence vector, count vector and semantic vector to train the deep learning model. And we can choose our own combination of the feature vectors that we wanna use.
- Lastly, use the saved model to do inference on the test dataset.
And the questions I have based on the workflow that I describe above are:
- When generating the semantic vector, it used a event2semantic_vec.json file that contains total 0-28 events number indicator mapping to different vectors. I guess this file is specifically generated for HDFS dataset to correspond to each eventID in HDFS, right? And how can we generate such json file if we are using our own log data?
- And I'm also a little bit confused about the two sampling parts in step 2 and step 3. My understanding is that the sampling method happened in step2 is to generate sequence of event like we see in the hdfs_train file. And I believe it also depends on which window type you choose, right? And based on your sampling example for HDFS, I think that is session window sampling which is the same method used in loglizer dataloader.py. And for the sampling method in the sample.py, this is mainly for generating the feature vectors, right?
Feel free to correct me if there is anything wrong!! Look forward to any feedback!! :)
from logdeep.
hi @cherishwsx
1.
I just use the Facebook open source fasttext pre-trained word vector model to extract the Word vector and use tf-idf method to generate Sentence vector for a log(correspond to each enentID).
If you are interested I can upload my code as a reference
2.
LSTM training process input sequence length needs to be fixed
You are right, first sampling origin HDFS dataset by session window, then use sliding window for generating the feature vectors(count vector and sequence vector)
If you use BGL dataset, just sampling origin data by sliding window and can just use it for generating the feature.
In robustlog (supervised learning), I just use a fixed sequence length method(crop and pad) to train lstm in the code......
from logdeep.
Thank you for the reply!!
- It would be great if you can upload your code! I really appreciate that!
- Do you mean the sequence length (28 in your case since this number will be different depend on which parser tool) that you use to initialize the some of the feature vector length?
from logdeep.
I mean the length of sequence [5 5 5 22 11 9 11 9 11 9] is fixed as 10 in deeplog and loganomaly.
Example:
sequence [5 5 5 22 11 9 11 9 11 9]
sequence_vector=[5,5,5,22,11,9,11,9,11,9]
count_vector=[0]*28
count_vector[5] = 3
count_vector[9] = 3
count_vector[11] = 3
count_vector[22] = 1
28 is just the number of template(Ground truth, not parsing by myself) of the HDFS dataset.
from logdeep.
In this case, I think you are refering to the winow_size parameter (default is 10), correct? For example, if the window_size is set to 20, then the length of sequence will be 20. And the sequence_vector and count_vector (like the code chunk you showed above) will be created based on the length 20 sequence.
from logdeep.
You are right! @cherishwsx
from logdeep.
hi @cherishwsx
1.
I just use the Facebook open source fasttext pre-trained word vector model to extract the Word vector and use tf-idf method to generate Sentence vector for a log(correspond to each enentID).
If you are interested I can upload my code as a reference
@donglee-afar I'm very interested about how to get the event2semantic_vec.json file. Could you please upload the code? Thank you very much!
from logdeep.
hi @cherishwsx
1.
I just use the Facebook open source fasttext pre-trained word vector model to extract the Word vector and use tf-idf method to generate Sentence vector for a log(correspond to each enentID).
If you are interested I can upload my code as a reference
2.
LSTM training process input sequence length needs to be fixedYou are right, first sampling origin HDFS dataset by session window, then use sliding window for generating the feature vectors(count vector and sequence vector)
If you use BGL dataset, just sampling origin data by sliding window and can just use it for generating the feature.In robustlog (supervised learning), I just use a fixed sequence length method(crop and pad) to train lstm in the code......
Hi @donglee-afar ,
Can you please upload the code that is used to generate the event2semantic_vec.json file?
Thanks,
Deep
from logdeep.
Hi @donglee-afar,
Please give me a hint for creating the event2semantic_vec.json file. Thank you
from logdeep.
Hi @donglee-afar,
How to use fasttext generate event2semantic_vec.json file? Look forward to any feedback! Thank you!!!
from logdeep.
Hi @donglee-afar
What is the format of the "template.txt" file? Is the same as "templates.csv" file in the repository, which include EventId and EventTemplate, or the hdfs.log file?
Thanks
from logdeep.
Hi @donglee-afar
What is the format of the "template.txt" file? Is the same as "templates.csv" file in the repository, which include EventId and EventTemplate, or the hdfs.log file?
Thanks
Hi,
These are the templates, you can dump the templates from the templates.csv file into this file as .txt. I am not sure but I think you can even use the templates.csv file but only templates column needs to be kept in that file as well as header row to be removed
from logdeep.
@ZanisAli Thank you for your reply
I have another question: In the testing phase to predict "test_normal" and "test_abnormal", Does the "template.txt" file become update from the training phase or not? In the other words, is the "template.txt" file in the train the same as the test step?
from logdeep.
@Farhodi For the training and testing part, template.txt
file will not be used at all, instead, the sequences generated from the structured file created by template identification techniques will be used. template.txt
file here is only used for robustlog to generate event2vector_semantics.json
file. Except that there is no use of this file.
from logdeep.
@ZanisAli, Thanks for your helping
Is it possible to edit the "LogAnomaly" demo code to have semantic information?
from logdeep.
Thanks @donglee-afar for this fantastic project and all the good work. also for @cherishwsx the good summary.
" And to get the training data and test data that look like the hdfs_train, hdfs_test_normal and hdfs_test_abnormal from the structured log dataset that I got from step 1, I will need to first do the sampling to generate the sequence of number as you stated in the example of how to sample your own log. Then after having the event sequences, I will need to do the train test split mannually to get the three datsasets that listed above."
Could you please help elaborate the above a bit more in regard with how to generate these three files hdfs_train, hdfs_test_normal and hdfs_test_abnormal ?
What I am trying to achieve here is to apply robustlog to Linux logs (e.g. syslogs). I first parse the syslogs with Drain to get the template, and then labelled the original syslogs by placing the "-" or otherwise(for abnormal) in the first field, after that, apply structure_bgl.py and sample_bgl.py to structure and sample the logs respectively, and I am stuck in the next step to train and validate the model, and do prediction with it, and that is where the above question comes from.
Would you please help here ? Thanks a lot!
from logdeep.
Sorry for the late reply, These are the three code snippets I wrote before, run them in orderI hope it will be useful to you! @huhui ,@arunbaruah ,@nagsubhadeep, @Magical66 1.
# -*- coding: utf-8 -*- """ Created on Mon Dec 23 10:54:57 2019 @author: lidongxu1 """ import re import spacy import json
<code snipped for brevity, see above in thread for full code>
dump_2_json(event2semantic_vec,'event2semantic_vec_sameoov.json')
Thank you so much, @donglee-afar, for the excellent project and the code snippets for preprocessing. I have used your example code snippets to create a gist for preprocessing (in my case is was Ubuntu system logs and I used a parser project based on SPELL as well as my own text normalization method). This was mainly to create the event2semantic_vec.json
semantics file for use with LogAnomaly method you've implemented.
Here is the gist in case it can help anyone (pls give feedback as you wish): https://gist.github.com/michhar/388d037439da6114d67aa8f793293870
Best regards.
from logdeep.
谢谢@donglee-afar为了这个出色的项目和所有出色的工作。也为了了@cherishwsx 好的总结。
““为了从我从1”获得的结构化日志数据集中获得看起来看起来像hdfs_train,hdfs_test_test_normal和hdfs_test_abnormal的的示例。然而在获得事件序列之后,我将需要手动进行火灾车辆测试拆分以获取上面列表的三个数据集。”
关于如何生成这三个文件 hdfs_train、hdfs_test_normal 和 hdfs_test_abnormal,您能否帮助详细说明以上内容?
(syslogs syslogs)syslogs syslogs得到得到得到得到日志日志日志日志structure_bgl.py和sample_bgl.py应用于结构和样本分别是日志,我被困在下一步训练和验证模型,并用它做预测,这就是上面提到的问题的来源。
你能帮忙吗?多谢!
@gavine Hello buddy, I have the same problem as you. Can you help me? I hope you can reply to me when you see it. This is very important to me. Thank you!
from logdeep.
Related Issues (20)
- hdfs parsing
- hdfs_train sequence file doesn't correspond to the sequence file generated for 100k structured file provided in the repository HOT 21
- hdfs文件夹下的event2semantic_vec.json这个文件是怎么用原始日志得到的 HOT 2
- 请问作者,data_read('template.txt')中template.txt文件是怎么得到的?第二个脚本里deepLog_hdfs_train.txt文件在data文件夹下也没看到 HOT 4
- 请问下,deeplog输出的这些指标是基于啥计算的,无监督的话咋知道哪些是对的,哪些是错的?最后有输出啥结果文件,找出有问题的日志窗口吗 HOT 1
- 关于 TP,FP,TN,FN的问题!
- 怎么生成训练数据hdfs_train呢? HOT 1
- '../result/deeplog/deeplog_last.pth 这个文件怎么产生 HOT 1
- prepare_log 这个的内容是什么 HOT 4
- Question about hdfs_train, hdfs_test_normal, and hdfs_test_abnormal HOT 1
- In HDFS templates count is 28? HOT 4
- An error occurs when the terminal command line runs
- Question about deeplog in logs Apache
- A data processing problem HOT 1
- One-hot encoding?
- F1 not achieved
- DeepLog hdfs original unpased data
- Possible implementation errors for session_windows
- In RobustLog's code, I didn't see the operation of weighting the semantic vector with TF-IDF
- Anomaly log file type detection and predict future log error
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from logdeep.