Hi, I have had this app installed for a while now, but cannot really figure out wh

Clarify use case of app & recall/precision; ML model seems bad about suspicious_login HOT 3 CLOSED

nextcloud commented on June 12, 2024 1

Clarify use case of app & recall/precision; ML model seems bad

from suspicious_login.

Comments (3)

ChristophWurst commented on June 12, 2024

It does not block logins. And in fact, once an attacker gets into the account, they can just delete that notification. Which will then also disappear on all other devices.

That's true. Because that we also send out an email.

I assume it also uses neither geographical information nor internet topology information, which would be very useful for a classification model pipeline (i.e. turn (ip, uid) into (ip, uid, lat, long) for training with geographic info)

IP addresses have topology if you look at them as vector of bits. The neural net is able to learn those. For ipv6 we also throw away the lower half of the vector.

Is there a command with which I can test the classifier on an (ip, uid) tuple of my choosing?

Yes, suspiciouslogin:predict

How can you make these stats? I would wager that 100% of the successful logins on my instance are legitmate so there are zero "actually suspicious" logins, so the real precision is zero. According to the repository description you are also not using failed logins. So where do you get authoritative/"supervised" information from? All you really know is (TP+FP) and (TN+FN).

I think https://blog.wuc.me/2019/04/25/nextcloud-suspicious-login-detection.html covers that well. We don't have supervised data for validated suspiciousness as it's hard to find this data. Failed login attempts are not a reliable source for this data. Therefore we generate random IPs as negative samples, and we simulate that other users try to log in with other user IDs.

My best guess is that you 1) assume all captured (successful) logins L are legitimate, and that 2) all** G \ L are suspicious, of which you pick a random set. You then run M(t) on the non-training part of set L (and get FP & TN), and run M(t) on a random sample of G \ L (and get TP & FN).

See

suspicious_login/lib/Service/TrainService.php

Lines 66 to 88 in 770a62c

 // Load 

 $collectedData = $this->dataLoader->loadTrainingAndValidationData( 

 $dataConfig, 

 $strategy 

 ); 

 $data = $this->dataLoader->generateRandomShuffledData( 

 $collectedData, 

 $config, 

 $strategy 

 ); 

 // Train 

 $result = $this->trainer->train( 

 $config, 

 $data, 

 $strategy 

 ); 

 // Persist 

 $this->store->persist( 

 $result->getClassifier(), 

 $result->getModel() 

 );

. We load the real data, then we create the complementary negative samples and this is then taken for the training. In the training the set is split again into two sets and we validate with data the classifier doesn't see during the training.

or a certain subset with minimum distance to L according to some metric

This is very well possible, but it needs input from a human. This means you have to adjust this parameter based on how your instance is used. And you might have people that have a lower radius locations they typically log in from and others who travel more. And then you have those who commute and have to regular but distinct locations that they use. You can detect anomalies in this data with rules, where the machine learning approach shines is that it can adapt without user input. As an admin you install the app and basically can forgot about it.

I hope this clear things up.

from suspicious_login.

mueslo commented on June 12, 2024

After trying this add-on for multiple years, I have to be a bit blunt and say it just straight up doesn't seem to reliably work as intended. Since there are not really any configuration options, I don't think it is a configuration error, but must be an error in the model choice or programming. Maybe it's due to the small number of users on my instance, but no idea.

A lot new login from a new IP address yields a warning, even if it is in the same /24 subnet as previous attempts. But more than that, the exact IPs I have logged in from in the past are regularly classified as suspicious on new logins! (At least once every 6 months, sometimes once a month) This strongly implies something is fundamentally broken.

On average I get two "suspicious login" e-mails per week (see image). None of those are actually suspicious (precision = 0). When the failure rate is so high, it just becomes another warning you ignore. Roughly one quarter of the warnings are logins via my mobile carrier. All logins via my mobile carrier come from the same narrow /22 subnet.

from suspicious_login.

mueslo commented on June 12, 2024

My Nextcloud instance's trained model on average claims roughly 94% precision and 90% recall.

So far the app has captured 733494 logins (including client connections), of which 1795 are distinct (IP, UID) tuples.

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)

Based on your previous explanation I therefore take that you just calculate it with the random sampling of all IPs and assume that all random IPs that trigger the model are TP, random IPs that don't trigger the model are FN, previous IPs that trigger the model are FP and previous IPs that don't trigger the model are TN.

Precision = random trigger / (random trigger + historical trigger)
Recall = random trigger / (random trigger + random notrigger)

So:
Precision = What fraction of triggering IPs comes from the random set
Recall = What fraction of the random set causes a trigger

In this model, the precision is an absolutely meaningless value, since it depends on the relative number of "random" and "historical" samples put into the validation dataset. Assuming it's 1:1, then the precision of 94% tells me that basically 6% of the triggers were due to the 50% historical data. meaning historical data has a 12% chance of triggering a warning. That's pretty bad.

Simultaneously, the recall of 90% implies that a supposed attacker logging in from a random IP has a 10% chance of not triggering a notification. Keep in mind: these are for completely random IP addresses. This value is harder to get to 100% without triggering too many false positives, but 99%+ should be easily doable too...

Are the IPs even weighted by frequency for training? It is much worse if an IP I login daily gets classified as suspicious than an IP that I logged in from once 5 months ago and never since. Taking just the unique (userid, ip) tuples is not a good idea.

This further reinforces my thinking that the chosen model, means of training and validation is just not appropriate. A simple hardcoded IP address distance check would probably work leaps and bounds better than this. Or a K-Neighbors Classifier. A big part of the problem seems to also be that the daily trained neural net will sometimes produce something bad.

And further:

I assume it also uses neither geographical information nor internet topology information, which would be very useful for a classification model pipeline (i.e. turn (ip, uid) into (ip, uid, lat, long) for training with geographic info)

IP addresses have topology if you look at them as vector of bits. The neural net is able to learn those. For ipv6 we also throw away the lower half of the vector.

Even assuming a perfect model (which this obviously isn't), their topology is not good enough to not produce a lot of false positives. IPs that are close in number tend to be close geographically, that much is true. But the inverse is not true, IPs that are close geographically can be very far away from each other numerically, due to wildly different IP blocks per carrier/company/institution, and even sometimes multiple distant IP blocks within a single institution.

Also, the training approach you mentioned is bad in general and needs to overfit to work properly in the limit case. It's like training an image recognition net to detect cats, and as negatives giving random images - including cats. The resulting neural net will not detect cats, it will detect specific images of cats. Similarly, your choice of model wants to converge on simply becoming a list of previous logins (but it evidently also fails at that).

from suspicious_login.

Clarify use case of app & recall/precision; ML model seems bad about suspicious_login HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	// Load
	$collectedData = $this->dataLoader->loadTrainingAndValidationData(
	$dataConfig,
	$strategy
	);
	$data = $this->dataLoader->generateRandomShuffledData(
	$collectedData,
	$config,
	$strategy
	);

	// Train
	$result = $this->trainer->train(
	$config,
	$data,
	$strategy
	);

	// Persist
	$this->store->persist(
	$result->getClassifier(),
	$result->getModel()
	);