Comments (3)
It does not block logins. And in fact, once an attacker gets into the account, they can just delete that notification. Which will then also disappear on all other devices.
That's true. Because that we also send out an email.
I assume it also uses neither geographical information nor internet topology information, which would be very useful for a classification model pipeline (i.e. turn (ip, uid) into (ip, uid, lat, long) for training with geographic info)
IP addresses have topology if you look at them as vector of bits. The neural net is able to learn those. For ipv6 we also throw away the lower half of the vector.
Is there a command with which I can test the classifier on an (ip, uid) tuple of my choosing?
Yes, suspiciouslogin:predict
How can you make these stats? I would wager that 100% of the successful logins on my instance are legitmate so there are zero "actually suspicious" logins, so the real precision is zero. According to the repository description you are also not using failed logins. So where do you get authoritative/"supervised" information from? All you really know is (TP+FP) and (TN+FN).
I think https://blog.wuc.me/2019/04/25/nextcloud-suspicious-login-detection.html covers that well. We don't have supervised data for validated suspiciousness as it's hard to find this data. Failed login attempts are not a reliable source for this data. Therefore we generate random IPs as negative samples, and we simulate that other users try to log in with other user IDs.
My best guess is that you 1) assume all captured (successful) logins L are legitimate, and that 2) all** G \ L are suspicious, of which you pick a random set. You then run M(t) on the non-training part of set L (and get FP & TN), and run M(t) on a random sample of G \ L (and get TP & FN).
See
suspicious_login/lib/Service/TrainService.php
Lines 66 to 88 in 770a62c
or a certain subset with minimum distance to L according to some metric
This is very well possible, but it needs input from a human. This means you have to adjust this parameter based on how your instance is used. And you might have people that have a lower radius locations they typically log in from and others who travel more. And then you have those who commute and have to regular but distinct locations that they use. You can detect anomalies in this data with rules, where the machine learning approach shines is that it can adapt without user input. As an admin you install the app and basically can forgot about it.
I hope this clear things up.
from suspicious_login.
After trying this add-on for multiple years, I have to be a bit blunt and say it just straight up doesn't seem to reliably work as intended. Since there are not really any configuration options, I don't think it is a configuration error, but must be an error in the model choice or programming. Maybe it's due to the small number of users on my instance, but no idea.
A lot new login from a new IP address yields a warning, even if it is in the same /24 subnet as previous attempts. But more than that, the exact IPs I have logged in from in the past are regularly classified as suspicious on new logins! (At least once every 6 months, sometimes once a month) This strongly implies something is fundamentally broken.
On average I get two "suspicious login" e-mails per week (see image). None of those are actually suspicious (precision = 0). When the failure rate is so high, it just becomes another warning you ignore. Roughly one quarter of the warnings are logins via my mobile carrier. All logins via my mobile carrier come from the same narrow /22 subnet.
from suspicious_login.
My Nextcloud instance's trained model on average claims roughly 94% precision and 90% recall.
So far the app has captured 733494 logins (including client connections), of which 1795 are distinct (IP, UID) tuples.
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
Based on your previous explanation I therefore take that you just calculate it with the random sampling of all IPs and assume that all random IPs that trigger the model are TP, random IPs that don't trigger the model are FN, previous IPs that trigger the model are FP and previous IPs that don't trigger the model are TN.
Precision = random trigger / (random trigger + historical trigger)
Recall = random trigger / (random trigger + random notrigger)
So:
Precision = What fraction of triggering IPs comes from the random set
Recall = What fraction of the random set causes a trigger
In this model, the precision is an absolutely meaningless value, since it depends on the relative number of "random" and "historical" samples put into the validation dataset. Assuming it's 1:1, then the precision of 94% tells me that basically 6% of the triggers were due to the 50% historical data. meaning historical data has a 12% chance of triggering a warning. That's pretty bad.
Simultaneously, the recall of 90% implies that a supposed attacker logging in from a random IP has a 10% chance of not triggering a notification. Keep in mind: these are for completely random IP addresses. This value is harder to get to 100% without triggering too many false positives, but 99%+ should be easily doable too...
Are the IPs even weighted by frequency for training? It is much worse if an IP I login daily gets classified as suspicious than an IP that I logged in from once 5 months ago and never since. Taking just the unique (userid, ip) tuples is not a good idea.
This further reinforces my thinking that the chosen model, means of training and validation is just not appropriate. A simple hardcoded IP address distance check would probably work leaps and bounds better than this. Or a K-Neighbors Classifier. A big part of the problem seems to also be that the daily trained neural net will sometimes produce something bad.
And further:
I assume it also uses neither geographical information nor internet topology information, which would be very useful for a classification model pipeline (i.e. turn (ip, uid) into (ip, uid, lat, long) for training with geographic info)
IP addresses have topology if you look at them as vector of bits. The neural net is able to learn those. For ipv6 we also throw away the lower half of the vector.
Even assuming a perfect model (which this obviously isn't), their topology is not good enough to not produce a lot of false positives. IPs that are close in number tend to be close geographically, that much is true. But the inverse is not true, IPs that are close geographically can be very far away from each other numerically, due to wildly different IP blocks per carrier/company/institution, and even sometimes multiple distant IP blocks within a single institution.
Also, the training approach you mentioned is bad in general and needs to overfit to work properly in the limit case. It's like training an image recognition net to detect cats, and as negatives giving random images - including cats. The resulting neural net will not detect cats, it will detect specific images of cats. Similarly, your choice of model wants to converge on simply becoming a list of previous logins (but it evidently also fails at that).
from suspicious_login.
Related Issues (20)
- Replace lint.yml with split workflows
- Dependency Dashboard
- Update rubixML HOT 15
- App not passing integrity check in v26/25 HOT 3
- Too verbose logs when model not found HOT 1
- Unit tests don't execute against php 8.2
- Add button to email notifications to get more info about the suspicious ip HOT 1
- A new login into your account was detected - really a login, or just a try? HOT 1
- Drop support for PHP 7.4 HOT 2
- PHP unserialize(): Error at offset HOT 5
- 0 rows on login_address table HOT 1
- Version on app store and compatible versions HOT 1
- Investigate feasibility of use in 32-bit environments HOT 2
- Email Notification: New login location detected HOT 2
- Huge database table due to login attempts/reconnects from Thunderbird client CalDAV HOT 1
- Datasets must have the same number of columns HOT 19
- All logins reported as suspicious - underlying DataLoader.php bug HOT 1
- Can not start training HOT 1
- Suspicious Login notification whenever logged in (thousands of warnings) HOT 2
- ValueError during IPv4 background training HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from suspicious_login.