Many thanks for this great tool! I need to map the original patient

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to map original to anonymized data in recursive mode? about dicognito HOT 7 CLOSED

blairconrad commented on June 2, 2024

How to map original to anonymized data in recursive mode?

from dicognito.

Comments (7)

blairconrad commented on June 2, 2024

Hi, Dr. Woodard.

Thanks for the kind words. I'm glad you're enjoying dicognito.

No, there's currently no way to expose this data from the command line wrapper. One could always use dicognito as a module and get this, but then you'd have to rewrite the tree-walking and the whole big. How tedious.

I think your new feature sounds useful and fun, and see no reason not to add it! If you want to do so, go nuts! Happy to have you aboard.

I'm keen to hear how you envision this working, and everything's open for discussion, but I'm currently imagining a command-line option or options that would enable the feature. I'm suggesting not enabling this additional output by default for a couple of reasons

backwards compatibility. We're in 0.xx.yy release range, so technically, we could break anything with impunity, but that's not reason to do it. I wouldn't want to surprise my as many as 10 users with new data, and
my use of the tool is not for analyzing clinical data. It's more often that deidentifying patient data to use when troubleshooting a bug in an enterprise imaging software suite. In these cases, I really don't want to know anything about the original data if I can avoid it, so having it pop up on the screen by default is not ideal.

What think you? Don't hesitate to come back with counter-ideas, suggestions, questions, or whatnot.
If you're still interested in working on the issue, say the word and I'll assign it to you!

from dicognito.

blairconrad commented on June 2, 2024

@annawoodard, if you are interested in contributing a change, be aware that I just now merged #125, which changes the appearance of the summary slightly, and more importantly should make it easier for you to make any further change to the summary content.

from dicognito.

annawoodard commented on June 2, 2024

I'm currently imagining a command-line option or options that would enable the feature.

Absolutely, I totally agree that we shouldn't surprise users with unexpected sensitive data. And please do assign it to me. I'm thinking of just a simple flag (--save-anonymization-map, --key, --save-key, or --save-map; I don't have a strong preference, although 'key' may be too overloaded a choice; let me know if you have one.) Then the output would be a fixed file name in CSV format. Users could then rename the key if they wished. We could either refuse to overwrite existing map files with the same name or alternatively we could append to them. Another possibility would be to let the user pass a filename in for the output key-- I don't like that though as it may be confusing what suffix the file should have. It's easy enough to just rename a file, so I favor not having this configurable to reduce mental overhead.

be aware that I just now merged #125

Thanks for the heads up!

from dicognito.

blairconrad commented on June 2, 2024

Ah. I misunderstood your intent. I thought that you were proposing to alter the output from the tool. Given your preferred approach, #125 shouldn't play a factor at all. Ah, well.

A separate anonymization map file makes sense to me. As you suggest, "key" is probably a little too broad. Either of your *-map options sound fine.

Appending to the file would never have occurred to me. I would've naively overwritten. I'm never in a situation where I'd keep anonymizing new batches of files and considering them to be part of the same body of work. I assume you have a specific workflow in mind where this sort of thing happens? Would you describe it? In particular, I'm curious about how having one growing CSV file is preferable to individual ones that might later be aggregated, either by concatenating or by importing into another tool.

In sadder news, I have a differing opinion about specifying the filename. I think it's preferable to do so. Here's my reasoning:

if users don't specify the filename, it'll be dropped… somewhere. The user has to have read the docs (or help) to know where this file was created. If they haven't, they may not find the file, possibly moving on in frustration. There's a(n admittedly low) chance that they'll move on and the map file will be left sitting around on disk and be discovered later by those who shouldn't
moving a file later is easy enough, as you say, but it's still a thing that has to be remembered after running the tool. What if someone forgets?
the user may be confused about what suffix the file should have, but another way to look at it is that this gives them choice! Maybe they want .csv, or maybe there's a special .map suffix that some tool likes to read CSVs from. As an idea for a future feature, dicognito could choose the output format based on the file suffix: JSON for .json, XML for .xml, and so on

I'm not completely closed to a default filename, but do wonder if I've swayed you at all. Regardless, that consideration is largely separable from the work of constructing the file, so I'll assign to you and we can refine the vision while (or if you prefer, before) you work.

Thanks!

from dicognito.

annawoodard commented on June 2, 2024

Apologies for the delay.

Appending to the file would never have occurred to me. I would've naively overwritten. I'm never in a situation where I'd keep anonymizing new batches of files and considering them to be part of the same body of work. I assume you have a specific workflow in mind where this sort of thing happens? Would you describe it? In particular, I'm curious about how having one growing CSV file is preferable to individual ones that might later be aggregated, either by concatenating or by importing into another tool.

I'm building deep learning models to predict breast cancer risk from screening imaging (MRI or MG) data. We follow a cohort of patients at our institution who have consented to share their imaging data with us. They typically undergo screening imaging every 6-12 months. With time, more patients consent, and previously consented patients accumulate additional exams. Model development is iterative, so rather than waiting until the study closes to download all the data at once and start building and training models, I run a pipeline every few months to pull down new exams, preprocess them, and add them to the training set. De-identification is handled for us by a centralized group that oversees research on imaging data at our institution. We are establishing a data-sharing agreement with collaborators who have no such group, so I'm writing a pipeline for them that includes de-identification. We'll have a similar workflow of incrementally updating their dataset.

Thanks for your questions-- now that I reflect on it, my previous plan was ill-formed. Whether we append or aggregate, the result would be a many-to-one mapping of anon IDs to original IDs. While it would be possible to disentangle, it's not ideal. I've opened #126 to discuss this. Solving the checkpointing issue would obviate the need to append, so let's not consider that further.

I'm not completely closed to a default filename, but do wonder if I've swayed you at all.

Yes, you've made great points. Let's require the user to specify a filename.

from dicognito.

blairconrad commented on June 2, 2024

No need to apologize for the delay! This issue is your metaphorical baby.. And while it appears to be relevant to your actual job, most of my interactions with people on GitHub involve folk using or maintaining projects as hobbies (while I use dicognito at my work, writing it was very much entertainment for me), so it's rare that things move quickly.
I was considering checking in on you in a few days in case I'd turned you off with my previous comments. I'm glad you're still here. Anyhow, I'm content to work to your pace.

Thanks for sharing the details about your iterative model. It's very interesting!

While it would be possible to disentangle, it's not ideal.

We could have a map of maps to the deidentification maps! 😉

from dicognito.

How to map original to anonymized data in recursive mode? about dicognito HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent