mljs / generate-dataset Goto Github PK

Generate synthetic datasets for testing

License: MIT License

JavaScript 100.00%

generate-dataset's Introduction

ml-generate-dataset

Installation

npm install --save ml-generate-dataset

Example

var generateDataset = require('ml-generate-dataset');
/*
 * this options will be used to create the dataSet, thus permites to create several classes where markers will be
 * the elements with differents distributions between the classes, pay attention to the element with index 1, it has
 * a distribution with means 9.4 and 10.3 for the first and second classes respectivelly
 */
var options = {
    keepDataClass: true,
    keepCompositionMatrix: true,
    dummyMatrix: true,
    seed: 22,
    classes: [
        {
            nbSample: 500,
            elements: [
                {
                    index: 0,
                    distribution: {
                        name: 'normal',
                        parameters: {
                            mean: 9.4,
                            standardDesviation: 0.1
                        }
                    }
                },
                {
                    index: 1,
                    distribution: {
                        name: 'normal',
                        parameters: {
                            mean: 9.4,
                            standardDesviation: 0.1
                        }
                    }
                },
                {
                    index: 2,
                    distribution: {
                        name: 'normal',
                        parameters: {
                            mean: 9.4,
                            standardDesviation: 0.1
                        }
                    }
                }
            ]
        },
        {
            nbSample: 500,
            elements: [
                {
                    index: 0,
                    distribution: {
                        name: 'normal',
                        parameters: {
                            mean: 9.4,
                            standardDesviation: 0.1
                        }
                    }
                },
                {
                    index: 1,
                    distribution: {
                        name: 'normal',
                        parameters: {
                            mean: 10.3,
                            standardDesviation: 0.15
                        }
                    }
                },
                {
                    index: 2,
                    distribution: {
                        name: 'normal',
                        parameters: {
                            mean: 9.4,
                            standardDesviation: 0.1
                        }
                    }
                }
            ]
        }
    ]
};
// the pureElements matrix could be whatever you want like NMR or IR spectra.
var pureElements = [
    [0, 0, 0, 1, 0, 0, 0],
    [0, 0, 0, 0, 0, 1, 0],
    [0, 1, 0, 0, 0, 0, 0]
];
/* the rows of pureElements matrix will be linear combined like:
 * var pureElements = [
 *  element A,
 *  element B,
 *  element C,
 *      .
 *      .
 *      .
 *  ];
 *  so each element of dataset is  AA = aA + bB + cC + ...
 *  and the matrix composition contain those percentages
 */
var dataset = generateDataset(pureElements, options);
// now you have an object with the dataset, matrix composition and dataClass matrix to do a statistical procedure and debug

License

MIT

generate-dataset's People

Contributors

Stargazers

Watchers

Forkers

jobo322

generate-dataset's Issues

format of the config file

in my opinion the config file is too complex to handle large cases. If we have a system with 1000 molecules, the file will become very big.

The variable pureElement should come first and may be a filename as well (if imported it should be able to read a csv file with each element on a new line)

Then because we don't want to fill in information for 1000 molecules we should be able to define a default behaviour.

// the pureElements matrix could be whatever you want like NMR or IR spectra.
var pureElements = [
    [0, 0, 0, 1, 0, 0, 0],
    [0, 0, 0, 0, 0, 1, 0],
    [0, 1, 0, 0, 0, 0, 0]
];

// this vector is optional, it allows to tune the composition if not already in the pureElements. Each row of pureElements will be multiplied by the corresponding element of meanComposition. (min 0, max 100)
var meanComposition = [10, 15, 70]; // for 3 classes


var options = {
    keepDataClass: true,
    keepCompositionMatrix: true,
    dummyMatrix: true,
    seed: 22,
    defaultBehavior: {
		distribution: {
			name: ['normal', 'normal', 'normal'], // one for each class
			parameters: {
				standardDeviation: [0.1, 0.1, 0.2], // one for each class
				meanType: 'sd/diff/absolute',
				mean: [0, -0.1, 0.1] // if 'sd' then defined as X times the sd, if 'diff' then defined as  difference with respect to meanComposition, if 'absolute' then overwrite the meanComposition entry.
			}	
		}
	}
	classes: [
        {
            nbSample: 500,
            elements: [
                {
                    index: 0,
                    distribution: {
                        name: 'normal',
                        parameters: {
                            meanType: 'sd/diff/absolute',
                            mean: 9.4, 
                            standardDeviation: 0.1
                        }
                    }
                },
                {
                    index: 1,
                    distribution: {
                        name: 'normal',
                        parameters: {
                        	meanType: 'sd/diff/absolute',
                            mean: 9.4,
                            standardDeviation: 0.1
                        }
                    }
                },
                {
                    index: 2,
                    distribution: {
                        name: 'normal',
                        parameters: {
                        	meanType: 'sd/diff/absolute',
                            mean: 9.4,
                            standardDeviation: 0.1
                        }
                    }
                }
            ]
        },
        {
            nbSample: 500,
            elements: [
                {
                    index: 0,
                    distribution: {
                        name: 'normal',
                        parameters: {
                            mean: 9.4,
                            standardDeviation: 0.1
                        }
                    }
                },
                {
                    index: 1,
                    distribution: {
                        name: 'normal',
                        parameters: {
                            mean: 10.3,
                            standardDeviation: 0.15
                        }
                    }
                },
                {
                    index: 2,
                    distribution: {
                        name: 'normal',
                        parameters: {
                            mean: 9.4,
                            standardDeviation: 0.1
                        }
                    }
                }
            ]
        },
        {
            nbSample: 400,
            elements: [
                {
                    index: 0,
                    distribution: {
                        name: 'normal',
                        parameters: {
                            mean: 9.4,
                            standardDeviation: 0.1
                        }
                    }
                },
                {
                    index: 1,
                    distribution: {
                        name: 'normal',
                        parameters: {
                            mean: 11,
                            standardDeviation: 0.15
                        }
                    }
                },
                {
                    index: 2,
                    distribution: {
                        name: 'normal',
                        parameters: {
                            mean: 9.4,
                            standardDeviation: 0.1
                        }
                    }
                }
            ]
        }
    ]
};

command line

I hope to be able to use the npm something like this...

node generate.js -dataType=nmr fromSDF=pathtosdf outputType=csv outputPrefix='blablah'
node generate.js -dataType=nmr fromCSV=pathtocsv outputType=csv outputPrefix='blablah'
node generate.js -dataType=nmr fromCSV=pathtocsv outputType=jcamp outputPrefix='blablah'

the output should contain the class vector and the composition matrix and the dataset

random noise

we should be able to add random noise on rows of the compositionMatrix after the "mixing" is done. This would simulate experimental noise when data are acquired and should be applied after mixing. Random noise is the same for all class, since it only depends on the experiments but not on the sample preparation.

compositionMatrix

We should be able to output a composition matrix in csv that can be used to multiply the pureElement matrix outside the npm.

mljs / generate-dataset Goto Github PK

generate-dataset's Introduction

ml-generate-dataset

Installation

Example

API Documentation

License

generate-dataset's People

Contributors

Stargazers

Watchers

Forkers

generate-dataset's Issues

format of the config file

command line

random noise

compositionMatrix

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent