Setting up ============ Reference Object ----------------- Git is an important aspect of PaDRe because we need to version the source code for tracking the lifetime of experiments. Sometimes, the code might come from another python package too. For us to keep track of the provenance, we need a reference that points us to the source and this is done via the reference object. A reference object can either be a Python package or a Python file. If a Python file is given as a reference it should be part of a git repository. The source code git should be created by the user and this could be in any directory of the user's system environment. And if it is a Python package, we obtain the information about the Python package by inspecting the package and storing the package identifier and the function that is called for execution. The reference object is explicitly specified only when using pure code to create experiments and projects, while when using decorators it is automatically picked up by the PaDRe framework to resolve the required names and file objects into references. Creating a Project in PaDRe ---------------------------- A project contains one or more experiments that are semantically grouped together. It could be either different experimental methods working towards an identical goal, or many different experiments that are parts of a larger goal. Parameters of a project are - Name: Name of the project. - Description: A short description that provides information about the project. - Reference: A reference object that specifies how the project was created and how it is handled. Creating an experiment in PaDRe --------------------------------- An experiment requires the following parameters to be initialized. -Name: The name of the experiment. This should be unique -Description: A short description of the intention of the experiment. -Dataset: The dataset on which the experiment will work on -Pipeline: A workflow consisting of one or more algorithms -Project: The project to which this experiment belongs to. The name of the project can be specified and PaDRe automatically searches and groups the experiment under the specified project. -strategy: Splitting strategy for the dataset. The supported strategies are random, cv for cross validation, explicit where the user can explicitly specify the indices for training, testing and validation, function where the user passes a function that returns the indices or index where a list of indices are passed. If no option is given, the random splitting method is chosen. -preprocessing\_pipeline: Preprocessing workflow for the dataset in a case that an algorithm has to be applied to the dataset as a whole for the experiment. This could be something such as computing the mean and standard deviation of a dataset or creating an embedding which normally should be based on the whole dataset. -reference: A reference object to the source code being executed Single Pipeline Experiments --------------------------- Single pipeline experiments can be created in different ways: 1. Through class instantiation .. code-block:: python from pypadre.core.model.project import Project from pypadre.core.model.experiment import Experiment from pypadre.binding.metrics import sklearn_metrics print(sklearn_metrics) self.app.datasets.load_defaults() project = Project(name='Sample Project', description=Example Project', creator=Function(fn=self.test_full_stack)) _id = '_iris_dataset' dataset = self.app.datasets.list({'name': _id}) experiment = Experiment(name='Sample Experiment', description='Example Experiment', dataset=dataset.pop(), project=project, pipeline=SKLearnPipeline(pipeline_fn=create_test_pipeline_multiple_estimators), reference=self.test_full_stack) experiment.execute() 2. Through decorators .. code-block:: python import numpy as np from sklearn.datasets import load_iris # Import the metrics to register them from pypadre.binding.metrics import sklearn_metrics from pypadre.examples.base_example import example_app app = example_app() @app.dataset(name="iris", columns=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'class'], target_features='class') def dataset(): data = load_iris().data target = load_iris().target.reshape(-1, 1) return np.append(data, target, axis=1) @app.experiment(dataset=dataset, reference_git=__file__, experiment_name="Iris SVC - User Defined Metrics", seed=1, allow_metrics=True, project_name="Examples") def experiment(): from sklearn.pipeline import Pipeline from sklearn.svm import SVC estimators = [('SVC', SVC(probability=True))] return Pipeline(estimators) 3. Creating an experiment via the CLI pypadre > project create --name PROJECT_NAME pypadre > experiment initialize --name EXPERIMENT_NAME The above command opens an editor where the user eidt the code for the dataset and experiments similar to that of the decorator example pypadre > experiment execute --name EXPERIMENT_NAME (alternatively, the user can use the path to the experiment with --path) Hyperparameter Optimization --------------------------- 1. Through parameters passed to the experiment execute function. The parameters are passed as a dictionary with the key as the component name and an inner dictionary. The inner dictionary contains the parameter name as the key and an array of values that are to be used for hyperparameter optimization. .. code-block:: python parameter_dict = {'SVR': {'C': [0.1, 0.2]}} experiment.execute(parameters={'SKLearnEvaluator': {'write_results': True}, 'SKLearnEstimator': {'parameters': parameter_dict} 2. Through decorators using the parameter keyword .. code-block:: python @app.dataset(name="iris", columns=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'class'], target_features='class') def dataset(): data = load_iris().data target = load_iris().target.reshape(-1, 1) return np.append(data, target, axis=1) @app.parameter_map() def parameters(): return {'SKLearnEstimator': {'parameters': {'SVC': {'C': [0.1, 0.5, 1.0]}, 'PCA': {'n_components': [1, 2, 3]}}}} @app.experiment(dataset=dataset, reference_package=__file__, parameters=parameters, experiment_name="Iris SVC", project_name="Examples", ptype=SKLearnPipeline) def experiment(): from sklearn.pipeline import Pipeline from sklearn.svm import SVC estimators = [('PCA', PCA()), ('SVC', SVC(probability=True))] return Pipeline(estimators) Multi-pipline, multi-data Experiments ------------------------------------- Currently, not supported