Module contents

methylprep.get_manifest(raw_datasets, array_type=None, manifest_filepath=None)

Generates a SampleSheet instance for a given directory of processed data.

Arguments:
raw_datasets {list(RawDataset)} – Collection of RawDataset instances that
require a manifest file for the related array_type.
Keyword Arguments:
array_type {ArrayType} – The type of array to process. If not provided, it
will be inferred from the number of probes in the IDAT file. (default: {None})
manifest_filepath {path-like} – Path to the manifest file. If not provided,
it will be inferred from the array_type and downloaded if necessary (default: {None})
Returns:
[Manifest] – A Manifest instance.
methylprep.get_raw_datasets(sample_sheet, sample_name=None, from_s3=None, meta_only=False)

Generates a collection of RawDataset instances for the samples in a sample sheet.

Arguments:
sample_sheet {SampleSheet} – The SampleSheet from which the data originates.
Keyword Arguments:
sample_name {string} – Optional: one sample to process from the sample_sheet. (default: {None}) from_s3 {zip_reader} – pass in a S3ZipReader object to extract idat files from a zipfile hosted on s3. meta_only {True/False} – doesn’t read idat files, only parses the meta data about them. (RawMetaDataset is same as RawDataset but has no idat probe values stored in object, because not needed in pipeline)
Raises:
ValueError: If the number of probes between raw datasets differ.
Returns:
[RawDataset] – A RawDataset instance.
methylprep.run_pipeline(data_dir, array_type=None, export=False, manifest_filepath=None, sample_sheet_filepath=None, sample_name=None, betas=False, m_value=False, make_sample_sheet=False, batch_size=None, save_uncorrected=False, meta_data_frame=True)

The main CLI processing pipeline. This does every processing step and returns a data set.

Arguments:
data_dir [required]
path where idat files can be found, and samplesheet csv.
array_type [default: autodetect]
27k, 450k, EPIC, EPIC+ If omitted, this will autodetect it.
export [default: False]
if True, exports a CSV of the processed data for each idat file in sample.
betas
if True, saves a pickle (beta_values.pkl) of beta values for all samples
m_value
if True, saves a pickle (m_values.pkl) of beta values for all samples
manifest_filepath [optional]
if you want to provide a custom manifest, provide the path. Otherwise, it will download the appropriate one for you.
sample_sheet_filepath [optional]
it will autodetect if ommitted.
sample_name [optional, list]
if you don’t want to process all samples, you can specify individual as a list. if sample_names are specified, this will not also do batch sizes (large batches must process all samples)
make_sample_sheet [optional]
if True, generates a sample sheet from idat files called ‘samplesheet.csv’, so that processing will work. From CLI pass in “–no_sample_sheet” to trigger sample sheet auto-generation.
batch_size [optional]
if set to any integer, samples will be processed and saved in batches no greater than the specified batch size. This will yield multiple output files in the format of “beta_values_1.pkl … beta_values_N.pkl”.
Returns:

By default, if called as a function, a list of SampleDataContainer objects is returned.

betas
if True, will return a single data frame of betavalues instead of a list of SampleDataContainer objects. Format is a “wide matrix”: columns contain probes and rows contain samples.
m_value
if True, will return a single data frame of m_factor values instead of a list of SampleDataContainer objects. Format is a “wide matrix”: columns contain probes and rows contain samples.

if batch_size is set to more than 200 samples, nothing is returned but all the files are saved. You can recreate the output by loading the files.

Processing note:
The sample_sheet parser will ensure every sample has a unique name and assign one (e.g. Sample1) if missing, or append a number (e.g. _1) if not unique. This may cause sample_sheets and processed data in dataframes to not match up. Will fix in future version.
methylprep.get_sample_sheet(dir_path, filepath=None)

Generates a SampleSheet instance for a given directory of processed data.

Arguments:
dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files.
Keyword Arguments:
filepath {string or path-like} – path of the sample sheet file if provided, otherwise
one will try to be found. (default: {None})
Returns:
[SampleSheet] – A SampleSheet instance.
methylprep.consolidate_values_for_sheet(data_containers, postprocess_func_colname='beta_value')

with a data_containers (list of processed SampleDataContainer objects), this will transform results into a single dataframe with all of the function values, with probe names in rows, and sample beta values for probes in columns.

Input:
data_containers – the output of run_pipeline() is this, a list of data_containers.
Arguments for postprocess_func_colname:
calculate_beta_value –> ‘beta_value’ calculate_m_value –> ‘m_value’ calculate_copy_number –> ‘cm_value’

note: these are hard-coded in pipeline.py as part of process_all() step.

methylprep.run_series(id, path, dict_only=False, batch_size=100, clean=True, verbose=False)

Downloads the IDATs and metadata for a series then generates one metadata dictionary and one beta value matrix for each platform in the series

Arguments:
id [required]
the series ID (can be a GEO or ArrayExpress ID)
path [required]
the path to the directory to download the data to. It is assumed a dictionaries and beta values directory has been created for each platform (and will create one for each if not)
dict_only
if True, downloads idat files and meta data and creates data dictionaries for each platform
batch_size
the batch_size to use when processing samples (number of samples run at a time). By default is set to the constant 100.
clean
if True, removes intermediate processing files
verbose
if True, adds additional debugging information
methylprep.run_series_list(list_file, path, dict_only=False, batch_size=100)

Downloads the IDATs and metadata for a list of series, creating metadata dictionaries and dataframes of sample beta_values

Arguments:
list_file [required]
the name of the file containing the series to download and process. This file must be located in the directory data is downloaded to (path). Each line of the file contains the name of one series.
path [required]
the path to the directory to download the data to. It is assumed a dictionaries and beta values directory has been created for each platform (and will create one for each if not)
dict_only
if True, only downloads data and creates dictionaries for each platform
batch_size
the batch_size to use when processing samples (number of samples run at a time). By default is set to the constant 100.
methylprep.convert_miniml(geo_id, data_dir='.', merge=True, download_it=True, extract_controls=False, require_keyword=None, sync_idats=False)
This scans the datadir for an xml file with the geo_id in it.
Then it parses it and saves the useful stuff to a dataframe called “sample_sheet_meta_data.pkl”. DOES NOT REQUIRE idats.
Arguments:
merge:
If merge==True and there is a file with ‘samplesheet’ in the folder, and that sheet has GSM_IDs, merge that data into this samplesheet. Useful for when you have idats and want one combined samplesheet for the dataset.
download_it:
if miniml file not in data_dir path, it will download it from web.
extract_controls [experimental]:
if you only want to retain samples from the whole set that have certain keywords, such as “control” or “blood”, this experimental flag will rewrite the samplesheet with only the parts you want, then feed that into run_pipeline with named samples.
require_keyword [experimental]:
another way to eliminate samples from samplesheets, before passing into the processor. if specified, this keyword needs to appear somewhere in the values of a samplesheet.
sync_idats:
If flagged, this will search data_dir for idats and remove any of those that are not found in the filtered samplesheet. Requires you to run the download function first to get all idats, before you run this meta_data function.

Attempts to also read idat filenames, if they exist, but won’t fail if they don’t.