API Reference

methylprep.processing
methylprep.run_pipeline(data_dir[, …]) The main CLI processing pipeline.
methylprep.files.create_sample_sheet(dir_path) Creates a samplesheet.csv file from the .IDAT files of a GEO series directory
methylprep.download
methylprep.run_series(id, path[, dict_only, …]) Downloads the IDATs and metadata for a series then generates one metadata dictionary and one beta value matrix for each platform in the series
methylprep.read_geo
methylprep.build_composite_dataset(…[, …]) A wrapper function for convert_miniml() to download a list of GEO datasets and process only those samples that meet criteria.
methylprep.models
methylprep.files
class methylprep.ArrayType[source]

Bases: enum.Enum

This class stores meta data about array types, such as numbers of probes of each type, and how to guess the array from probes in idat files.

CUSTOM = 'custom'
ILLUMINA_27K = '27k'
ILLUMINA_450K = '450k'
ILLUMINA_EPIC = 'epic'
ILLUMINA_EPIC_PLUS = 'epic+'
ILLUMINA_MOUSE = 'mouse'
from_probe_count = <bound method ArrayType.from_probe_count of <enum 'ArrayType'>>[source]
num_controls
num_probes

Used to load normal cg+ch probes from start of manifest until this point. Then start control df.

num_snps
class methylprep.Manifest(array_type, filepath_or_buffer=None, on_lambda=False, verbose=True)[source]

Bases: object

Provides an object interface to an Illumina array manifest file.

Arguments:
array_type {ArrayType} – The type of array to process. values are styled like ArrayType.ILLUMINA_27K, ArrayType.ILLUMINA_EPIC or ArrayType(‘epic’), ArrayType(‘mouse’)
Keyword Arguments:
filepath_or_buffer {file-like} – a pre-existing manifest filepath (default: {None})
Raises:
ValueError: The sample sheet is not formatted properly or a sample cannot be found.
columns
control_data_frame
data_frame
static download_default(array_type, on_lambda=False)[source]

Downloads the appropriate manifest file if one does not already exist.

Arguments:
array_type {ArrayType} – The type of array to process.
Returns:
[PurePath] – Path to the manifest file.
get_data_types()[source]
get_probe_details(probe_type, channel=None)[source]

used by infer_channel_switch. Given a probe type (I, II, SnpI, SnpII, Control) and a channel (Channel.RED | Channel.GREEN), this will return info needed to map probes to their names (e.g. cg0031313 or rs00542420), which are NOT in the idat files.

mouse_data_frame
read_control_probes(manifest_file)[source]

Unlike other probes, control probes have no IlmnID because they’re not locus-specific. they also use arbitrary columns, ignoring the header at start of manifest file.

read_mouse_probes(manifest_file)[source]

ILLUMINA_MOUSE contains unique probes whose names begin with ‘mu’ and ‘rp’ for ‘murine’ and ‘repeat’, respectively. This creates a dataframe of these probes, which are not processed like normal cg/ch probes.

read_probes(manifest_file)[source]
read_snp_probes(manifest_file)[source]

Unlike cpg and control probes, these rs probes are NOT sequential in all arrays.

static seek_to_start(manifest_file)[source]

find the start of the data part of the manifest. first left-most column must be “IlmnID” to be found.

snp_data_frame
methylprep.get_sample_sheet(dir_path, filepath=None)[source]

Generates a SampleSheet instance for a given directory of processed data.

Arguments:
dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files.
Keyword Arguments:
filepath {string or path-like} – path of the sample sheet file if provided, otherwise
one will try to be found. (default: {None})
Returns:
[SampleSheet] – A SampleSheet instance.
methylprep.parse_sample_sheet_into_idat_datasets(sample_sheet, sample_name=None, from_s3=None, meta_only=False, bit='float32')[source]

Generates a collection of IdatDatasets from samples in a sample sheet.

Arguments:
sample_sheet {SampleSheet} – The SampleSheet from which the data originates.
Keyword Arguments:
sample_name {string} – Optional: one sample to process from the sample_sheet. (default: {None}) from_s3 {zip_reader} – pass in a S3ZipReader object to extract idat files from a zipfile hosted on s3. meta_only {True/False} – doesn’t read idat files, only parses the meta data about them. (RawMetaDataset is same as RawDataset but has no idat probe values stored in object, because not needed in pipeline)
Raises:
ValueError: If the number of probes between raw datasets differ.
Returns:
[RawDatasets] – A list of idat data pairs, each a dict like {‘green_idat’: green_idat, ‘red_idat’: red_idat}
methylprep.consolidate_values_for_sheet(data_containers, postprocess_func_colname='beta_value', bit='float32', poobah=True, poobah_sig=0.05, exclude_rs=True)[source]

Transforms results into a single dataframe with all of the function values, with probe names in rows, and sample beta values for probes in columns.

Input:
data_containers – the output of run_pipeline() is this, a list of data_containers. (a list of processed SampleDataContainer objects)
Arguments for postprocess_func_colname:
calculate_beta_value –> ‘beta_value’ calculate_m_value –> ‘m_value’ calculate_copy_number –> ‘cm_value’

note: these functions are hard-coded in pipeline.py as part of process_all() step. note: if run_pipeline included ‘sesame’ option, then quality mask is automatically applied to all pickle outputs, and saved as column in processed CSV.

Options:
bit (float16, float32, float64) – change the default data type from float32
to another type to save disk space. float16 works fine, but might not be compatible with all numnpy/pandas functions, or with outside packages, so float32 is default. This is specified from methylprep process command line.
poobah
If true, filters by the poobah_pval column. (beta m_val pass True in for this.)
data_container.quality_mask (True/False)
If ‘quality_mask’ is present in df, True filters these probes from pickle output.
exclude_rs
as of v1.5.0 SigSet keeps snp (‘rs’) probes with other probe types (if qualityMask is false); need to separate them here before exporting to file.
methylprep.run_series(id, path, dict_only=False, batch_size=100, clean=True, abort_if_no_idats=True, decompress=True)[source]

Downloads the IDATs and metadata for a series then generates one metadata dictionary and one beta value matrix for each platform in the series

Arguments:
id [required]
the series ID (can be a GEO or ArrayExpress ID)
path [required]
the path to the directory to download the data to. It is assumed a dictionaries and beta values directory has been created for each platform (and will create one for each if not)
dict_only
if True, downloads idat files and meta data and creates data dictionaries for each platform, but does not process them further.
batch_size
the batch_size to use when processing samples (number of samples run at a time). By default is set to the constant 100.
clean
if True, removes intermediate processing files
methylprep.run_series_list(list_file, path, dict_only=False, batch_size=100, **kwargs)[source]

Downloads the IDATs and metadata for a list of series, creating metadata dictionaries and dataframes of sample beta_values

Arguments:
list_file [required]
the name of the file containing a list of GEO_IDS and/or Array Express IDs to download and process. This file must be located in the directory data is downloaded to. Each line of the file should contain the name of one data series ID.
path [required]
the path to the directory to download the data to. It is assumed a dictionaries and beta values directory has been created for each platform (and will create one for each if not)
dict_only
if True, only downloads data and creates dictionaries for each platform
batch_size
the batch_size to use when processing samples (number of samples run at a time). By default is set to the constant 100.
methylprep.convert_miniml(geo_id, data_dir='.', merge=True, download_it=True, extract_controls=False, require_keyword=None, sync_idats=False, remove_tgz=False, verbose=False)[source]
This scans the datadir for an xml file with the geo_id in it.
Then it parses it and saves the useful stuff to a dataframe called “sample_sheet_meta_data.pkl”. DOES NOT REQUIRE idats.
CLI version:
python -m meta_data -i GSExxxxx -d <my_folder>
Arguments:
merge:
If merge==True and there is a file with ‘samplesheet’ in the folder, and that sheet has GSM_IDs, merge that data into this samplesheet. Useful for when you have idats and want one combined samplesheet for the dataset.
download_it:
if miniml file not in data_dir path, it will download it from web.
extract_controls [experimental]:
if you only want to retain samples from the whole set that have certain keywords, such as “control” or “blood”, this experimental flag will rewrite the samplesheet with only the parts you want, then feed that into run_pipeline with named samples.
require_keyword [experimental]:
another way to eliminate samples from samplesheets, before passing into the processor. if specified, this keyword needs to appear somewhere in the values of a samplesheet.
sync_idats:
If flagged, this will search data_dir for idats and remove any of those that are not found in the filtered samplesheet. Requires you to run the download function first to get all idats, before you run this meta_data function.

Attempts to also read idat filenames, if they exist, but won’t fail if they don’t.

methylprep.build_composite_dataset(geo_id_list, data_dir, merge=True, download_it=True, extract_controls=False, require_keyword=None, sync_idats=True, betas=False, m_value=False, export=False)[source]

A wrapper function for convert_miniml() to download a list of GEO datasets and process only those samples that meet criteria. Specifically - grab the “control” or “normal” samples from a bunch of experiments for one tissue type (e.g. “blood”), process them, and put all the resulting beta_values and/or m_values pkl files in one place, so that you can run methylize.load_both() to create a combined reference dataset for QC, analysis, or meta-analysis.

Arguments:
geo_id_list (required):
A list of GEO “GSEnnn” ids. From command line, pass these in as separate values
data_dir:
folder to save data
merge (True):
If merge==True and there is a file with ‘samplesheet’ in the folder, and that sheet has GSM_IDs, merge that data into this samplesheet. Useful for when you have idats and want one combined samplesheet for the dataset.
download_it (True):
if miniml file not in data_dir path, it will download it from web.
extract_controls (True)):
if you only want to retain samples from the whole set that have certain keywords, such as “control” or “blood”, this experimental flag will rewrite the samplesheet with only the parts you want, then feed that into run_pipeline with named samples.
require_keyword (None):
another way to eliminate samples from samplesheets, before passing into the processor. if specified, the “keyword” string passed in must appear somewhere in the values of a samplesheet for sample to be downloaded, processed, retained.
sync_idats:
If flagged, this will search data_dir for idats and remove any of those that are not found in the filtered samplesheet. Requires you to run the download function first to get all idats, before you run this meta_data function.
betas:
process beta_values
m_value:
process m_values
  • Attempts to also read idat filenames, if they exist, but won’t fail if they don’t.
  • removes unneeded files as it goes, but leaves the xml MINiML file and folder there as a marker if a geo dataset fails to download. So it won’t try again on resume.
methylprep.run_pipeline(data_dir, array_type=None, export=False, manifest_filepath=None, sample_sheet_filepath=None, sample_name=None, betas=False, m_value=False, make_sample_sheet=False, batch_size=None, save_uncorrected=False, save_control=True, meta_data_frame=True, bit='float32', poobah=False, export_poobah=False, poobah_decimals=3, poobah_sig=0.05, low_memory=True, sesame=True, quality_mask=None, pneg_ecdf=False, file_format='pickle', **kwargs)[source]

The main CLI processing pipeline. This does every processing step and returns a data set.

Required Arguments:
data_dir [required]
path where idat files can be found, and a samplesheet csv.
Optional file and sub-sampling inputs:
manifest_filepath [optional]
if you want to provide a custom manifest, provide the path. Otherwise, it will download the appropriate one for you.
sample_sheet_filepath [optional]
it will autodetect if ommitted.
make_sample_sheet [optional]
if True, generates a sample sheet from idat files called ‘samplesheet.csv’, so that processing will work. From CLI pass in “–no_sample_sheet” to trigger sample sheet auto-generation.
sample_name [optional, list]
if you don’t want to process all samples, you can specify individual samples as a list. if sample_names are specified, this will not also do batch sizes (large batches must process all samples)
Optional processing arguments:
sesame [default: True]
If True, applies offsets, poobah, noob, infer_channel_switch, nonlinear-dye-bias-correction, and qualityMask to imitate the output of openSesame function. If False, outputs will closely match minfi’s processing output. Prior to version 1.4.0, file processing matched minfi.
array_type [default: autodetect]
27k, 450k, EPIC, EPIC+ If omitted, this will autodetect it.
batch_size [optional]
if set to any integer, samples will be processed and saved in batches no greater than the specified batch size. This will yield multiple output files in the format of “beta_values_1.pkl … beta_values_N.pkl”.
bit [default: float32]
You can change the processed output files to one of: {float16, float32, float64}. This will make files & memory usage smaller, often with no loss in precision. However, using float16 masy cause an overflow error, resulting in “inf” appearing instead of numbers, and numpy/pandas functions do not universally support float16.
low_memory [default: True]
If False, pipeline will not remove intermediate objects and data sets during processing. This provides access to probe subsets, foreground, and background probe sets in the SampleDataContainer object returned when this is run in a notebook (not CLI).
quality_mask [default: None]
If False, process will NOT remove sesame’s list of unreliable probes. If True, removes probes. The default None will defer to sesamee, which defaults to true. But if explicitly set, it will override sesame setting.
Optional export files:
meta_data_frame [default: True]
if True, saves a file, “sample_sheet_meta_data.pkl” with samplesheet info.
export [default: False]
if True, exports a CSV of the processed data for each idat file in sample.
file_format [default: pickle; optional: parquet]
Matrix style files are faster to load and process than CSVs, and python supports two types of binary formats: pickle and parquet. Parquet is readable by other languages, so it is an option starting v1.7.0.
save_uncorrected [default: False]
if True, adds two additional columns to the processed.csv per sample (meth and unmeth), representing the raw fluorescence intensities for all probes. It does not apply NOOB correction to values in these columns.
save_control [default: False]
if True, adds all Control and SnpI type probe values to a separate pickled dataframe, with probes in rows and sample_name in the first column. These non-CpG probe names are excluded from processed data and must be stored separately.
poobah [default: False]
If specified as True, the pipeline will run Sesame’s p-value probe detection method (poobah) on samples to remove probes that fail the signal/noise ratio on their fluorescence channels. These will appear as NaNs in the resulting dataframes (beta_values.pkl or m_values.pkl). All probes, regardless of p-value cutoff, will be retained in CSVs, but there will be a ‘poobah_pval’ column in CSV files that methylcheck.load uses to exclude failed probes upon import at a later step.
poobah_sig [default: 0.05]
the p-value level of significance, above which, will exclude probes from output (typical range of 0.001 to 0.1)
poobah_decimals [default: 3]
The number of decimal places to round p-value column in the processed CSV output files.
mouse probes
Mouse-specific will be saved if processing a mouse array.
Optional final estimators:
betas
if True, saves a pickle (beta_values.pkl) of beta values for all samples
m_value
if True, saves a pickle (m_values.pkl) of beta values for all samples
Note on meth/unmeth:
if either betas or m_value is True, this will also save two additional files: ‘meth_values.pkl’ and ‘unmeth_values.pkl’ with the same dataframe structure, representing raw, uncorrected meth probe intensities for all samples. These are useful in some methylcheck functions and load/produce results 100X faster than loading from processed CSV output.
Returns:

By default, if called as a function, a list of SampleDataContainer objects is returned, with the following execptions:

betas
if True, will return a single data frame of betavalues instead of a list of SampleDataContainer objects. Format is a “wide matrix”: columns contain probes and rows contain samples.
m_value
if True, will return a single data frame of m_factor values instead of a list of SampleDataContainer objects. Format is a “wide matrix”: columns contain probes and rows contain samples.

if batch_size is set to more than ~600 samples, nothing is returned but all the files are saved. You can recreate/merge output files by loading the files using methylcheck.load().

Processing notes:

The sample_sheet parser will ensure every sample has a unique name and assign one (e.g. Sample1) if missing, or append a number (e.g. _1) if not unique. This may cause sample_sheets and processed data in dataframes to not match up. Will fix in future version.

pipeline steps:

1 make sample sheet or read sample sheet into a list of samples’ data 2 split large projects into batches, if necessary, and ensure unique sample names 3 read idats 4 select and read manifest 5 put everything into SampleDataContainer class objects 6 process everything, using the pipeline steps specified

idats -> channel_swaps -> poobah -> quality_mask -> noob -> dye_bias

7 apply the final estimator function (beta, m_value, or copy number) to all data 8 export all the data into multiple files, as defined by pipeline

methylprep.make_pipeline(data_dir='.', steps=None, exports=None, estimator='beta', **kwargs)[source]

Specify a list of processing steps for run_pipeline, then instantiate and run that pipeline.

steps:
list of processing steps [‘all’, ‘infer_channel_switch’, ‘poobah’, ‘quality_mask’, ‘noob’, ‘dye_bias’]
exports:
list of files to be saved; anything not specified is not saved; [‘all’] saves everything. [‘all’, ‘csv’, ‘poobah’, ‘meth’, ‘unmeth’, ‘noob_meth’, ‘noob_unmeth’, ‘sample_sheet_meta_data’, ‘mouse’, ‘control’]
estimator:
which final format? [beta | m_value | copy_number | None (returns containers instead)]

This feeds a Class that runs the run_pipeline function of transforms with a final estimator. It replaces all of the kwargs that are in run_pipeline() and adds a few more options:

[steps] – you can set all of these with [‘all’] or any combination of these in a list of steps:
Also note that adding “sesame=True” to kwargs will enable: infer_channel_switch, poobah, quality_mask, noob, dye_bias ‘infer_channel_switch’ ‘poobah’ ‘quality_mask’ ‘noob’ ‘dye_bias’ – specifying this select’s sesame’s nonlinear-dye-bias correction. Omitting causes NOOB to use minfi’s linear-dye-correction, unless NOOB is missing.
[exports]
export=False, make_sample_sheet=False, export_poobah=False, save_uncorrected=False, save_control=False, meta_data_frame=True,
[final estimator] – default: return list of sample data containers.
betas=False, m_value=False, -copy_number- You may override that by specifying `estimator`= (‘betas’ or ‘m_value’).
[how it works]

make_pipeline calls run_pipeline(), which has a **kwargs final keyword that maps many additional esoteric settings that you can define here.

These are used for more granular unit testing on methylsuite, but could allow you to change how data is processed in very fine-tuned ways.

The rest of these are additional optional kwargs you can include:

[inputs] – omitting these kwargs will assume the defaults, as shown below
data_dir, array_type=None, manifest_filepath=None, sample_sheet_filepath=None, sample_name=None,
[processing] – omitting these kwargs will assume the defaults, as shown below
batch_size=None, — if you have low RAM memory or >500 samples, you might need to process the batch in chunks. bit=’float32’, — float16 or float64 also supported for higher/lower memory/disk usage low_memory=True, — If True, processing deletes intermediate objects. But you can save them in the SampleDataContainer by setting this to False. poobah_decimals=3 — in csv file output poobah_sig=0.05
[logging] – how much information do you want on the screen? Default is minimal information.
verbose=False (True for more) debug=False (True for a LOT more info)

processing

class methylprep.processing.SampleDataContainer(idat_dataset_pair, manifest=None, retain_uncorrected_probe_intensities=False, bit='float32', pval=False, poobah_decimals=3, poobah_sig=0.05, do_noob=True, quality_mask=True, switch_probes=True, do_nonlinear_dye_bias=True, debug=False, sesame=True, pneg_ecdf=False, file_format='csv')[source]

Wrapper that provides easy access to red+green idat datasets, the sample, manifest, and processing params.

Arguments:

raw_dataset {RawDataset} – A sample’s RawDataset for a single well on the processed array. manifest {Manifest} – The Manifest for the correlated RawDataset’s array type. bit (default: float32) – option to store data as float16 or float32 to save space. pval (default: False) – whether to apply p-value-detection algorithm to remove

unreliable probes (based on signal/noise ratio of fluoresence) uses the sesame method (pOOBah) based on out of band background levels

Jan 2020: added .snp_(un)methylated property. used in postprocess.consolidate_crontrol_snp() Mar 2020: added p-value detection option Mar 2020: added mouse probe post-processing separation June 2020: major refactor to use SigSet, like sesame. Removed raw_dataset and methylationDataset. - SigSet is now a Super-class of SampleDataContainer.

export(output_path)[source]

Saves a CSV for each sample with all processing intermediate data

process_all()[source]
Runs all pre and post-processing calculations for the dataset.
Combines the SigSet methylated and unmethylated parts of SampleDataContainer, and modifies them, whilst creating self.__data_frame with noob/dye processed data.
Order:
  • poobah
  • quality_mask
  • noob (background correction)
  • build data_frame
  • nonlinear dye-bias correction
  • reduce memory/bit-depth of data
  • copy over uncorrected values
  • split out mouse probes
process_beta_value(input_dataframe, quality_mask_probes=None)[source]

Calculate Beta value from methylation data

process_copy_number(input_dataframe)[source]

Calculate copy number value from methylation data

process_m_value(input_dataframe)[source]

Calculate M value from methylation data

methylprep.processing.preprocess_noob(container, offset=15, pval_probes_df=None, quality_mask_df=None, nonlinear_dye_correction=True, debug=False, unit_test_oob=False)[source]

NOOB pythonized copy of https://github.com/zwdzwd/sesame/blob/master/R/background_correction.R - The function takes a SigSet and returns a modified SigSet with the background subtracted. - Background is modelled in a normal distribution and true signal in an exponential distribution. - The Norm-Exp deconvolution is parameterized using Out-Of-Band (oob) probes. - includes snps, but not control probes yet - output should replace the container instead of returning debug dataframes - II RED and II GREEN both have data, but manifest doesn’t have a way to track this, so function tracks it. - keep IlmnID as index for meth/unmeth snps, and convert fg_green

if nonlinear_dye_correction=True, this uses a sesame method in place of minfi method, in a later step. if unit_test_oob==True, returns the intermediate data instead of updating the SigSet/SampleDataContainer.

methylprep.processing.run_pipeline(data_dir, array_type=None, export=False, manifest_filepath=None, sample_sheet_filepath=None, sample_name=None, betas=False, m_value=False, make_sample_sheet=False, batch_size=None, save_uncorrected=False, save_control=True, meta_data_frame=True, bit='float32', poobah=False, export_poobah=False, poobah_decimals=3, poobah_sig=0.05, low_memory=True, sesame=True, quality_mask=None, pneg_ecdf=False, file_format='pickle', **kwargs)[source]

The main CLI processing pipeline. This does every processing step and returns a data set.

Required Arguments:
data_dir [required]
path where idat files can be found, and a samplesheet csv.
Optional file and sub-sampling inputs:
manifest_filepath [optional]
if you want to provide a custom manifest, provide the path. Otherwise, it will download the appropriate one for you.
sample_sheet_filepath [optional]
it will autodetect if ommitted.
make_sample_sheet [optional]
if True, generates a sample sheet from idat files called ‘samplesheet.csv’, so that processing will work. From CLI pass in “–no_sample_sheet” to trigger sample sheet auto-generation.
sample_name [optional, list]
if you don’t want to process all samples, you can specify individual samples as a list. if sample_names are specified, this will not also do batch sizes (large batches must process all samples)
Optional processing arguments:
sesame [default: True]
If True, applies offsets, poobah, noob, infer_channel_switch, nonlinear-dye-bias-correction, and qualityMask to imitate the output of openSesame function. If False, outputs will closely match minfi’s processing output. Prior to version 1.4.0, file processing matched minfi.
array_type [default: autodetect]
27k, 450k, EPIC, EPIC+ If omitted, this will autodetect it.
batch_size [optional]
if set to any integer, samples will be processed and saved in batches no greater than the specified batch size. This will yield multiple output files in the format of “beta_values_1.pkl … beta_values_N.pkl”.
bit [default: float32]
You can change the processed output files to one of: {float16, float32, float64}. This will make files & memory usage smaller, often with no loss in precision. However, using float16 masy cause an overflow error, resulting in “inf” appearing instead of numbers, and numpy/pandas functions do not universally support float16.
low_memory [default: True]
If False, pipeline will not remove intermediate objects and data sets during processing. This provides access to probe subsets, foreground, and background probe sets in the SampleDataContainer object returned when this is run in a notebook (not CLI).
quality_mask [default: None]
If False, process will NOT remove sesame’s list of unreliable probes. If True, removes probes. The default None will defer to sesamee, which defaults to true. But if explicitly set, it will override sesame setting.
Optional export files:
meta_data_frame [default: True]
if True, saves a file, “sample_sheet_meta_data.pkl” with samplesheet info.
export [default: False]
if True, exports a CSV of the processed data for each idat file in sample.
file_format [default: pickle; optional: parquet]
Matrix style files are faster to load and process than CSVs, and python supports two types of binary formats: pickle and parquet. Parquet is readable by other languages, so it is an option starting v1.7.0.
save_uncorrected [default: False]
if True, adds two additional columns to the processed.csv per sample (meth and unmeth), representing the raw fluorescence intensities for all probes. It does not apply NOOB correction to values in these columns.
save_control [default: False]
if True, adds all Control and SnpI type probe values to a separate pickled dataframe, with probes in rows and sample_name in the first column. These non-CpG probe names are excluded from processed data and must be stored separately.
poobah [default: False]
If specified as True, the pipeline will run Sesame’s p-value probe detection method (poobah) on samples to remove probes that fail the signal/noise ratio on their fluorescence channels. These will appear as NaNs in the resulting dataframes (beta_values.pkl or m_values.pkl). All probes, regardless of p-value cutoff, will be retained in CSVs, but there will be a ‘poobah_pval’ column in CSV files that methylcheck.load uses to exclude failed probes upon import at a later step.
poobah_sig [default: 0.05]
the p-value level of significance, above which, will exclude probes from output (typical range of 0.001 to 0.1)
poobah_decimals [default: 3]
The number of decimal places to round p-value column in the processed CSV output files.
mouse probes
Mouse-specific will be saved if processing a mouse array.
Optional final estimators:
betas
if True, saves a pickle (beta_values.pkl) of beta values for all samples
m_value
if True, saves a pickle (m_values.pkl) of beta values for all samples
Note on meth/unmeth:
if either betas or m_value is True, this will also save two additional files: ‘meth_values.pkl’ and ‘unmeth_values.pkl’ with the same dataframe structure, representing raw, uncorrected meth probe intensities for all samples. These are useful in some methylcheck functions and load/produce results 100X faster than loading from processed CSV output.
Returns:

By default, if called as a function, a list of SampleDataContainer objects is returned, with the following execptions:

betas
if True, will return a single data frame of betavalues instead of a list of SampleDataContainer objects. Format is a “wide matrix”: columns contain probes and rows contain samples.
m_value
if True, will return a single data frame of m_factor values instead of a list of SampleDataContainer objects. Format is a “wide matrix”: columns contain probes and rows contain samples.

if batch_size is set to more than ~600 samples, nothing is returned but all the files are saved. You can recreate/merge output files by loading the files using methylcheck.load().

Processing notes:

The sample_sheet parser will ensure every sample has a unique name and assign one (e.g. Sample1) if missing, or append a number (e.g. _1) if not unique. This may cause sample_sheets and processed data in dataframes to not match up. Will fix in future version.

pipeline steps:

1 make sample sheet or read sample sheet into a list of samples’ data 2 split large projects into batches, if necessary, and ensure unique sample names 3 read idats 4 select and read manifest 5 put everything into SampleDataContainer class objects 6 process everything, using the pipeline steps specified

idats -> channel_swaps -> poobah -> quality_mask -> noob -> dye_bias

7 apply the final estimator function (beta, m_value, or copy number) to all data 8 export all the data into multiple files, as defined by pipeline

methylprep.processing.consolidate_values_for_sheet(data_containers, postprocess_func_colname='beta_value', bit='float32', poobah=True, poobah_sig=0.05, exclude_rs=True)[source]

Transforms results into a single dataframe with all of the function values, with probe names in rows, and sample beta values for probes in columns.

Input:
data_containers – the output of run_pipeline() is this, a list of data_containers. (a list of processed SampleDataContainer objects)
Arguments for postprocess_func_colname:
calculate_beta_value –> ‘beta_value’ calculate_m_value –> ‘m_value’ calculate_copy_number –> ‘cm_value’

note: these functions are hard-coded in pipeline.py as part of process_all() step. note: if run_pipeline included ‘sesame’ option, then quality mask is automatically applied to all pickle outputs, and saved as column in processed CSV.

Options:
bit (float16, float32, float64) – change the default data type from float32
to another type to save disk space. float16 works fine, but might not be compatible with all numnpy/pandas functions, or with outside packages, so float32 is default. This is specified from methylprep process command line.
poobah
If true, filters by the poobah_pval column. (beta m_val pass True in for this.)
data_container.quality_mask (True/False)
If ‘quality_mask’ is present in df, True filters these probes from pickle output.
exclude_rs
as of v1.5.0 SigSet keeps snp (‘rs’) probes with other probe types (if qualityMask is false); need to separate them here before exporting to file.

models

class methylprep.models.ArrayType[source]

This class stores meta data about array types, such as numbers of probes of each type, and how to guess the array from probes in idat files.

num_probes

Used to load normal cg+ch probes from start of manifest until this point. Then start control df.

class methylprep.models.Channel[source]

idat probes measure either a red or green fluorescence. This specifies which to return within idat.py: red_idat or green_idat.

class methylprep.models.ControlProbe(address, control_type, color, extended_type)[source]

NOT USED ANYWHERE

class methylprep.models.ControlType[source]

An enumeration.

methylprep.models.parse_sample_sheet_into_idat_datasets(sample_sheet, sample_name=None, from_s3=None, meta_only=False, bit='float32')[source]

Generates a collection of IdatDatasets from samples in a sample sheet.

Arguments:
sample_sheet {SampleSheet} – The SampleSheet from which the data originates.
Keyword Arguments:
sample_name {string} – Optional: one sample to process from the sample_sheet. (default: {None}) from_s3 {zip_reader} – pass in a S3ZipReader object to extract idat files from a zipfile hosted on s3. meta_only {True/False} – doesn’t read idat files, only parses the meta data about them. (RawMetaDataset is same as RawDataset but has no idat probe values stored in object, because not needed in pipeline)
Raises:
ValueError: If the number of probes between raw datasets differ.
Returns:
[RawDatasets] – A list of idat data pairs, each a dict like {‘green_idat’: green_idat, ‘red_idat’: red_idat}
class methylprep.models.ProbeType[source]

probes can either be type I or type II for CpG or Snp sequences. Control probes are used for background correction in different fluorescence ranges and staining efficiency. Type I probes record EITHER a red or a green value. Type II probes record both values together. NOOB uses the red fluorescence on a green probe and vice versa to calculate background fluorescence.

class methylprep.models.Sample(data_dir, sentrix_id, sentrix_position, **addl_fields)[source]

Object representing a row in a SampleSheet file

Arguments:
data_dir {string or path-like} – Base directory of the sample sheet and associated IDAT files. sentrix_id {string} – The slide number of the processed array. sentrix_position {string} – The position on the processed slide.
Keyword Arguments:

addl_fields {} – Additional metadata describing the sample. including experiment subject meta data:

name (sample name, unique id) Sample_Type Control GSM_ID (same as sample name if using GEO public data)

array meta data:

group plate pool well
alternate_base_filename

GEO data sets using this file name convention.

get_export_filepath(extension='csv')[source]

Called by run_pipeline to find the folder/filename to export data as CSV, but CSV file doesn’t exist yet.

get_file_s3(zip_reader, extension, suffix=None)[source]

replaces get_filepath, but for s3 context. Since these files are compressed within a single zipfile in the bucket, they don’t resolve to PurePaths.

get_filepath(extension, suffix=None, verify=True)[source]

builds the filepath based on custom file extensions and suffixes during processing.

Params (verify):
tests whether file exists, either in data_dir or somewhere in recursive search path of data_dir.
Export:
uses this later to fetch the place where a file ought to be created – but doesn’t exist yet, so use verify=False.
Notes:
_suffix – used to create the <file>_processed files.
class methylprep.models.SigSet(sample, green_idat, red_idat, manifest, debug=False)[source]

I’m gonna try to create a fresh methylprep “SigSet” to replace our methylationDataset and RawDataset objects, which are redundant, and even have redundant functions within them. Part of why I have been frustrated/confused by our code. Central to the SeSAMe platform is the SigSet data structure, an S4 class with slots containing signals for six different classes of probes: [x] II - Type-II probes; [x] IR - Type-I Red channel probes; [x] IG - Type-I Grn channel probes; [x] oobG - Out-of-band Grn channel probes (matching Type-I Red channel probes in number); [x] oobR - Out-of-band Red channel probes (matching Type-I Grn channel probes in number); [x] ctrl_green, ctrl_red - control probes. [x] methylated, unmethylated, snp_methylated, snp_unmethylated [x] fg_green, fg_red (opposite of oobG and oobR) AKA ibG, ibR for in-band probes.

  • just tidying up how we access this stuff, and trying to stick to IlmnID everywhere because the illumina_id within IDAT files is no longer unique as a ref.
  • I checked again, and no other array breaks these rules. But sounds like Bret won’t stick to this pattern going forward with designs. So I suspect other software will break with new arrays, unless they rewrite for this too.
  • this combines every layer of objects between IdatDatasets and SampleDataContainers.
  • this avoids looping through probe subsets, instead referring to a lookup-dataframe of how these relate.
  • avoids probes.py
    probe_type is a derived label, not in manifest (I, II, SnpI, SnpII, control)
address_code = None

## SigSet EPIC ## - @IG probes: 49989 - 332 4145 70 7094 599 2958 … ## - @IR probes: 92294 - 183 8040 1949 6152 833 89 … ## - @II probes: 724612 - 6543 1596 3133 1011 3035 2837 … ## - @oobG probes: 92294 - 138 277 107 218 232 80 … ## - @oobR probes: 49989 - 1013 150 81 910 448 183 … ## - @ctl probes: 635 … ## - @pval: 866895 - 0.005141179 0.04914081 0.002757492 …

SigSet 450k @II 350076 ………………. methylated 485512 @IG 46298 … oobR 46298 ….. unmethylated 485512 @IR 89203 … oobG 89203 ….. snp_methylated 65 ………………………… snp_unmethylated 65 fg_green 396325 |vs| ibG 396374 (incl 40 + 9 SNPs) –(flattened)–> 442672 fg_red 439223 |vs| ibR 439279 (incl 40 + 16 SNPs) –(flattened)–> 528482

check_for_probe_loss(stage='')[source]

Debugger runs this during processing to see where mouse probes go missing or get duplicated.

detect_and_drop_duplicates()[source]

as of v1.5.0, mouse manifest includes a few probes that cause duplicate values, and breaks processing. So this removes them. About 5 probes in all.

Note: This runs during SigSet__init__, and might fail if any of these probes are affected by inter_type_I_probe_switch(), which theoretically should never happen in mouse. But infer-probes affects the idat probe_means directly, and runs before SigSet is created in SampleDataContainer, to avoid double-reading confusion.

set_noob(red_factor)[source]

same method as update_probe_means, but simply applies a linear correction to all RED channel values

update_probe_means(noob_green, noob_red, red_factor=None)[source]

pass in two dataframes (green and red) with IlmnIDs in index and a ‘bg_corrected’ column in each.

because __init__ has created each subset as a dataframe with IlmnID in index, this matches to index. and uses decoder to parse whether ‘Meth’ or ‘Unmeth’ values get updated.

upstream: container.sigset.update_probe_means(noob_green, noob_red)

replaces ‘bg_corrected’ column with ‘noob_Meth’ or ‘noob_Unmeth’ column.

does NOT update ctrl_red or ctrl_green; these are updated within the NOOB function because structually different.

class methylprep.models.RawMetaDataset(sample)[source]

Wrapper for a sample and meta data, without its pair of raw IdatDataset values.

methylprep.models.get_array_type(idat_dataset_pairs)[source]

provide a list of idat_dataset_pairs and it will return the array type, confirming probe counts match in batch.

files

class methylprep.files.IdatDataset(filepath_or_buffer, channel, idat_id='IDAT', idat_version=3, verbose=False, std_dev=False, nbeads=False, bit='float32')[source]

Validates and parses an Illumina IDAT file.

Arguments:
filepath_or_buffer {file-like} – the IDAT file to parse. channel {Channel} – the fluorescent channel (Channel.RED or Channel.GREEN) that produced the IDAT dataset.
Keyword Arguments:

idat_id {string} – expected IDAT file identifier (default: {DEFAULT_IDAT_FILE_ID}) idat_version {integer} – expected IDAT version (default: {DEFAULT_IDAT_VERSION}) bit {string, default ‘float32’} – ‘float16’ will pre-normalize intensities,

capping max intensity at 32127. This cuts data size in half, but will reduce precision on ~0.01% of probes. [effectively downscaling fluorescence]
Raises:
ValueError: The IDAT file has an incorrect identifier or version specifier.
meta(idat_file)[source]

To enable this, initialize idatDataset with verbose=True

read(idat_file)[source]

Reads the IDAT file and parses the appropriate sections. Joins the mean probe intensity values with their Illumina probe ID.

Arguments:
idat_file {file-like} – the IDAT file to process.
Returns:
DataFrame – mean probe intensity values indexed by Illumina ID.
class methylprep.files.Manifest(array_type, filepath_or_buffer=None, on_lambda=False, verbose=True)[source]

Provides an object interface to an Illumina array manifest file.

Arguments:
array_type {ArrayType} – The type of array to process. values are styled like ArrayType.ILLUMINA_27K, ArrayType.ILLUMINA_EPIC or ArrayType(‘epic’), ArrayType(‘mouse’)
Keyword Arguments:
filepath_or_buffer {file-like} – a pre-existing manifest filepath (default: {None})
Raises:
ValueError: The sample sheet is not formatted properly or a sample cannot be found.
static download_default(array_type, on_lambda=False)[source]

Downloads the appropriate manifest file if one does not already exist.

Arguments:
array_type {ArrayType} – The type of array to process.
Returns:
[PurePath] – Path to the manifest file.
get_probe_details(probe_type, channel=None)[source]

used by infer_channel_switch. Given a probe type (I, II, SnpI, SnpII, Control) and a channel (Channel.RED | Channel.GREEN), this will return info needed to map probes to their names (e.g. cg0031313 or rs00542420), which are NOT in the idat files.

read_control_probes(manifest_file)[source]

Unlike other probes, control probes have no IlmnID because they’re not locus-specific. they also use arbitrary columns, ignoring the header at start of manifest file.

read_mouse_probes(manifest_file)[source]

ILLUMINA_MOUSE contains unique probes whose names begin with ‘mu’ and ‘rp’ for ‘murine’ and ‘repeat’, respectively. This creates a dataframe of these probes, which are not processed like normal cg/ch probes.

read_snp_probes(manifest_file)[source]

Unlike cpg and control probes, these rs probes are NOT sequential in all arrays.

static seek_to_start(manifest_file)[source]

find the start of the data part of the manifest. first left-most column must be “IlmnID” to be found.

class methylprep.files.SampleSheet(filepath_or_buffer, data_dir)[source]

Validates and parses an Illumina sample sheet file.

Arguments:
filepath_or_buffer {file-like} – the sample sheet file to parse. dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files.
Raises:
ValueError: The sample sheet is not formatted properly or a sample cannot be found.
build_meta_data(samples=None)[source]

Takes a list of samples and returns a data_frame that can be saved as a pickle.

build_samples()[source]

Builds Sample objects from the processed sample sheet rows.

Added to Sample as class_method: if the idat file is not in the same folder, (check if exists!) looks recursively for that filename and updates the data_dir for that Sample.

contains_column(column_name)[source]

helper function to determine if sample_sheet contains a specific column, such as GSM_ID. SampleSheet must already have __data_frame in it.

get_sample(sample_name)[source]

scans all samples for one matching sample_name, if provided. If no sample_name, then it returns all samples.

get_samples()[source]

Retrieves Sample objects from the processed sample sheet rows, building them if necessary.

methylprep.files.get_sample_sheet(dir_path, filepath=None)[source]

Generates a SampleSheet instance for a given directory of processed data.

Arguments:
dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files.
Keyword Arguments:
filepath {string or path-like} – path of the sample sheet file if provided, otherwise
one will try to be found. (default: {None})
Returns:
[SampleSheet] – A SampleSheet instance.
methylprep.files.get_sample_sheet_s3(zip_reader)[source]

reads a zipfile and considers all filenames with ‘sample_sheet’ but will test all csv. the zip_reader is an amazon S3ZipReader object capable of reading the zipfile header.

methylprep.files.create_sample_sheet(dir_path, matrix_file=False, output_file='samplesheet.csv', sample_type='', sample_sub_type='')[source]

Creates a samplesheet.csv file from the .IDAT files of a GEO series directory

Arguments:

dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files. matrix_file {boolean} – Whether or not a Series Matrix File should be searched for names. (default: {False})

========== | ========= | ==== | ======= parameter | required | type | effect ========== | ========= ==== | ======= sample_type | optional | string | label all samples in created sheet as this type (i.e. blood, saliva, tumor cells) sample_sub_type | optional | string | further detail sample type for batch controls | optional | list of sample_names | assign all samples in controls list to be “control samples”, not treatment samples. ========== | ========= | ==== | =======

Note:
Because sample_names are only generated from Matrix files, this method won’t let you assign controls to samples from CLI. Would require all sample names be passed in from CLI as well, a pretty messy endeavor.
Raises:
FileNotFoundError: The directory could not be found.
methylprep.files.find_sample_sheet(dir_path, return_all=False)[source]

Find sample sheet file for Illumina methylation array.

Notes:
looks for csv files in {dir_path}. If more than one csv file found, returns the one that has “sample_sheet” or ‘samplesheet’ in its name. Otherwise, raises error.
Arguments:

dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files. return_all – if True,

returns a list of paths to samplesheets, if multiple present, instead of raising an error.
Raises:
FileNotFoundError: [description] Exception: [description]
Returns:
[string] – Path to sample sheet in base directory

geo download

methylprep.download.run_series(id, path, dict_only=False, batch_size=100, clean=True, abort_if_no_idats=True, decompress=True)[source]

Downloads the IDATs and metadata for a series then generates one metadata dictionary and one beta value matrix for each platform in the series

Arguments:
id [required]
the series ID (can be a GEO or ArrayExpress ID)
path [required]
the path to the directory to download the data to. It is assumed a dictionaries and beta values directory has been created for each platform (and will create one for each if not)
dict_only
if True, downloads idat files and meta data and creates data dictionaries for each platform, but does not process them further.
batch_size
the batch_size to use when processing samples (number of samples run at a time). By default is set to the constant 100.
clean
if True, removes intermediate processing files
methylprep.download.run_series_list(list_file, path, dict_only=False, batch_size=100, **kwargs)[source]

Downloads the IDATs and metadata for a list of series, creating metadata dictionaries and dataframes of sample beta_values

Arguments:
list_file [required]
the name of the file containing a list of GEO_IDS and/or Array Express IDs to download and process. This file must be located in the directory data is downloaded to. Each line of the file should contain the name of one data series ID.
path [required]
the path to the directory to download the data to. It is assumed a dictionaries and beta values directory has been created for each platform (and will create one for each if not)
dict_only
if True, only downloads data and creates dictionaries for each platform
batch_size
the batch_size to use when processing samples (number of samples run at a time). By default is set to the constant 100.
methylprep.download.convert_miniml(geo_id, data_dir='.', merge=True, download_it=True, extract_controls=False, require_keyword=None, sync_idats=False, remove_tgz=False, verbose=False)[source]
This scans the datadir for an xml file with the geo_id in it.
Then it parses it and saves the useful stuff to a dataframe called “sample_sheet_meta_data.pkl”. DOES NOT REQUIRE idats.
CLI version:
python -m meta_data -i GSExxxxx -d <my_folder>
Arguments:
merge:
If merge==True and there is a file with ‘samplesheet’ in the folder, and that sheet has GSM_IDs, merge that data into this samplesheet. Useful for when you have idats and want one combined samplesheet for the dataset.
download_it:
if miniml file not in data_dir path, it will download it from web.
extract_controls [experimental]:
if you only want to retain samples from the whole set that have certain keywords, such as “control” or “blood”, this experimental flag will rewrite the samplesheet with only the parts you want, then feed that into run_pipeline with named samples.
require_keyword [experimental]:
another way to eliminate samples from samplesheets, before passing into the processor. if specified, this keyword needs to appear somewhere in the values of a samplesheet.
sync_idats:
If flagged, this will search data_dir for idats and remove any of those that are not found in the filtered samplesheet. Requires you to run the download function first to get all idats, before you run this meta_data function.

Attempts to also read idat filenames, if they exist, but won’t fail if they don’t.

methylprep.download.build_composite_dataset(geo_id_list, data_dir, merge=True, download_it=True, extract_controls=False, require_keyword=None, sync_idats=True, betas=False, m_value=False, export=False)[source]

A wrapper function for convert_miniml() to download a list of GEO datasets and process only those samples that meet criteria. Specifically - grab the “control” or “normal” samples from a bunch of experiments for one tissue type (e.g. “blood”), process them, and put all the resulting beta_values and/or m_values pkl files in one place, so that you can run methylize.load_both() to create a combined reference dataset for QC, analysis, or meta-analysis.

Arguments:
geo_id_list (required):
A list of GEO “GSEnnn” ids. From command line, pass these in as separate values
data_dir:
folder to save data
merge (True):
If merge==True and there is a file with ‘samplesheet’ in the folder, and that sheet has GSM_IDs, merge that data into this samplesheet. Useful for when you have idats and want one combined samplesheet for the dataset.
download_it (True):
if miniml file not in data_dir path, it will download it from web.
extract_controls (True)):
if you only want to retain samples from the whole set that have certain keywords, such as “control” or “blood”, this experimental flag will rewrite the samplesheet with only the parts you want, then feed that into run_pipeline with named samples.
require_keyword (None):
another way to eliminate samples from samplesheets, before passing into the processor. if specified, the “keyword” string passed in must appear somewhere in the values of a samplesheet for sample to be downloaded, processed, retained.
sync_idats:
If flagged, this will search data_dir for idats and remove any of those that are not found in the filtered samplesheet. Requires you to run the download function first to get all idats, before you run this meta_data function.
betas:
process beta_values
m_value:
process m_values
  • Attempts to also read idat filenames, if they exist, but won’t fail if they don’t.
  • removes unneeded files as it goes, but leaves the xml MINiML file and folder there as a marker if a geo dataset fails to download. So it won’t try again on resume.
methylprep.download.search(keyword, filepath='.', verbose=True)[source]
CLI/cron function to check for new datasets.
set up as a weekly cron. uses a local storage file to compare with old datasets in <pattern>_meta.csv. saves the dates of each dataset from GEO; calculates any new ones as new rows. updates csv.
options:
pass in -k keyword verbose (True|False) — reports to page; saves csv too
returns:
saves a CSV to disk and returns a dataframe of results
methylprep.download.pipeline_find_betas_any_source(**kwargs)[source]

beta_bake: Sets up a script to run methylprep that saves directly to path or S3. The slowest part of processing GEO datasets is downloading, so this handles that.

STEPS
  • uses methylprep alert -k <keywords> to curate a list of GEO IDs worth grabbing.
    note that version 1 will only process idats. also runs methylcheck.load on processed files, if installed.
  • downloads a zipfile, uncompresses it,
  • creates a samplesheet,
  • moves it into foxo-test-pipeline-raw for processing.
  • You get back a zipfile with all the output data.
required kwargs:
  • project_name: string, like GSE123456, to specify one GEO data set to download.
    to initialize, specify one GEO id as an input when starting the function. - beforehand, you can use methylprep alert to verify the data exists. - OR you can pass in a string of GEO_ID separated by commas without any spaces and it will split them.
optional kwargs:
  • function: ‘geo’ (optional, ignored; used to specify this pipeline to run from command line)
  • data_dir:
    • default is current working directory (‘.’) if omitted
    • use to specify where all files will be downloaded, processed, and finally stored, unless –cleanup=False.
    • if using AWS S3 settings below, this will be ignored.
  • verbose: False, default is minimal logging messages.
  • save_source: if True, it will retain .idat and/or -tbl-1.txt files used to generate beta_values dataframe pkl files.
  • compress: if True, it will package everything together in a {geo_id}.zip file, or use gzip if files are too big for zip.
    • default is False
  • clean: If True, removes files from folder, except the compressed output zip file. (Requires compress to the True too)
It will use local disk by default, but if you want it to run in AWS batch + efs provide these:
  • efs (AWS elastic file system name, for lambda or AWS batch processing)

  • clean: default True. If False, does not explicitly remove the tempfolder files at end, or move files into data_dir output filepath/folder.
    • if you need to keep folders in working/efs folder instead of moving them to the data_dir.
    • use cleanup=False when embedding this in an AWS/batch/S3 context,

    then use the working tempfolder path and filenames returned to copy these files into S3.

returns:
  • if a single GEO_ID, returns a dict with “error”, “filenames”, and “tempdir” keys.
  • if mulitple GEO_IDs, returns a dict with “error”, “geo_ids” (nested dict), and “tempdir” keys. Uses same tempdir for everything, so clean should be set to True.
  • “error” will be None if it worked okay.
  • “filenames” will be a list of filenames that were created as outputs (type=string)
  • “tempdir” will be the python tempfile tempory-directory object. Passing this out prevents
    garbage collector from removing it when the function ends, so you can retrive these files and run tempdir.cleanup() manually. Otherwise, python will remove the tempdir for you when python closes, so copy whatever you want out of it first. This makes it possible to use this function with AWS EFS (elastic file systems) as part of a lambda or aws-batch function where disk space is more limited.

NOTE: v1.3.0 does NOT support multiple GEO IDs yet.