Module contents

methylprep.processing
methylprep.run_pipeline(data_dir[, …]) The main CLI processing pipeline.
methylprep.files.create_sample_sheet(dir_path) Creates a samplesheet.csv file from the .IDAT files of a GEO series directory
methylprep.download
methylprep.run_series(id, path[, dict_only, …]) Downloads the IDATs and metadata for a series then generates one metadata dictionary and one beta value matrix for each platform in the series
methylprep.read_geo(filepath[, verbose, …]) Use to load preprocessed GEO data into methylcheck. Attempts to find the sample beta/M_values
methylprep.build_composite_dataset(…[, …]) A wrapper function for convert_miniml() to download a list of GEO datasets and process only those samples that meet criteria.
methylprep.models
methylprep.files
methylprep.get_manifest(raw_datasets, array_type=None, manifest_filepath=None)[source]

Return a Manifest, given a list of raw_datasets (from idats).

Arguments:
raw_datasets {list(RawDataset)} – Collection of RawDataset instances that
require a manifest file for the related array_type.
Keyword Arguments:
array_type {ArrayType} – The type of array to process. If not provided, it
will be inferred from the number of probes in the IDAT file. (default: {None})
manifest_filepath {path-like} – Path to the manifest file. If not provided,
it will be inferred from the array_type and downloaded if necessary (default: {None})
Returns:
[Manifest] – A Manifest instance.
methylprep.get_raw_datasets(sample_sheet, sample_name=None, from_s3=None, meta_only=False)[source]

Generates a collection of RawDataset instances for the samples in a sample sheet.

Arguments:
sample_sheet {SampleSheet} – The SampleSheet from which the data originates.
Keyword Arguments:
sample_name {string} – Optional: one sample to process from the sample_sheet. (default: {None}) from_s3 {zip_reader} – pass in a S3ZipReader object to extract idat files from a zipfile hosted on s3. meta_only {True/False} – doesn’t read idat files, only parses the meta data about them. (RawMetaDataset is same as RawDataset but has no idat probe values stored in object, because not needed in pipeline)
Raises:
ValueError: If the number of probes between raw datasets differ.
Returns:
[RawDataset] – A RawDataset instance.
methylprep.run_pipeline(data_dir, array_type=None, export=False, manifest_filepath=None, sample_sheet_filepath=None, sample_name=None, betas=False, m_value=False, make_sample_sheet=False, batch_size=None, save_uncorrected=False, save_control=True, meta_data_frame=True, bit='float32', poobah=False, export_poobah=False, poobah_decimals=3, poobah_sig=0.05, low_memory=True, sesame=True, quality_mask=None, **kwargs)[source]

The main CLI processing pipeline. This does every processing step and returns a data set.

Required Arguments:
data_dir [required]
path where idat files can be found, and a samplesheet csv.
Optional file and sub-sampling inputs:
manifest_filepath [optional]
if you want to provide a custom manifest, provide the path. Otherwise, it will download the appropriate one for you.
sample_sheet_filepath [optional]
it will autodetect if ommitted.
make_sample_sheet [optional]
if True, generates a sample sheet from idat files called ‘samplesheet.csv’, so that processing will work. From CLI pass in “–no_sample_sheet” to trigger sample sheet auto-generation.
sample_name [optional, list]
if you don’t want to process all samples, you can specify individual samples as a list. if sample_names are specified, this will not also do batch sizes (large batches must process all samples)
Optional processing arguments:
sesame [default: True]
If True, applies offsets, poobah, noob, infer_channel_switch, nonlinear-dye-bias-correction, and qualityMask to imitate the output of openSesame function. If False, outputs will closely match minfi’s processing output. Prior to version 1.4.0, file processing matched minfi.
array_type [default: autodetect]
27k, 450k, EPIC, EPIC+ If omitted, this will autodetect it.
batch_size [optional]
if set to any integer, samples will be processed and saved in batches no greater than the specified batch size. This will yield multiple output files in the format of “beta_values_1.pkl … beta_values_N.pkl”.
bit [default: float32]
You can change the processed output files to one of: {float16, float32, float64}. This will make files & memory usage smaller, often with no loss in precision. However, using float16 masy cause an overflow error, resulting in “inf” appearing instead of numbers, and numpy/pandas functions do not universally support float16.
low_memory [default: True]
If False, pipeline will not remove intermediate objects and data sets during processing. This provides access to probe subsets, foreground, and background probe sets in the SampleDataContainer object returned when this is run in a notebook (not CLI).
quality_mask [default: None]
If False, process will NOT remove sesame’s list of unreliable probes. If True, removes probes. The default None will defer to sesamee, which defaults to true. But if explicitly set, it will override sesame setting.
Optional export files:
meta_data_frame [default: True]
if True, saves a file, “sample_sheet_meta_data.pkl” with samplesheet info.
export [default: False]
if True, exports a CSV of the processed data for each idat file in sample.
save_uncorrected [default: False]
if True, adds two additional columns to the processed.csv per sample (meth and unmeth), representing the raw fluorescence intensities for all probes. It does not apply NOOB correction to values in these columns.
save_control [default: False]
if True, adds all Control and SnpI type probe values to a separate pickled dataframe, with probes in rows and sample_name in the first column. These non-CpG probe names are excluded from processed data and must be stored separately.
poobah [default: False]
If specified as True, the pipeline will run Sesame’s p-value probe detection method (poobah) on samples to remove probes that fail the signal/noise ratio on their fluorescence channels. These will appear as NaNs in the resulting dataframes (beta_values.pkl or m_values.pkl). All probes, regardless of p-value cutoff, will be retained in CSVs, but there will be a ‘poobah_pval’ column in CSV files that methylcheck.load uses to exclude failed probes upon import at a later step.
poobah_sig [default: 0.05]
the p-value level of significance, above which, will exclude probes from output (typical range of 0.001 to 0.1)
poobah_decimals [default: 3]
The number of decimal places to round p-value column in the processed CSV output files.
mouse probes
Mouse-specific will be saved if processing a mouse array.
Optional final estimators:
betas
if True, saves a pickle (beta_values.pkl) of beta values for all samples
m_value
if True, saves a pickle (m_values.pkl) of beta values for all samples
Note on meth/unmeth:
if either betas or m_value is True, this will also save two additional files: ‘meth_values.pkl’ and ‘unmeth_values.pkl’ with the same dataframe structure, representing raw, uncorrected meth probe intensities for all samples. These are useful in some methylcheck functions and load/produce results 100X faster than loading from processed CSV output.
Returns:

By default, if called as a function, a list of SampleDataContainer objects is returned, with the following execptions:

betas
if True, will return a single data frame of betavalues instead of a list of SampleDataContainer objects. Format is a “wide matrix”: columns contain probes and rows contain samples.
m_value
if True, will return a single data frame of m_factor values instead of a list of SampleDataContainer objects. Format is a “wide matrix”: columns contain probes and rows contain samples.

if batch_size is set to more than ~600 samples, nothing is returned but all the files are saved. You can recreate/merge output files by loading the files using methylcheck.load().

Processing notes:

The sample_sheet parser will ensure every sample has a unique name and assign one (e.g. Sample1) if missing, or append a number (e.g. _1) if not unique. This may cause sample_sheets and processed data in dataframes to not match up. Will fix in future version.

pipeline steps:

1 make sample sheet or read sample sheet into a list of samples’ data 2 split large projects into batches, if necessary, and ensure unique sample names 3 read idats 4 select and read manifest 5 put everything into SampleDataContainer class objects 6 process everything, using the pipeline steps specified

idats -> channel_swaps -> poobah -> quality_mask -> noob -> dye_bias

7 apply the final estimator function (beta, m_value, or copy number) to all data 8 export all the data into multiple files, as defined by pipeline

methylprep.get_sample_sheet(dir_path, filepath=None)[source]

Generates a SampleSheet instance for a given directory of processed data.

Arguments:
dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files.
Keyword Arguments:
filepath {string or path-like} – path of the sample sheet file if provided, otherwise
one will try to be found. (default: {None})
Returns:
[SampleSheet] – A SampleSheet instance.
methylprep.consolidate_values_for_sheet(data_containers, postprocess_func_colname='beta_value', bit='float32', poobah=True, poobah_sig=0.05)[source]

with a data_containers (list of processed SampleDataContainer objects), this will transform results into a single dataframe with all of the function values, with probe names in rows, and sample beta values for probes in columns.

Input:
data_containers – the output of run_pipeline() is this, a list of data_containers.
Arguments for postprocess_func_colname:
calculate_beta_value –> ‘beta_value’ calculate_m_value –> ‘m_value’ calculate_copy_number –> ‘cm_value’

note: these functions are hard-coded in pipeline.py as part of process_all() step. note: if run_pipeline included ‘sesame’ option, then quality mask is automatically applied to all pickle outputs, and saved as column in processed CSV.

Options:
bit (float16, float32, float64) – change the default data type from float32
to another type to save disk space. float16 works fine, but might not be compatible with all numnpy/pandas functions, or with outside packages, so float32 is default. This is specified from methylprep process command line.
poobah
If true, filters by the poobah_pval column. (beta m_val pass True in for this.)
data_container.quality_mask (True/False)
If ‘quality_mask’ is present in df, True filters these probes from pickle output.
methylprep.run_series(id, path, dict_only=False, batch_size=100, clean=True, abort_if_no_idats=True, decompress=True)[source]

Downloads the IDATs and metadata for a series then generates one metadata dictionary and one beta value matrix for each platform in the series

Arguments:
id [required]
the series ID (can be a GEO or ArrayExpress ID)
path [required]
the path to the directory to download the data to. It is assumed a dictionaries and beta values directory has been created for each platform (and will create one for each if not)
dict_only
if True, downloads idat files and meta data and creates data dictionaries for each platform, but does not process them further.
batch_size
the batch_size to use when processing samples (number of samples run at a time). By default is set to the constant 100.
clean
if True, removes intermediate processing files
methylprep.run_series_list(list_file, path, dict_only=False, batch_size=100, **kwargs)[source]

Downloads the IDATs and metadata for a list of series, creating metadata dictionaries and dataframes of sample beta_values

Arguments:
list_file [required]
the name of the file containing a list of GEO_IDS and/or Array Express IDs to download and process. This file must be located in the directory data is downloaded to. Each line of the file should contain the name of one data series ID.
path [required]
the path to the directory to download the data to. It is assumed a dictionaries and beta values directory has been created for each platform (and will create one for each if not)
dict_only
if True, only downloads data and creates dictionaries for each platform
batch_size
the batch_size to use when processing samples (number of samples run at a time). By default is set to the constant 100.
methylprep.convert_miniml(geo_id, data_dir='.', merge=True, download_it=True, extract_controls=False, require_keyword=None, sync_idats=False, remove_tgz=False, verbose=False)[source]
This scans the datadir for an xml file with the geo_id in it.
Then it parses it and saves the useful stuff to a dataframe called “sample_sheet_meta_data.pkl”. DOES NOT REQUIRE idats.
CLI version:
python -m meta_data -i GSExxxxx -d <my_folder>
Arguments:
merge:
If merge==True and there is a file with ‘samplesheet’ in the folder, and that sheet has GSM_IDs, merge that data into this samplesheet. Useful for when you have idats and want one combined samplesheet for the dataset.
download_it:
if miniml file not in data_dir path, it will download it from web.
extract_controls [experimental]:
if you only want to retain samples from the whole set that have certain keywords, such as “control” or “blood”, this experimental flag will rewrite the samplesheet with only the parts you want, then feed that into run_pipeline with named samples.
require_keyword [experimental]:
another way to eliminate samples from samplesheets, before passing into the processor. if specified, this keyword needs to appear somewhere in the values of a samplesheet.
sync_idats:
If flagged, this will search data_dir for idats and remove any of those that are not found in the filtered samplesheet. Requires you to run the download function first to get all idats, before you run this meta_data function.

Attempts to also read idat filenames, if they exist, but won’t fail if they don’t.

methylprep.read_geo(filepath, verbose=False, debug=False, as_beta=True, column_pattern=None, test_only=False, rename_probe_column=True, decimals=3)[source]
Use to load preprocessed GEO data into methylcheck. Attempts to find the sample beta/M_values

in the CSV/TXT/XLSX file and turn it into a clean dataframe, with probe ids in the index/rows. Version 3 (introduced June 2020)

  • reads a downloaded file, either in csv, xlsx, pickle, txt
  • looks for /d_RxxCxx patterned headings and an probe index
  • sets index in df to probes
  • sets columns to sample names
  • forces probe values to be floats, if strings/mixed
  • if filename has ‘intensit’ or ‘signal’ in it, this converts to betas and saves even if filename doesn’t match, if columns have Methylated in them, it will convert and save
  • detect multi-line headers and adjusts dataframe columns accordingly
  • returns the usable dataframe

as_beta == True – converts meth/unmeth into a df of sample betas. column_pattern=None (Sample21 | Sample_21 | Sample 21) – some string of characters that precedes the number part of each sample in the columns of the file to be ingested.

FIXED:

[x] handle files with .Signal_A and .Signal_B instead of Meth/Unmeth [x] BUG: can’t parse matrix_… files if uses underscores instead of spaces around sample numbers, or where sampleXXX has no separator. [x] handle processed files with sample_XX [x] returns IlmnID as index/probe column, unless ‘rename_probe_column’ == False [x] pass in sample_column names from header parser so that logic is in one place

(makes the output much larger, so add kwarg to exclude this)

[x] demicals (default 3) – round all probe beta/intensity/p values returned to this number of decimal places. [x] bug: can only recognize beta samples if ‘sample’ in column name, or sentrix_id pattern matches columns.

need to expand this to handle arbitrary sample naming styles (limited to one column per sample patterns)
TODO:
[-] BUG: meth_unmeth_pval works as_beta but not returning full data yet [-] multiline header not working with all files yet.
notes:
this makes inferences based on strings in the filename, and based on the column names.
methylprep.detect_header_pattern(test, filename, return_sample_column_names=False)[source]

test is a dataframe with first 100 rows of the data set, and all columns. makes all the assumptions easier to read in one place.

betas non-normalized matrix_processed matrix_signal series_matrix methylated_signal_intensities and unmethylated_signal_intensities _family

TODO: GSM12345-tbl-1.txt type files (in _family.tar.gz packages) are possible, but needs more work. TODO: combining two files with meth/unmeth values

  • numbered samples handled differently from sample_ids in columns
  • won’t detect columns with no separators in strings
methylprep.build_composite_dataset(geo_id_list, data_dir, merge=True, download_it=True, extract_controls=False, require_keyword=None, sync_idats=True, betas=False, m_value=False, export=False)[source]

A wrapper function for convert_miniml() to download a list of GEO datasets and process only those samples that meet criteria. Specifically - grab the “control” or “normal” samples from a bunch of experiments for one tissue type (e.g. “blood”), process them, and put all the resulting beta_values and/or m_values pkl files in one place, so that you can run methylize.load_both() to create a combined reference dataset for QC, analysis, or meta-analysis.

Arguments:
geo_id_list (required):
A list of GEO “GSEnnn” ids. From command line, pass these in as separate values
data_dir:
folder to save data
merge (True):
If merge==True and there is a file with ‘samplesheet’ in the folder, and that sheet has GSM_IDs, merge that data into this samplesheet. Useful for when you have idats and want one combined samplesheet for the dataset.
download_it (True):
if miniml file not in data_dir path, it will download it from web.
extract_controls (True)):
if you only want to retain samples from the whole set that have certain keywords, such as “control” or “blood”, this experimental flag will rewrite the samplesheet with only the parts you want, then feed that into run_pipeline with named samples.
require_keyword (None):
another way to eliminate samples from samplesheets, before passing into the processor. if specified, the “keyword” string passed in must appear somewhere in the values of a samplesheet for sample to be downloaded, processed, retained.
sync_idats:
If flagged, this will search data_dir for idats and remove any of those that are not found in the filtered samplesheet. Requires you to run the download function first to get all idats, before you run this meta_data function.
betas:
process beta_values
m_value:
process m_values
  • Attempts to also read idat filenames, if they exist, but won’t fail if they don’t.
  • removes unneeded files as it goes, but leaves the xml MINiML file and folder there as a marker if a geo dataset fails to download. So it won’t try again on resume.
class methylprep.Manifest(array_type, filepath_or_buffer=None, on_lambda=False, verbose=True)[source]

Bases: object

Provides an object interface to an Illumina array manifest file.

Arguments:
array_type {ArrayType} – The type of array to process. values are styled like ArrayType.ILLUMINA_27K, ArrayType.ILLUMINA_EPIC or ArrayType(‘epic’), ArrayType(‘mouse’)
Keyword Arguments:
filepath_or_buffer {file-like} – a pre-existing manifest filepath (default: {None})
Raises:
ValueError: The sample sheet is not formatted properly or a sample cannot be found.
columns
control_data_frame
data_frame
static download_default(array_type, on_lambda=False)[source]

Downloads the appropriate manifest file if one does not already exist.

Arguments:
array_type {ArrayType} – The type of array to process.
Returns:
[PurePath] – Path to the manifest file.
get_data_types()[source]
get_genome_data()[source]
get_loci_count()[source]

Returns the number of unique loci/identifiers in the manifest

get_loci_names()[source]

Returns the list of unique loci/identifiers in the manifest

get_probe_details(probe_type, channel=None)[source]

given a probe type (I, II, SnpI, SnpII, Control) and a channel (Channel.RED | Channel.GREEN), This will return info needed to map probes to their names (e.g. cg0031313 or rs00542420), which are NOT in the idat files.

map_to_genome(data_frame)[source]
mouse_data_frame
read_control_probes(manifest_file)[source]

Unlike other probes, control probes have no IlmnID because they’re not locus-specific. they also use arbitrary columns, ignoring the header at start of manifest file.

read_mouse_probes(manifest_file)[source]

ILLUMINA_MOUSE contains unique probes whose names begin with ‘mu’ and ‘rp’ for ‘murine’ and ‘repeat’, respectively. This creates a dataframe of these probes, which are not processed like normal cg/ch probes.

read_probes(manifest_file)[source]
read_snp_probes(manifest_file)[source]

Unlike cpg and control probes, these rs probes are NOT sequential in all arrays.

static seek_to_start(manifest_file)[source]

find the start of the data part of the manifest. first left-most column must be “IlmnID” to be found.

snp_data_frame
class methylprep.ArrayType[source]

Bases: enum.Enum

This class stores meta data about array types, such as numbers of probes of each type, and how to guess the array from probes in idat files.

CUSTOM = 'custom'
ILLUMINA_27K = '27k'
ILLUMINA_450K = '450k'
ILLUMINA_EPIC = 'epic'
ILLUMINA_EPIC_PLUS = 'epic+'
ILLUMINA_MOUSE = 'mouse'
from_probe_count = <bound method ArrayType.from_probe_count of <enum 'ArrayType'>>[source]
num_controls
num_probes

used to load normal cg+ch probes from start of manifest until this point.

num_snps

processing

class methylprep.processing.SampleDataContainer(raw_dataset, manifest, retain_uncorrected_probe_intensities=False, bit='float32', pval=False, poobah_decimals=3, poobah_sig=0.05, do_noob=True, quality_mask=True, switch_probes=True, correct_dye_bias=True, debug=False, sesame=True)[source]

Wrapper that provides easy access to slices of data for a Sample, its RawDataset, and the pre-configured MethylationDataset subsets of probes.

Arguments:

raw_dataset {RawDataset} – A sample’s RawDataset for a single well on the processed array. manifest {Manifest} – The Manifest for the correlated RawDataset’s array type. bit (default: float64) – option to store data as float16 or float32 to save space. pval (default: False) – whether to apply p-value-detection algorithm to remove

unreliable probes (based on signal/noise ratio of fluoresence) uses the sesame method (pOOBah) based on out of band background levels

Jan 2020: added .snp_(un)methylated property. used in postprocess.consolidate_crontrol_snp() Mar 2020: added p-value detection option Mar 2020: added mouse probe post-processing separation

IG

research function to match sesame’s IG function; not used in processing only works if save_uncorrected=True.

II

research function to match sesame’s II function; not used in processing. only works if save_uncorrected=True.

IR

research function to match sesame’s IR function; not used in processing only works if save_uncorrected=True.

preprocess()[source]

combines the methylated and unmethylated columns from the SampleDataContainer.

process_all()[source]

Runs all pre and post-processing calculations for the dataset.

process_beta_value(input_dataframe, quality_mask_probes=None)[source]

Calculate Beta value from methylation data

process_copy_number(input_dataframe)[source]

Calculate copy number value from methylation data

process_m_value(input_dataframe)[source]

Calculate M value from methylation data

raw_IG

Uncorrected type-I GREEN probes from idats. only works if save_uncorrected=True; should match sesame SigSet.IG output before noob or any other transformations. note that running dye_bias_correction will modify the raw idat probe means in memory (changing this)

raw_II

Uncorrected type-II probes from idats. only works if save_uncorrected=True; should match sesame SigSet.IG output before noob or any other transformations. note that running dye_bias_correction will modify the raw idat probe means in memory (changing this)

raw_IR

Uncorrected type-I RED probes from idats. only works if save_uncorrected=True; should match sesame SigSet.IG output before noob or any other transformations. note that running dye_bias_correction will modify the raw idat probe means in memory (changing this)

snp_IG

used by dye-bias to copy IG ‘rs’ probes into @IG

snp_IR

used by dye-bias to copy IR ‘rs’ probes into @IR

methylprep.processing.get_manifest(raw_datasets, array_type=None, manifest_filepath=None)[source]

Return a Manifest, given a list of raw_datasets (from idats).

Arguments:
raw_datasets {list(RawDataset)} – Collection of RawDataset instances that
require a manifest file for the related array_type.
Keyword Arguments:
array_type {ArrayType} – The type of array to process. If not provided, it
will be inferred from the number of probes in the IDAT file. (default: {None})
manifest_filepath {path-like} – Path to the manifest file. If not provided,
it will be inferred from the array_type and downloaded if necessary (default: {None})
Returns:
[Manifest] – A Manifest instance.
methylprep.processing.preprocess_noob(data_container, linear_dye_correction=False, offset=15)[source]

the main preprocessing function. Applies background-subtraction and NOOB. Sets data_container.methylated and unmethylated values for sample.

methylprep.processing.run_pipeline(data_dir, array_type=None, export=False, manifest_filepath=None, sample_sheet_filepath=None, sample_name=None, betas=False, m_value=False, make_sample_sheet=False, batch_size=None, save_uncorrected=False, save_control=True, meta_data_frame=True, bit='float32', poobah=False, export_poobah=False, poobah_decimals=3, poobah_sig=0.05, low_memory=True, sesame=True, quality_mask=None, **kwargs)[source]

The main CLI processing pipeline. This does every processing step and returns a data set.

Required Arguments:
data_dir [required]
path where idat files can be found, and a samplesheet csv.
Optional file and sub-sampling inputs:
manifest_filepath [optional]
if you want to provide a custom manifest, provide the path. Otherwise, it will download the appropriate one for you.
sample_sheet_filepath [optional]
it will autodetect if ommitted.
make_sample_sheet [optional]
if True, generates a sample sheet from idat files called ‘samplesheet.csv’, so that processing will work. From CLI pass in “–no_sample_sheet” to trigger sample sheet auto-generation.
sample_name [optional, list]
if you don’t want to process all samples, you can specify individual samples as a list. if sample_names are specified, this will not also do batch sizes (large batches must process all samples)
Optional processing arguments:
sesame [default: True]
If True, applies offsets, poobah, noob, infer_channel_switch, nonlinear-dye-bias-correction, and qualityMask to imitate the output of openSesame function. If False, outputs will closely match minfi’s processing output. Prior to version 1.4.0, file processing matched minfi.
array_type [default: autodetect]
27k, 450k, EPIC, EPIC+ If omitted, this will autodetect it.
batch_size [optional]
if set to any integer, samples will be processed and saved in batches no greater than the specified batch size. This will yield multiple output files in the format of “beta_values_1.pkl … beta_values_N.pkl”.
bit [default: float32]
You can change the processed output files to one of: {float16, float32, float64}. This will make files & memory usage smaller, often with no loss in precision. However, using float16 masy cause an overflow error, resulting in “inf” appearing instead of numbers, and numpy/pandas functions do not universally support float16.
low_memory [default: True]
If False, pipeline will not remove intermediate objects and data sets during processing. This provides access to probe subsets, foreground, and background probe sets in the SampleDataContainer object returned when this is run in a notebook (not CLI).
quality_mask [default: None]
If False, process will NOT remove sesame’s list of unreliable probes. If True, removes probes. The default None will defer to sesamee, which defaults to true. But if explicitly set, it will override sesame setting.
Optional export files:
meta_data_frame [default: True]
if True, saves a file, “sample_sheet_meta_data.pkl” with samplesheet info.
export [default: False]
if True, exports a CSV of the processed data for each idat file in sample.
save_uncorrected [default: False]
if True, adds two additional columns to the processed.csv per sample (meth and unmeth), representing the raw fluorescence intensities for all probes. It does not apply NOOB correction to values in these columns.
save_control [default: False]
if True, adds all Control and SnpI type probe values to a separate pickled dataframe, with probes in rows and sample_name in the first column. These non-CpG probe names are excluded from processed data and must be stored separately.
poobah [default: False]
If specified as True, the pipeline will run Sesame’s p-value probe detection method (poobah) on samples to remove probes that fail the signal/noise ratio on their fluorescence channels. These will appear as NaNs in the resulting dataframes (beta_values.pkl or m_values.pkl). All probes, regardless of p-value cutoff, will be retained in CSVs, but there will be a ‘poobah_pval’ column in CSV files that methylcheck.load uses to exclude failed probes upon import at a later step.
poobah_sig [default: 0.05]
the p-value level of significance, above which, will exclude probes from output (typical range of 0.001 to 0.1)
poobah_decimals [default: 3]
The number of decimal places to round p-value column in the processed CSV output files.
mouse probes
Mouse-specific will be saved if processing a mouse array.
Optional final estimators:
betas
if True, saves a pickle (beta_values.pkl) of beta values for all samples
m_value
if True, saves a pickle (m_values.pkl) of beta values for all samples
Note on meth/unmeth:
if either betas or m_value is True, this will also save two additional files: ‘meth_values.pkl’ and ‘unmeth_values.pkl’ with the same dataframe structure, representing raw, uncorrected meth probe intensities for all samples. These are useful in some methylcheck functions and load/produce results 100X faster than loading from processed CSV output.
Returns:

By default, if called as a function, a list of SampleDataContainer objects is returned, with the following execptions:

betas
if True, will return a single data frame of betavalues instead of a list of SampleDataContainer objects. Format is a “wide matrix”: columns contain probes and rows contain samples.
m_value
if True, will return a single data frame of m_factor values instead of a list of SampleDataContainer objects. Format is a “wide matrix”: columns contain probes and rows contain samples.

if batch_size is set to more than ~600 samples, nothing is returned but all the files are saved. You can recreate/merge output files by loading the files using methylcheck.load().

Processing notes:

The sample_sheet parser will ensure every sample has a unique name and assign one (e.g. Sample1) if missing, or append a number (e.g. _1) if not unique. This may cause sample_sheets and processed data in dataframes to not match up. Will fix in future version.

pipeline steps:

1 make sample sheet or read sample sheet into a list of samples’ data 2 split large projects into batches, if necessary, and ensure unique sample names 3 read idats 4 select and read manifest 5 put everything into SampleDataContainer class objects 6 process everything, using the pipeline steps specified

idats -> channel_swaps -> poobah -> quality_mask -> noob -> dye_bias

7 apply the final estimator function (beta, m_value, or copy number) to all data 8 export all the data into multiple files, as defined by pipeline

methylprep.processing.read_geo(filepath, verbose=False, debug=False, as_beta=True, column_pattern=None, test_only=False, rename_probe_column=True, decimals=3)[source]
Use to load preprocessed GEO data into methylcheck. Attempts to find the sample beta/M_values

in the CSV/TXT/XLSX file and turn it into a clean dataframe, with probe ids in the index/rows. Version 3 (introduced June 2020)

  • reads a downloaded file, either in csv, xlsx, pickle, txt
  • looks for /d_RxxCxx patterned headings and an probe index
  • sets index in df to probes
  • sets columns to sample names
  • forces probe values to be floats, if strings/mixed
  • if filename has ‘intensit’ or ‘signal’ in it, this converts to betas and saves even if filename doesn’t match, if columns have Methylated in them, it will convert and save
  • detect multi-line headers and adjusts dataframe columns accordingly
  • returns the usable dataframe

as_beta == True – converts meth/unmeth into a df of sample betas. column_pattern=None (Sample21 | Sample_21 | Sample 21) – some string of characters that precedes the number part of each sample in the columns of the file to be ingested.

FIXED:

[x] handle files with .Signal_A and .Signal_B instead of Meth/Unmeth [x] BUG: can’t parse matrix_… files if uses underscores instead of spaces around sample numbers, or where sampleXXX has no separator. [x] handle processed files with sample_XX [x] returns IlmnID as index/probe column, unless ‘rename_probe_column’ == False [x] pass in sample_column names from header parser so that logic is in one place

(makes the output much larger, so add kwarg to exclude this)

[x] demicals (default 3) – round all probe beta/intensity/p values returned to this number of decimal places. [x] bug: can only recognize beta samples if ‘sample’ in column name, or sentrix_id pattern matches columns.

need to expand this to handle arbitrary sample naming styles (limited to one column per sample patterns)
TODO:
[-] BUG: meth_unmeth_pval works as_beta but not returning full data yet [-] multiline header not working with all files yet.
notes:
this makes inferences based on strings in the filename, and based on the column names.
methylprep.processing.detect_header_pattern(test, filename, return_sample_column_names=False)[source]

test is a dataframe with first 100 rows of the data set, and all columns. makes all the assumptions easier to read in one place.

betas non-normalized matrix_processed matrix_signal series_matrix methylated_signal_intensities and unmethylated_signal_intensities _family

TODO: GSM12345-tbl-1.txt type files (in _family.tar.gz packages) are possible, but needs more work. TODO: combining two files with meth/unmeth values

  • numbered samples handled differently from sample_ids in columns
  • won’t detect columns with no separators in strings

models

class methylprep.models.ArrayType[source]

This class stores meta data about array types, such as numbers of probes of each type, and how to guess the array from probes in idat files.

num_probes

used to load normal cg+ch probes from start of manifest until this point.

class methylprep.models.Channel[source]

idat probes measure either a red or green fluorescence. This specifies which to return within idat.py: red_idat or green_idat.

class methylprep.models.ControlType[source]

An enumeration.

class methylprep.models.Probe(address, illumina_id, probe_type)[source]

this doesn’t appear to be instantiated anywhere in methylprep

class methylprep.models.ProbeAddress[source]

AddressA_ID and AddressB_ID are columns in the manifest csv that contain internal Illumina probe identifiers.

Type II probes use AddressA_ID; Type I uses both, because there are two probes, two colors.

probe intensities in .idat files are keyed to one of these ids, but processed data is always keyed to the IlmnID probe “names” – so this is used in converting between IDs. It is used to define probe sets below in this probes.py

class methylprep.models.ProbeSubset(data_channel, probe_address, probe_channel, probe_type)[source]

used below in probes.py to define sub-sets of probes: foreground-(red|green|all), or (un)methylated probes

class methylprep.models.ProbeType[source]

probes can either be type I or type II for CpG or Snp sequences. Control probes are used for background correction in different fluorescence ranges and staining efficiency. Type I probes record EITHER a red or a green value. Type II probes record both values together. NOOB uses the red fluorescence on a green probe and vice versa to calculate background fluorescence.

class methylprep.models.Sample(data_dir, sentrix_id, sentrix_position, **addl_fields)[source]

Object representing a row in a SampleSheet file

Arguments:
data_dir {string or path-like} – Base directory of the sample sheet and associated IDAT files. sentrix_id {string} – The slide number of the processed array. sentrix_position {string} – The position on the processed slide.
Keyword Arguments:

addl_fields {} – Additional metadata describing the sample. including experiment subject meta data:

name (sample name, unique id) Sample_Type Control GSM_ID (same as sample name if using GEO public data)

array meta data:

group plate pool well
alternate_base_filename

GEO data sets using this file name convention.

get_export_filepath()[source]

Called by run_pipeline to find the folder/filename to export data as CSV, but CSV file doesn’t exist yet.

get_file_s3(zip_reader, extension, suffix=None)[source]

replaces get_filepath, but for s3 context. Since these files are compressed within a single zipfile in the bucket, they don’t resolve to PurePaths.

get_filepath(extension, suffix=None, verify=True)[source]

builds the filepath based on custom file extensions and suffixes during processing.

Params (verify):
tests whether file exists, either in data_dir or somewhere in recursive search path of data_dir.
Export:
uses this later to fetch the place where a file ought to be created – but doesn’t exist yet, so use verify=False.
Notes:
_suffix – used to create the <file>_processed files.
class methylprep.models.MethylationDataset(raw_dataset, manifest, probe_subsets)[source]

Wrapper for a collection of methylated or unmethylated probes and their mean intensity values, providing common functionality for the subset of probes.

Arguments:
raw_dataset {RawDataset} – A sample’s RawDataset for a single well on the processed array. manifest {Manifest} – The Manifest for the correlated RawDataset’s array type. probe_subsets {list(ProbeSubset)} – Collection of ProbeSubsets that correspond to the probe type (methylated or unmethylated).

note: self.methylated.data_frame ‘bg_corrected’ and ‘noob’ values will be same under preprocess_sesame_noob, but different under minfi/legacy pre-v1.4.0 results. And this ‘noob’ will not match SampleDataContainer.dataframe because dye-bias correction happens later in processing.

classmethod methylated(raw_dataset, manifest)[source]

convenience method that feeds in a pre-defined list of methylated CpG locii probes

classmethod snp_methylated(raw_dataset, manifest)[source]

convenience method that feeds in a pre-defined list of methylated Snp locii probes

classmethod snp_unmethylated(raw_dataset, manifest)[source]

convenience method that feeds in a pre-defined list of UNmethylated Snp locii probes

classmethod unmethylated(raw_dataset, manifest)[source]

convenience method that feeds in a pre-defined list of UNmethylated CpG locii probes

class methylprep.models.RawDataset(sample, green_idat, red_idat)[source]

Wrapper for a sample and its pair of raw IdatDataset values.

Arguments:
sample {Sample} – A Sample parsed from the sample sheet. green_idat {IdatDataset} – The sample’s GREEN channel IdatDataset. red_idat {IdatDataset} – The sample’s RED channel IdatDataset.
Raises:
ValueError: If the IDAT file pair have differing number of probes. TypeError: If an invalid Channel is provided when parsing an IDAT file.
filter_oob_probes(channel, manifest, idat_dataset, include_rs=True)[source]

adds both channels, and rs probes

get_fg_values(manifest, channel)[source]

appears to only be used in bg_correct part of NOOB function

get_infer_channel_probes(manifest, debug=False)[source]

like filter_oob_probes, but returns two dataframes for green and red channels with meth and unmeth columns effectively criss-crosses the red-oob channels and appends to green, and appends green-oob to red returns a dict with ‘green’ and ‘red’ channel probes

get_oob_controls(manifest, include_rs=True)[source]

Out-of-bound controls are the mean intensity values for the channel in the opposite channel’s probes (IG oob and IR oob)

get_subset_means(probe_subset, manifest)[source]

called by get_fg_values for each of 6 probe subsets

class methylprep.models.RawMetaDataset(sample)[source]

Wrapper for a sample and meta data, without its pair of raw IdatDataset values.

Arguments:
sample {Sample} – A Sample parsed from the sample sheet.
each Sample contains (at a minimum):
data_dir=self.data_dir sentrix_id=sentrix_id sentrix_position=sentrix_position
methylprep.models.get_array_type(raw_datasets)[source]

provide a list of raw_datasets and it will return the array type by counting probes

files

class methylprep.files.IdatDataset(filepath_or_buffer, channel, idat_id='IDAT', idat_version=3, verbose=False, std_dev=False, nbeads=False, bit='float32')[source]

Validates and parses an Illumina IDAT file.

Arguments:
filepath_or_buffer {file-like} – the IDAT file to parse. channel {Channel} – the fluorescent channel (Channel.RED or Channel.GREEN) that produced the IDAT dataset.
Keyword Arguments:

idat_id {string} – expected IDAT file identifier (default: {DEFAULT_IDAT_FILE_ID}) idat_version {integer} – expected IDAT version (default: {DEFAULT_IDAT_VERSION}) bit {string, default ‘float32’} – ‘float16’ will pre-normalize intensities,

capping max intensity at 32127. This cuts data size in half, but will reduce precision on ~0.01% of probes. [effectively downscaling fluorescence]
Raises:
ValueError: The IDAT file has an incorrect identifier or version specifier.
read(idat_file)[source]

Reads the IDAT file and parses the appropriate sections. Joins the mean probe intensity values with their Illumina probe ID.

Arguments:
idat_file {file-like} – the IDAT file to process.
Returns:
DataFrame – mean probe intensity values indexed by Illumina ID.
class methylprep.files.Manifest(array_type, filepath_or_buffer=None, on_lambda=False, verbose=True)[source]

Provides an object interface to an Illumina array manifest file.

Arguments:
array_type {ArrayType} – The type of array to process. values are styled like ArrayType.ILLUMINA_27K, ArrayType.ILLUMINA_EPIC or ArrayType(‘epic’), ArrayType(‘mouse’)
Keyword Arguments:
filepath_or_buffer {file-like} – a pre-existing manifest filepath (default: {None})
Raises:
ValueError: The sample sheet is not formatted properly or a sample cannot be found.
static download_default(array_type, on_lambda=False)[source]

Downloads the appropriate manifest file if one does not already exist.

Arguments:
array_type {ArrayType} – The type of array to process.
Returns:
[PurePath] – Path to the manifest file.
get_loci_count()[source]

Returns the number of unique loci/identifiers in the manifest

get_loci_names()[source]

Returns the list of unique loci/identifiers in the manifest

get_probe_details(probe_type, channel=None)[source]

given a probe type (I, II, SnpI, SnpII, Control) and a channel (Channel.RED | Channel.GREEN), This will return info needed to map probes to their names (e.g. cg0031313 or rs00542420), which are NOT in the idat files.

read_control_probes(manifest_file)[source]

Unlike other probes, control probes have no IlmnID because they’re not locus-specific. they also use arbitrary columns, ignoring the header at start of manifest file.

read_mouse_probes(manifest_file)[source]

ILLUMINA_MOUSE contains unique probes whose names begin with ‘mu’ and ‘rp’ for ‘murine’ and ‘repeat’, respectively. This creates a dataframe of these probes, which are not processed like normal cg/ch probes.

read_snp_probes(manifest_file)[source]

Unlike cpg and control probes, these rs probes are NOT sequential in all arrays.

static seek_to_start(manifest_file)[source]

find the start of the data part of the manifest. first left-most column must be “IlmnID” to be found.

class methylprep.files.SampleSheet(filepath_or_buffer, data_dir)[source]

Validates and parses an Illumina sample sheet file.

Arguments:
filepath_or_buffer {file-like} – the sample sheet file to parse. dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files.
Raises:
ValueError: The sample sheet is not formatted properly or a sample cannot be found.
build_samples()[source]

Builds Sample objects from the processed sample sheet rows.

Added to Sample as class_method: if the idat file is not in the same folder, (check if exists!) looks recursively for that filename and updates the data_dir for that Sample.

contains_column(column_name)[source]

helper function to determine if sample_sheet contains a specific column, such as GSM_ID. SampleSheet must already have __data_frame in it.

get_sample(sample_name)[source]

scans all samples for one matching sample_name, if provided. If no sample_name, then it returns all samples.

get_samples()[source]

Retrieves Sample objects from the processed sample sheet rows, building them if necessary.

methylprep.files.get_sample_sheet(dir_path, filepath=None)[source]

Generates a SampleSheet instance for a given directory of processed data.

Arguments:
dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files.
Keyword Arguments:
filepath {string or path-like} – path of the sample sheet file if provided, otherwise
one will try to be found. (default: {None})
Returns:
[SampleSheet] – A SampleSheet instance.
methylprep.files.get_sample_sheet_s3(zip_reader)[source]

reads a zipfile and considers all filenames with ‘sample_sheet’ but will test all csv. the zip_reader is an amazon S3ZipReader object capable of reading the zipfile header.

methylprep.files.create_sample_sheet(dir_path, matrix_file=False, output_file='samplesheet.csv', sample_type='', sample_sub_type='')[source]

Creates a samplesheet.csv file from the .IDAT files of a GEO series directory

Arguments:

dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files. matrix_file {boolean} – Whether or not a Series Matrix File should be searched for names. (default: {False})

========== | ========= | ==== | ======= parameter | required | type | effect ========== | ========= ==== | ======= sample_type | optional | string | label all samples in created sheet as this type (i.e. blood, saliva, tumor cells) sample_sub_type | optional | string | further detail sample type for batch controls | optional | list of sample_names | assign all samples in controls list to be “control samples”, not treatment samples. ========== | ========= | ==== | =======

Note:
Because sample_names are only generated from Matrix files, this method won’t let you assign controls to samples from CLI. Would require all sample names be passed in from CLI as well, a pretty messy endeavor.
Raises:
FileNotFoundError: The directory could not be found.
methylprep.files.find_sample_sheet(dir_path)[source]

Find sample sheet file for Illumina methylation array.

Notes:
looks for csv files in {dir_path}. If more than one csv file found, returns the one that has “sample_sheet” or ‘samplesheet’ in its name. Otherwise, raises error.
Arguments:
dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files.
Raises:
FileNotFoundError: [description] Exception: [description]
Returns:
[string] – Path to sample sheet in base directory

geo download

methylprep.download.run_series(id, path, dict_only=False, batch_size=100, clean=True, abort_if_no_idats=True, decompress=True)[source]

Downloads the IDATs and metadata for a series then generates one metadata dictionary and one beta value matrix for each platform in the series

Arguments:
id [required]
the series ID (can be a GEO or ArrayExpress ID)
path [required]
the path to the directory to download the data to. It is assumed a dictionaries and beta values directory has been created for each platform (and will create one for each if not)
dict_only
if True, downloads idat files and meta data and creates data dictionaries for each platform, but does not process them further.
batch_size
the batch_size to use when processing samples (number of samples run at a time). By default is set to the constant 100.
clean
if True, removes intermediate processing files
methylprep.download.run_series_list(list_file, path, dict_only=False, batch_size=100, **kwargs)[source]

Downloads the IDATs and metadata for a list of series, creating metadata dictionaries and dataframes of sample beta_values

Arguments:
list_file [required]
the name of the file containing a list of GEO_IDS and/or Array Express IDs to download and process. This file must be located in the directory data is downloaded to. Each line of the file should contain the name of one data series ID.
path [required]
the path to the directory to download the data to. It is assumed a dictionaries and beta values directory has been created for each platform (and will create one for each if not)
dict_only
if True, only downloads data and creates dictionaries for each platform
batch_size
the batch_size to use when processing samples (number of samples run at a time). By default is set to the constant 100.
methylprep.download.convert_miniml(geo_id, data_dir='.', merge=True, download_it=True, extract_controls=False, require_keyword=None, sync_idats=False, remove_tgz=False, verbose=False)[source]
This scans the datadir for an xml file with the geo_id in it.
Then it parses it and saves the useful stuff to a dataframe called “sample_sheet_meta_data.pkl”. DOES NOT REQUIRE idats.
CLI version:
python -m meta_data -i GSExxxxx -d <my_folder>
Arguments:
merge:
If merge==True and there is a file with ‘samplesheet’ in the folder, and that sheet has GSM_IDs, merge that data into this samplesheet. Useful for when you have idats and want one combined samplesheet for the dataset.
download_it:
if miniml file not in data_dir path, it will download it from web.
extract_controls [experimental]:
if you only want to retain samples from the whole set that have certain keywords, such as “control” or “blood”, this experimental flag will rewrite the samplesheet with only the parts you want, then feed that into run_pipeline with named samples.
require_keyword [experimental]:
another way to eliminate samples from samplesheets, before passing into the processor. if specified, this keyword needs to appear somewhere in the values of a samplesheet.
sync_idats:
If flagged, this will search data_dir for idats and remove any of those that are not found in the filtered samplesheet. Requires you to run the download function first to get all idats, before you run this meta_data function.

Attempts to also read idat filenames, if they exist, but won’t fail if they don’t.

methylprep.download.build_composite_dataset(geo_id_list, data_dir, merge=True, download_it=True, extract_controls=False, require_keyword=None, sync_idats=True, betas=False, m_value=False, export=False)[source]

A wrapper function for convert_miniml() to download a list of GEO datasets and process only those samples that meet criteria. Specifically - grab the “control” or “normal” samples from a bunch of experiments for one tissue type (e.g. “blood”), process them, and put all the resulting beta_values and/or m_values pkl files in one place, so that you can run methylize.load_both() to create a combined reference dataset for QC, analysis, or meta-analysis.

Arguments:
geo_id_list (required):
A list of GEO “GSEnnn” ids. From command line, pass these in as separate values
data_dir:
folder to save data
merge (True):
If merge==True and there is a file with ‘samplesheet’ in the folder, and that sheet has GSM_IDs, merge that data into this samplesheet. Useful for when you have idats and want one combined samplesheet for the dataset.
download_it (True):
if miniml file not in data_dir path, it will download it from web.
extract_controls (True)):
if you only want to retain samples from the whole set that have certain keywords, such as “control” or “blood”, this experimental flag will rewrite the samplesheet with only the parts you want, then feed that into run_pipeline with named samples.
require_keyword (None):
another way to eliminate samples from samplesheets, before passing into the processor. if specified, the “keyword” string passed in must appear somewhere in the values of a samplesheet for sample to be downloaded, processed, retained.
sync_idats:
If flagged, this will search data_dir for idats and remove any of those that are not found in the filtered samplesheet. Requires you to run the download function first to get all idats, before you run this meta_data function.
betas:
process beta_values
m_value:
process m_values
  • Attempts to also read idat filenames, if they exist, but won’t fail if they don’t.
  • removes unneeded files as it goes, but leaves the xml MINiML file and folder there as a marker if a geo dataset fails to download. So it won’t try again on resume.
methylprep.download.search(keyword, filepath='.', verbose=True)[source]
CLI/cron function to check for new datasets.
set up as a weekly cron. uses a local storage file to compare with old datasets in <pattern>_meta.csv. saves the dates of each dataset from GEO; calculates any new ones as new rows. updates csv.
options:
pass in -k keyword verbose (True|False) — reports to page; saves csv too
returns:
saves a CSV to disk and returns a dataframe of results
methylprep.download.pipeline_find_betas_any_source(**kwargs)[source]

Sets up a script to run methylprep that saves directly to path or S3. The slowest part of processing GEO datasets is downloading, so this handles that.

STEPS
  • uses methylprep alert -k <keywords> to curate a list of GEO IDs worth grabbing.
    note that version 1 will only process idats. also runs methylcheck.load on processed files, if installed.
  • downloads a zipfile, uncompresses it,
  • creates a samplesheet,
  • moves it into foxo-test-pipeline-raw for processing.
  • You get back a zipfile with all the output data.
required kwargs:
  • project_name: string, like GSE123456, to specify one GEO data set to download.
    to initialize, specify one GEO id as an input when starting the function. - beforehand, you can use methylprep alert to verify the data exists. - OR you can pass in a string of GEO_ID separated by commas without any spaces and it will split them.
optional kwargs:
  • function: ‘geo’ (optional, ignored; used to specify this pipeline to run from command line)
  • data_dir:
    • default is current working directory (‘.’) if omitted
    • use to specify where all files will be downloaded, processed, and finally stored, unless –cleanup=False.
    • if using AWS S3 settings below, this will be ignored.
  • verbose: False, default is minimal logging messages.
  • save_source: if True, it will retain .idat and/or -tbl-1.txt files used to generate beta_values dataframe pkl files.
It will use local disk by default, but if you want it to save to S3, provide these:
  • bucket (where downloaded files are stored)

  • efs (AWS elastic file system name, for lambda or AWS batch processing)

  • processed_bucket (where final files are saved)

  • clean: default True. If False, does not explicitly remove the tempfolder files at end, or move files into data_dir output filepath/folder.
    • if you need to keep folders in working/efs folder instead of moving them to the data_dir.
    • use cleanup=False when embedding this in an AWS/batch/S3 context,

    then use the working tempfolder path and filenames returned to copy these files into S3.

returns:
  • if a single GEO_ID, returns a dict with “error”, “filenames”, and “tempdir” keys.
  • if mulitple GEO_IDs, returns a dict with “error”, “geo_ids” (nested dict), and “tempdir” keys. Uses same tempdir for everything, so clean should be set to True.
  • “error” will be None if it worked okay.
  • “filenames” will be a list of filenames that were created as outputs (type=string)
  • “tempdir” will be the python tempfile tempory-directory object. Passing this out prevents
    garbage collector from removing it when the function ends, so you can retrive these files and run tempdir.cleanup() manually. Otherwise, python will remove the tempdir for you when python closes, so copy whatever you want out of it first. This makes it possible to use this function with AWS EFS (elastic file systems) as part of a lambda or aws-batch function where disk space is more limited.

NOTE: v1.3.0 does NOT support multiple GEO IDs yet.