Troubleshooting, Tips, and FAQs¶
Loading Processed Data¶
If you have a small data set (under 200 samples), and did not use the batch_size option:
import pandas as pd
data = pd.read_pickle('{path_to_file}/beta_values.pkl')
Otherwise, loading hundreds of samples can be slow. But there’s a helper function that loads a bunch of smaller
batches into one dataframe, if they were processed with methylprep’s batch_size option:
import methylize
data = methylize.load(path_to_file)
That will load 1600 samples in 8 seconds, compared to taking minutes if you read_pickle them and concat them using pandas.
Processing Notes¶
- DataFrames are the format that methylcheck/methylize functions expect. For saving and loading, the probe names belong in columns and the sample names belong in the index. But when processing data, some functions auto-transpose this to decrease processing time.
- By default,
methylprep processalso creates a file calledsample_sheet_meta_data.pklfrom various data sources- if there is a GEO series MINiML xml file, it reads this data preferentially
- if there is a samplesheet, it will convert this to a meta DataFrame
- if these sources are missing required data, such as Sentrix_Position and Sentrix_ID, it will look for idat files and read these from the filenames.
idat filenames¶
Some GEO datasets have unconventionally named IDATs, and cannot be processed without renaming them first.
- There are two acceptable formats:
<GSM_ID>_<Sentrix_ID>_<Sentrix_Position>_<Red|Grn>.idat<.gz><Sentrix_ID>_<Sentrix_Position>_<Red|Grn>.idat<.gz>- Methylprep will convert
.gzfiles to.idatuncompressed files when processing.
- methylprep does not recognize the older 27k array filename format:
<GSM_ID>_<Sentrix_ID>_<Sentrix_Position>_<SOME_LETTER>.idat
Why didn’t the methylprep download function work for GEO dataset GSEnnn?¶
A significant number of GEO datasets do not store their data in a consistent format. Here are some reasons a GEO dataset fails to download:
idatfilename format is off (missing R00C00 position) Thedownloadfunction ONLY supports.idatfiles.- no raw idats in zip, only processed data
- The meta data in MINiML format xml file is incomplete. (There are ways to make it work without meta data)
- the NIH GEO FTP server is down (yes, we’ve experienced this whilst testing too)
idatfiles in dataset have varying number of probes. If a dataset combines results from two array types (EPIC and 450k), it can sometimes split the data into two sample sheets and you can process each folder separately.methylprepusually resolves this for you by moving each batch of files into separate folders by array, so you can process them separately.
In most cases where download fails, there will be processed data available in some other format.
Use methylprep beta_bake to download and convert this data instead.
In other cases where beta_bake also fails, we’ve made an effort to giving you clear error messages, detailing why. IF you detect a pattern that could be parsed with our code, we’re happy to support additional
file formats for methylation data found on GEO.
What if methylprep download and methylprep beta_bake fails. How do I use the data anyway?¶
- Download the raw idat zipfile manually to a folder.
- Uncompress it.
- Confirm that there are
.idatfiles present. - If the files end in
.idat.gz, gunzip them first. (In a mac/linux bash window you can navigate to the folder and typegunzip *to uncompress all those files. On a PC, use some software like7ziporwinzip.) - IF the filenames don’t include Sentrix IDs and Sentrix array positions, like
<GSM ID>_<some long number>_<R01C01>_<Red or Grn>.idat, you’ll need to manually edit the samplesheet to match the files. - Run
methylprep processon the folder, possibly with the--no_sample_sheetoption. It should work, but you won’t have any of the sample meta data bundled for you for analysis with datasets in nonstandard formats.
How to process only part of a GEO dataset¶
The methylprep meta_data command line interface (CLI) option allows you to specify which samples to process using keyword matching against the GEO meta data. You can use this before methylprep download and methylprep process to create smaller data sets, faster.
You can also specify sample names individually from a large set like this: methylprep process --sample_name Bob -n Suzy -n Doug. This will reduce processing time.
Examples¶
(1) Grab a samplesheet for a GEO data set¶
python -m methylprep meta_data -i GSE125105
Yields three files on local disk:
- GSE125105_family.xml (the original GEO meta data file in MINiML format)
- GSE125105_GPL13534_samplesheet.csv (used for processing)
- GSE125105_GPL13534_meta_data.pkl (used in analysis to describe samples)
| GSM_ID | Sample_Name | Sentrix_ID | Sentrix_Position | source | diagnosis | age | Sex | tissue | cellcount-cd8t | cellcount-cd4t | cellcount-nk | cellcount-bcell | cellcount-mono | cellcount-gran | description |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GSM3562834 | genomic DNA from sample291 | 3999840035 | R01C01 | control_whole blood | control | 73 | F | whole blood | 0.07679 | 0.09099 | 0.06041 | 0.08542 | 0.09072 | 0.62266 | whole blood control sample |
| GSM3562835 | genomic DNA from sample612 | 3999840035 | R01C02 | case_whole blood | case | 32 | M | whole blood | 0.05544 | 0.07946 | 0.0159 | 0.09557 | 0.05515 | 0.72663 | whole blood case sample |
| GSM3562836 | genomic DNA from sample611 | 3999840035 | R02C01 | case_whole blood | case | 51 | F | whole blood | 0.08279 | 0.22216 | 0.03107 | 0.0769 | 0.07915 | 0.54165 | whole blood case sample |
| GSM3562837 | genomic DNA from sample375 | 3999840035 | R02C02 | case_whole blood | case | 30 | M | whole blood | 0.03779 | 0.07368 | 0.00385 | 0.07548 | 0.0891 | 0.74809 | whole blood case sample |
(2) Grab just the samplesheet and create a csv and dataframe pkl of it.¶
python -m methylprep -v meta_data -i GSE84727 -d GSE84727
Verbose CLI output
INFO:methylprep.download.miniml:Downloading GSE84727_family.xml
INFO:methylprep.download.miniml:Downloaded GSE84727_family.xml
INFO:methylprep.download.miniml:Unpacking GSE84727_family.xml
INFO:methylprep.download.miniml:Downloaded and unpacked GSE84727
INFO:methylprep.download.miniml:MINiML file does not provide `methylprep_name` (sentrix_id_R00C00) for 847/847 samples.
INFO:methylprep.download.miniml:dropped Sentrix_ID (empty)
INFO:methylprep.download.miniml:dropped Sentrix_Position (empty)
INFO:methylprep.download.miniml:source == description; dropping source
INFO:methylprep.download.miniml:dropped `platform` ({'GPL13534'})
INFO:methylprep.download.miniml:title == Sample_Name; dropping title
INFO:methylprep.download.miniml:Final samplesheet contains 847 rows and 7 columns
INFO:methylprep.download.miniml:['GSM_ID', 'Sample_Name', 'sentrixids', 'Sex', 'age', 'disease_status', 'description']
Resulting samplesheet
| GSM_ID | Sample_Name | sentrixids | Sex | age | disease_status | description |
|---|---|---|---|---|---|---|
| GSM2250273 | 33262604 | 3998567027_R01C01 | M | 47.3 | 1 | control blood |
| GSM2250274 | 33261623 | 3998567027_R02C01 | M | 60.4 | 1 | control blood |
| GSM2250275 | 33262614 | 3998567027_R04C01 | M | 30.1 | 1 | control blood |
You’ll notice that one column sentrixids is misnamed. It should be split into Sentrix_Position and Sentrix_ID columns for processing to work on this GEO series.
You can edit the csv and fix that prior to running the pipeline with methylprep process. If you don’t, you’ll get an error saying, “methylprep could not find the samplesheet.” This error is caused by the researcher putting an arbitrary name into the GSE84727_family.xml MINiML meta data file:
<Characteristics tag="sentrixids">
3998567027_R02C02
</Characteristics>
If you use the methylprep download option by itself, it can generally avoid this type of XML parsing error, but it will download everything. Doing analysis on just part of a dataset requires some debugging like this.
(3) Samplesheet with only “normal” samples¶
python -m methylprep -v meta_data -i GSE52270 -d GSE52270 -k normal
-k is shorthand for --keyword. The resulting sample sheet only includes samples that include the keyword normal
| GSM_ID | Sample_Name | source | disease state | description |
|---|---|---|---|---|
| GSM1278809 | Colon 61 | Large Intestine | normal | 1011N |
| GSM1278812 | Colon 64 | Large Intestine | normal | 1082N |
| GSM1278823 | Colon 75 | Large Intestine | normal | 1184N |
| GSM1278825 | White matter 77 | Central Nervous System | normal | 12_03 |
(4) Generate filtered samplesheet with only control samples from blood¶
python -m methylprep meta_data -i GSE125105 -d GSE125105 --control -k blood
| GSM_ID | Sample_Name | Sentrix_ID | Sentrix_Position | source | diagnosis | age | Sex | tissue | cellcount-cd8t | cellcount-cd4t | cellcount-nk | cellcount-bcell | cellcount-mono | cellcount-gran | description |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GSM3562834 | genomic DNA from sample291 | 3999840035 | R01C01 | control_whole blood | control | 73 | F | whole blood | 0.07679 | 0.09099 | 0.06041 | 0.08542 | 0.09072 | 0.62266 | whole blood control sample |
| GSM3562839 | genomic DNA from sample176 | 3999840035 | R03C02 | control_whole blood | control | 43 | M | whole blood | 0.06946 | 0.12989 | 0.04703 | 0.09808 | 0.14105 | 0.54662 | whole blood control sample |
| GSM3562842 | genomic DNA from sample161 | 3999840035 | R05C01 | control_whole blood | control | 44 | F | whole blood | 0.10986 | 0.13565 | 0.07657 | 0.09125 | 0.12521 | 0.49317 | whole blood control sample |
| GSM3562846 | genomic DNA from sample270 | 3999840037 | R01C01 | control_whole blood | control | 64 | M | whole blood | 0.11508 | 0.14116 | 0.0679 | 0.09415 | 0.15311 | 0.47829 | whole blood control sample |
| GSM3562855 | genomic DNA from sample162 | 3999840037 | R06C01 | control_whole blood | control | 65 | M | whole blood | 0.01668 | 0.14318 | 0.1096 | 0.05545 | 0.09695 | 0.60283 | whole blood control sample |
This only retains 211 of the 699 samples in a samplesheet. Next, you download the .idat files with with methylprep download and then remove the idat files you won’t need like this:
python -m methylprep -v download -i GSE125105 -d GSE125105
python -m methylprep -v meta_data -i GSE125105 -d GSE125105 --sync_idats --control -k blood
And then process the 6.1GB file using this samplesheet, like this:
python -m methylprep -v process -d GSE125105 --betas --m_value -e
Those options will create two big files. One is a dataframe of beta_values for each sample. The other, m_values for each sample (kind of the same thing, but sometimes you want m_values). the -e or --no_export option will suppress the function creating files of probe values for each sample, as these are not needed by most methylize and methylcheck functions. There is also a --save_uncorrected option that prevents any sort of background and NOOB signal enhancement during processing. Uncorrected files are needed for a few analysis functions, namely p-value probe detection.
In general, partial-dataset processing fails because the meta data for a GEO dataset is incomplete. Either the array positions are missing, or misnamed. Careful checking can allow one to fix this and build a large data set from multiple GEO datasets.
(4b) Another condensed example of downloading GEO data and only processing control samples¶
python -m methylprep -v download -i GSE130030 -d GSE130030
# next, remove the treatment samples using `-c` and remove extra idats with `-s`
python -m methylprep -v meta_data -i GSE130030 -d GSE130030 --control -s
# finally, process it
python -m methylprep -v process -d GSE130030 --betas --m_value --no_export
This creates two files, beta_values.pkl and GSE130030_GPL13534_meta_data.pkl, that you can work with in methylize like this:
Navigate to the GSE130030 folder created by methylrep, and start a python interpreter:
import methylize
data,meta = methylize.load_both()
INFO:methylize.helpers:Found several meta_data files; using: GSE130030_GPL13534_meta_data.pkl)
INFO:methylize.helpers:loaded data (485512, 14) from 1 pickled files (0.159s)
INFO:methylize.helpers:meta.Sample_IDs match data.index (OK)
Or if you are running in a notebook, specify the full path:
import methylize
data,meta = methylize.load_both('<path_to...>/GSE105018')
Why won’t methylprep composite parse some GEO data sets’ meta data?¶
Here are some examples of logical, but unexpected ways data can be stored in the MINiML file format:
The composite expects “control” to be in one of the rows in a spreadsheet. Instead, the authors have recoded “control” as a number, “1” in the column header name. Our parser just isn’t smart enough to read that.
['GSM_ID', 'Sample_Name', 'diseasestatus (1=control, 2=scz patient)', 'source', 'gender', 'sample type', 'plate', 'sentrix barcode', 'sentrix position', 'well id', 'age', 'used_in_analysis', 'description']
Here, instead of having the same names for data for each sample, they have split the smoking status into a bunch of columns, and not provided values for every sample. (smoking_evernever and smoke_free_years don’t add up to 95.) Fixing this requires putting in null values for each incomplete column in a sample sheet.
ValueError - array lengths vary in sample meta data: [('GSM_ID', 95), ('Sample_Name', 95),
('smoking_evernever', 52), ('smoke_free_years', 30), ('Sentrix_ID', 95), ('Sentrix_Position', 95), ('source', 95), ('gender', 95), ('slide', 95), ('array', 95), ('array_name', 95), ('sample_group', 52), ('smoking_5years', 52), ('ms_case_control', 52), ('sample_year', 52), ('age_sampling', 52), ('py', 52), ('description', 95), ('Sample_ID', 95)]