Recover cell cycle in mESC

Here we use the mESC dataset. For simplicity we have converted the dataset into TPM. The original count data is available at ArrayExpress: E-MTAB-2805. Tools to transform data are also provided and explained in the following sections.

Import necessary packages

[1]:
%load_ext autoreload
%autoreload 1
[2]:
import sys

import pandas as pd
import numpy as np
import pickle as pkl
import sklearn as skl
import sklearn.preprocessing

import matplotlib as mpl

import matplotlib.pyplot as plt

Warning information from TensorFlow may occur. It doesn’t matter.

[3]:
import cyclum
from cyclum import writer
/home/shaoheng/.conda/envs/tensorflow-gpu/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
[4]:
input_file_mask = 'data/mESC/mesc-tpm'
output_file_mask = './results/mESC_original/mesc-tpm'

Read data

Here we have label, so we load both. However, the label is not used until evaluation.

[5]:
def preprocess(input_file_mask):
    """
    Read in data and perform log transform (log2(x+1)), centering (mean = 1) and scaling (sd = 1).
    """
    tpm = writer.read_df_from_binary(input_file_mask).T
    sttpm = pd.DataFrame(data=skl.preprocessing.scale(np.log2(tpm.values + 1)), index=tpm.index, columns=tpm.columns)

    label = pd.read_csv(input_file_mask + '-label.txt', sep="\t", index_col=0).T
    return sttpm, label

sttpm, label = preprocess(input_file_mask)

There is no convention whether cells should be columns or rows. Here we require cells to be rows.

[6]:
sttpm.head()
[6]:
Gnai3 Pbsn Cdc45 H19 Scml2 Apoh Narf Cav2 Klf6 Scmh1 ... RP23-345J21.2 AC121960.1 AC136147.1 AC122013.1 AC132389.1 Gm11392 AC160109.2 AC154675.1 AC156980.1 RP23-429I18.1
G1_cell1_count -0.411123 -0.059028 -0.099416 5.385822 -0.691219 0.0 -0.690715 -0.059028 -1.051909 -0.350978 ... -0.146722 0.0 -0.079577 -0.374972 -0.824399 -0.059028 -0.079861 0.0 -0.144843 0.090295
G1_cell2_count -0.180800 -0.059028 0.777223 -0.165725 -0.820206 0.0 0.362341 -0.059028 1.458881 0.207421 ... -0.146722 0.0 -0.079577 -0.374972 -0.824399 -0.059028 -0.079861 0.0 -0.144843 -1.271033
G1_cell3_count -1.409101 -0.059028 -1.218187 -0.165725 -0.820206 0.0 -0.690715 -0.059028 -1.271394 -0.657735 ... 2.593349 0.0 -0.079577 -0.374972 -0.592938 -0.059028 -0.079861 0.0 -0.144843 -1.271033
G1_cell4_count -1.867558 -0.059028 0.923695 -0.165725 -0.820206 0.0 0.903266 -0.059028 1.430708 -0.657735 ... -0.146722 0.0 -0.079577 -0.374972 2.938898 -0.059028 -0.079861 0.0 -0.144843 -1.271033
G1_cell5_count -1.646290 -0.059028 0.001887 -0.165725 -0.820206 0.0 -0.690715 -0.059028 -0.811233 -0.657735 ... -0.146722 0.0 -0.079577 -0.374972 -0.824399 -0.059028 -0.079861 0.0 -0.144843 -0.111558

5 rows × 38293 columns

[7]:
label.head()
[7]:
stage
G1_cell1_count g0/g1
G1_cell2_count g0/g1
G1_cell3_count g0/g1
G1_cell4_count g0/g1
G1_cell5_count g0/g1

Set up the model and fit the model

Fitting the model may take some time. Using a GTX 960M GPU it takes 6 minutes.

[8]:
model = cyclum.PreloadCyclum(sttpm.values, q_circular=3, q_linear=1)
[9]:
pseudotime, rotation = model.fit()
pretrain burnin
step  2000: loss [0.1452561, 45035228.0, 141884.6] time 9.83
pretrain train
step  2000: loss [0.14524707, 45035230.0, 141884.84] time 7.32
step  4000: loss [0.14524722, 45035230.0, 141884.84] time 6.24
midtrain burnin
step  2000: loss [0.14524722, 13479.106, 65288.434] time 31.53
midtrain train
step  2000: loss [0.14524722, 13475.677, 65288.434] time 31.44
step  4000: loss [0.14524722, 13475.676, 65288.434] time 29.15
finaltrain train
step  2000: loss [0.23462683, 12895.94, 35502.785] time 41.88
step  4000: loss [0.25569192, 12895.507, 18958.936] time 38.46
step  6000: loss [0.276326, 12895.623, 9609.924] time 38.45
finaltrain refine
step  2000: loss [0.28275946, 12895.442, 7700.0806] time 42.01
step  4000: loss [0.28499442, 12895.469, 6226.207] time 41.03
step  6000: loss [0.28545943, 12895.44, 5038.1367] time 39.55
step  8000: loss [0.2893271, 12895.435, 4072.1836] time 40.60
step 10000: loss [0.29161924, 12895.42, 3288.5647] time 40.55
Full time 438.50

Illustrations

We illustrate the results on a circle, to show its circular nature. There is virtually no start and end of the circle. Red, green and blue represents G0/G1, S and G2/M phase respectively. The inner lines represents single cells. The cells spread across the The areas outside

[10]:
%aimport cyclum.illustration
[11]:
color_map = {'stage': {"g0/g1": "red", "s": "green", "g2/m": "blue"},
                 'subcluster': {"intact": "cyan", "perturbed": "violet"}}
cyclum.illustration.plot_round_distr_color(pseudotime, label['stage'], color_map['stage'])
[11]:
_images/example_mESC_18_0.png
_images/example_mESC_18_1.png