{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Recover cell cycle in mESC\n", "\n", "Here we use the mESC dataset. For simplicity we have converted the dataset into TPM.\n", "The original count data is available at ArrayExpress: [E-MTAB-2805](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-2805/). Tools to transform data are also provided and explained in the following sections." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import necessary packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 1" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import sys\n", "\n", "import pandas as pd\n", "import numpy as np\n", "import pickle as pkl\n", "import sklearn as skl\n", "import sklearn.preprocessing\n", "\n", "import matplotlib as mpl\n", "\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Warning information from TensorFlow may occur. It doesn't matter." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/shaoheng/.conda/envs/tensorflow-gpu/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n", " from ._conv import register_converters as _register_converters\n" ] } ], "source": [ "import cyclum\n", "from cyclum import writer" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "input_file_mask = 'data/mESC/mesc-tpm'\n", "output_file_mask = './results/mESC_original/mesc-tpm'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read data\n", "Here we have label, so we load both. However, the label is not used until evaluation." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def preprocess(input_file_mask):\n", " \"\"\"\n", " Read in data and perform log transform (log2(x+1)), centering (mean = 1) and scaling (sd = 1).\n", " \"\"\"\n", " tpm = writer.read_df_from_binary(input_file_mask).T\n", " sttpm = pd.DataFrame(data=skl.preprocessing.scale(np.log2(tpm.values + 1)), index=tpm.index, columns=tpm.columns)\n", " \n", " label = pd.read_csv(input_file_mask + '-label.txt', sep=\"\\t\", index_col=0).T\n", " return sttpm, label\n", "\n", "sttpm, label = preprocess(input_file_mask)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is no convention whether cells should be columns or rows. Here we require cells to be rows." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Gnai3 | \n", "Pbsn | \n", "Cdc45 | \n", "H19 | \n", "Scml2 | \n", "Apoh | \n", "Narf | \n", "Cav2 | \n", "Klf6 | \n", "Scmh1 | \n", "... | \n", "RP23-345J21.2 | \n", "AC121960.1 | \n", "AC136147.1 | \n", "AC122013.1 | \n", "AC132389.1 | \n", "Gm11392 | \n", "AC160109.2 | \n", "AC154675.1 | \n", "AC156980.1 | \n", "RP23-429I18.1 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
G1_cell1_count | \n", "-0.411123 | \n", "-0.059028 | \n", "-0.099416 | \n", "5.385822 | \n", "-0.691219 | \n", "0.0 | \n", "-0.690715 | \n", "-0.059028 | \n", "-1.051909 | \n", "-0.350978 | \n", "... | \n", "-0.146722 | \n", "0.0 | \n", "-0.079577 | \n", "-0.374972 | \n", "-0.824399 | \n", "-0.059028 | \n", "-0.079861 | \n", "0.0 | \n", "-0.144843 | \n", "0.090295 | \n", "
G1_cell2_count | \n", "-0.180800 | \n", "-0.059028 | \n", "0.777223 | \n", "-0.165725 | \n", "-0.820206 | \n", "0.0 | \n", "0.362341 | \n", "-0.059028 | \n", "1.458881 | \n", "0.207421 | \n", "... | \n", "-0.146722 | \n", "0.0 | \n", "-0.079577 | \n", "-0.374972 | \n", "-0.824399 | \n", "-0.059028 | \n", "-0.079861 | \n", "0.0 | \n", "-0.144843 | \n", "-1.271033 | \n", "
G1_cell3_count | \n", "-1.409101 | \n", "-0.059028 | \n", "-1.218187 | \n", "-0.165725 | \n", "-0.820206 | \n", "0.0 | \n", "-0.690715 | \n", "-0.059028 | \n", "-1.271394 | \n", "-0.657735 | \n", "... | \n", "2.593349 | \n", "0.0 | \n", "-0.079577 | \n", "-0.374972 | \n", "-0.592938 | \n", "-0.059028 | \n", "-0.079861 | \n", "0.0 | \n", "-0.144843 | \n", "-1.271033 | \n", "
G1_cell4_count | \n", "-1.867558 | \n", "-0.059028 | \n", "0.923695 | \n", "-0.165725 | \n", "-0.820206 | \n", "0.0 | \n", "0.903266 | \n", "-0.059028 | \n", "1.430708 | \n", "-0.657735 | \n", "... | \n", "-0.146722 | \n", "0.0 | \n", "-0.079577 | \n", "-0.374972 | \n", "2.938898 | \n", "-0.059028 | \n", "-0.079861 | \n", "0.0 | \n", "-0.144843 | \n", "-1.271033 | \n", "
G1_cell5_count | \n", "-1.646290 | \n", "-0.059028 | \n", "0.001887 | \n", "-0.165725 | \n", "-0.820206 | \n", "0.0 | \n", "-0.690715 | \n", "-0.059028 | \n", "-0.811233 | \n", "-0.657735 | \n", "... | \n", "-0.146722 | \n", "0.0 | \n", "-0.079577 | \n", "-0.374972 | \n", "-0.824399 | \n", "-0.059028 | \n", "-0.079861 | \n", "0.0 | \n", "-0.144843 | \n", "-0.111558 | \n", "
5 rows × 38293 columns
\n", "\n", " | stage | \n", "
---|---|
G1_cell1_count | \n", "g0/g1 | \n", "
G1_cell2_count | \n", "g0/g1 | \n", "
G1_cell3_count | \n", "g0/g1 | \n", "
G1_cell4_count | \n", "g0/g1 | \n", "
G1_cell5_count | \n", "g0/g1 | \n", "