Profile

Scrapbook photo 1
gwern branwen
2,335 followers|1,366,798 views
AboutPostsPhotosVideos

Stream

gwern branwen

Shared publicly  - 
 
Always interesting to become part of the story.
> /r/DarkNetMarkets has received its first known LE subpoena: a request for 5 accounts' data, including mine, related to Evolution and the...
4
Mursel Kukic's profile photo
 
Dang!!!!
Add a comment...

gwern branwen

Shared publicly  - 
 
Reminds me of that earlier paper on immune systems.
We assume that microbes evolved to attack humans when actually we are just civilian casualties in a much older war
6
4
Steven Rose's profile photoChauncy Gardiner's profile photoRandy LaMonda's profile photoVivek Rai's profile photo
 
Makes a nice backdrop to my motto: ambiguities are like microbes, the pathogenic ones kidnap attention.
Add a comment...
 
Messing around with using some of my daily weight data using probabilistic graphs/ #bayesnets   in #R .

---

As the datasets I'm interested in grow in number of variables, it becomes harder to justify doing analysis by simply writing down a simple linear model with a single dependent variable and throwing in the independent variables and maybe a few transformations chosen by hand.
I can instead write down some simultaneous-equations/structural-equation-models, but while it's usually obvious what to do for k<4 and if it's not I can compare the possible variants, 4 variables is questionable what the right SEM is, and >5, it's hopeless.
Factor analysis to extract some latent variables is a possibility, but the more general solution here seems to be probabilistic graphical models such as Bayesian networks.

I thought I'd try out some Bayes net inference on some of my datasets.
In this case, I have ~150 daily measurements from my Omron body composition scale, measuring total weight, body fat percentage, and some other things (see an Omron manual http://ecx.images-amazon.com/images/I/B1v0aFQLGFS.pdf ):

1. Total weight
2. BMI
3. Body fat percentage
4. Muscle percentage
5. Resting metabolism in calories
6. "Body age"
7. Visceral fat index

The 7 variables are interrelated, so this is definitely a case where a simple `lm` is not going to do the trick.
It's also not 100% clear how to set up a SEM; some definitions are obvious (the much-criticized BMI is going to be determined solely by total weight, muscle and fat percentage might be inversely related) but others are not (how does "visceral fat" relate to body fat?).
And it's not a hopelessly small amount of data.

The Bayes net R library I'm trying out is [`bnlearn`](http://www.bnlearn.com) ([paper](http://www.jstatsoft.org/v35/i03/paper)).


library(bnlearn)
# https://www.dropbox.com/s/4nsrszm85m47272/2015-03-22-gwern-weight.csv
weight <- read.csv("selfexperiment/weight.csv")
weight$Date <- NULL; weight$Weight.scale <- NULL
# remove missing data
weightC <- na.omit(weight)
# bnlearn can't handle integers, oddly enough
weightC <- as.data.frame(sapply(weightC, as.numeric))
summary(weightC)
#   Weight.Omron        Weight.BMI        Weight.body.fat    Weight.muscle
#  Min.   :193.0000   Min.   : 26.90000   Min.   :27.00000   Min.   :32.60000
#  1st Qu.:195.2000   1st Qu.: 27.20000   1st Qu.:28.40000   1st Qu.:34.20000
#  Median :196.4000   Median : 27.40000   Median :28.70000   Median :34.50000
#  Mean   :196.4931   Mean   : 28.95409   Mean   :28.70314   Mean   :34.47296
#  3rd Qu.:197.8000   3rd Qu.: 27.60000   3rd Qu.:29.10000   3rd Qu.:34.70000
#  Max.   :200.6000   Max.   : 28.00000   Max.   :31.70000   Max.   :35.50000
#  Weight.resting.metabolism Weight.body.age    Weight.visceral.fat
#  Min.   :1857.000          Min.   :52.00000   Min.   : 9.000000
#  1st Qu.:1877.000          1st Qu.:53.00000   1st Qu.:10.000000
#  Median :1885.000          Median :53.00000   Median :10.000000
#  Mean   :1885.138          Mean   :53.32704   Mean   : 9.949686
#  3rd Qu.:1893.000          3rd Qu.:54.00000   3rd Qu.:10.000000
#  Max.   :1914.000          Max.   :56.00000   Max.   :11.000000
cor(weightC)
#                             Weight.Omron     Weight.BMI Weight.body.fat  Weight.muscle
# Weight.Omron               1.00000000000  0.98858376919    0.1610643221 -0.06976934825
# Weight.BMI                 0.98858376919  1.00000000000    0.1521872557 -0.06231142104
# Weight.body.fat            0.16106432213  0.15218725566    1.0000000000 -0.98704369855
# Weight.muscle             -0.06976934825 -0.06231142104   -0.9870436985  1.00000000000
# Weight.resting.metabolism  0.96693236051  0.95959140245   -0.0665001241  0.15621294274
# Weight.body.age            0.82581939626  0.81286141659    0.5500409365 -0.47408608681
# Weight.visceral.fat        0.41542744168  0.43260100665    0.2798756916 -0.25076619829
#                           Weight.resting.metabolism Weight.body.age Weight.visceral.fat
# Weight.Omron                           0.9669323605    0.8258193963        0.4154274417
# Weight.BMI                             0.9595914024    0.8128614166        0.4326010067
# Weight.body.fat                       -0.0665001241    0.5500409365        0.2798756916
# Weight.muscle                          0.1562129427   -0.4740860868       -0.2507661983
# Weight.resting.metabolism              1.0000000000    0.7008354776        0.3557229425
# Weight.body.age                        0.7008354776    1.0000000000        0.4840752389
# Weight.visceral.fat                    0.3557229425    0.4840752389        1.0000000000

## create alternate dataset expressing the two percentage variables as pounds, since this might fit better
weightC2 <- weightC
weightC2$Weight.body.fat <- weightC2$Weight.Omron * (weightC2$Weight.body.fat / 100)
weightC2$Weight.muscle   <- weightC2$Weight.Omron * (weightC2$Weight.muscle / 100)


Begin analysis:


pdap <- hc(weightC)
pdapc2 <- hc(weightC2)
## bigger is better:
score(pdap, weightC)
[1] -224.2563072
score(pdapc2, weightC2)
[1] -439.7811072
## stick with the original, then
pdap
#   Bayesian network learned via Score-based methods
#
#   model:
#    [Weight.Omron][Weight.body.fat][Weight.BMI|Weight.Omron]
#    [Weight.resting.metabolism|Weight.Omron:Weight.body.fat]
#    [Weight.body.age|Weight.Omron:Weight.body.fat]
#    [Weight.muscle|Weight.body.fat:Weight.resting.metabolism][Weight.visceral.fat|Weight.body.age]
#   nodes:                                 7
#   arcs:                                  8
#     undirected arcs:                     0
#     directed arcs:                       8
#   average markov blanket size:           2.57
#   average neighbourhood size:            2.29
#   average branching factor:              1.14
#
#   learning algorithm:                    Hill-Climbing
#   score:                                 BIC (Gauss.)
#   penalization coefficient:              2.534452101
#   tests used in the learning procedure:  69
#   optimized:                             TRUE
plot(pdap)
## https://i.imgur.com/nipmqta.png


This inferred graph is obviously wrong in several respects, violating prior knowledge about some of the relationships.

More specifically, my prior knowledge:

- `Weight.Omron` == total weight; should be influenced by `Weight.body.fat` (%), `Weight.muscle` (%), & `Weight.visceral.fat`
- `Weight.visceral.fat`: ordinal variable, <=9 = normal; 10-14 = high; 15+ = very high; from the Omron manual:

    > Visceral fat area (0 - approx. 300 cm , 1 inch=2.54 cm) distribution with 30 levels. NOTE: Visceral fat levels are relative and not absolute values.
- `Weight.BMI`: BMI is a simple function of total weight & height (specifically `BMI = round(weight / height^2)`), so it should be influenced only by `Weight.Omron`, and influence nothing else
- `Weight.body.age`: should be influenced by `Weight.Omron`, `Weight.body.fat`, and `Weight.muscle`, based on the description in the manual:

    > Body age is based on your resting metabolism. Body age is calculated by using your weight, body fat percentage and skeletal muscle percentage to produce a guide to whether your body age is above or below the average for your actual age.
- `Weight.resting.metabolism`: a function of the others, but I'm not sure which exactly; manual talks about what resting metabolism is generically and specifies it has the range "385 to 3999 kcal with 1 kcal increments"; https://en.wikipedia.org/wiki/Basal_metabolic_rate suggests the Omron may be using one of several approximation equations based on age/sex/height/weight, but it might also be using lean body mass as well.

Unfortunately, bnlearn doesn't seem to support any easy way of encoding the prior knowledge - for example, you can't say 'no outgoing arrows from node X' - so I iterate, adding bad arrows to the blacklist.

Which arrows violate prior knowledge?

- `[Weight.visceral.fat|Weight.body.age]` (read backwards, as `Weight.body.age ~> Weight.visceral.fat`)
- `[Weight.muscle|Weight.resting.metabolism]`

Retry, blacklisting those 2 arrows:


pdap2 <- hc(weightC, blacklist=data.frame(from=c("Weight.body.age", "Weight.resting.metabolism"), to=c("Weight.visceral.fat","Weight.muscle")))


New violations:

- `[Weight.visceral.fat|Weight.BMI]`
- `[Weight.muscle|Weight.Omron]`


pdap3 <- hc(weightC, blacklist=data.frame(from=c("Weight.body.age", "Weight.resting.metabolism", "Weight.BMI", "Weight.Omron"), to=c("Weight.visceral.fat","Weight.muscle", "Weight.visceral.fat", "Weight.muscle")))


New violations:

- `[Weight.visceral.fat|Weight.Omron]`
- `[Weight.muscle|Weight.BMI]`

~~{.R}
pdap4 <- hc(weightC, blacklist=data.frame(from=c("Weight.body.age", "Weight.resting.metabolism", "Weight.BMI", "Weight.Omron", "Weight.Omron", "Weight.BMI"), to=c("Weight.visceral.fat","Weight.muscle", "Weight.visceral.fat", "Weight.muscle", "Weight.visceral.fat", "Weight.muscle")))


One violation:

- `[Weight.muscle|Weight.body.age]`


pdap5 <- hc(weightC, blacklist=data.frame(from=c("Weight.body.age", "Weight.resting.metabolism", "Weight.BMI", "Weight.Omron", "Weight.Omron", "Weight.BMI", "Weight.body.age"), to=c("Weight.visceral.fat","Weight.muscle", "Weight.visceral.fat", "Weight.muscle", "Weight.visceral.fat", "Weight.muscle", "Weight.muscle")))
#   Bayesian network learned via Score-based methods
#
#   model:
#    [Weight.body.fat][Weight.muscle|Weight.body.fat][Weight.visceral.fat|Weight.body.fat]
#    [Weight.Omron|Weight.visceral.fat][Weight.BMI|Weight.Omron]
#    [Weight.resting.metabolism|Weight.Omron:Weight.body.fat]
#    [Weight.body.age|Weight.Omron:Weight.body.fat]
#   nodes:                                 7
#   arcs:                                  8
#     undirected arcs:                     0
#     directed arcs:                       8
#   average markov blanket size:           2.57
#   average neighbourhood size:            2.29
#   average branching factor:              1.14
#
#   learning algorithm:                    Hill-Climbing
#   score:                                 BIC (Gauss.)
#   penalization coefficient:              2.534452101
#   tests used in the learning procedure:  62
#   optimized:                             TRUE
plot(pdap5)
## https://i.imgur.com/nxCfmYf.png

## implementing all the prior knowledge cost ~30:
score(pdap5, weightC)
# [1] -254.6061724


No violations, so let's use the network and estimate the specific parameters:


fit <- bn.fit(pdap5, weightC); fit
#   Bayesian network parameters
#
#   Parameters of node Weight.Omron (Gaussian distribution)
#
# Conditional density: Weight.Omron | Weight.visceral.fat
# Coefficients:
#         (Intercept)  Weight.visceral.fat
#       169.181651376          2.744954128
# Standard deviation of the residuals: 1.486044472
#
#   Parameters of node Weight.BMI (Gaussian distribution)
#
# Conditional density: Weight.BMI | Weight.Omron
# Coefficients:
#   (Intercept)   Weight.Omron
# -0.3115772322   0.1411044216
# Standard deviation of the residuals: 0.03513413381
#
#   Parameters of node Weight.body.fat (Gaussian distribution)
#
# Conditional density: Weight.body.fat
# Coefficients:
# (Intercept)
# 28.70314465
# Standard deviation of the residuals: 0.644590085
#
#   Parameters of node Weight.muscle (Gaussian distribution)
#
# Conditional density: Weight.muscle | Weight.body.fat
# Coefficients:
#     (Intercept)  Weight.body.fat
#   52.1003347352    -0.6141270921
# Standard deviation of the residuals: 0.06455478599
#
#   Parameters of node Weight.resting.metabolism (Gaussian distribution)
#
# Conditional density: Weight.resting.metabolism | Weight.Omron + Weight.body.fat
# Coefficients:
#     (Intercept)     Weight.Omron  Weight.body.fat
#   666.910582196      6.767607964     -3.886694779
# Standard deviation of the residuals: 1.323176507
#
#   Parameters of node Weight.body.age (Gaussian distribution)
#
# Conditional density: Weight.body.age | Weight.Omron + Weight.body.fat
# Coefficients:
#     (Intercept)     Weight.Omron  Weight.body.fat
#  -32.2651379176     0.3603672788     0.5150134225
# Standard deviation of the residuals: 0.2914301529
#
#   Parameters of node Weight.visceral.fat (Gaussian distribution)
#
# Conditional density: Weight.visceral.fat | Weight.body.fat
# Coefficients:
#     (Intercept)  Weight.body.fat
#    6.8781100009     0.1070118125
# Standard deviation of the residuals: 0.2373649058
## residuals look fairly good, except for Weight.resting.metabolism, where there are some extreme residuals in what looks a bit like a sigmoid sort of pattern, suggesting nonlinearities in the Omron scale's formula?
bn.fit.qqplot(fit)
## https://i.imgur.com/mSallOv.png


We can double-check the estimates here by turning the Bayes net model into a SEM and seeing how the estimates compare, and also seeing if the p-values suggest we've found a good model:


library(lavaan)
Weight.model1 <- '
    Weight.visceral.fat ~ Weight.body.fat
    Weight.Omron ~ Weight.visceral.fat
    Weight.BMI ~ Weight.Omron
    Weight.body.age ~ Weight.Omron + Weight.body.fat
    Weight.muscle ~ Weight.body.fat
    Weight.resting.metabolism ~ Weight.Omron + Weight.body.fat
                   '
Weight.fit1 <- sem(model = Weight.model1,  data = weightC)
summary(Weight.fit1)
# lavaan (0.5-16) converged normally after 139 iterations
#
#   Number of observations                           159
#
#   Estimator                                         ML
#   Minimum Function Test Statistic               71.342
#   Degrees of freedom                                 7
#   P-value (Chi-square)                           0.000
#
# Parameter estimates:
#
#   Information                                 Expected
#   Standard Errors                             Standard
#
#                    Estimate  Std.err  Z-value  P(>|z|)
# Regressions:
#   Weight.visceral.fat ~
#     Weight.bdy.ft     0.107    0.029    3.676    0.000
#   Weight.Omron ~
#     Wght.vscrl.ft     2.745    0.477    5.759    0.000
#   Weight.BMI ~
#     Weight.Omron      0.141    0.002   82.862    0.000
#   Weight.body.age ~
#     Weight.Omron      0.357    0.014   25.162    0.000
#     Weight.bdy.ft     0.516    0.036   14.387    0.000
#   Weight.muscle ~
#     Weight.bdy.ft    -0.614    0.008  -77.591    0.000
#   Weight.resting.metabolism ~
#     Weight.Omron      6.730    0.064  104.631    0.000
#     Weight.bdy.ft    -3.860    0.162  -23.837    0.000
#
# Covariances:
#   Weight.BMI ~~
#     Weight.body.g    -0.000    0.001   -0.116    0.907
#     Weight.muscle    -0.000    0.000   -0.216    0.829
#     Wght.rstng.mt     0.005    0.004    1.453    0.146
#   Weight.body.age ~~
#     Weight.muscle     0.001    0.001    0.403    0.687
#     Wght.rstng.mt    -0.021    0.030   -0.700    0.484
#   Weight.muscle ~~
#     Wght.rstng.mt     0.007    0.007    1.003    0.316
#
# Variances:
#     Wght.vscrl.ft     0.056    0.006
#     Weight.Omron      2.181    0.245
#     Weight.BMI        0.001    0.000
#     Weight.body.g     0.083    0.009
#     Weight.muscle     0.004    0.000
#     Wght.rstng.mt     1.721    0.193


Comparing the coefficients by eye, they tend to be quite close (usually within 0.1) and the p-values are all statistically-significant.

The network itself looks right, although some of the edges are surprises: I didn't know visceral fat was predictable from body fat (I thought they were measuring separate things), and the relative independence of muscle suggests that in any exercise plan I might be better off focusing on the body fat percentage rather than the muscle percentage since the former may be effectively determining the latter.

So what did I learn here?

- learning network structure and direction of arrows is hard; even with only 7 variables and  n=159 (accurate clean data), the hill-climbing algorithm will learn at least 7 wrong arcs.

    - and the derived graphs depend disturbingly heavily on choice of algorithm; I used the `hc` hill-climbing algorithm (since I'm lazy and didn't want to specify arrow directions), but when I try out the alternatives like `iamb` on the same data & blacklist, the found graph looks rather different
- Gaussians are, as always, sensitive to outliers: I was surprised the first graph didn't show BMI connected to anything, so I took a closer look and found I had miscoded a BMI of 28 as 280!
- bnlearn, while not as hard to use as I expected, could still use usability improvements: I should not need to coerce integer data into exactly equivalent numeric types just because bnlearn doesn't recognize integers; and blacklisting/whitelisting needs to be more powerful - iteratively generating graphs and manually inspecting and manually blacklisting is tedious and does not scale

    - hence, it may make more sense to find a graph using `bnlearn` and then convert it into simultaneous-equations and manipulate it using more mature SEM libraries

#statistics #SEM  
5
1
Evelyn Mitchell's profile photogwern branwen's profile photoEric Jain's profile photoPaolo Marino's profile photo
3 comments
 
Interesting analysis. How dependent muscle and fat mass are might depend on what you are doing (and there could be a time lag)?
Add a comment...

gwern branwen

Shared publicly  - 
Special Article from The New England Journal of Medicine — Compliance with Results Reporting at ClinicalTrials.gov
2
Add a comment...
 
Many Labs examines interactions, finding few (and worse replicability than Many Labs 1 of main effects):

"The university participant pool is a key resource for behavioral research, and data quality is believed to vary over the course of the academic semester. This crowdsourced project examined time of semester variation in 10 known effects, 10 individual differences, and 3 data quality indicators over the course of the academic semester in 20 participant pools (N = 2,696) and with an online sample (N = 737). Seven of the 10 effects did not replicate. Three of those were interaction effects for which a main effect did replicate. Weak time of semester effects were observed on data quality indicators, participant sex, and a few individual differences— conscientiousness, mood, and stress. However, there was little evidence for time of semester qualifying experimental or correlational effects. This suggests a provocative conclusion. Mean characteristics of pool samples change slightly during the semester, but those changes are mostly irrelevant for detecting effects."

Commentary: https://hardsci.wordpress.com/2015/03/12/an-open-review-of-many-labs-3-much-to-learn/

Something to keep in mind when you read a psych paper whose main effect fails but the authors argue that there was really just an interaction and/or an unfortunate moderator of participant payment / country / time of year / etc.

#psychology #statistics #replicability  
Many Labs 3 is a crowdsourced project that will systematically evaluate time-of-semester effects across many participant pools. The project team will create a single experimental protocol assessing 10 known effects within 30 minutes. As many labs as can join will administer the same protocol across the academic semester. The aggregate data will provide high-powered evidence whether time-of-semester moderates the detectability of effects. Effe...
1
Add a comment...

gwern branwen

Shared publicly  - 
 
The Plowshares movement was inspired by Dorothy Day, a Greenwich Village bohemian who converted to Catholicism and urged resistance to all wars. In the Vietnam era, Philip Berrigan led actions to symbolically destroy the nuclear arsenal. Credit Illustrations by Alex Williamson; Left to Right: Getty (Dorothy Day); Bettman / Corbis (Philip Berrigan and Protest March)
4
Add a comment...
Have him in circles
2,335 people
James Greenwood's profile photo
Tarun Gehlot's profile photo
Shae Erisson (shapr)'s profile photo
Björn Buckwalter's profile photo
Keith Poole's profile photo
Brian Forsberg's profile photo
dan o'huiginn's profile photo
Hasitha Liyanage's profile photo
Edward Kmett's profile photo
 
Previously, with weight data: https://plus.google.com/103530621949492999968/posts/aBwZ57xFRjB

Here I look at my Zeo sleep data; more variables, more complex relations, and more unknown ones, but on the positive side, ~12x more data to work with.

zeo <- read.csv("http://www.gwern.net/docs/zeo/gwern-zeodata.csv")
zeo$Sleep.Date <- as.Date(zeo$Sleep.Date, format="%m/%d/%Y")

## convert "05/12/2014 06:45" to "06:45"
zeo$Start.of.Night <- sapply(strsplit(as.character(zeo$Start.of.Night), " "), function(x) { x[2] })
## convert "06:45" to 24300
interval <- function(x) { if (!is.na(x)) { if (grepl(" s",x)) as.integer(sub(" s","",x))
                                           else { y <- unlist(strsplit(x, ":")); as.integer(y[[1]])*60 + as.integer(y[[2]]); }
                                         }
                          else NA
                        }
zeo$Start.of.Night <- sapply(zeo$Start.of.Night, interval)
## correct for the switch to new unencrypted firmware in March 2013;
## I don't know why the new firmware subtracts 15 hours
zeo[(zeo$Sleep.Date >= as.Date("2013-03-11")),]$Start.of.Night <- (zeo[(zeo$Sleep.Date >= as.Date("2013-03-11")),]$Start.of.Night + 900) %% (24*60)

## after midnight (24*60=1440), Start.of.Night wraps around to 0, which obscures any trends,
## so we'll map anything before 7AM to time+1440
zeo[zeo$Start.of.Night<420 & !is.na(zeo$Start.of.Night),]$Start.of.Night <- (zeo[zeo$Start.of.Night<420 & !is.na(zeo$Start.of.Night),]$Start.of.Night + (24*60))

zeoSmall <- subset(zeo, select=c(ZQ,Total.Z,Time.to.Z,Time.in.Wake,Time.in.REM,Time.in.Light,Time.in.Deep,Awakenings,Start.of.Night,Morning.Feel))
zeoClean <- na.omit(zeoSmall)
# bnlearn doesn't like the 'integer' class that most of the data-frame is in
zeoClean <- as.data.frame(sapply(zeoClean, as.numeric))

Prior knowledge:

- `Start.of.Night` is temporally first, and cannot be caused
- `Time.to.Z` is temporally second, and can be influenced by `Start.of.Night` (likely a connection between how late I go to bed and how fast I fall asleep) & `Time.in.Wake` (since if it takes 10 minutes to fall asleep, I must spend >=10 minutes in wake) but not others
- `Morning.Feel` is temporally last, and cannot cause anything
- `ZQ` is a synthetic variable invented by Zeo according to an opaque formula, which cannot cause anything but is determined by others
- `Total.Z` should be the sum of `Time.in.Light`, `Time.in.REM`, and `Time.in.Deep`
- `Awakenings` should have an arrow with `Time.in.Wake` but it's not clear which way it should run

library(bnlearn)
## after a bunch of iteration, blacklisting arrows which violate the prior knowledge
bl <- data.frame(from=c("Morning.Feel", "ZQ", "ZQ", "ZQ", "ZQ", "ZQ", "ZQ", "Time.in.REM", "Time.in.Light", "Time.in.Deep", "Morning.Feel", "Awakenings", "Time.in.Light", "Morning.Feel", "Morning.Feel","Total.Z", "Time.in.Wake", "Time.to.Z", "Total.Z", "Total.Z", "Total.Z"),
                 to=c("Start.of.Night", "Total.Z", "Time.in.Wake", "Time.in.REM", "Time.in.Deep", "Morning.Feel","Start.of.Night", "Start.of.Night","Start.of.Night","Start.of.Night", "Time.to.Z", "Time.to.Z", "Time.to.Z", "Total.Z", "Time.in.Wake","Time.to.Z","Time.to.Z", "Start.of.Night", "Time.in.Deep", "Time.in.REM", "Time.in.Light"))

zeo.hc <- hc(zeoClean, blacklist=bl)
zeo.iamb         <- iamb(zeoClean, blacklist=bl)
## problem: undirected arc: Time.in.Deep/Time.in.REM; since hc inferred [Time.in.Deep|Time.in.REM], I'll copy that for iamb:
zeo.iamb <- set.arc(zeo.iamb, from = "Time.in.REM", to = "Time.in.Deep")
zeo.gs <- gs(zeoClean, blacklist=bl)
## same undirected arc:
zeo.gs <- set.arc(zeo.gs, from = "Time.in.REM", to = "Time.in.Deep")

## Bigger is better:
score(zeo.iamb, data=zeoClean)
# [1] -44776.79185
score(zeo.gs, data=zeoClean)
# [1] -44776.79185
score(zeo.hc, data=zeoClean)
# [1] -44557.6952
## hc scores best, so let's look at it:
zeo.hc
#   Bayesian network learned via Score-based methods
#
#   model:
#    [Start.of.Night][Time.to.Z|Start.of.Night][Time.in.Light|Time.to.Z:Start.of.Night]
#    [Time.in.REM|Time.in.Light:Start.of.Night][Time.in.Deep|Time.in.REM:Time.in.Light:Start.of.Night]
#    [Total.Z|Time.in.REM:Time.in.Light:Time.in.Deep][Time.in.Wake|Total.Z:Time.to.Z]
#    [Awakenings|Time.to.Z:Time.in.Wake:Time.in.REM:Time.in.Light:Start.of.Night]
#    [Morning.Feel|Total.Z:Time.to.Z:Time.in.Wake:Time.in.Light:Start.of.Night]
#    [ZQ|Total.Z:Time.in.Wake:Time.in.REM:Time.in.Deep:Awakenings]
#   nodes:                                 10
#   arcs:                                  28
#     undirected arcs:                     0
#     directed arcs:                       28
#   average markov blanket size:           7.40
#   average neighbourhood size:            5.60
#   average branching factor:              2.80
#
#   learning algorithm:                    Hill-Climbing
#   score:                                 BIC (Gauss.)
#   penalization coefficient:              3.614556939
#   tests used in the learning procedure:  281
#   optimized:                             TRUE

plot(zeo.hc)
## https://i.imgur.com/nD3LXND.png

fit <- bn.fit(zeo.hc, zeoClean); fit
#
#   Bayesian network parameters
#
#   Parameters of node ZQ (Gaussian distribution)
#
# Conditional density: ZQ | Total.Z + Time.in.Wake + Time.in.REM + Time.in.Deep + Awakenings
# Coefficients:
#    (Intercept)         Total.Z    Time.in.Wake     Time.in.REM    Time.in.Deep      Awakenings
# -0.12468522173   0.14197043518  -0.07103211437   0.07053271816   0.21121000076  -0.56476256303
# Standard deviation of the residuals: 0.3000223604
#
#   Parameters of node Total.Z (Gaussian distribution)
#
# Conditional density: Total.Z | Time.in.Wake + Start.of.Night
# Coefficients:
#    (Intercept)    Time.in.Wake  Start.of.Night
# 907.6406157850   -0.4479377278   -0.2680771514
# Standard deviation of the residuals: 68.90853885
#
#   Parameters of node Time.to.Z (Gaussian distribution)
#
# Conditional density: Time.to.Z | Start.of.Night
# Coefficients:
#    (Intercept)  Start.of.Night
# -1.02898431407   0.01568450832
# Standard deviation of the residuals: 13.51606719
#
#   Parameters of node Time.in.Wake (Gaussian distribution)
#
# Conditional density: Time.in.Wake | Time.to.Z
# Coefficients:
#   (Intercept)      Time.to.Z
# 14.7433880499   0.3289378711
# Standard deviation of the residuals: 19.0906685
#
#   Parameters of node Time.in.REM (Gaussian distribution)
#
# Conditional density: Time.in.REM | Total.Z + Start.of.Night
# Coefficients:
#      (Intercept)           Total.Z    Start.of.Night
# -120.62442964234     0.37864195651     0.06275760841
# Standard deviation of the residuals: 19.32560757
#
#   Parameters of node Time.in.Light (Gaussian distribution)
#
# Conditional density: Time.in.Light | Total.Z + Time.in.REM + Time.in.Deep
# Coefficients:
#   (Intercept)        Total.Z    Time.in.REM   Time.in.Deep
#  0.6424267863   0.9997862624  -1.0000587988  -1.0001805537
# Standard deviation of the residuals: 0.5002896274
#
#   Parameters of node Time.in.Deep (Gaussian distribution)
#
# Conditional density: Time.in.Deep | Total.Z + Time.in.REM
# Coefficients:
#   (Intercept)        Total.Z    Time.in.REM
# 15.4961459056   0.1283622577  -0.1187382535
# Standard deviation of the residuals: 11.90756843
#
#   Parameters of node Awakenings (Gaussian distribution)
#
# Conditional density: Awakenings | Time.to.Z + Time.in.Wake + Time.in.REM + Time.in.Light + Start.of.Night
# Coefficients:
#     (Intercept)        Time.to.Z     Time.in.Wake      Time.in.REM    Time.in.Light
# -18.41014329148    0.02605164827    0.05736596152    0.02291139969    0.01060661963
#  Start.of.Night
#   0.01129521977
# Standard deviation of the residuals: 2.427868657
#
#   Parameters of node Start.of.Night (Gaussian distribution)
#
# Conditional density: Start.of.Night
# Coefficients:
# (Intercept)
# 1413.382886
# Standard deviation of the residuals: 64.43144125
#
#   Parameters of node Morning.Feel (Gaussian distribution)
#
# Conditional density: Morning.Feel | Total.Z + Time.to.Z + Time.in.Wake + Time.in.Light + Start.of.Night
# Coefficients:
#     (Intercept)          Total.Z        Time.to.Z     Time.in.Wake    Time.in.Light
# -0.924662971061   0.004808652252  -0.010127269154  -0.008636841492  -0.002766602019
#  Start.of.Night
#  0.001672816480
# Standard deviation of the residuals: 0.7104115719

## some issues with big residuals at the extremes in the variables Time.in.Light, Time.in.Wake, and Time.to.Z;
## not sure how to fix those
bn.fit.qqplot(fit)
# https://i.imgur.com/fmP1ca0.png

library(lavaan)
Zeo.model1 <- '
    Time.to.Z ~ Start.of.Night
    Time.in.Wake ~ Total.Z + Time.to.Z
    Awakenings ~ Time.to.Z + Time.in.Wake + Time.in.REM + Time.in.Light + Start.of.Night
    Time.in.Light ~ Time.to.Z + Start.of.Night
    Time.in.REM ~ Time.in.Light + Start.of.Night
    Time.in.Deep ~ Time.in.REM + Time.in.Light + Start.of.Night
    Total.Z ~ Time.in.REM + Time.in.Light + Time.in.Deep
    ZQ ~ Total.Z + Time.in.Wake + Time.in.REM + Time.in.Deep + Awakenings
    Morning.Feel ~ Total.Z + Time.to.Z + Time.in.Wake + Time.in.Light + Start.of.Night
                   '
Zeo.fit1 <- sem(model = Zeo.model1,  data = zeoClean)
summary(Zeo.fit1)
# lavaan (0.5-16) converged normally after 183 iterations
#
#   Number of observations                          1379
#
#   Estimator                                         ML
#   Minimum Function Test Statistic               22.737
#   Degrees of freedom                                16
#   P-value (Chi-square)                           0.121
#
# Parameter estimates:
#
#   Information                                 Expected
#   Standard Errors                             Standard
#
#                    Estimate  Std.err  Z-value  P(>|z|)
# Regressions:
#   Time.to.Z ~
#     Start.of.Nght     0.016    0.006    2.778    0.005
#   Time.in.Wake ~
#     Total.Z          -0.026    0.007   -3.592    0.000
#     Time.to.Z         0.314    0.038    8.277    0.000
#   Awakenings ~
#     Time.to.Z         0.026    0.005    5.233    0.000
#     Time.in.Wake      0.057    0.003   16.700    0.000
#     Time.in.REM       0.023    0.002   10.107    0.000
#     Time.in.Light     0.011    0.002    6.088    0.000
#     Start.of.Nght     0.011    0.001   10.635    0.000
#   Time.in.Light ~
#     Time.to.Z        -0.348    0.085   -4.121    0.000
#     Start.of.Nght    -0.195    0.018  -10.988    0.000
#   Time.in.REM ~
#     Time.in.Light     0.358    0.018   19.695    0.000
#     Start.of.Nght     0.034    0.013    2.725    0.006
#   Time.in.Deep ~
#     Time.in.REM       0.081    0.012    6.657    0.000
#     Time.in.Light     0.034    0.009    3.713    0.000
#     Start.of.Nght    -0.017    0.006   -3.014    0.003
#   Total.Z ~
#     Time.in.REM       1.000    0.000 2115.859    0.000
#     Time.in.Light     1.000    0.000 2902.045    0.000
#     Time.in.Deep      1.000    0.001  967.322    0.000
#   ZQ ~
#     Total.Z           0.142    0.000  683.980    0.000
#     Time.in.Wake     -0.071    0.000 -155.121    0.000
#     Time.in.REM       0.071    0.000  167.090    0.000
#     Time.in.Deep      0.211    0.001  311.454    0.000
#     Awakenings       -0.565    0.003 -178.407    0.000
#   Morning.Feel ~
#     Total.Z           0.005    0.001    8.488    0.000
#     Time.to.Z        -0.010    0.001   -6.948    0.000
#     Time.in.Wake     -0.009    0.001   -8.592    0.000
#     Time.in.Light    -0.003    0.001   -2.996    0.003
#     Start.of.Nght     0.002    0.000    5.414    0.000

Again no major surprises, but one thing I notice is that ZQ does not seem to connect to Time.in.Light, though Time.in.Light does connect to Morning.Feel; I've long suspected that ZQ is a flawed summary and thought it was insufficiently taking into account wakes or something else, so it looks like it's Time.in.Light specifically which is missing.
`Start.of.night` also is more highly connected than I had expected.

Comparing graphs from the 3 algorithms, they don't seem to differ as badly as the weight ones did. Is this thanks to the much greater data or the constraints?

#bayesnet #statistics #R  
1
Add a comment...

gwern branwen

Shared publicly  - 
 
Near-sightedness increases dramatically in all industrializing countries; why? Is it food, lack of physical activity, lack of focusing on distant objects, too much reading, or lack of bright light?

(Pointer from +Evelyn Mitchell)
Short-sightedness is reaching epidemic proportions. Some scientists think they have found a reason why.
8
4
Steven Rose's profile photoRobin Green's profile photoMark Somerville's profile photoAndy Dalrymple's profile photo
 
Aargh. "... a new notion: that spending too long indoors..." It is not a new notion. Where do they get such shit?
Add a comment...
 
The Accidental Space Spy: an earthling is accidentally impressed into chasing after an anthropologist and thrown into exotic (and deadly) alien cultures, where he struggles to understand their biologies & reproductive strategies & evolutionary pressures while staying alive. He does, most characters don't.

The art is best described as 'dire but it may grow on you', and the real joy is just the sheer variety of aliens on offer (a bit like Stapledon's Star Maker) and trying to figure out what evo biology scenario has been pushed to the max by the author - costly signaling of fitness, mate guarding, short-term memories, placebo effect etc. Some are pretty cool (I really liked the Invisibility Zone one), others not so much (placebo effect scenario less convincing than Robin Hanson's).

#evolution #biology #webcomics  
PART ONE: HUUG · Commander Harp · Briefing · Gluuduu · Cave · Screams · Huugian women · Evolution of the huugians · Acid · Jumping · Pleading · Animals · Fighting · Big ugly one · Eating · Doohickey · Math · Reasons for living · Dignity · Gluurds · Vjuurgs and Buutuus · Silence · Tomorrow ...
8
1
Kee Hinckley's profile photo
Add a comment...

gwern branwen

Shared publicly  - 
 
"Eight thousand years of [human] natural selection in Europe", Mathieson et al 2015; abstract:

"The arrival of farming in Europe beginning around 8,500 years ago required adaptation to new environments, pathogens, diets, and social organizations. While evidence of natural selection can be revealed by studying patterns of genetic variation in present-day people, these pattern are only indirect echoes of past events, and provide little information about where and when selection occurred. Ancient DNA makes it possible to examine populations as they were before, during and after adaptation events, and thus to reveal the tempo and mode of selection. Here we report the first genome-wide scan for selection using ancient DNA, based on 83 human samples from Holocene Europe analyzed at over 300,000 positions. We find five genome-wide signals of selection, at loci associated with diet and pigmentation. Surprisingly in light of suggestions of selection on immune traits associated with the advent of agriculture and denser living conditions, we find no strong sweeps associated with immunological phenotypes. We also report a scan for selection for complex traits, and find two signals of selection on height: for short stature in Iberia after the arrival of agriculture, and for tall stature on the Pontic-Caspian steppe earlier than 5,000 years ago. A surprise is that in Scandinavian hunter-gatherers living around 8,000 years ago, there is a high frequency of the derived allele at the EDAR gene that is the strongest known signal of selection in East Asians and that is thought to have arisen in East Asia. These results document the power of ancient DNA to reveal features of past adaptation that could not be understood from analyses of present-day people."

Discussion: http://www.unz.com/gnxp/selection-in-europeans-but-it-still-sweeps/ http://infoproc.blogspot.com/2015/03/eight-thousand-years-of-natural.html

As sample size increases (this is based on just n~=81) and more ancient DNA becomes available, we can expect to find more soft sweeps showing up at the population level.
5
1
Sudarshan Muktibodh's profile photo
Add a comment...

gwern branwen

Shared publicly  - 
 
GWAS hits replicate across populations; dominance genetic variation explains little of variation (most continues to be additive).
The above is a figure from The Fourth Law of Behavior Genetics (ungated), accepted for publication in Current Directions in Psychological Science. It’s an excellent overview of the intersection of behavior genetics and genomics over the past 10 years or so. The full story is outlined in the paper, fleshed out by the copious and […]
1
Add a comment...

gwern branwen

Shared publicly  - 
 
One thing I find interesting about the nasal cycle is that even though it's ultra-noticeable when you have a cold or stuffy nose, a lot of people seem to have either never realized that they have a cycle or assume it's a personal idiosyncrasy of their nose. (Sort of the opposite of the typical mind fallacy.)
4
1
Shawn Lesniak's profile photoMark Somerville's profile photo
 
I'd read the 1991 experiment on forced unilateral nostril breathing but I didn't know there were more recent ones. Fascinating.

http://www.ncbi.nlm.nih.gov/pubmed/1938166
Add a comment...
People
Have him in circles
2,335 people
James Greenwood's profile photo
Tarun Gehlot's profile photo
Shae Erisson (shapr)'s profile photo
Björn Buckwalter's profile photo
Keith Poole's profile photo
Brian Forsberg's profile photo
dan o'huiginn's profile photo
Hasitha Liyanage's profile photo
Edward Kmett's profile photo
Links
Contributor to
Basic Information
Gender
Male