Response to Bem’s Comments
January 6, 2011
James Alcock responds to Daryl Bem's comments on his critique.
Outrage and ad hominem condescension. Bem’s response brings to mind an old adage from the legal world: “If the facts are against you, argue the law; if the law is against you, argue the facts; if the facts and the law are against you, yell like hell.” And yell like hell he does. Rather than deal with much of the substance of my critique, he directs his attack at my abilities. However, the issue is not about my intelligence or his or my character or his. It is about the data and the way in which they were gathered and analyzed.
Not only does he shoot the messenger; much of his defense rests on an appeal to authority – in this case, the authority of the reviewers at the Journal of Personality and Social Psychology who recommended that his article be published. While I do not understand why they decided as they did, their decision cannot make a silk purse out of a sow’s ear; it does not make the serious flaws in this research go away. Bem also employs the tired defence that it is not enough for a critic to find flaws in a study; the critic must also show that the obvious flaws could themselves account for the observed results. This is of course simply nonsense. The burden of proof is on the individual who presents the data, and significant flaws in the research militate against confidence that the researcher did not make other undetected and unreported errors as well. This is all the more of concern when one is claiming evidence for phenomena that contradict well-established knowledge in physics, neurology and psychology.
In response to his various points:
- He accuses me of an imaginative rewriting of parapsychological history. However, in essence, I simply pointed out that none of the much-touted supposed parapsychological “breakthroughs” of the past have succeeded in establishing the reality of psychic phenomena so far as the larger scientific community is concerned. If he believes otherwise, then it is his imagination and not mine that is at play.
- Bem chooses not to address most of my criticisms about his procedure, but instead misleadingly states that my major criticism is the “selection and deployment of the pictorial stimuli used in six of the nine experiments.” My major methodological criticism actually has to do with the chaotic, careless nature of his procedures; his “selection and deployment of stimuli” only reflects the more general problem, and is not the key problem itself. Whether gay participants were provided with homosexual pictures is not in and of itself at issue. However, his response actually adds further obfuscation. Let me explain: Although the presumed impact of erotic stimuli is central to several of his experiments, he made no effort to assess whether any of the erotic stimuli are actually erotic. Instead, begging the question in the true philosophical sense of the term, he simply assumed that because females “psychically” scored above chance with the erotic pictures, while males did not, this gender difference must have come about because the males did not find the pictures to be erotic enough. He states, in regard to my critique of Experiment 1:
and female raters differed markedly in their ratings of negative and
erotic pictures. Male raters rated every one of the negative pictures
as less negative and less arousing than did the female raters, and they
gave more positive ratings than the female raters to the most explicit
erotic pictures. Possibly reflecting this sex difference, female participants
showed significant psi effects with negative and erotic stimuli in my
earliest experiment but male participants did not. Accordingly, I decided
to introduce different sets of pictures for men and women in subsequent
experiments, choosing more extreme and more arousing pictures for the
would seem to suggest that he directly assessed the erotic nature of
the stimuli, that his participants actually rated the pictures. They
did not. His conclusion about gender differences in erotic response
is based on the ratings provided with the IAPS set of pictures that
he used, ratings that were available to him before he began his
experiments. He apparently had no concern about those gender differences
until after he had collected his data, at which time, having
found that males did not show what he refers to as “significant psi
effects” (actually, just significant deviations from the chance response
rate), he decided to modify his stimuli set for males for Experiment
6, and in other subsequent experiments including the strangely-numbered
Experiment 1. However, in his original article, there is nothing
in his description of Experiment 1 that discusses male and female reactions
to the stimuli, apart from a reference to Experiment 5, (which was run
before Experiment 1!). It is in Experiment 6 where he makes clear where
the “ratings” came from. He states (in his original article):
we decided to use sets of negative and erotic pictures that were different
for men and women. As noted above, women showed a significant psi effect
on the negative trials in Experiment 5, but men did not. ... The
ratings supplied with the IAPS pictures reveal that male raters rated
every one of the negative pictures in the set as less negative and less
arousing than did female raters. … So, for this replication, we
supplemented the IAPS pictures for men with stronger and more explicit
negative and erotic images obtained from Internet sites.” (My
after relying on a standardized set of pictures, he now chose who-knows-what
pictures from the internet to “supplement” the IAPS pictures (How
many internet pictures were involved, he does not say). Again, let me
make clear that it is not the picture set per se that is my concern;
it is the arbitrary way in which pictures were chosen and switched about,
which reflects the systemic experimental carelessness. .
- Bem points out that correctly that I was confused with regard to the procedure in Experiment 1. I was initially mistaken in interpreting the chance rate for erotic pictures as being 33% for some of the participants, although I did indeed figure out that his primary analysis was with regard the hit rate on the erotic targets for all 100 participants, with a chance rate of 50%. Upon review, I can assure the reader that none of this has any bearing on my overall conclusions.
the relative clarity of his present description to the description in
the original article: (Keep in mind that by “session,” he
is referring to the trials conducted with a single participant).
present response to me:
were 100 sessions in this experiment, and on each of 36 trials, participants
saw images of two curtains side-by-side on the computer screen. They
were told that a picture would be behind one of the curtains and a blank
wall would be behind the other. Their task on each trial was to click
on the curtain they felt concealed the picture. After they made their
selection, the selected curtain opened, revealing either a picture or
a blank wall. ... On randomly selected trials, the picture was erotic;
on other trials, it was a nonerotic, and the participant had no (non-psi)
way of knowing which kind of picture would be used on any given trial.
Because there were two alternatives on each trial—left curtain or
right curtain—the probability that the participant would correctly
select the location of the picture by chance was always 50%.”
this purpose, 40 of the sessions comprised 12 trials using erotic pictures,
12 trials using negative pictures, and 12 trials using neutral pictures.
The sequencing of the pictures and their left/right positions were randomly
determined by the programming language’s internal random function.
The remaining 60 sessions comprised 18 trials using erotic pictures
and 18 trials using nonerotic positive pictures with both high and low
arousal ratings. These included eight pictures featuring couples in
romantic but nonerotic situations (e.g., a romantic kiss, a bride and
groom at their wedding). The sequencing of the pictures on these trials
was randomly determined by a randomizing algorithm … Although it is
always desirable to have as many trials as possible in an experiment,
there are practical constraints limiting the number of critical trials
that can be included in this and several others experiments reported
in this article. In particular, on all the experiments using highly
arousing erotic or negative stimuli a relatively large number of nonarousing
trials must be included to permit the participant’s arousal level
to “settle down” between critical trials. This requires including
many trials that do not contribute directly to the effect being tested...”
thinking again about his procedure, another problem becomes apparent.
There was a relatively large data set for the erotic pictures, since
all participants were exposed to them, but the data set for each of
the various categories of non-erotic images was considerably smaller,
which makes it less likely that one would detect small but significant
effects in those data even were they to exist. Thus, the procedure is
weighted such that it is less likely that significant above-chance rates,
if they exist, would be detected for the non-erotic stimuli because
smaller sample sizes provide less power in the statistical tests.
So, while he makes much of having found significant “psi” effects
only with the erotic stimuli (notwithstanding the statistical problems
associated with the multiple testing), this difference might well be
due only to differences in the power of the tests, reflecting the differences
in sample size.
- Bem deals with my critique of his statistical analysis in a purely ad hominem manner. He completely fails to respond to one of my most serious criticisms, about his use of one-tailed tests. However, he sneers at my concerns about the multiple t-tests that he relies upon, and argues that:
t tests demonstrated that participants did no better than chance on
any of the subcategories of nonerotic pictures. It is here that Alcock
first complains about my performing multiple tests without adjusting
the significance level for the number of tests performed. In this case,
Alcock is almost right. Suppose that in testing each of the four subcategories
of nonerotic pictures, I had found that one of them (e.g., romantic
pictures) showed a significant precognitive effect. Because this finding
would have emerged post hoc, only after I had first performed separate
tests on four different picture types, I would have had to adjust the
significance level to be less significant. If I did not, I would be
illegitimately capitalizing on the likelihood that at least one of the
four tests would have yielded a positive result just by chance. But
there was no psi effect on any of the subcategories of nonerotic pictures.”
thus justifies the decision not to control for multiple testing on
the basis of an examination of the data themselves! Because
he found no psi effect for the non-erotic categories, he argues that
there was no problem with conducting the several tests without controlling
for multiple testing!! One simply cannot make one’s choice of statistical
procedure after looking at the data. One must adjust the statistical
criterion in advance to reflect the number of tests to be done, taking
into account not only the t-tests for the non-erotic images, but the
test for the erotic images as well. Since he obviously did not do so,
he consequently employed the wrong criterion value, and that, combined
with the use of one-tailed testing, greatly magnifies the likelihood
of finding statistical significance when none exists.
- With regard to his comments about the Priming studies, the reader should note that, contrary to Bem’s complaint, I did not object to the use of complex data analyses involving two transformations and outlier cut-off criteria. I simply said that because of the complexity of the analyses, one cannot fairly judge his interpretations of the data without seeing the data and the analyses that were done.
is interesting that Bem states that:
least one expert in priming experiments has also argued that one should
always perform several analyses using different transformations and
different cut-off criteria to ensure that the priming effects hold up
across these variations. That is precisely what I did.”
may be just fine, but the cautious observer will reserve judgment in
the absence of detailed information about those various analyses, and
the transformations and cut-off criteria upon which they relied.
- For all
his sound and fury, this seriously flawed set of experiments is still
a seriously flawed set of experiments. If Bem wants the scientific world
to pay attention to his claims of psi, he must first produce meaningful
data from a well-designed, well-executed and well-analyzed experiment.
Neither excuses for careless research, nor angry defences of it, will
achieve this; he must simply do it right.