April 30, 2016

對於p-value的反省,我覺得這才是正確的態度


by Jim Frost 30 April, 2015

Banned! In February 2015, editor David Trafimow and associate editor Michael Marks of the Journal of Basic and Applied Social Psychology declared that the null hypothesis statistical testing procedure is invalid. They promptly banned P values, confidence intervals, and hypothesis testing from the journal.

The journal now requires descriptive statistics and effect sizes. They also encourage large sample sizes, but they don’t require it.

This is the first of two posts in which I focus on the ban. In this post, I’ll start by showing how hypothesis testing provides crucial information that descriptive statistics alone just can't convey. In my next post, I’ll explain the editors' rationale for the ban—and why I disagree with them.

P Values and Confidence Intervals Are Valuable!

It’s really easy to show how P values and confidence intervals are valuable. Take a look at the graph below and determine which study found a true treatment effect and which one didn’t. The difference between the treatment group and the control group is the effect size, which is what the editors want authors to focus on.

Bar chart that compares the effect size of two studies

Can you tell? The truth is that the results from both of these studies could represent either a true treatment effect or a random fluctuation due to sampling error.

So, how do you know? There are three factors at play.

  • Effect size: The larger the effect size, the less likely it is to be a random fluctuation. Clearly, Study A has a larger effect size. The large effect seems significant, but it’s not enough by itself.
  • Sample size: A larger sample size allows you to detect smaller effects. If the sample size for Study B is large enough, its smaller treatment effect may very well be real.
  • Variability in the data: The greater the variability, the more likely you’ll see large differences between the experimental groups due to random sampling error. If the variability in Study A is large enough, its larger difference may be attributable to random error rather than a treatment effect.

The effect size from either study could be meaningful, or not, depending on the other factors. As you can see, there are scenarios where the larger effect size in Study A can be random error while the smaller effect size in Study B can be a true treatment effect.

Presumably, these statistics will all be reported under the journal's new focus on effect size and descriptive statistics. However, assessing different combinations of effect sizes, sample sizes, and variability gets fairly complicated. The ban forces journal readers to use a subjective eyeball approach to determine whether the difference is a true effect. And this is just for comparing two means, which is about as simple as it can get! (How the heck would you even perform multiple regression analysis with only descriptive statistics?!)

Wouldn’t it be nice if there was some sort of statistic that incorporated all of these factors and rolled them into one objective number?

Hold on . . . that’s the P value! The P value provides an objective standard for everyone assessing the results from a study.

Now, let’s consider two different experiments that have studied the same treatment and have come up with the following two estimates of the effect size.

Effect Size Study C Effect Size Study D
10 10

Which estimate is better? It is pretty hard to say which 10 is better, right? Wouldn’t it be nice if there was a procedure that incorporated the effect size, sample size, and variability to provide a range of probable values and indicate the precision of the estimate?

Oh wait . . . that’s the confidence interval!

If we create the confidence intervals for Study C [-5 25] and Study D [8 12], we gain some very valuable information. The confidence interval for Study C is both very wide and contains 0. This estimate is imprecise, and we can't rule out the possibility of no treatment effect. We're not learning anything from this study. On the other hand, the estimate from Study D is both very precise and statistically significant.

The two studies produced the same point estimate of the effect size, but the confidence interval shows that they're actually very different.

Focusing solely on effect sizes and descriptive statistics is inadequate. P values and confidence intervals contribute truly important information that descriptive statistics alone can’t provide. That's why banning them is a mistake.

See a graphical explanation of how hypothesis tests work.

If you'd like to see some fun examples of hypothesis tests in action, check out my posts about the Mythbusters!

The editors do raise some legitimate concerns about the hypothesis testing process. In part two, I assess their arguments and explain why I believe a ban still is not justified.


http://bit.ly/1SAsBYW

April 27, 2016

你能否用資料分析確定這個論述對否? 蘇起:台灣人支持台獨是有條件的 | 新頭殼 newtalk

我們上課時就來使用Duke的資料來重製檢證這個結論。
前陸委會主委蘇起7日出席新書座談會表示,台灣人對於支持台獨是「有條件性的」,而且影響重要的變因就是「中國會不會攻打台灣」。 圖:鄭佑漢/攝

準總統蔡英文即將在520後上任,有關兩岸之間是否存在「九二共識」的爭議藍綠依然爭執不休,前陸委會主委蘇起7日出席「波濤滾滾:1986-2015兩岸談判30年關鍵秘辛」新書座談會,他在談論台海兩岸關係時表示,台灣人是有條件支持台獨,主要影響的兩個因素就是「中共會不會攻打台灣」、「美國會不會來救援」。

蘇起在會中談到,他從美國杜克大學牛銘實教授委託政大執行的一份台獨調查研究數據中發現,台灣人支持台獨是「有條件的」,其中兩個重要的變因是「中國會不會攻打台灣」、「美國會不會來救台灣」。

數據顯示,中共不會打、美國不會救的狀況(現狀)下,台獨的支持比例是近6成,若是中共不打、美國會來救,台獨支持比例將提高到7成8;不過,一旦「中共會攻打」的因素加進來後,台獨的支持比例立刻腰斬,即便美國會出手救援,台獨的支持比例將會跌破4成。蘇起以數據分析,台灣人是有條件的支持台獨,並且中國的因素比美國更重要,因此,蔡英文在520的就職典禮上,對於「九二共識」的態度,將會影響台海關係。

蘇起還提到,過去馬政府執政8年相當引以為傲的「兩岸和平」現狀,他認為是馬英九的成功,但也是他的失敗,馬英九總統讓兩岸越是和平越有安全感後,支持台獨的比例就越高。不過,蘇起也表示,蔡英文在2000年即已否認九二共識的存在,因此,他並不期待2016年高票當選總統後,蔡英文會轉而接受,不過,中國對九二共識的立場卻是從來沒有變過。

蘇起也強調,蔡英文在520就職後不能再採取模糊路線,應該要正面解決台海間的關係,引導兩岸和平。至於他對於蔡英文的就職演說是否會影響到與中國的關係?蘇起僅表示「祝福、祝福再祝福」。


http://bit.ly/234JlJh

April 04, 2016

網路調查走上結合實驗之路, 正在發生。

也許不久之後成為新主流。由此可見一斑。
Michael Smith writes: I have a research challenge and I was hoping you could spare a minute of your time. I hope it isn’t a bother—I first came across you when I saw your post on how psychology researchers can learn from statisticians. I figure even if you don’t know the answer to this question, you might know someone who would. My colleagues and I want to explore implicit biases using the trolley problem as the mechanism for discovering these biases. The problem we have is we have very specific needs for o…

http://bit.ly/22459Eh