author: niplav, created: 2022-02-04, modified: 2024-04-15, language: english, status: in progress, importance: 4, confidence: unlikely
Absence of correlation implies causation ≤5% of the time in sparse-ish linear causal networks, but not never. The number of causal non-correlations first grows and then shrinks with the number of nodes in the causal network, with the maximum at ≈30. I can't explain why.
"Correlation ⇏ Causation" is trite by now. And we know that the contrapositive is also false: "¬Correlation ⇏ ¬Causation".
Spencer Greenberg summarizes:
All of this being said, while causation does not NECESSARILY imply correlation, causation USUALLY DOES imply correlation. Some software that attempts to discover causation in observational data even goes so far as to make this assumption of causation implying correlation.
I, however, have an inner computer scientist.
And he demands answers.
He will not rest until he knows how often ¬Correlation ⇒ ¬Causation, and how often it doesn't.
This can be tested by creating a Monte-Carlo
simulation
over random linear structural equation
models
with $n$
variables, computing the correlations between the different
variables for random inputs, and checking whether the correlations being
very small implies that there is no causation.
So we start by generating a random linear SEM with
$n$
variables (code in
Julia). The
parameters are normally
distributed with
mean 0 and variance 1, but for now we'll assume there is no noise.
struct LinearSEM
g::SimpleDiGraph{Int64}
coefficients::Dict
end
We can decide how dense/sparse we want the SEM to be via the threshold
parameter, the probability that two different nodes have an edge
between them. The higher the threshold, the more edges in the SEM.
function random_linear_sem(n::Int, threshold=0.5)
g=DiGraph(n)
for i in 1:n
for j in (i+1):n
if rand() < threshold
add_edge!(g, i, j)
end
end
end
coefficients=Dict()
for edge in edges(g)
coefficients[edge]=randn()
end
return LinearSEM(g, coefficients)
end
We can then run a bunch of inputs through that model, and compute their correlations:
function correlations(sem::LinearSEM, inner_samples::Int)
n=size(vertices(sem.g), 1)
input_nodes=[node for node in vertices(sem.g) if indegree(sem.g, node) == 0]
results=Matrix{Float64}(undef, inner_samples, n) # Preallocate results matrix
for i in 1:inner_samples
input_values=Dict([node => randn() for node in input_nodes])
sem_values=calculate_sem_values(sem, input_values)
sem_value_row=reshape(collect(values(sort(sem_values))), 1, :)
results[i, :]=sem_value_row
end
cor_matrix==cor(results)
cor_matrix[diagind(cor_matrix)].=0
return abs.(cor_matrix)
end
We can then check how many correlations are "incorrectly small".
Let's take all the correlations between variables which don't have any causal relationship. The largest of those is the "largest uncaused correlation". Correlations between two variables which cause each other but are smaller than the largest uncaused correlation are "too small": There is a causation but it's not detected.
We can now take the correlations and separate them into correlations between pairs of variables with causal relationship (i.e., a directed graph through the DAG representing the SEM), and pairs of variables without causal relationships.
correlation=correlations(sem, inner_samples)
influence=Matrix(Bool.(transpose(adjacency_matrix(transitiveclosure(sem.g)))))
not_influence=tril(.!(influence), -1)
non_causal_cors=not_influence.*correlation
causal_cors=influence.*correlation
This gives us two distributions, the distribution of non_causal_cors
and the distribution of causal_cors
, e.g. for SEMs with 48 variables:
One may notice that some variables that are not causing each other still have high correlations with each other, this is because they have a common cause. So we have to decide what it means for a correlation to be too small to be relevant.
I can see three different salient options to decide whether a correlation is small:
And, in the outermost loop, we compute the number of misclassifications for a number of linear SEMs (with a threshold of 0.25, since 0.5 usually produces SEMs which are too dense):
function misclassified_absence_mc(n::Int, outer_samples::Int, inner_samples::Int)
return [misclassifications(random_linear_sem(n, 0.25), inner_samples) for i in 1:outer_samples]
end
So we collect a bunch of samples. SEMs with one, two and three variables are ignored because when running the code, they never give me any causal non-correlations. (I'd be interested in seeing examples to the contrary).
results = Dict{Int, Array{Int, 1}}()
sem_samples=400
inputs_samples=10000
upperlim=52
stepsize=4
Threads.@threads for i in 4:stepsize:upperlim
results[i]=misclassified_absence_mc(i, sem_samples, inputs_samples)
end
We can now first calculate the mean number of small causal correlations and the proportion of of small causal correlations correlations, using the formula for the triangular number:
result_means=[mean(values) for (key, values) in sort(results)]
result_props=[mean(values)/((key^2+key)/2) for (key, values) in sort(results)]
So it looks like at most 5% of causal relationships have a correlation that is "too small", which happens if the SEM has ≈30 variables, and that number shrinks with a larger SEM.
I find this very surprising: Is there a specific mathematical reason why SEMs with 30 variables have the highest number of causal non-correlations? The mean number of causal relationships with small correlations seems to grow quite steadily, so I'm not sure what's going on here.
Is the issue with the number of inner samples, are we simply not checking enough? But 10k samples ought to be enough for anybody—if that's not sufficient, I don't know what is.
But let's better go and write some code to check:
more_samples=Dict{Int, Array{Int, 1}}()
samples_test_size=20
sem_samples=400
inputs_samples=2 .^(6:16)
for inputs_sample in inputs_samples
println(inputs_sample)
more_samples[inputs_sample]=misclassified_absence_mc(samples_test_size, sem_samples, inputs_sample)
end
Plotting the number of causal non-correlations reveals that 10k samples ought to be enough, at least for small numbers of variables:
The densities fluctuate, sure, but not so much that I'll throw out the baby with the bathwater. If I was a better person, I'd make a statistical test here, but alas, I am not.
I don't know why causal non-correlations are so common, especially in linear SEMs.