## Phi_K Correlation Analyzer Library¶

Phi_K is a new and practical correlation coefficient based on several refinements to Pearson’s hypothesis test of independence of two variables.

The combined features of Phi_K form an advantage over existing coefficients. First, it works consistently between categorical, ordinal and interval variables. Second, it captures non-linear dependency. Third, it reverts to the Pearson correlation coefficient in case of a bi-variate normal input distribution. These are useful features when studying the correlation matrix of variables with mixed types.

The presented algorithms are easy to use and available through this public Python library: the correlation analyzer package. Emphasis is paid to the proper evaluation of statistical significance of correlations and to the interpretation of variable relationships in a contingency table, in particular in case of low statistics samples.

For example, the Phi_K correlation analyzer package has been used to study surveys, insurance claims, correlograms, etc. For details on the methodology behind the calculations, please see our publication.

### Documentation¶

The entire Phi_K documentation including tutorials can be found at read-the-docs. See the tutorials for detailed examples on how to run the code with pandas. We also have one example on how calculate the Phi_K correlation matrix for a spark dataframe.

### Check it out¶

The Phi_K library requires Python 3 and is pip friendly. To get started, simply do:

$ pip install phik

or check out the code from out GitHub repository:

$ git clone https://github.com/KaveIO/PhiK.git $ pip install -e PhiK/

where in this example the code is installed in edit mode (option -e).

You can now use the package in Python with:

import phik

**Congratulations, you are now ready to use the PhiK correlation analyzer library!**

### Quick run¶

As a quick example, you can do:

import pandas as pd import phik from phik import resources, report # open fake car insurance data df = pd.read_csv( resources.fixture('fake_insurance_data.csv.gz') ) df.head() # Pearson's correlation matrix between numeric variables (pandas functionality) df.corr() # get the phi_k correlation matrix between all variables df.phik_matrix() # get global correlations based on phi_k correlation matrix df.global_phik() # get the significance matrix (expressed as one-sided Z) # of the hypothesis test of each variable-pair dependency df.significance_matrix() # contingency table of two columns cols = ['mileage','car_size'] df[cols].hist2d() # normalized residuals of contingency test applied to cols df[cols].outlier_significance_matrix() # show the normalized residuals of each variable-pair df.outlier_significance_matrices() # generate a phik correlation report and save as test.pdf report.correlation_report(df, pdf_file_name='test.pdf')

For all available examples, please see the tutorials at read-the-docs.

### Contact and support¶

- Issues & Ideas: https://github.com/kaveio/phik/issues
- Q&A Support: contact us at: kave [at] kpmg [dot] com

Please note that KPMG provides support only on a best-effort basis.

### Contents¶

- Why did we build this?
- Tutorials
- Publication & Talks
- Publication
- Talks
- References

- Working on the package
- Contributing
- Tips and Tricks

## Simple code for phi(k) correlation matrix in Python

I am looking for a simple way (2 or 3 lines of code) to generate a Phi(k) correlation matrix in Python. That should be possible since pandas_profiling is doing it, and it works fine. But I want to be able to do it without pandas_profiling which is too heavy and computes things I don’t need. pandas_profiling is using phik library. I tried phik library (didn’t find anything else) I don’t understand the error I got : TypeError: sequence item 0: expected str instance, int found I have no int in my dataframe. Seems like a bug in phik, but then how does pandas profiling do, since it’s using it too ? What’s happening here ? Many thanks I have this code :

`import numpy as np import pandas as pd import phik NB_SAMPLES = 200 NB_VARIABLES = 3 rand_mat = np.random.uniform(low=0.5, high=15, size=(NB_SAMPLES,NB_VARIABLES)) df = pd.DataFrame(rand_mat) df['cat_column'] = pd.cut(df[0], bins=5, labels=['F1','F2','F3','F4','F5']) print(df) df.phik_matrix()`

`0 1 2 cat_column 0 0.911098 8.549206 9.270484 F1 1 13.591250 9.161498 5.614470 F5 2 3.308305 1.589402 5.394675 F1 3 12.031064 9.968686 7.519628 F5 4 14.427813 1.533533 2.352659 F5 .. . . . . 195 10.556285 3.541869 4.804826 F4 196 5.721784 11.783908 13.104844 F2 197 7.336637 14.512256 14.993096 F3 198 4.375895 11.881784 1.129816 F2 199 0.519900 6.624423 9.239070 F1 [200 rows x 4 columns] interval_cols not set, guessing: [0, 1, 2] --------------------------------------------------------------------------- _RemoteTraceback Traceback (most recent call last) _RemoteTraceback: """ Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 418, in _process_worker r = call_item() File "/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 272, in __call__ return self.fn(*self.args, **self.kwargs) File "/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 608, in __call__ return self.func(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 256, in __call__ for func, args, kwargs in self.items] File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 256, in for func, args, kwargs in self.items] File "/opt/conda/lib/python3.7/site-packages/phik/phik.py", line 162, in _calc_phik combi = ':'.join(comb) TypeError: sequence item 0: expected str instance, int found """ The above exception was the direct cause of the following exception: TypeError Traceback (most recent call last) in 11 df['cat_column'] = pd.cut(df[0], bins=5, labels=['F1','F2','F3','F4','F5']) 12 print(df) ---> 13 df.phik_matrix() /opt/conda/lib/python3.7/site-packages/phik/phik.py in phik_matrix(df, interval_cols, bins, quantile, noise_correction, dropna, drop_underflow, drop_overflow) 215 data_binned, binning_dict = bin_data(df_clean, cols=interval_cols_clean, bins=bins, quantile=quantile, retbins=True) 216 return phik_from_rebinned_df(data_binned, noise_correction, dropna=dropna, drop_underflow=drop_underflow, --> 217 drop_overflow=drop_overflow) 218 219 /opt/conda/lib/python3.7/site-packages/phik/phik.py in phik_from_rebinned_df(data_binned, noise_correction, dropna, drop_underflow, drop_overflow) 145 146 phik_list = Parallel(n_jobs=NCORES)(delayed(_calc_phik)(co, data_binned[list(co)], noise_correction) --> 147 for co in itertools.combinations_with_replacement(data_binned.columns.values, 2)) 148 149 phik_overview = create_correlation_overview_table(dict(phik_list)) /opt/conda/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable) 1015 1016 with self._backend.retrieval_context(): -> 1017 self.retrieve() 1018 # Make sure that we get a last message telling us we are done 1019 elapsed_time = time.time() - self._start_time /opt/conda/lib/python3.7/site-packages/joblib/parallel.py in retrieve(self) 907 try: 908 if getattr(self._backend, 'supports_timeout', False): --> 909 self._output.extend(job.get(timeout=self.timeout)) 910 else: 911 self._output.extend(job.get()) /opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout) 560 AsyncResults.get from multiprocessing.""" 561 try: --> 562 return future.result(timeout=timeout) 563 except LokyTimeoutError: 564 raise TimeoutError() /opt/conda/lib/python3.7/concurrent/futures/_base.py in result(self, timeout) 433 raise CancelledError() 434 elif self._state == FINISHED: --> 435 return self.__get_result() 436 else: 437 raise TimeoutError() /opt/conda/lib/python3.7/concurrent/futures/_base.py in __get_result(self) 382 def __get_result(self): 383 if self._exception: --> 384 raise self._exception 385 else: 386 return self._result TypeError: sequence item 0: expected str instance, int found`

**How to calculate Pearson, Spearman and Phik correlation between variables using Python**### A comparison on the use of different methods that are commonly used to calculate correlation and why we should consider measuring all 3 of them in our data analyses.

Published in

4 min read

Jan 16, 2022Pairwise comparisons between data variables are commonly used for data analysts. The association between variables is typically measured by the Pearson’s correlation coefficient, which is the measure of strength of the linear relationship between two variables. While useful in most cases, the Pearson’s correlation can fail to capture relationships that are non-linear, or in datasets which have many outliers.

To better capture non-linear relationships and reduce effects from outliers, the Seaborn library allows users to calculate correlations using the Spearman correlation. Instead of calculating correlations by using the numerical raw values, the Spearman method calculates correlations based on rank, where points are ranked in ascending order. The correlation value is high when both variables increase or decrease together.

Finally, we have the Phik (φK) correlation, which is recently developed and has been demonstrated to capture non-linear dependencies efficiently. Phik correlation is obtained by inverting the chi-square contingency test statistics, thereby allowing users to also analyse correlation between numerical, categorical, interval and ordinal variables. For more details on the mathematics involved, readers can read the research article here. From the figure below, you can also quickly appreciate that Phik correlation can detect correlations that would otherwise be missed if you had only analysed using Pearson’s correlation.

Fortunately, these correlations can be easily calculated using the Python programming language. First, we import the required packages (You can use the pip install command if you do not have any of these packages:

`import numpy as np`

import pandas as pd

import phik

import seaborn as sns

from phik import…## Open Access Giving Week

arXiv is community supported – we depend on you! Donate today and your contribution will fund essential operations and new initiatives.

We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. Donate

> stat > arXiv:1811.11440

open search

open navigation menu## Statistics > Methodology

**arXiv:1811.11440**(stat)

[Submitted on 28 Nov 2018 (v1), last revised 9 Mar 2019 (this version, v2)]## Title: A new correlation coefficient between categorical, ordinal and interval variables with Pearson characteristics

Download a PDF of the paper titled A new correlation coefficient between categorical, ordinal and interval variables with Pearson characteristics, by M. Baak and 3 other authors

Abstract: A prescription is presented for a new and practical correlation coefficient, $\phi_K$, based on several refinements to Pearson’s hypothesis test of independence of two variables. The combined features of $\phi_K$ form an advantage over existing coefficients. First, it works consistently between categorical, ordinal and interval variables. Second, it captures non-linear dependency. Third, it reverts to the Pearson correlation coefficient in case of a bi-variate normal input distribution. These are useful features when studying the correlation between variables with mixed types. Particular emphasis is paid to the proper evaluation of statistical significance of correlations and to the interpretation of variable relationships in a contingency table, in particular in case of low statistics samples and significant dependencies. Three practical applications are discussed. The presented algorithms are easy to use and available through a public Python library.

Comments: Submitted to Computational Statistics and Data analysis. See for examples and Python code: this https URL Subjects: Methodology (stat.ME) Cite as: arXiv:1811.11440 [stat.ME] (or arXiv:1811.11440v2 [stat.ME] for this version) https://doi.org/10.48550/arXiv.1811.11440 Focus to learn more

arXiv-issued DOI via DataCite

### Submission history

From: Max Baak [view email]

**[v1]**Wed, 28 Nov 2018 08:34:55 UTC (1,814 KB)**[v2]**Sat, 9 Mar 2019 08:54:42 UTC (1,814 KB)Full-text links:

### Access Paper:

Download a PDF of the paper titled A new correlation coefficient between categorical, ordinal and interval variables with Pearson characteristics, by M. Baak and 3 other authors