Overview of Ambrosia Tester class Spark data support¶
This example gives brief overview of the Splitter class functionality on Spark DataFrames. Synthetic data on the time spent on viewing content by MTS KION users is used.
The functionality of the
Tester class on Spark data currently is limited and only two-sampled independed t-test cant be used.See the main
Tester tutorial on pandas data to learn the full functionality.[2]:
import os
import pandas as pd
import pyspark
from ambrosia.tester import Tester
Build local spark session
[3]:
os.environ['SPARK_LOCAL_IP'] = '127.0.0.1'
spark = pyspark.sql.SparkSession.builder.master("local[1]").getOrCreate()
spark.sparkContext.setLogLevel('ERROR')
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/04/21 17:40:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/04/21 17:40:18 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Create Spark DataFrame
[4]:
kion_watch_results_agg = pd.read_csv('../tests/test_data/watch_result_agg.csv')
sdf = spark.createDataFrame(kion_watch_results_agg)
[5]:
sdf.printSchema()
root
|-- id: long (nullable = true)
|-- watched: double (nullable = true)
|-- group: string (nullable = true)
Using t-test for Spark data¶
The interface for using the Tester class is exactly the same as in the case of pandas data
Let’s create an instance of the class and pass the parameters
[6]:
spark_tester = Tester(dataframe=sdf,
column_groups='group',
first_type_errors=0.05,
metrics='watched')
Now take a look at the absolute results of the experiment
[7]:
spark_tester.run(effect_type='absolute', method='theory')
[7]:
| first_type_error | pvalue | effect | confidence_interval | metric name | group A label | group B label | |
|---|---|---|---|---|---|---|---|
| 0 | 0.05 | 0.000022 | 55.314679 | (26.54, 84.0893) | watched | A | B |
And at the relative effect
[8]:
spark_tester.run(effect_type='relative', method='theory')
[8]:
| first_type_error | pvalue | effect | confidence_interval | metric name | group A label | group B label | |
|---|---|---|---|---|---|---|---|
| 0 | 0.05 | 0.00004 | 0.079901 | (0.0419, 0.1183) | watched | A | B |
[9]:
spark.stop()