The basic problem (which I've blurred a bit and disguised but which has been kept real in essentials) seemed simple: the client had run an A/B test on a new sort order for their product listings. At first, it seemed like a fairly straightforward analysis problem – an e commerce site with conversions and a well-tagged A/B test. The client used Coremetrics for Web analytics and had Explore – Core’s quite powerful, Discover-like advanced analysis tool.
The first glance analysis in Coremetrics was disappointing. The new split was almost indistinguishable from the old split in terms of performance. Running sampled Explore data, it looked like the new split had increased conversion but reduced the average ticket. It also looked like the apparent reduction in average ticket might be due to outliers with very large orders on the current method. Unfortunately, running unsampled Explore data pretty much destroyed that hypothesis. We also found that we were having lots of trouble isolating the relevant cases where the product listings were actually re-ordered. Many searches were unaffected by the results and even with specific searches it was possible for the user to override the behavior by adding specific criteria. We wanted to know – when the sort order was changed – did it make a difference. And we were stuck without any way to tell.
I mentioned that the client has a fairly robust tagging infrastructure. They collect the actual sorted product listings served on the page for each search. That information was available in Coremetrics. There wasn’t any way, unfortunately, to use the data to find specific instances where the results returned were different between the two splits and a given search term. Web analytics tools don’t typically provide any method of comparing the values of one field in two subsets as a criterion for selection.
We could have done this by hand, laboriously selecting individual searches and then aggregating the results – but it would take a lot of time to analyze even a small percentage of search result sets and it would risk producing the wrong answer if the tail didn’t happen to work quite like the head.
Fortunately, the client also takes a data feed and provided us with the raw data (only for search sessions). It wasn’t a huge data set – fewer than 10 million search records during the selected test period and a smaller file with actual orders. I loaded their file into SQL-Server and joined searches to orders to create a single table with the following fields:Date/Time
Session
Brand
Split Test
Country
Feature Specified
Search Field
Products Returned
Order Flag
Order Value
Ordered Product
Our first goal was to create a reasonable study group. To do that, we decided that the best method was to isolate the group of visitors with a single search type and a single order. Having multiple types of product searches greatly complicated attribution and we were still concerned about the possibility of outliers messing up the analysis – especially since the vast majority of orders were singular.
To accomplish this I created a temporary table with the Search and Order count produced by “Grouping” on Session and counting the number of rows and the number of Orders. Then I created a new table with the study group by selecting only Sessions from the initial table with a count of 1 for both search types and orders in the temporary table.
With robust tagging, this is a step I might have duplicated in tools like Explore or Discover. But the next step took me beyond what is easily achievable in Web analytics tools. My goal was to isolate all the search cases where the returned results were different between the two splits. To do this, I needed to do some simple massaging of the search term (the application was uneven in spacing and didn’t consistently append the same information). This could have been fixed at the tagging level in a Web analytics solution but it wasn’t. And Web analytics solutions are unforgiving when it comes to this sort of thing. They don’t allow for post-load any field manipulation.
Not so, SQL, of course.
It was a cinch to strip spaces and fix up the field to make it significantly more consistent in my study group table. But I needed to go beyond this simple clean-up. To isolate searches where the results were actually impacted by the split test, I created a new temporary table grouped by the cleaned-up search term, the split-code and the results returned and that had a count for the number of rows (actual searches) for that combination. I then selected the top row for each search term/split combination:
INSERT INTO SPLITATopSearchCountsWithOrder(SEARCH, BRAND, SEARCHRESULTS, SEARCHCOUNT, ORDERCOUNT)
SELECT SEARCH, BRAND, SEARCHRESULTS, SEARCHCOUNT, ORDERCOUNT from (
select SEARCH, BRAND, SEARCHRESULTS, SEARCHCOUNT, ORDERCOUNT, row_number()over(partition by BRAND, SEARCH order by ORDERCOUNT desc, SEARCHCOUNT desc) as rn from SPLITASearchCountsWithOrder) t
WHERE rn = 1
ORDER BY SEARCH, BRAND
This is one place where I had to use some tricky code – the “row_number() over” construct is relatively new in SQL. It was introduced in SQL-Server in the 2005 edition and it allows you to select the top N rows from each “Group By.” I’m by no means a SQL guru and I had to look this one up to get the syntax right – but it’s the only piece of even slightly esoteric code I used in the whole process.
Finally, I joined the file to itself to identify rows where the top search results for the two splits actually differed.
INSERT INTO SIGKEYSEARCHES (BRAND, TRIMSEARCH, ASEARCHRESULTS, BSEARCHRESULTS, SearchCountA, SearchCountB) SELECT a.BRAND, a.SEARCH,a.SEARCHRESULTS, b.SEARCHRESULTS,
a.SearchCount, b.SearchCount from SPLITATopSearchCounts a , SPLITBTopSearchCounts b
where a.BRAND = b. BRAND AND a.SEARCH = b.SEARCH and a.SEARCHRESULTS <> b.SEARCHRESULTS
You just can’t do this kind of thing in a Web analytics tool.
I wrote this out as a table of key searches and then used it to study performance in cases where the Split Test made an actual difference in what the user saw.
SELECT SPLIT, SCOUNT as Searchs, BCOUNT as Order, (CAST(BCOUNT as FLOAT) / CAST(SCOUNT as FLOAT)) as OrdersPerSearch,
BREV/BCOUNT as RevenuePerOrder, BREV / SCOUNT as RevenuePerSearch
from (select SPLIT, count(*) as SCOUNT, SUM(REV)as BREV,
SUM(
CASE
WHEN REV > 0 THEN 1
ELSE 0
END
) as BCOUNT
from
SINGLESEARCHANDORDERALL x, SIGKEYSEARCHES y
WHERE x.BRAND = y.BRAND AND x.TRIMSEARCH = y.TRIMSEARCH
AND x.TRIMSEARCH <> ''
GROUP BY SPLIT) A
This gave me Orders, Order Rates, Revenue per Order, and Revenue per Search for each split in every case where the two split methods had actually differed in the product lists they showed the visitor.
When I did this, the new method showed a modest but significant incremental lift. With this greatly refined sample, we were also able to show Brand and Selection patterns much more clearly – patterns that were submerged in the noisier data set. Not only could we identify the lift, we could identify how often the change actually made some difference in what the user saw. For a split-test in a dynamic system like search, this is a non-trivial and surprisingly difficult piece of information to come by.I was also able, with SQL, to drive the analysis deeper. I counted, by Split-code, the number of times the purchased product didn’t appear on the first page of the search results. To do this, I simply selected for cases where the ordered product was contained in the search results field (which contained the first 10 items). Again, Web analytics tools just don't let you do selections that compare the values of two different fields on a row by row, visit or visitor basis. It turned out that the new method performed significantly better than the old method – the improvement here was higher than in the measured order lift. Orders were much more likely to come from the top search results in the new method.
This suggested that while visitors were often able to find what they wanted in the old method by repeated search paging, they had to do less work with the new method. An improvement in the GUI to eliminate friction will almost never result in an equivalent improvement in conversion – an important fact for any analyst to keep in mind. This analysis also identified some data quality problems in specific markets where the tagging wasn’t working correctly. If you’ve done a deep-dive analysis and you didn’t find a data quality problem somewhere you probably missed something important!
Having the data in SQL also made it possible to see where search placement differences were impactful. By parsing the search results into individual fields (1-10) and then matching the orders by split into the positions that changed, the lift for changing each position in the search could be measured.
You may think this is overkill for measuring the current test – and in a way it is. Knowing the impact by position isn’t going to make any difference in the decision about which method is performing better. We already knew which method was performing better. But doing this type of analysis makes it possible to understand the likely impact of future changes to search – depending on how much and where those changes would alter the search results.
The improved control group and analysis by position also gave us a better way to do additional segmentations. After I had the key population selected, I ran a set of queries measuring performance by Brand, Country, and Features. There were significant differences in lift, particularly in some of the international markets (each market returns distinct result sets). I might have been able to use Explore or Discover to find some of those performance differences, but I’d never have been able to identify the root causes.
SQL is not, of course, a data analysis tool. It’s a data manipulation tool. None of the measurements I performed in this analysis involved more than simple counts, sums and ratios. But as any analyst knows (or soon finds out), a big part of most analysis projects is organizing the data the right way. Tasks like selecting records based on frequency counts (single search types), cleaning fields (eliminating spaces and fixing up the search term), creating intermediate sums and counts (counts by top search results by search term by split) that can be used in further selections, doing intra and inter-field comparisons (did the top search results for a given search term differ by split), and reorganizing the data within a record or field (splitting the search results positionally and analyzing) are bread-and-butter for SQL but hard-crust in traditional Web analytics tools.
The entire SQL-Server project took place over 4-5 days, required no new tagging, and produced a set of answers that went beyond the basic research question.
For most of us, the knowledge that the data necessary to answer an important question exists in the solution but can’t be used in the right way is a frustrating commonplace. SQL may lack important analytic capabilities, but it provides extraordinary flexibility in getting at the data the way you need to and organizing it however seems necessary or convenient. That’s one of the main reasons a database system can deliver answers that a dedicated analytics platform sometimes can’t.
Comments