Increasingly, the first point of
contact a company has with its customers is at its Web site,
where a staggering amount of consumer data can be aggregated for
analysis and mining. The Web provides companies with an
unprecedented opportunity to analyze customer behavior and
preferences. Every visit to a Web site generates important
consumer behavioral data, regardless of whether or not a sale is
made. Every visitor
action is a digital gesture exhibiting
habits, preferences, and tendencies. These interactions reveal
important trends and patterns that can help a company design a
Web site that effectively communicates and markets its products
and services. Companies can aggregate, enhance, and mine Web data
to learn what sells, what works and what doesn't, and who is or
isn't buying.
Figure 1: Web data utilization in large U.S. corporations
|
Web Data Applications
|
| Marketing
|
18%
|
| Customer Service
|
16%
|
| Don't Use Web Data
|
72%
|
However, according to a recent
survey by Forrester Research, few companies are listening: Of 50
of the largest
U.S. corporations, only 18 percent are using their
Web data
(see Figure 1, above). Why are so few companies taking advantage of
this resource? There seem to be two reasons:
- In the frenzy to become the next Amazon.com, companies of all
sizes and types are scrambling to set up e-commerce sites. They
often concentrate on the mechanics of transactional processing,
setting up inventory and shopping carts, but fail to plan to use
the vast amount of customer data their sites will generate. Most
companies
fail to see that e-commerce success will depend on how
this Web data is leveraged to convert visitors into customers and
customers into loyal clients. The Web data generated with a
single sale is of more value then the sale itself, because it can
lead to a long and profitable relationship with that customer.
The goal of marketers today is not to capture market share but to
capture a share of a customer over a long period of time. The Web
provides an ideal marketplace for doing this.
- The process
of mining Web data is complicated because of the
diversity of the data collected. A single visit to a site can be
captured not only on server log files but also in cookies with ad
networks or databases created by CGI scripts from registration
and purchase forms. One of the challenges to mining Web data is
organizing it into a cohesive view of visitors and customers.
Most of today's log analyzers and ad networks report on TCP/IP
activity and not consumer demographics, lifestyle, values,
behavior, and
attributes. They are limited to reporting the
activity of browsers, not individuals.
GATHERING WEB DATA
Let's take a look at how to collect data on visitors to your
e-business. The main sources for Web site data are log files,
cookies, and forms:
Log Files.
Server log files provide domain types, time
of access, keywords, and search engines used by visitors. Figure 2 (below)
illustrates the amount of information gathered in a log file. The
referer section of a log
file provides valuable information about
where visitors are coming from. It can tell you what your
visitors were looking for when they came to your site by
identifying the keywords they used in their search (assuming they
found you through a search engine) and what search engine or
banner ad they were referred from.
Figure 2: Information included in a typical log file.
|
Anatomy of A Log File
- Internet provider IP address: This can be either webminer.com or 204.58.155.58
- Identification field: This usually appears as a dash, "-"
- AuthUser: This is an ID or password for accessing a protected area
- Date, time, and GMT (Greenwich Mean Time): Thu July 17 12:38:09 1999
- Transaction: Usually "GET" filename such as /index.html/products.htm
- Status or error code of transaction: Usually 200 (success)
- Size in bytes of transaction
(file size): 3234
Additional Fields in the Extended Log Format
- Referer: search engine and keyword used to find your Web site, such as
http://search.yahoo.com/bin/search?p=data+miningư/index.html
- Agent: browser used by your visitor, such as Mozilla/2.0 (Win95; I)
- Cookie: .snap.com TRUE / FALSE 946684799 u_vid_0_0 00ed7085
|
Cookies.
Cookies dispensed from the server can track
browser visits and pages viewed and can provide some insights
into how often a visitor has been to your site and what sections
they wander into. Cookies are special HTTP headers that servers
pass to a browser. They reside in small text files on a browser's
hard disk. You can find the cookie value in the last field of the
extended log format file. A retail Web site can issue cookies
to:
- First-time visitors to introduce products and services
- Returning visitors to acknowledge their
preferences
- All visitors at the point of registration in order to
associate a cookie with a customer's personal information from
online forms.
Cookies are standard components for tracking customer activity
in most e-commerce sites. They are used as counters and unique
identification values that tell retailers who is a first-time
visitor and where returning visitors have been within a site.
Forms.
By far the most effective method of gathering
Web site visitor and customer
information is via registration and
purchase forms (see Figure
3, below). Forms can provide important personal information about
visitors, such as gender, age, and ZIP code. Form submissions can
launch a CGI program that returns a response to the Web site
visitor. Forms are simple browser-to-server mechanisms that can
lead to a complex array of customer interaction from which
relationships can evolve. These customer relationships can evolve
into direct feedback systems through which consumers can
communicate
with a retailer and servers can continue to gather
information from browsers.
Figure 3: A Web registration form for collecting visitor information.
Using CGI forms, you can create either relational tables or
comma-delimited flat files recording the entries from your forms.
These customer-provided information files can be analyzed
directly or imported into a relational database such as DB2. It's
a good idea to import the files into a
relational database as
your file volume grows. The database engine not only makes data
management easier, but it also handles such issues as integrity,
security, backup, and restoration. Having the data in a
relational database environment also gives you access to
enterprise-strength analysis tools, such as IBM's Intelligent
Miner, which can turn your Web data into valuable business
insights (I'll say more on this later).
As a Web site retailer, you want to place menus, links, and
contests in your home page
in order to capture visitor
preferences via forms and cookies. The more you interact with
your customers, the more information you should be collecting
about their needs, values, choices, and preferences. Take care,
however, to ask for only the most essential information. No one
likes lengthy and intrusive questionnaires. Keep in mind that
there are methods and sources for gathering demographic
information that don't involve asking for it directly. A ZIP code
captured from a contest registration form can
provide some
demographic data, while a physical address culled from a purchase
form in your store can provide valuable household information for
subsequent data mining.
Your home page should quickly solicit information about your
visitors' needs and offer information about your various products
and services. By taking the time to consider the overall design
of your site, such as what prompts and links you position in your
home page, you can direct the movements of your visitors. In
addition, a quick
and short registration at the onset of a visit,
inquiry, or purchase, can capture important personal information
that you can latter enhance and mine. Focus on interacting with
your customers to learn what their needs are so you can service
them better over time and retain them.
One key to compiling and capturing this shopper information is
a unique identifier: a visitor ID number. A proven strategy for
collecting key visitor data is to entice new visitors to register
at your site with a special
service or incentive. Offer access to
a special section of your site or have contests and door prizes.
The point is that you need them to register in order to set a
cookie, which can be used as the unique ID number. From that
point, the unique key can enable you to track every interaction
with that visitor. This unique key will allow your site to link
log files and forms database with your company's data warehouse
and other third-party demographic and household information, ad
server networks, or collaborative
filtering engines.
ENHANCING YOUR WEB DATA
Of the three data sources I've mentioned - log files, cookies,
and forms - forms provide the most important customer view
because they contain information that can be used to append
additional data such as from a data warehouse or a third-party
provider.
The kinds of additional data you may want to append include
such demographic and household data as a visitor's probable
income, the type of car they drive, and the number of children
they have. By linking this external information to your Web-site
database, you gain additional insight into the identity,
attributes, lifestyle, and behavior of your visitors and
customers. For example, a ZIP code allows you to provide visitors
with local news, coupons, services, and weather while enabling
you to discover the demographics of your visitors (see Figure 4, below).
Median income, age, presence of children, type of automobile, age
of home, and other factors are available when a physical address
is known. Various data providers make this information available,
and some are beginning to provide their information via the Web.
Experian and Acxiom can today match and append the consumer
information you capture in your registration or purchase forms in
real time. Other vendors of this type of demographic information
include CACI and Polk. There is an entire industry devoted to
segmenting, classifying, and reselling consumer behavior
information.
Figure 4: Zip code collection as a way to gather
user demographics.
In addition, new providers of
webographics
- details on browsing activity, such as length
of visits, number of return visits, preferences exhibited by
clickthroughs in banner ads - have recently emerged, selling
software or services, and sometimes both, for collaborative
filtering, relational marketing, and visitor profiling. These new
data providers - including Andromedia, DoubleClick, Engage
Technologies, Firefly, Manna,
Net Perceptions, and Personify -
represent a whole new genre of Web companies seeking to capture
and generate information about Internet users' behavior and
preferences. They use a myriad of solutions to track and profile
visitors - everything from proprietary software and databases to
commingling cookies via server networks.
Collaborative filtering software such as Andromedia's
LikeMinds uses individual purchase history or preferences to find
people with similar tastes and make suggestions to
shoppers.
LikeMinds can help Web sites make personal recommendations and
offer direct marketing based on visitors' past behavior. Its
Preference Server delivers personalized recommendations based on
preferences either explicitly stated by the visitor or customer
via forms or implicitly determined by sales records,
clickthroughs, or other interactions within the site.
Collaborative filtering networks like Firefly provide the same
matching functionality over multiple Web sites.
Ad networks such as
DoubleClick and Flycast also capture and
store webographics. The DoubleClick system tracks user movements
among more than 170 sites that commingle their cookies in order
to place the appropriate ad to visitors. DoubleClick targets ads
based on a user's interests as expressed via their selections in
the member Web sites in the ad network. DoubleClick recently
purchased Abacus Direct, which manages a database of more than 80
million households and 1,100 consumer mail catalogs. The mix of
online webographics and
offline demographics will give
DoubleClick an enhanced view of consumer behavior.
Webographics are also being captured in proprietary databases
from such companies as Engage Technologies. Engage provides
member clients with access to its proprietary database of 30
million anonymous behavior-based consumer profiles. Engage tracks
the interest and preferences of Web site visitors without
tracking their identity. Profiles are based on the content
viewed, the time spent viewing, and the frequency of visits.
Profiles include identification number, interest category code,
and interest score, but no identity.
Another company, Aptex Software, uses both proprietary
real-time content analysis techniques and neural networks to
predict Web user behavior. Its two main products are SelectCast
and SelectResponse. The Aptex profiling technology doesn't store
personal customer information; instead, it uses a neural network
to profile users based exclusively on their real-time actions and
observed user behavior.
Other webographic players include MatchLogic, which collects
profiles from interactive sites that track where users go after
viewing online ads; Net Perceptions, which offers real-time ad
targeting via the use of neural networks, fuzzy logic, and
genetic algorithms; Personify, which tracks clickstreams,
registration, and transaction data for segmentations and
anonymous profile; and Primary Knowledge, which also collects
clickstream information from large consumer sites to identify
paths buyers
navigate to goods and sells these vital statistics
to online retailers.
All this internal and external demographic and webographic
information can be written to a relational table or a flat file,
which can then be linked or imported into a data mining tool.
These include automated tools, which have principally been used
in data warehouses to extract patterns, trends, and
relationships, and new-generation data mining tools with GUI
interfaces that are designed for business and marketing
personnel. These
data mining analyses can provide actionable
solutions in many formats, which can be shared with those
individuals responsible for the design, maintenance, and
marketing of e-commerce and content-providing Web sites.
MINING DYNAMIC DATABASES
Most analysis of Web data until now has involved log traffic
reports, which mainly provide cumulative accounts of server
activity but do not provide any true business insight about
customer demographics and online behavior. Most current traffic
analysis software, including WebManage Technologies'
NetIntellect, Marketwave's HitList, Sane Solutions' NetTracker,
Netrics.com's Surf Report, and WebTrends Corp.'s WebTrends, offer
predefined reports about server activity based on the analysis of
log files. One of the best logic analyzers is Marketwave's
HitList, which uses cookies as part of its report and allows log
files to be compressed and prepared for Web mining. These tools,
however, deal exclusively with domain names, IP addresses,
cookies, browsers, and
other TCP/IP-specific machine-to-machine
activity.
On the other hand, mining Web data for an e-commerce site
yields insight into visitor behavior and profiles, rather than
server statistics. Your e-commerce site needs to know about the
preferences and lifestyles of its visitors. Data mining in this
context enables you to address such business questions as, "Who
is buying what items and at what rates?"
You should also know what is selling so you can adjust your
inventory and plan your orders
and shipping. You need to know how
to sell, what incentives, offers, and ads work, and how you
should design your site to optimize your profits. Data mining
algorithms can search for relationships in Web data to determine
if patterns exist that can yield actionable business and
marketing intelligence. Data mining solutions come in many types,
such as association, segmentation, clustering, classification
(prediction), and visualization:
Figure 5: Data Association
Figure 6: Data segmentation
Figure 7: Data clustering
Figure 8: Data prediction
Figure 9: Data
Visualization
Using a data mining tool that incorporates these algorithms,
you can segment a Web site database into unique groups of
visitors, each with specific behavioral characteristics. These
tools perform statistical tests on the data and partition it into
multiple market segments independent of the analyst or marketer
and can identify key intervals and ranges in the data that
distinguish good prospects from bad ones.
If you're in a DB2
environment and using Intelligent Miner as
your data mining tool, you have access to all of these processes.
Intelligent Miner performs clustering, classification, and
prediction - a form of classification into the future. For
prediction, Intelligent Miner uses either a tree induction
algorithm or a neural network to predict a field, such as the
number of purchases a customer is likely to make. Using a
self-organizing map, also known as a Kohonen neural network,
Intelligent Miner can be used to segment a
population of similar
customer accounts. In addition to conducting association analysis
to identify items frequently sold in the same transactions,
Intelligent Miner can also perform a more powerful sequential
pattern analysis to match different transactions from the same
customer over time.
Most data mining tools incorporate versions of such algorithms
as CART (classification and regression trees), CHAID (chi-squared
automatic interaction detection), and ID3 (Interactive
Dichotomizer), or its successors
C4.5 and C5.0. They segment a
database into statistically significant clusters based on a
desired output. They generate decision trees, which provide a
graphical breakdown of a data set in the form of a map of
significant clusters. These tools produce rules that can point
out important ranges and characteristics. For example, this rule
might point out a higher-than-average propensity to make an
online purchase when a particular category exists in combination
with a certain number of visits:
IF
Last Sale Category is
"Computer Book"
and Number of Visits is 8.00
(average = 5.94 )
THEN Number of Total Sales is more
than 3.76
Rule's probability: 0.879
The rule exists in 2900 records.
Significance Level: Error
probability < 0.01
The process of stratification is automated by data mining
algorithms on the basis of the data. For example, a Web-site
database created from registration or purchase
forms can be
segmented by these algorithms to discover the key attributes
(domain, referred engine, age, gender, and so forth) that
distinguish profitable from nonprofitable visitors.
RECOGNIZING OPPORTUNITIES
Web data mining goes beyond log analysis and ad clickstreams; it
focuses on identifying customer attributes and consumer behavior.
The goals are generally to find out who is likely to buy your
products and services and identify the features of your most
loyal and profitable
customers so that you can find more like
them. Today, sites inundated with data face the challenge of
recognizing the patterns of opportunities.
One of the common traits of firms that have traditionally used
data mining, such as cellular phone and credit card companies, is
that they have mountains of transactional data and compete for
customer loyalty and dollars in crowded markets where it costs
little for customers to switch to another company. The same
description applies to the evolving e-commerce
landscape.
The Web is a fast, competitive marketplace in which millions
of online transactions are generated (and captured) in log files
and registration forms every hour of every day - and that
marketplace doubles every 100 days. Online shoppers browse retail
sites with fingers poised over their mouse, ready to buy or move
on should they not find what they are looking for or should the
content, wording, incentive, promotion, product, or service of
that site not meet their preferences. Browsers are
retained based
on how well the retailer remembers their needs and whims. The
goal is to know and serve every customer, one at a time, and
build long-term, mutually beneficial relationships.
Data mining is the key to customer
knowledge and intimacy in this type of competitive and crowded
marketplace. In hyper-competitive markets, the strategic use of
customer information is critical to survival. In a networked
electronic environment, the margins and profits go to the fast,
responsive players who are able
to leverage predictive models to
anticipate customer behavior and preferences.
Retailing on the Web is an interactive process that allows
consumers to negotiate, exchange information, and specify and
customize the product and services they want from the retailer.
For the electronic retailer, it is essential to analyze what
consumers are doing and asking for.
As billions of business interactions evolve and organize
themselves into revenue streams, subtle transformations occur in
the
relationships between consumers and retailers in this dynamic
marketplace. Mining Web site data with data mining tools such as
neural networks and machine-learning and genetic algorithms is an
attempt to recognize, anticipate, and understand customer buying
habits and preferences in a constantly evolving business
environment.
4 Key Web Data Strategies
1. Leverage your data mining findings into your overall
approach to communicating with your customers.
Use segmentation analysis
to stratify your email offers
to prospects you have identified via your mining analysis.
Use targeted email to provide incentives only to those
individuals likely to be interested in your products or services.
Remember that email to individuals who you know little about will
be little more than spam.
Automatically reply, route, manage, and segment email so
you can efficiently and effectively respond to your customers
through email via direct marketing.
2. Manage your
customer contacts as you interact with them
online and offline
.
Provide prompt customer service via auto or segmented
email.
Pool together data about customer behavior and
transactions as customers interact with you online and offline
through sales calls, meetings, phone, and email inquiries
as they buy your product and services.
3. Track your marketing ad efforts to identify what works
and why.
Monitor which ads are getting
click-throughs and which
actually lead to sales.
Develop profiles that include demographics, tastes, and
email addresses of your best prospects.
4. Manage your back-end logistics effectively via your
supply chain.
Close the supply chain in the inventory loop
and translate the knowledge of your customers tastes and
purchases into a quick turnaround by customizing your products
and services for them.
Jesus Mena
is CEO of
webminer.com, a Web data mining
company, and is the author of
Data Mining Your Website
(Digital Press, 1999), a book on how to mine Web data for
e-commerce and relational marketing. You can reach him at
jmena@webminer.com
.