Data mining, the process of finding useful patterns in data, has become common as computers have grown more powerful and the amount of data generated and stored electronically has increased.
Amazon.com uses it to recommend products to its customers; Netflix uses it to recommend movies; many businesses use it to decide who should receive their catalogs; and biologists use it for protein structure prediction.
In practice, there are many factors that complicate the data mining process, and that is where Gary M. Weiss, Ph.D., assistant professor of computer and information science, comes in.
“Every phone call you make, the phone number, the time of day—all of that is recorded. We’re talking about hundreds of billions of records in a database,” he said.
Weiss likened today’s computer-based data accumulation to digital photography of the heavens. “There are telescopes continuously recording digital images; an astronomer could never look at all of that. Data mining, which leverages the power of the computer, is needed.”
Phone systems are of particular interest to Weiss, who was a senior member of the technical staff at AT&T Laboratories before coming to Fordham in 2004. His work at AT&T focused on programming “expert systems,” which mimic the diagnostic abilities of human technicians, and data mining of customer data, particularly calling patterns.
“Can you tell if a phone line, based on the usage pattern, belongs to a business or residence? Businesses tend to make more calls during the day, residences more at night. We used data mining to automatically identify more subtle patterns for differentiating the two types of customers,” he said. “That’s certainly useful for marketing.”
Since coming to Fordham, he has expanded his data mining interests to include cyber security. Last year, he won a $25,000 grant from a major multinational bank to conduct research on “Enhancing Anti-Money Laundering Systems Using Machine Learning Methods.”
The challenge was to develop a program that could quickly respond to a federal government request for the name of a crime suspect by identifying all possible matches. The task is harder than it sounds, because aliases, nicknames and abbreviations can trip up a program.
“The bank was basically saying, ‘If the last name matches, you get so many points; if the first name matches, you get so many points; if the date of birth matches, you get so many points.’ So they were manually coming up with those points, but a machine approach would be, ‘Here’s some training data; here are some names; here are some matches that we had an expert go over very carefully. Let’s have the computer decide what the weights should be used and how they should be combined,’” he said.
A theme throughout Weiss’ research involves utility-based data mining, a term he and two colleagues recently popularized. Much of the existing work in data mining ignores the complex environment in which data mining occurs and relies on overly simplistic assumptions, he said.
Utility-based data mining, on the other hand, considers all of these factors, or utilities, such as the cost of acquiring training data, the cost of computer time and the benefits associated with the “mined” knowledge in the business or scientific setting.
“If you’re a company, you may have records on your customers, but you might want to develop records on other people,” Weiss said. “Sometimes you don’t have enough data, and then the question is, ‘How much data is enough, and if I need to pay for this data, what amount is optimal?’” This question was examined by Weiss and Ye Tian, a Fordham graduate student, with the results published in a leading data mining journal.
The advent of the Web and social networking sites has presented many new challenges for the data mining community, and Weiss has begun to shift some of his research efforts in this direction. Weiss recently advised Qiang Ma, a Fordham graduate student now pursuing a doctorate at Rutgers University, on his master’s thesis, which was inspired by a unique challenge offered by the film-rental website Netflix. The company is offering $1 million to anyone who can improve the success rate of its movie recommendations by 10 percent.
The amount of data that Netflix provides is enormous, so Ma and Weiss used a smaller set from another database—Movie Lens. Weiss said they approached the problem not from the standpoint of analyzing a person based on his or her movie likes and dislikes, but rather based on the person’s similarity to other users with similar preferences. The movie recommendation problem was translated into a network graph problem where user movie preferences were iteratively propagated through the graph.
Weiss also is interested in applying data mining to social networks such as Facebook to predict which person within a group of friends will most influence others.
“It used to be that you couldn’t automatically identify those relationships very easily. Sociologists had to interview people; it was time consuming,” he said. “Now you go to Facebook; if they give you the data, you have a huge network with millions of nodes and edges and you know who everyone’s friends are.”