Locating Product Information from the Web using Simhash Fingerprints

  • Pham Nguyen Tuan Anh
  • Nguyen Khanh Van


We  considerthe  problem  of  creating  efficient search  schemes  that  are  specialized  for  product information;  this  is  a  very  important  issue  given  the explosive  growth  of  commercial  websites  and  Internetbased  services.  We  share  the  observation  in  PEWeb [24], that products are almost always displayed in range of  similar-look  info  pieces  showing  features  and  prices for customers to choose and so, the webpage DOM tree would have similar subtrees in the parts  corresponding to the product show areas.

We  propose  to  use  a  special  hash  function,  namely Simhash  [18],  for  identifying  the  product  regions.  Our basic idea is that sub-trees (in the  webpage DOM tree) with  similar  structures  would  have  similar  Simhash fingerprints (separated just by a few bits). To eliminate possible  miscalls  in  the  first  phase  using  Simhash,  we also combine  with a decision tree approach which gives us  more  flexibility  especially  with  product  websites developed  by  Vietnamese  companies  which  prefer certain  display  formats  not  very  popular  worldwide. Compared to PEWeb,  our scheme can be  more refined and  flexible  where  we  have  more  options to  adjust  the scheme.  This  improvement  in  preciseness  is  strongly supported by experimental results.

