How to extract data from site?

Web scraping (also called Web harvesting or Web data extraction) is a computer software technique of extracting information from websites.
Here is the code that extract the content of a specific site,

WebClient wc = new WebClient();
string html = string.Empty;
MatchCollection matches;
string url = string.Empty;
int id = 0;
html = wc.DownloadString(urlPath).Replace("<html>", "").Replace("</html>", "").Replace("<!DOCTYPEHTML>", "").Replace("<head>", "").Replace("</head>", "").Replace("<script>", "").Replace("</script>", "");matches = Regex.Matches(html, "(.*?)", RegexOptions.IgnoreCase | RegexOptions.Singleline);
if (destinationList == null)
destinationList = new List();
foreach (Match match in matches)
{
string matchUrl = match.Groups[1].Value;
//For internal links, build the url mapped to the base address
if (match.Groups[0].Value.Contains("travel/landing_page_hotels.cfm"))
{
url = MapUrl(urlPath, match.Groups[1].Value);
if (url.Length > 0)
{
destination = new clsDestinations();
id += 1;
destination.ID = id;
destination.Url = url;
destination.CityName = match.Groups[2].Value;
if (!destinationList.Exists(d => d.CityName == destination.CityName))
destinationList.Add(destination);
}
}
}

Once you have the data in your collection. Then you can save them one by one like this,

foreach(clsDestinations cy in destinationList)
{
if (!cityBll.CheckForDuplicateCity(cy, false))
result += cityBll.InsertCity(cy);
}

Here are the references that I have used:
web scraping
extract text

Advertisements
This entry was posted in C# 3.5. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s