php写的爬虫，爬某博客所有文章和图片

其实，这个站点分享的破解软件真的很给力，我用的很多软件都是从这里下载下来的，所以真心觉着不错，于是就想到把他爬下来，然后用crontab定期每天爬，给我feehi.com也导一点流量来。好吧，我承认，这样做有那么一丢丢不道德，不过，我是带着研究学习的心态哒…

php爬虫

真心无奈，php没有多线程，真的要等很久很久很久，才能把这么一个3k篇左右文章的站点爬完（带图片下载及数据入库）。没办法，我只对php更熟一些，py基本的api都不熟，多线程一时半会也搞不定，于是首次研究爬虫还是用php来。为了练练我的正则（貌似没怎么用，平时为了效率，都是用内置的函数，主要是匹配的复杂度不需要正则），这次就没有用扩展库来查找dom节点，而是手动分析html源码，并手写正则来抓取。好在，这个站点没有用ajax，不然就要麻烦一些咯。

爬虫

日志

上代码：

源码下载
<?php error_reporting(0); set_time_limit(0);//取消超时限制 date_default_timezone_set('PRC'); define('DB_HOST','localhost'); define('DB_USER','root'); define('DB_PASSWORD','xxx'); $logTxt = date('Y-m-d H-i-s').'.txt'; $log = "抓取开始rn"; echo $log; file_put_contents($logTxt,date('Y-m-d H:i:s')." $log",FILE_APPEND); $log = "连接数据库...rn"; echo $log; file_put_contents($logTxt,date('Y-m-d H:i:s')." $log",FILE_APPEND); $conn = mysql_connect(DB_HOST,DB_USER,DB_PASSWORD); if($conn){ $log = "数据库连接成功rn"; echo $log; file_put_contents($logTxt,date('Y-m-d H:i:s')." $log",FILE_APPEND); }else{ $log = "数据库连接失败，程序退出rn"; echo $log; file_put_contents($logTxt,date('Y-m-d H:i:s')." $log",FILE_APPEND); exit(); } mysql_select_db('soft'); mysql_query("set names utf8"); for($i=0;$i<154;$i++){ $num = $i+1; $log = "正在分析{$num}页rn"; echo $log; file_put_contents($logTxt,date('Y-m-d H:i:s')." $log",FILE_APPEND); $url = 'http://www.guofs.com/page/'.$num; $content = file_get_contents($url); preg_match_all('/<h2>s*<a href="(.*)"/U',$content, $matches); $log = "第{$num}页找到".count($matches[1])."篇文章rn"; echo $log; file_put_contents($logTxt,date('Y-m-d H:i:s')." $log",FILE_APPEND); preg_match_all('/id="customImg"s+class="customImg"><as+href=".*"><imgs+src="(.*)"/U',$content,$matchesThumb); $thumbPic = array(); foreach($matchesThumb[1] as $ThumbK => $ThumbV){ $log = "正在下载第{$num}页的第".($ThumbK+1)."张缩略图rn"; echo $log; file_put_contents($logTxt,date('Y-m-d H:i:s')." $log",FILE_APPEND); $dataThumb = file_get_contents($ThumbV); $infoThumb = pathinfo($ThumbV); $pathThumb = '/thumb/'.date('Y-m-d').'/'; $filePathThumb = dirname(__FILE__).$pathThumb; if(!is_dir($filePathThumb)){ mkdir($filePathThumb,0777,true); } $rand = rand(0,10000).'_'; $filePathThumb .= $rand.urlencode($infoThumb['basename']); $thumbPic[] = $pathThumb.$rand.urlencode($infoThumb['basename']); $fp = @fopen($filePathThumb,'w'); @fwrite($fp,$dataThumb); $log = "第{$num}页的第".($ThumbK+1)."张缩略图下载完成rn"; echo $log; file_put_contents($logTxt,date('Y-m-d H:i:s')." $log",FILE_APPEND); } foreach($matches[1] as $k => $v){ mysql_query("update ttt set checked_times=checked_times+1"); $log = "正在分析{$num}页第".($k+1)."篇文章...rn"; echo $log; file_put_contents($logTxt,date('Y-m-d H:i:s')." $log",FILE_APPEND); if(is_array($row=mysql_fetch_assoc(mysql_query("select * from ttt where url='$v'")))){ $log = "{$v}在".date('Y-m-d H:i:s',$row['created_at']).'已经抓取过了，本次未抓取。'; echo $log; file_put_contents($logTxt,date('Y-m-d H:i:s')." $log",FILE_APPEND); continue; } $content2 = file_get_contents($v);//echo $content2;die; preg_match('/class="postTitle">s*<h1>(.*)</h1>/U',$content2,$matches2); $title = $matches2[1]; $temp = iconv('utf-8','GB2312',$title); $log = "抓取{$v}成功。标题:{$title}...rn"; echo $log; file_put_contents($logTxt,date('Y-m-d H:i:s')." $log",FILE_APPEND); preg_match('/class="entry">([Ss]+)<div/U',$content2,$matches2); $article = $matches2[1]; preg_match_all('/<img[sS]*src="(.*)"/U',$article,$pics); foreach($pics[1] as $k2 => $v2){ $log = '本页包含'.count($pics[1])."张图，正在下载第".($k2+1)."张...rn"; echo $log; file_put_contents($logTxt,date('Y-m-d H:i:s')." $log",FILE_APPEND); $data = file_get_contents($v2); $info = pathinfo($v2); $path = '/uploads/'.date('Y-m-d').'/'; $filePath = dirname(__FILE__).$path; if(!is_dir($filePath)){ mkdir($filePath,0777,true); } $rand_pic = rand(0,10000).'_'; $filePath .= $rand_pic.urlencode($info['basename']); $fp = @fopen($filePath,'w'); @fwrite($fp,$data); $log = "下载第".($k2+1)."张图片成功...rn"; echo $log; file_put_contents($logTxt,date('Y-m-d H:i:s')." $log",FILE_APPEND); $article = str_replace($v2,$path.$rand_pic.urlencode($info['basename']),$article); $time = time(); } $article = str_replace(array('www.guofs.com','独木成林'),array('soft.feehi.com','飞嗨'),$article); mysql_query("insert into ttt(title,content,thumb,created_at,url) values('$title','$article','{$thumbPic[$k]}',$time,'$v')"); $log = "{$v}入库成功...rn"; echo $log; file_put_contents($logTxt,date('Y-m-d H:i:s')." $log",FILE_APPEND); } } mysql_close($conn); $log = "本次抓取完成，请于脚本文件同部门查找log.txt日志记录rn"; echo $log; file_put_contents($logTxt,date('Y-m-d H:i:s')." $log",FILE_APPEND); ?>
首先分析url地址，很traditional，分页参数为get传值page，这就好办了，一共有145页，一个for循环搞定，然后分析每页有哪些文章列表，获取每页的文章url，逐个去爬，各种for嵌套。。。没什么好搞的，反正没有ajax…

这里主要是写正则匹配麻烦，我算是领会到了很多人都会被坑的地方.*不能匹配空字符串，oh my god，算是长了教训，赶紧换成[sS]*

下载的图片

然后需要什么就用什么记录和输出日志，以及保存入库。当然，偶要把他的网站名称和域名全替换成偶哒…………………………….。改天有空找个模板，seo站点soft.feehi.com就要上线啦

php写的爬虫，爬某博客所有文章和图片

相关推荐

评论

目前评论：0