Ana sayfa Genel Python ile Web Crawling’e Giriş

Python ile Web Crawling’e Giriş

2146
226
PAYLAŞ

Herkese merhaba,

Bu yazıda web crawling’e basit bir giriş yapacağız ve Python’un bize verdiği yetkiye dayanarak Ekşi Sözlük’te gündemdeki başlıkları url’leriyle birlikte çekip bir dosyaya kaydedeceğiz.

Yavaştan başlayalım. Her şeyden önce sadece proje için bize lazım olan paketleri yükleyeceğimiz yeni bir virtual environment oluşturacağız. Bunun için linux’ta virtualenv komutunu kullanarak eksi_crawler adında bir virtual environment oluşturuyoruz:

Eğer virtualenv paketi hazırda kurulu değilse pip kullanarak kurabilirsiniz:

Sonrasında ileriki kısımlarda ihtiyaç duyacağımız paketleri kuruyoruz:

Bu paketlerin ne işimize yarayacağını yazının ilerleyen kısmında göstereceğim.

Şimdi de başlıkları ve başlık linklerini nasıl çekebileceğimize bakalım. Bunun için söz konusu sitenin (bu durumda ekşi sözlük oluyor) html kodlarına göz atmamız gerekiyor. Tarayıcıda “https://eksisozluk.com/basliklar/gundem” url’ine gidiyoruz. Buradaki başlıklardan birine sağ tıklayıp “Ögeyi İncele” ya da “Inspect Element” e tıklıyoruz. Yan tarafta açılan kısımda sitenin kaynak kodlarında başlıkların bulunduğu kısmı görebiliriz.

DOM’u biraz inceleyince şu sonuca varıyoruz: Tüm başlıklar bir liste içerisinde şu düzene göre sıralanmış:

Yani bunu Xpath olarak yazmak gerekirse şöyle yazabiliriz:

Xpath ile ilgili daha fazla bilgiyi burada bulabilirsiniz

Buraya kadar DOM’dan başlıkları ve başlık linklerini içeren DOM elementlerini nasıl çıkaracağımızı bulduğumuza göre asıl işi yapacak Python scriptini yazmaya başlayabiliriz.

Önce modüllerimizi import edelim

Daha sonra web sayfasının html içeriğini requests modülünü kullanarak çekelim:

Burada requests modülünü kullanmamızın nedeni bir web sayfasının içeriğini kendi makinemize çekebilmektir. Bunun için de requests modülünün get fonksiyonunu kullandık. Bu fonksiyon çok basit bir anlatımla muhatap olduğumuz server’a (burada eksisozluk.com) istediğimiz sayfa/url için (burada /basliklar/gundem) GET request gönderir


Not: Burada response’un içeriğini pg.content’e bakarak görebiliriz. Fakat pg.content’in değeri encode edilmiş site içeriği (bytes) olduğundan ve bizim ileride kullanacağımız fonksiyonlar bytes değil string kabul ettiğinden, biz decode edilmiş halini kullanacağız. Decode edilmiş içeriğe de pg.text’ten ulaşabiliriz.


Requests modülü hakkında daha fazla bilgi için dokümantasyonlarına bakabilirsiniz.

Artık sitenin kaynak kodlarını çektiğimize göre bir DOM tree oluşturup istediğimiz bilgileri çekmenin vakti geldi. DOM tree oluşturmak için lxml.html modülünden faydalanacağız.


Not: DOM tree nedir ne değildir bilmiyorsanız şuraya bir göz atmanızda fayda var


Burada pg.text değişkeni çektiğimiz HTML kodlarını içermekte. html.fromstring fonksiyonu da bu HTML kodlarını alıp bize bir DOM tree oluşturuyor.

Artık xpath kullanarak istediğimiz elementleri çekebiliriz:

Eğer links değişkeninin içeriğine bakarsak, şuna benzer bir şey görürüz:

Burada listeyi oluşturan elemanlar bizim gündem başlıklarımıza ait DOM elementleridir. Mesela ilk başlık ve linkine bakalım:

Fark ettiyseniz başlık linki relative, yani şu halde bakan birisi bu linkin eksisozluk.com sitesine ait olduğunu anlayamaz. Bu yüzden kaydederken her linkin başına “https://eksisozluk.com ” ekleyeceğiz.

Şimdi işin son kısmına, yani elde ettiğimiz datayı bir dosyaya kaydetmeye geldi. Ben bu datayı bir json dosyasına kaydetmeyi uygun gördüm, bunun için önce datamızı json formatına uygun olarak yeniden şekillendirmemiz gerekiyor.

Artık datamızı kaydedebiliriz:

Bu kodları da çalıştırdıktan sonra, çektiğimiz başlıklar ve linklerini bulunduğumuz dizinde oluşturduğumuz eksi_data.json dosyasına kaydetmiş oluyoruz.

 

 

226 YORUMLAR

  1. Hi, Neat post. There is a problem together with your website in internet explorer, could check thisK IE nonetheless is the market leader and a good component to other folks will omit your wonderful writing due to this problem.

  2. Good – I should definitely pronounce, impressed with your web site. I had no trouble navigating through all the tabs and related information ended up being truly simple to do to access. I recently found what I hoped for before you know it in the least. Quite unusual. Is likely to appreciate it for those who add forums or anything, website theme . a tones way for your customer to communicate. Nice task..

  3. That is really interesting, You’re a very skilled blogger. I’ve joined your rss feed and sit up for searching for more of your wonderful post. Also, I’ve shared your web site in my social networks!

  4. After I initially commented I clicked the -Notify me when new feedback are added- checkbox and now each time a remark is added I get 4 emails with the identical comment. Is there any way you possibly can take away me from that service? Thanks!

  5. Those are yours alright! . We at least need to get these people stealing images to start blogging! They probably just did a image search and grabbed them. They look good though!

  6. I would like to thnkx for the efforts you have put in writing this site. I’m hoping the same high-grade web site post from you in the upcoming also. In fact your creative writing skills has encouraged me to get my own blog now. Actually the blogging is spreading its wings rapidly. Your write up is a good example of it.

  7. Its such as you learn my mind! You appear to know a lot about this, such as you wrote the e-book in it or something. I believe that you could do with some to pressure the message house a little bit, however instead of that, this is great blog. A great read. I’ll certainly be back.

  8. I precisely wished to thank you so much yet again. I am not sure the things I would’ve worked on in the absence of the entire creative ideas shared by you on my problem. It was before a very traumatic case in my circumstances, nevertheless looking at this skilled manner you resolved that took me to cry for delight. I am grateful for your guidance and in addition believe you really know what a great job you are always providing teaching most people thru your website. I am certain you’ve never encountered all of us.

  9. My programmer is trying to convince me to move to .net from PHP. I have always disliked the idea because of the costs. But he’s tryiong none the less. I’ve been using WordPress on various websites for about a year and am worried about switching to another platform. I have heard fantastic things about blogengine.net. Is there a way I can import all my wordpress posts into it? Any kind of help would be really appreciated!

  10. I’ll immediately grab your rss feed as I can’t find your e-mail subscription link or e-newsletter service. Do you’ve any? Please let me know so that I could subscribe. Thanks.

  11. Hiya! Quick question that’s totally off topic. Do you know how to make your site mobile friendly? My website looks weird when viewing from my iphone4. I’m trying to find a template or plugin that might be able to fix this problem. If you have any suggestions, please share. Cheers!

  12. obviously like your web site but you have to take a look at the spelling on several of your posts. A number of them are rife with spelling problems and I find it very bothersome to tell the truth nevertheless I will surely come again again.

  13. Very nice post. I just stumbled upon your weblog and wished to say that I have truly loved surfing around your weblog posts. After all I will be subscribing on your feed and I am hoping you write once more very soon!

  14. This web site is really a walk-through for all of the info you wanted about this and didn’t know who to ask. Glimpse here, and you’ll definitely discover it.

  15. What’s Happening i’m new to this, I stumbled upon this I have found It positively helpful and it has helped me out loads. I’m hoping to give a contribution & help different users like its helped me. Good job.

  16. I don’t even know how I ended up here, but I thought this post was great. I do not know who you are but definitely you’re going to a famous blogger if you are not already 😉 Cheers!

  17. Excellent post however , I was wanting to know if you could write a litte more on this topic? I’d be very thankful if you could elaborate a little bit further. Thanks!

  18. Good day! I could have sworn I’ve been to this website before but after reading through some of the post I realized it’s new to me. Anyhow, I’m definitely delighted I found it and I’ll be book-marking and checking back often!

  19. Hi there, You’ve performed a great job. I’ll definitely digg it and in my view suggest to my friends. I’m sure they will be benefited from this website.

  20. Hiya very cool blog!! Guy .. Beautiful .. Amazing .. I’ll bookmark your site and take the feeds also?KI’m satisfied to find a lot of helpful information right here within the submit, we need work out more techniques in this regard, thank you for sharing. . . . . .

  21. Can I just say what a relief to seek out somebody who really is aware of what theyre talking about on the internet. You undoubtedly know easy methods to deliver an issue to light and make it important. More individuals need to learn this and understand this aspect of the story. I cant believe youre not more popular because you definitely have the gift.

  22. Hey! I just wanted to ask if you ever have any trouble with hackers? My last blog (wordpress) was hacked and I ended up losing a few months of hard work due to no backup. Do you have any methods to protect against hackers?

  23. I am no longer sure where you’re getting your information, however great topic. I needs to spend a while finding out much more or working out more. Thank you for excellent information I used to be on the lookout for this information for my mission.

  24. Hello my friend! I wish to say that this post is awesome, nice written and include approximately all vital infos. I would like to look extra posts like this .

  25. An impressive share, I just given this onto a colleague who was doing a little analysis on this. And he in fact bought me breakfast because I found it for him.. smile. So let me reword that: Thnx for the treat! But yeah Thnkx for spending the time to discuss this, I feel strongly about it and love reading more on this topic. If possible, as you become expertise, would you mind updating your blog with more details? It is highly helpful for me. Big thumb up for this blog post!

  26. FineScan 會在肌膚上製造數以千計的細小深入傷口,即所謂的顯微加熱區(microthermal zone),但要確保每次治療時皆有部份組織不受能量影響,於是,每一個顯微加熱區的作用雖然強烈而明顯,但周圍都包覆著正常且結構完整的皮膚組織,使傷口能在短時間內癒合,並替換之前有缺陷的受損組織。Finescan不僅可讓表皮新生,更可促進深層膠原再生,從內而外徹底喚醒細胞,瞬時找回年輕時的肌膚狀態。憑藉最新的雙軸技術,FINESCAN 6可治療 – 面部 – 頸部 – 暗瘡凹凸洞 – 增生性疤痕

  27. Undoubtedly, It has multiple features which are useful for affiliate marketers. I am currently using Pretty Link to cloak the affiliate links. It also works fine for me. Thanks for showing an awesome alternative.

  28. Can I just say what a relief to search out someone who actually is aware of what theyre talking about on the internet. You undoubtedly know methods to deliver a difficulty to mild and make it important. Extra people need to learn this and perceive this facet of the story. I cant imagine youre no more fashionable since you positively have the gift.

  29. An interesting discussion is worth comment. I think that you should write more on this topic, it might not be a taboo subject but generally people are not enough to speak on such topics. To the next. Cheers

  30. Hi there are using WordPress for your site platform? I’m new to the blog world but I’m trying to get started and set up my own. Do you require any html coding expertise to make your own blog? Any help would be greatly appreciated!

  31. It is really a great and useful piece of information. I am happy that you simply shared this helpful info with us. Please stay us informed like this. Thanks for sharing.

  32. Howdy very nice web site!! Man .. Beautiful .. Wonderful .. I will bookmark your web site and take the feeds additionally…I am satisfied to find so many helpful info here in the post, we need develop extra techniques on this regard, thanks for sharing. . . . . .

  33. Sweet blog! I found it while browsing on Yahoo News. Do you have any suggestions on how to get listed in Yahoo News? I’ve been trying for a while but I never seem to get there! Thanks

  34. Howdy are using WordPress for your blog platform? I’m new to the blog world but I’m trying to get started and create my own. Do you require any coding knowledge to make your own blog? Any help would be really appreciated!

  35. obviously like your web site but you have to take a look at the spelling on several of your posts. Several of them are rife with spelling issues and I find it very troublesome to inform the truth on the other hand I’ll definitely come back again.

  36. I’ve been surfing on-line greater than 3 hours as of late, yet I never discovered any interesting article like yours. It’s lovely price enough for me. In my view, if all site owners and bloggers made just right content as you probably did, the net will be much more useful than ever before. “Perfection of moral virtue does not wholly take away the passions, but regulates them.” by Saint Thomas Aquinas.

BİR CEVAP BIRAK

Lütfen yorumunuzu giriniz
Buraya isminizi giriniz