Python RPA(업무자동화) 개념 및 실습

Notice

Recent Posts

Recent Comments

Link

« 2024/07 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

IT_developers

Python RPA(업무자동화) 개념 및 실습 - 크롤링(Beautifulsoup)(2) 본문

Python

Python RPA(업무자동화) 개념 및 실습 - 크롤링(Beautifulsoup)(2)

developers developing 2022. 9. 28. 12:00

RPA(Robotic Process Automation)

웹, 윈도우, 어플리케이션(엑셀 등)을 사전에 설정한 시나리오에 따라 자동적으로 작동하여 수작업을 최소화하는 일련의 프로세스
RPA 사용 소프트웨어
- Uipath, BluePrism, Automation Anywhere, WinAutomation
RPA 라이브러리
- pyautogui, pyperclip, selenium

크롤링 : 웹 사이트, 하이퍼링크, 데이터 정보 자원을 자동화된 방법으로 수집, 분류, 저장하는 것

URL 작업 - urllib 라이브러리 존재(파이썬)

request
1. urlretrieve()
  - 요청하는 url의 정보를 파일로 저장
  - 리턴값이 튜플 형태로 옴
  - csv 파일, api 데이터 등 많은 양의 데이터를 한번에 저장
2. urlopen()
  - 다운로드 하지 않고 정보를 메모리에 올려서 분석
  - read() : 메모리에 있는 정보를 읽어옴

requests + beautifulsoup4 조합

객체 생성 후 사용 가능
- 객체 생성(페이지소스, 파서)
- parser : lxml 사용. c 언어 기반으로 되어 있음
- parser : html.parser(기본) - 설치 필요없음.
- lxml이 html.parser 보다 빠름.
특정 엘리먼트 찾기
- 태그 이용(가장 처음에 만나는 태그만 가져옴)
- find() : find("찾을 태그명", class_="클래스 명")
- find_all()
- find_*()
CSS 선택자 이용해서 찾기
- select
- select_one

RPAbasic\crawl\beautifulsoup 폴더 - 1_실습.py

다음 뉴스에 있는 내용 찾기

import requests

from bs4 import BeautifulSoup

# 다음에 있는 첫 뉴스 주소

res = requests.get("https://news.v.daum.net/v/20220613093413149")

soup = BeautifulSoup(res.text, "lxml")

# 기사제목 가져오기

# 첫번째 있는 거라면 soup.h3도 가능

news_title = soup.find("h3")

print("기사 제목 : ", news_title)

print("기사 제목 내용 : ", news_title.get_text())

# 기사 작성날짜와 시간 가져오기

num_date = soup.find("span", "num_date")

print("작성 날짜 및 시간 : ", num_date)

print("작성 날짜 및 시간 내용 : ", num_date.get_text())

# 기사 작성자 가져오기

writer = soup.find("span", "txt_info")

print("작성자 : ", writer)

print("작성자 내용: ", writer.get_text())

# 기사 첫번째 문단 가져오기

para1 = soup.find("p")

print("기사 첫 문단 : ", para1)

print("기사 첫 문단 내용 : ", para1.get_text())

# 전체 기사 내용 가져오기. p태그

contents = soup.find_all("p")

# print(contents)

for para1 in contents:

print(para1.get_text())

RPAbasic\crawl\beautifulsoup 폴더 - gmarket1.py

G마켓 전체 카테고리 가져오기

import requests

from bs4 import BeautifulSoup

# G마켓 주소

url = "https://www.gmarket.co.kr"

res = requests.get(url)

soup = BeautifulSoup(res.text, "lxml")

print(soup.prettify()) # 내용 읽어지는지 확인

# 1차 카테고리 추출하기

ul 안에 있는 li 클래스 명 복사 : link__1depth-item

# 1차 카테고리 추출하기

one_depth = soup.find_all("a", class_="link__1depth-item", limit=9)

print(one_depth) # 자료확인. 반복으로 2번 나와서 카테고리 9개만 출력

for item in one_depth:

print(item.get_text())

# 2차 카테고리 추출

li 클래스 명 복사 : link__2depth-item

# 2차 카테고리 추출

item__2depth = soup.find_all("li", "list-item__2depth")

for item in item__2depth:

print(item) # 내용가지고 오는지 확인

# 이름만 가지고 오기. 두번씩 가지고 오기 때문에 개수제한

item__2depth = soup.find_all("li", "list-item__2depth", limit=69)

print("카테고리 개수 : ", len(item__2depth)) # 개수 알기

for item in item__2depth:

print(item.get_text())

item__2depth = soup.find_all("li", "list-item__2depth", limit=69)

# 링크 안에 있는 a 태그를 찾고 href만 가져오기

for depth in item__2depth:

href = depth.find("a")["href"]

print(depth.get_text(), href)

# 바로 갈 수 있는 링크만 가져옴. # aspx : c# 개념

# string과 get_text 차이

item__2depth = soup.find_all("li", "list-item__2depth")

print(len(item__2depth))

for item in item__2depth:

print(item.string)

get_text() : 태그(자식태그 포함)가 가지고 있는 모든 문자열 가져오기

string : 태그가 가지고 있는 문자열만 가져오기

RPAbasic\crawl\beautifulsoup 폴더 - bs7.py

from bs4 import BeautifulSoup

# 문서 가져오기

with open("./RPAbasic/crawl/beautifulsoup/story.html", "r") as f:

html = f.read()

soup = BeautifulSoup(html, "lxml")

# 타이틀 클래스 명 가진 태그 요소 가져오기

title = soup.select_one("p.title")

print(title)

print(title.get_text())

# id가 link1 인 태그 요소 가져오기 : id는 중복이 안되니 하나만 가져올 수 있음

link1 = soup.select_one("#link1")

print(link1)

print(link1.get_text())

print(link1.string)

# a 태그 중 data-* 속성 태그요소 가져오기

link2 = soup.select_one("a[data-io='tillie']")

print(link2)

print(link2.get_text())

# p 클래스 안에 자식 태그 a 모두 가져오기

# > : 자식

# find_all(), select() : 리스트 형식 --> for문 돌리기

all_a = soup.select("p.story > a") # 리스트 형식으로 보여줌

print(all_a)

print()

for link in all_a:

print(link)

print(link.get_text())

RPAbasic\crawl\beautifulsoup 폴더 - bs8.py

다음 기사를 select형식으로 가지고 오기

from importlib.resources import contents

import requests

from bs4 import BeautifulSoup

# 다음에 있는 첫 뉴스 주소

res = requests.get("https://news.v.daum.net/v/20220613093413149")

print(res.text) # 자료확인

soup = BeautifulSoup(res.text, "lxml")

# <head> 태그 안 내용 가져오기

print(soup.head)

# <body> 태그 내용 가져오기

print(soup.body)

# title 태그

print(soup.title)

print(soup.title.name)

print(soup.title.get_text())

print(soup.title.string)

# 기사제목 가져오기

news_title = soup.select_one("h3")

print(news_title)

print(news_title.get_text())

# 기사 작정 날짜와 시간 가져오기

num_date = soup.select_one("span.num_date")

print(num_date)

print(num_date.get_text())

# 기사 작성자 가져오기

writer = soup.select_one("span.txt_info")

print(writer)

print(writer.get_text())

# 기사 첫번째 문단 가져오기

para1 = soup.select_one("p")

print(para1)

print(para1.get_text())

print()

contents = soup.select("p")

for para1 in contents:

print(para1.get_text())

RPAbasic\crawl\beautifulsoup 폴더 - bs9.py

위키백과 - 서울 지하철 노선 사진 저장

import requests

from bs4 import BeautifulSoup

from urllib.request import urlretrieve

res = requests.get(

"https://ko.wikipedia.org/wiki/%EC%84%9C%EC%9A%B8_%EC%A7%80%ED%95%98%EC%B2%A0"

)

soup = BeautifulSoup(res.text, "lxml")

print(soup.prettify()) # 자료 확인

# 지하철 노선 사진 저장

# copy - Copy selector : 장점은 쉽게 가지고 올 수 있고, 단점은 길게 따라옴

# copy - Copy selector : #mw-content-text > div.mw-parser-output > table.infobox > tbody > tr:nth-child(1) > td > a > img

# 첫번째 지하철 노선 사진

image1 = soup.select_one(

"#mw-content-text > div.mw-parser-output > table.infobox > tbody > tr:nth-child(1) > td > a > img"

)

print(image1)

print(image1["src"])

# 이미지 다운로드 - urlretrieve

# 다운로드 경로

path = "./RPAbasic/crawl/download/"

urlretrieve("이미지 원본 경로", "다운로드 받을 경로")

urlretrieve("http:" + image1["src"], path + "subway1.jpg")

# 두번째 사진 - 우표

# 요소 찾고 url 찾기

# #mw-content-text > div.mw-parser-output > div.thumb.tright > div > a > img

image2 = soup.select_one(

"#mw-content-text > div.mw-parser-output > div.thumb.tright > div > a > img"

)

print(image2)

print(image2["src"])

# 다운로드 하기

urlretrieve("http:" + image2["src"], path + "anniversiry.jpg")

RPAbasic\crawl\beautifulsoup 폴더 - bs10.py

책에서 제공하는 연습 사이트 이용 : https://pythonscraping.com/pages/page3.html

https://pythonscraping.com/pages/page3.html

Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is hand-curated by well-paid, free-range Tibetan monks. We haven't figured out how to make online shopping carts yet, but you can send us a

pythonscraping.com

import requests

from bs4 import BeautifulSoup

res = requests.get("https://pythonscraping.com/pages/page3.html")

soup = BeautifulSoup(res.text, "lxml")

# h1 태그 가져오기

h1 = soup.find("h1")

print(h1)

print(h1.get_text())

# 상단 내용 가져오기

content = soup.select_one("#content")

print(content)

print(content.get_text())

# 모든 img 태그 가져오기

img_list = soup.find_all("img") # soup.select("img")

print(img_list)

# 타이틀 행 가져오기

row = soup.select_one("table#giftList > tr:nth-child(1)")

print(row)

for item in row:

print(item.get_text())

# 테이블 내용 가져오기

table = soup.find_all("table", id="giftList")

print(table)

table = soup.find("table", id="giftList")

print(table.get_text())

# 가격만 가져오기

cost_list = soup.select("tr.gift")

for tr in cost_list:

print(tr.find_all("td")[2].get_text())

RPAbasic\crawl\beautifulsoup 폴더 - 2_실습_stock.py

네이버 금융 주식 인기 검색 종목

import requests

from bs4 import BeautifulSoup

res = requests.get("https://finance.naver.com/")

print(res.text) # 값 불러오는지 확인

soup = BeautifulSoup(res.text, "lxml")

# 인기 검색 종목 - 종목명, 현재 가격

stock1 = soup.select("div.aside_area.aside_popular > table > tbody > tr")

print(stock1) # 자료 불러오는지 확인

for item in stock1:

# 종목명

stock_name = item.find("a").get_text()

# 현재 가격

stock_price = item.find("td").get_text()

print(stock_name, stock_price)

# 해외 증시 - 종목명, 가격

stock2 = soup.select(" div.aside_area.aside_stock > table > tbody > tr")

print(stock2)

for item in stock2:

# 종목명

stock_name = item.find("a").get_text()

# 현재 가격

stock_price = item.find("td").get_text()

print(stock_name, stock_price)

RPAbasic\crawl\beautifulsoup 폴더 - 3_실습_clien.py

clien 팁과 강좌 게시판 크롤링

import requests

from bs4 import BeautifulSoup

res = requests.get("https://www.clien.net/service/board/lecture")

soup = BeautifulSoup(res.text, "lxml")

# 게시판 제목 가져오기

# div_content > div.list_content > div:nth-child(18) > div.list_title > a.list_subject > span.subject_fixed

# 출력했을 때 안나오면 select 한개씩 추가

title_list = soup.select(" a.list_subject > span.subject_fixed")

print(title_list)

for title in title_list:

print(title.get_text().strip())

# 1 ~ 5 page 목록 가지고 오기

for page_num in range(5): # range : 0~4

if page_num == 0: # 1page

res = requests.get("https://www.clien.net/service/board/lecture")

else:

res = requests.get(

"https://www.clien.net/service/board/lecture?&od=T31&category=0&po="

+ str(page_num) # 파이썬은 문자로 변경해줘야함.

)

soup = BeautifulSoup(res.text, "lxml")

title_list = soup.select("a.list_subject > span.subject_fixed")

for title in title_list:

print(title.get_text().strip())

print("*" * 80) # 페이지 나누기 구분

저작자표시 비영리 변경금지

'Python' 카테고리의 다른 글

Python RPA(업무자동화) 개념 및 실습 - 크롤링(네이버 오픈 API) (1)	2022.09.30
Python RPA(업무자동화) 개념 및 실습 - 크롤링(Beautifulsoup)(3) (1)	2022.09.29
Python RPA(업무자동화) 개념 및 실습 - 크롤링(Beautifulsoup)(1) (0)	2022.09.27
Python RPA(업무자동화) 개념 및 실습 - 크롤링(requests) (1)	2022.09.26
Python RPA(업무자동화) 개념 및 실습 - 크롤링(urllib) (1)	2022.09.25