[python] Pomoc s pythonním RE

Petr Messner petr.messner na gmail.com
Sobota Leden 12 23:45:43 CET 2013


Parsovat HTML regulárním výrazem umí jen Chuck Norris. Pro nás ostatní tu
jsou HTML parsery.

Zkuste třeba něco takového:

>>>
lxml.html.fromstring("<p>foo</p><script>bar</script>").xpath("//script")[0].text
'bar'

Doporučená literatura:
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

PM



2013/1/12 Bystroushaak <bystrousak na kitakitsune.org>

> Zdravím.
>
> Potřebuji pomoc s pythonním re modulem. Hraji si s tím už několik hodin
> a už jsem z toho v koncích.
>
> Mám script:
>
>
> -------------------------------------------------------------------------------
> import re
>
> data = """<tr><td class="newscap"><b style="font-size:13px">Downtime for
> Christmas</b>
>                 <br><small>by <script
> language="javascript">document.write('<a
> class=\"cap\"
> href=\"mailto:'+rot(5,'mvoogz na vrvmzizorjmf.jmb
> ')+'\">'+rot(5,'mvoogz na vrvmzizorjmf.jmb
> ')+'</a>')</script><noscript>rattle</noscript>
> on 12/30/12 10:48</small></td></tr>
>                 <tr><td class="aware" colspan="2">
>                 So, it appears the site was down for christmas. I could
> try to find
> out why, but I don't care enough. Went to <a
> href="https://events.ccc.de/congress/2012/wiki/Main_Page">29c3</a>,
> didn't get much done, ate a lot of fast food. I'm old, fat, and boring
> now. However, I found out about <a
> href="http://www.hyperelliptic.org/tanja/newelliptic/newelliptic.html
> ">Edwards
> curves</a>, that shit is rad.
>                 </td></tr>"""
>
> print re.sub(r'.*(<script.*>)(.*)(</script>).*',
> r"\n\n---\1\n---\2\n---\3", data)
>
> -------------------------------------------------------------------------------
>
> Který po spuštění vypíše:
>
>
> -------------------------------------------------------------------------------
> <tr><td class="newscap"><b style="font-size:13px">Downtime for
> Christmas</b>
>
>
> ---<script language="javascript">document.write('<a class="cap"
> href="mailto:'+rot(5,'mvoogz na vrvmzizorjmf.jmb
> ')+'">'+rot(5,'mvoogz na vrvmzizorjmf.jmb')+'</a>
> ---')
> ---</script>
>                 <tr><td class="aware" colspan="2">
>                 So, it appears the site was down for christmas. I could
> try to find
> out why, but I don't care enough. Went to <a
> href="https://events.ccc.de/congress/2012/wiki/Main_Page">29c3</a>,
> didn't get much done, ate a lot of fast food. I'm old, fat, and boring
> now. However, I found out about <a
> href="http://www.hyperelliptic.org/tanja/newelliptic/newelliptic.html
> ">Edwards
> curves</a>, that shit is rad.
>                 </td></tr>
>
> -------------------------------------------------------------------------------
>
> Mým cílem je mít ve skupině \1 tag <script>, tedy <script
> language="javascript">, v \2 pak tělo tagu. V současné podobě se mi
> oboje spojuje do \1.
>
> "Živá" ukázka: http://ideone.com/TfbmB1
>
> Prosím o nakopnutí správným směrem.
> _______________________________________________
> Python mailing list
> Python na py.cz
> http://www.py.cz/mailman/listinfo/python
>
------------- další část ---------------
HTML příloha byla odstraněna...
URL: <http://www.py.cz/pipermail/python/attachments/20130112/3622731b/attachment.html>


Další informace o konferenci Python