[python] Pomoc s pythonním RE

rajcze rajcze na gmail.com
Sobota Leden 12 23:51:25 CET 2013


OT: ja sice chapu, ze na XML/HTML je potreba pouzivat zasobnikovy automat,
ale IMHO existuje trivialni subset uloh, na ktery staci i regexpy...
Samozrejme je potreba vedet co chci, a jaky to ma pripadne limity, ale
nutne bych netvrdil, ze dostat subset dat z validniho XML/HTML umi jen nas
vsemocny oblibenec :D

2013/1/12 Petr Messner <petr.messner na gmail.com>

> Parsovat HTML regulárním výrazem umí jen Chuck Norris. Pro nás ostatní tu
> jsou HTML parsery.
>
> Zkuste třeba něco takového:
>
> >>>
> lxml.html.fromstring("<p>foo</p><script>bar</script>").xpath("//script")[0].text
> 'bar'
>
> Doporučená literatura:
>
> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
>
> PM
>
>
>
> 2013/1/12 Bystroushaak <bystrousak na kitakitsune.org>
>
>> Zdravím.
>>
>> Potřebuji pomoc s pythonním re modulem. Hraji si s tím už několik hodin
>> a už jsem z toho v koncích.
>>
>> Mám script:
>>
>>
>> -------------------------------------------------------------------------------
>> import re
>>
>> data = """<tr><td class="newscap"><b style="font-size:13px">Downtime for
>> Christmas</b>
>>                 <br><small>by <script
>> language="javascript">document.write('<a
>> class=\"cap\"
>> href=\"mailto:'+rot(5,'mvoogz na vrvmzizorjmf.jmb
>> ')+'\">'+rot(5,'mvoogz na vrvmzizorjmf.jmb
>> ')+'</a>')</script><noscript>rattle</noscript>
>> on 12/30/12 10:48</small></td></tr>
>>                 <tr><td class="aware" colspan="2">
>>                 So, it appears the site was down for christmas. I could
>> try to find
>> out why, but I don't care enough. Went to <a
>> href="https://events.ccc.de/congress/2012/wiki/Main_Page">29c3</a>,
>> didn't get much done, ate a lot of fast food. I'm old, fat, and boring
>> now. However, I found out about <a
>> href="http://www.hyperelliptic.org/tanja/newelliptic/newelliptic.html
>> ">Edwards
>> curves</a>, that shit is rad.
>>                 </td></tr>"""
>>
>> print re.sub(r'.*(<script.*>)(.*)(</script>).*',
>> r"\n\n---\1\n---\2\n---\3", data)
>>
>> -------------------------------------------------------------------------------
>>
>> Který po spuštění vypíše:
>>
>>
>> -------------------------------------------------------------------------------
>> <tr><td class="newscap"><b style="font-size:13px">Downtime for
>> Christmas</b>
>>
>>
>> ---<script language="javascript">document.write('<a class="cap"
>> href="mailto:'+rot(5,'mvoogz na vrvmzizorjmf.jmb
>> ')+'">'+rot(5,'mvoogz na vrvmzizorjmf.jmb')+'</a>
>> ---')
>> ---</script>
>>                 <tr><td class="aware" colspan="2">
>>                 So, it appears the site was down for christmas. I could
>> try to find
>> out why, but I don't care enough. Went to <a
>> href="https://events.ccc.de/congress/2012/wiki/Main_Page">29c3</a>,
>> didn't get much done, ate a lot of fast food. I'm old, fat, and boring
>> now. However, I found out about <a
>> href="http://www.hyperelliptic.org/tanja/newelliptic/newelliptic.html
>> ">Edwards
>> curves</a>, that shit is rad.
>>                 </td></tr>
>>
>> -------------------------------------------------------------------------------
>>
>> Mým cílem je mít ve skupině \1 tag <script>, tedy <script
>> language="javascript">, v \2 pak tělo tagu. V současné podobě se mi
>> oboje spojuje do \1.
>>
>> "Živá" ukázka: http://ideone.com/TfbmB1
>>
>> Prosím o nakopnutí správným směrem.
>> _______________________________________________
>> Python mailing list
>> Python na py.cz
>> http://www.py.cz/mailman/listinfo/python
>>
>
>
> _______________________________________________
> Python mailing list
> Python na py.cz
> http://www.py.cz/mailman/listinfo/python
>



-- 
Rules of Optimization:
Rule 1: Don't do it.
Rule 2 (for experts only): Don't do it yet.
------------- další část ---------------
HTML příloha byla odstraněna...
URL: <http://www.py.cz/pipermail/python/attachments/20130112/61e5113d/attachment.html>


Další informace o konferenci Python