Partage
  • Partager sur Facebook
  • Partager sur Twitter

[Cours] Calculs distribués sur données massives

AMA!

    11 octobre 2018 à 12:18:46

    Bonjour,

    je pose la question sur l'environnement de travail.

    Est- il possible d'utiliser spark SQL (l'activite 2) sur windows avec Anaconda.

    Merci.

    • Partager sur Facebook
    • Partager sur Twitter
      11 octobre 2018 à 20:26:28

      Bonjour.

      Spark SQL fonctionne très bien avec Python. Pour cela, il faut installer pyspark, lancer le script avec spark-submit et initialiser au début un objet SparkSession, par exemple comme ceci:

          from pyspark import SparkContext
          from pyspark.sql import SparkSession
          sc = SparkContext()
          spark = SparkSession.builder.appName("myapp").getOrCreate()

      Par contre, pour être complet dans la réponse il faudrait connaitre les versions de Spark et Python (anaconda) que tu utilises. Il se trouve en effet que certaines combinaisons de versions Spark-Python ne marchent pas, par exemple Spark 2.1.0 et Python 3.6.

      Michel.

      -
      Edité par Michel D. 11 octobre 2018 à 20:28:16

      • Partager sur Facebook
      • Partager sur Twitter
      Rien n'arrive dans la vie... ni comme on le craint, ni comme on l'espère.
        12 octobre 2018 à 11:03:33

        Bonjour Michel,

        Merci ta réponse.

        Concernant les versions, j’utilise  Anaconda 1.9.2  notbook 5.6 et Python 3.7.0

        Voici l'erreur que j'ai reçu après exécution de ces lignes :

         from pyspark import SparkContext
            from pyspark.sql import SparkSession

            sc = SparkContext() 

        ---------------------------------------------------------------------------
        Exception                                 Traceback (most recent call last)
        <ipython-input-3-2dfc28fca47d> in <module>()
        ----> 1sc = SparkContext()
        
        ~\Anaconda3\lib\site-packages\pyspark\context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
         113         """
         114         self._callsite = first_spark_call() or CallSite(None, None, None)
        --> 115SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
         116         try:
         117             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
        
        ~\Anaconda3\lib\site-packages\pyspark\context.py in _ensure_initialized(cls, instance, gateway, conf)
         290         with SparkContext._lock:
         291             if not SparkContext._gateway:
        --> 292SparkContext._gateway = gateway or launch_gateway(conf)
         293                 SparkContext._jvm = SparkContext._gateway.jvm
         294 
        
        ~\Anaconda3\lib\site-packages\pyspark\java_gateway.py in launch_gateway(conf)
         91 
         92             if not os.path.isfile(conn_info_file):
        ---> 93raise Exception("Java gateway process exited before sending its port number")
         94 
         95             with open(conn_info_file, "rb") as info:
        
        Exception: Java gateway process exited before sending its port number
        • Partager sur Facebook
        • Partager sur Twitter
          1 novembre 2018 à 22:32:48

          Bonjour,

          Le lien vers le jeu de données movies dans l'exercice sur le page rank est en erreur 404.

          Est-il possible de ré-uploader le jeu de données sur le site d'Openclassrooms ?

          Merci d'avance

          -------

          EDIT :

          Alors je m'auto-réponds :

          Le jeu de données a été sauvegardé sur archive.org, je l'ai retrouvé à cette URL :

          https://web.archive.org/web/20070314100741/http://www.cs.toronto.edu/~tsap/experiments/download/_movies.tar.Z

          Ceux qui ont fait ce cours confirment qu'il s'agit bien du même fichier ? On peut mettre à jour le lien de l'exercice avec ce lien là maintenant que la page originale est abandonnée ?

          -
          Edité par mikachou34 1 novembre 2018 à 22:48:07

          • Partager sur Facebook
          • Partager sur Twitter
            3 novembre 2018 à 12:55:17

            @Aimad

            Désolé pour la réponse tardive ! Tu as peut-être déjà résolu ton problème, sinon, peux-tu m'ndiquer ta version de Spark et également me dire sur quel OS tu travailles ?

            Je vois que tu utilises Python 3.7. Je ne sais pas si cela fonctionne avec la 3.7 qui est récente.

            J'ai également Anaconda 1.9.2 mais j'utilise Spyder 3.3.1 qui charge un environnement Python 3.6.3 et j'ai également la même erreur.

            Par contre dans un terminal avec le CLI python 3.6, je n'ai pas cette erreur.

            Hope this helps.

            • Partager sur Facebook
            • Partager sur Twitter
            Rien n'arrive dans la vie... ni comme on le craint, ni comme on l'espère.
              24 novembre 2018 à 23:12:41

              Bonjour,

              Bon je reviens avec un nouveau petit soucis, le script iliad_odyssey.py dans la partie « Apprenez à débugger une application spark » semble ne pas fonctionner :

              $ env PYSPARK_PYTHON=python3 ./spark-2.4.0-bin-hadoop2.7/bin/spark-submit ./iliad_odyssey.py
              18/11/24 22:47:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                File "/home/michael/openclassrooms/spark/./iliad_odyssey.py", line 31
                  .map(lambda (word, count): (word, count/float(word_count)))\
                              ^
              SyntaxError: invalid syntax

              A noter que même en supprimant les parenthèses après le mot-clé lambda, je tombe sur une nouvelle erreur :

              $ env PYSPARK_PYTHON=python3 ./spark-2.4.0-bin-hadoop2.7/bin/spark-submit ./iliad_odyssey.py
              18/11/24 22:54:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
              18/11/24 22:54:34 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 357, in main
                  eval_type = read_int(infile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
                  raise EOFError
              EOFError
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
              	at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
              	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
              	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
              	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
              	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
              	at org.apache.spark.scheduler.Task.run(Task.scala:121)
              	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
              	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
              	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
              	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              	at java.lang.Thread.run(Thread.java:748)
              Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 357, in main
                  eval_type = read_int(infile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
                  raise EOFError
              EOFError
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
              	at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
              	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
              	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
              	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
              	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
              	at org.apache.spark.scheduler.Task.run(Task.scala:121)
              	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
              	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
              	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
              	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              	at java.lang.Thread.run(Thread.java:748)
              Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 357, in main
                  eval_type = read_int(infile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
                  raise EOFError
              EOFError
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
              	at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
              	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
              	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
              	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
              	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
              	at org.apache.spark.scheduler.Task.run(Task.scala:121)
              	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
              	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
              	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
              	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              	at java.lang.Thread.run(Thread.java:748)
              Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR PythonRunner: This may have been caused by a prior exception:
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR PythonRunner: This may have been caused by a prior exception:
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR PythonRunner: This may have been caused by a prior exception:
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR Executor: Exception in task 1.0 in stage 4.0 (TID 9)
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 8)
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR Executor: Exception in task 3.0 in stage 4.0 (TID 11)
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 357, in main
                  eval_type = read_int(infile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
                  raise EOFError
              EOFError
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
              	at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
              	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
              	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
              	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
              	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
              	at org.apache.spark.scheduler.Task.run(Task.scala:121)
              	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
              	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
              	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
              	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              	at java.lang.Thread.run(Thread.java:748)
              Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR PythonRunner: This may have been caused by a prior exception:
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR Executor: Exception in task 2.0 in stage 4.0 (TID 10)
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR TaskSetManager: Task 3 in stage 4.0 failed 1 times; aborting job
              Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/./iliad_odyssey.py", line 60, in <module>
                  main()
                File "/home/michael/openclassrooms/spark/./iliad_odyssey.py", line 47, in main
                  emerging_words = join_words.takeOrdered(10, lambda word, freq_diff: -freq_diff)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1304, in takeOrdered
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 844, in reduce
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
              py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
              : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 4.0 failed 1 times, most recent failure: Lost task 3.0 in stage 4.0 (TID 11, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              
              Driver stacktrace:
              	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
              	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
              	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
              	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
              	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
              	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
              	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
              	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
              	at scala.Option.foreach(Option.scala:257)
              	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
              	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
              	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
              	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
              	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
              	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
              	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
              	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
              	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
              	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
              	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
              	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
              	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
              	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
              	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
              	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
              	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
              	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
              	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              	at java.lang.reflect.Method.invoke(Method.java:498)
              	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
              	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
              	at py4j.Gateway.invoke(Gateway.java:282)
              	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
              	at py4j.commands.CallCommand.execute(CallCommand.java:79)
              	at py4j.GatewayConnection.run(GatewayConnection.java:238)
              	at java.lang.Thread.run(Thread.java:748)
              Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
               


              Ensuite je reviens à la charge avec le liens _movies de l'execice PageRank pour la partie Hadoop : c'est un lien 404 (il n'existe plus). J'ai retrouvé le jeu de données il y a quelques quelques semaines sur archive.org, je suggère aux responsables du cours de mettre à jour ce lien : https://web.archive.org/web/20070314100741/http://www.cs.toronto.edu/~tsap/experiments/download/_movies.tar.Z

              Enfin pour l'exercice Hadoop, on me demandait de retourner 6 corrigés. En cours de route on ne m'en a demandé plus que 5 (« vous avez fait votre partie du travail »), mais depuis ce soir on m'en demande à nouveau 6. Mais pourquoi ???

              • Partager sur Facebook
              • Partager sur Twitter
                27 novembre 2018 à 9:43:22

                Bonjour.

                Concernant l'exception du script iliad, l'erreur provient du fait que la syntaxe utilisée dans le cours est sous Python 2 alors que tu dois utiliser Python 3. Dans ce cas, la syntaxe tuple est remplacer par un tableau. Il faut donc écrire :

                .map(lambda wordcount: (wordcount[0], wordcount[1]/float(word_count)))\

                Pour _movies.tar, je pouvais te proposer un lien mais tu as trouvé tout seul.

                Pour la gestion des corrigés, il faut envoyer un mail à hello@openclassrooms.com si tu n'as pas de réponse dans ce forum.

                Michel.

                • Partager sur Facebook
                • Partager sur Twitter
                Rien n'arrive dans la vie... ni comme on le craint, ni comme on l'espère.
                  Team OC 28 novembre 2018 à 14:52:49

                  Bonjour Mika,

                  Merci pour vos retours ! Dans un premier temps, nous avons mis-à-jour l'activité avec le nouveau lien. 

                  Bien cordialement,

                  • Partager sur Facebook
                  • Partager sur Twitter
                    4 décembre 2018 à 0:07:53

                    Bonsoir Michel et Luc,

                    Merci pour vos réponses, et désolé de ma réponse tardive.

                    @Michel :

                    Effectivement il fallait remplacer les tuples par des tableaux dans les paramètres de fonction, à l'endroit que vous avez indiqué et aussi un peu plus bas. Fort de vos indications je propose le script suivant (qui fonctionne chez moi avec python 3) :

                    from pyspark import SparkContext
                    
                    sc = SparkContext()
                    
                    def filter_stop_words(word):
                        from nltk.corpus import stopwords
                        english_stop_words = stopwords.words("english")
                        return word not in english_stop_words
                    
                    def load_text(text_path):
                        # Split text in words
                        # Remove empty word artefacts
                        # Remove stop words ('I', 'you', 'a', 'the', ...)
                        vocabulary = sc.textFile(text_path)\
                            .flatMap(lambda lines: lines.lower().split())\
                            .flatMap(lambda word: word.split("."))\
                            .flatMap(lambda word: word.split(","))\
                            .flatMap(lambda word: word.split("!"))\
                            .flatMap(lambda word: word.split("?"))\
                            .flatMap(lambda word: word.split("'"))\
                            .flatMap(lambda word: word.split("\""))\
                            .filter(lambda word: word is not None and len(word) > 0)\
                            .filter(filter_stop_words)
                    
                        # Count the total number of words in the text
                        word_count = vocabulary.count()
                    
                        # Compute the frequency of each word: frequency = #appearances/#word_count
                        word_freq = vocabulary.map(lambda word: (word, 1))\
                            .reduceByKey(lambda count1, count2: count1 + count2)\
                            .map(lambda wordcount: (wordcount[0], wordcount[1]/float(word_count)))\
                    
                        return word_freq
                    
                    def main():
                        iliad = load_text('./iliad.mb.txt')
                        odyssey = load_text('./odyssey.mb.txt')
                    
                        # Join the two datasets and compute the difference in frequency
                        # Note that we need to write (freq or 0) because some words do not appear
                        # in one of the two books. Thus, some frequencies are equal to None after
                        # the full outer join.
                        join_words = iliad.fullOuterJoin(odyssey)\
                            .map(lambda wordfreqs: (wordfreqs[0], (wordfreqs[1][1] or 0) - (wordfreqs[1][0] or 0)))
                    
                        # 10 words that get a boost in frequency in the sequel
                        emerging_words = join_words.takeOrdered(10, lambda freq_diff: -freq_diff[1])
                        # 10 words that get a decrease in frequency in the sequel
                        disappearing_words = join_words.takeOrdered(10, lambda freq_diff: freq_diff[1])
                    
                        # Print results
                        for word, freq_diff in emerging_words:
                            print("%.2f" % (freq_diff*10000), word)
                        for word, freq_diff in disappearing_words[::-1]:
                            print("%.2f" % (freq_diff*10000), word)
                    
                        input("press ctrl+c to exit")
                    
                    if __name__ == "__main__":
                        main()

                    @Luc :

                    Super ! Merci d'avoir mis l'exercice à jour !

                    • Partager sur Facebook
                    • Partager sur Twitter
                      5 décembre 2018 à 18:56:06

                      Bonsoir Mika.

                      Oui, c'est tout-à-fait cela. Je pense que ton script pourrait remplacer ou compléter (pour Python 3) avantageusement celui du cours... mais ce n'est pas moi qui décide !

                      Bonne continuation.

                      Michel.

                      • Partager sur Facebook
                      • Partager sur Twitter
                      Rien n'arrive dans la vie... ni comme on le craint, ni comme on l'espère.

                      [Cours] Calculs distribués sur données massives

                      × Après avoir cliqué sur "Répondre" vous serez invité à vous connecter pour que votre message soit publié.
                      • Editeur
                      • Markdown